Format & storage
Format & storage
sysuse
census.dta, clear
generate
urban_share = popurban/pop
* Format
// Sometimes, it is useful to fix how the variables should be displayed
// You can set the format of one or more variables using the command format
help
format
* Create copies for comparison
gen
urban_share1 = urban_share
gen
urban_share2 = urban_share
gen
urban_share3 = urban_share
gen
urban_share4 = urban_share
gen
urban_share5 = urban_share
br
urban_share*
* Format "general" --> how much space should be used to display the content
format
urban_share1 %12.0g
// total display: 12 units (some might be "empty", i.e., blanks), Stata chooses how many after the decimal point
format
urban_share2 %6.0g
// total display: 6 units, Stata chooses how many after the decimal point
* Format "fixed" --> fixes how many digits after the decimal point will be displayed
format
urban_share3 %9.2f
// total display: 9 units (some might be "empty"), after the decimal point: exactly 2
format
urban_share4 %9.3f
// total display: 9 units, after the decimal point: exactly 3
format
urban_share5 %-9.3f
// same as before, but left-justified
* Storage types
// The storage that variables need differs, compare for example
gen
dummy1 = 1 // --> only few storage needed
gen
dummy2 = 132224238 // --> more storage needed
gen
dummy3 = sqrt(pop) // --> even more storage needed
help
data types
// You can set the data type of the variables to balance storage & precision
* Set data type when generating variables
gen
byte dummy4 = 1 // "byte" sufficient for values between -127 and 100
* Compress data
compress
// minimize storage without loss of information
* Change storage type
recast
double dummy2 // you can always change to a higher storage type
recast
byte dummy2 // you cannot always change to a lower storage type without information loss
recast
byte dummy2, force // use "force", if you are willing to give up the precision
* Common mistake
sysuse
census.dta, clear
gen
urban_share = popurban/pop
// Let's assume you know the size of the urban population and the total population in Alabama
list
popurban pop if state=="Alabama"
// So, you should be able to find the observation for which urban_share== 2,337,713 / 3,893,888
tab
state if urban_share== 2337713 / 3893888
// No observation was found
/*
Stata calculates with a high precision, but the default storage of variables is
float, which is not as precise. Most of the time, this doesn't matter, but in
some cases, this can be crucial. In these cases, make sure that variables are
created as double (or are imported as double).
*/
gen
double urban_share_prec = popurban/pop
tab
state if urban_share_prec== 2337713 / 3893888
// If you only want to filter, use the float() to transform the precise calculation into a float:
tab
state if urban_share== float(2337713 / 3893888)
// This can also happen when you don't expect it
gen
dummy5 = 0.1
count
if dummy5==0.1
count
if dummy5==float(0.1)
// The reason is that Stata works with binary, not decimal numbers
// --> 0.1 is an infinitive binary number, such as 1/11 in decimal numbers
/*
You can also set the default such that Stata always creates variables as double,
but this needs much more storage ("set type"). Make sure to use "compress" to
save storage for all variables which are stored in a higher class than needed.
*/
Exercise
Load the pre-installed dataset auto.
- Change the display format of the variable “price” such that two digits after the decimal point are depicted.
- You can also tell the display command in which format it should display something (see help file). Display the number 1/11 using different formats: 9 units in the general format, 9 units (none after the decimal point) in the fixed format, and 9 units (7 after the decimal point) in the fixed format. What do you observe?
- Which storage type does the variable “rep78” have? Change its storage to the lowest storage format possible without loss of information.
- Summarize the variable “gear_ratio”. Use the command “tab” or “list” to show the name of the car with the lowest gear ratio.