Format & storage

Format & storage

sysuse census.dta, clear
generate urban_share = popurban/pop
* Format
// Sometimes, it is useful to fix how the variables should be displayed
// You can set the format of one or more variables using the command format
help format

* Create copies for comparison
gen urban_share1 = urban_share
gen urban_share2 = urban_share
gen urban_share3 = urban_share
gen urban_share4 = urban_share
gen urban_share5 = urban_share
br urban_share*

* Format "general" --> how much space should be used to display the content

format urban_share1 %12.0g // total display: 12 units (some might be "empty", i.e., blanks), Stata chooses how many after the decimal point
format urban_share2 %6.0g // total display: 6 units, Stata chooses how many after the decimal point

* Format "fixed" --> fixes how many digits after the decimal point will be displayed
format urban_share3 %9.2f // total display: 9 units (some might be "empty"), after the decimal point: exactly 2
format urban_share4 %9.3f // total display: 9 units, after the decimal point: exactly 3
format urban_share5 %-9.3f // same as before, but left-justified
* Storage types
// The storage that variables need differs, compare for example
gen dummy1 = 1 // --> only few storage needed
gen dummy2 = 132224238 // --> more storage needed
gen dummy3 = sqrt(pop) // --> even more storage needed
help data types // You can set the data type of the variables to balance storage & precision
* Set data type when generating variables
gen byte dummy4 = 1 // "byte" sufficient for values between -127 and 100
* Compress data
compress // minimize storage without loss of information
* Change storage type
recast double dummy2 // you can always change to a higher storage type
recast byte dummy2 // you cannot always change to a lower storage type without information loss
recast byte dummy2, force // use "force", if you are willing to give up the precision
* Common mistake
sysuse census.dta, clear
gen urban_share = popurban/pop // Let's assume you know the size of the urban population and the total population in Alabama
list popurban pop if state=="Alabama" // So, you should be able to find the observation for which urban_share== 2,337,713 / 3,893,888
tab state if urban_share== 2337713 / 3893888 // No observation was found
/*
Stata calculates with a high precision, but the default storage of variables is float, which is not as precise. Most of the time, this doesn't matter, but in some cases, this can be crucial. In these cases, make sure that variables are created as double (or are imported as double).
*/

gen double urban_share_prec = popurban/pop
tab state if urban_share_prec== 2337713 / 3893888

// If you only want to filter, use the float() to transform the precise calculation into a float:
tab state if urban_share== float(2337713 / 3893888)

// This can also happen when you don't expect it
gen dummy5 = 0.1
count if dummy5==0.1
count if dummy5==float(0.1) // The reason is that Stata works with binary, not decimal numbers
// --> 0.1 is an infinitive binary number, such as 1/11 in decimal numbers
/*
You can also set the default such that Stata always creates variables as double, but this needs much more storage ("set type"). Make sure to use "compress" to save storage for all variables which are stored in a higher class than needed.
*/

Exercise

Load the pre-installed dataset auto.

  1. Change the display format of the variable “price” such that two digits after the decimal point are depicted.
  2. You can also tell the display command in which format it should display something (see help file). Display the number 1/11 using different formats: 9 units in the general format, 9 units (none after the decimal point) in the fixed format, and 9 units (7 after the decimal point) in the fixed format. What do you observe?
  3. Which storage type does the variable “rep78” have? Change its storage to the lowest storage format possible without loss of information.
  4. Summarize the variable “gear_ratio”. Use the command “tab” or “list” to show the name of the car with the lowest gear ratio.