Common Mistakes

***************************************
*** Missings are infinitively large ***

sysuse census.dta, clear 

gen urban_share = popurban/pop
replace urban_share = . if region==3 	// generate missings for pedagogic purposes 

// You want to create a variable indicating a majority of urban population
gen urban_majority_wrong = 1 if urban_share>0.5
replace urban_majority_wrong = 0 if urban_share<=0.5
tab urban_majority_wrong
tab region urban_majority_wrong
/*
The variable should be missing for region 3 (South), but is coded as 1.
Stata treats missings as infinitively large, hence "urban_share>0.5" is true
for those with a urban share above 50%, and is true for those with missing 
values for urban share! To prevent this, use an if-expression or replace
the variable with "." for all observations which are missing in the urban_share
variable.
*/	
gen urban_majority_right = 1 if urban_share>0.5 & urban_share<.	
replace urban_majority_right = 0 if urban_share<=0.4
// --> results in 0's, 1's and missings
// alternatively to "if urban_share<.", you can use "if missing(urban_share)"

// Same problem if using the shortcut:
gen urban_majority_shortcut_wrong = (urban_share>0.5)
// --> generates 0's and 1's
gen urban_majority_shortcut_right = (urban_share>0.5) if urban_share<.
// --> generates 0's, 1's and missings

// ---> Always make sure that missings are excluded from your variable definition




***********************************
*** sum() creates a running sum ***

sysuse census.dta, clear 

// You want to create a variable with the total US population & find the function sum()
gen total_pop = sum(pop)
br state pop total_pop
// But this creates the running/rolling sum, it is not constant over observations!

// Instead, use one of the egen functions for total sum
egen total_pop2 = total(pop)
egen total_pop3 = sum(pop)
br state pop total_pop*




**************************************
*** Variable lists depend on order ***

sysuse census.dta, clear 

// You want to generate a variable summarizing the population groups
egen pop_sum = rowtotal(poplt5-pop65p)

// You do some changes further above in the do-file, changing the order of variables
order pop18p
// This can also happen if you change the order in which variables are created!

// Now, the command no longer does what was intended, but there is no warning
egen pop_sum_new = rowtotal(poplt5-pop65p)

// ---> Always check the results of the variable generation
// ---> Only use order-dependent varlists if you are sure about the order




***************************************
*** Some expressions depend on sort ***

sysuse census.dta, clear

// You want to create a state ID
gen state_id = _n

// You know that the population for Arizona is wrong & want to set it to missing
replace pop = . if state_id==3

// You do some changes further above in the do-file, changing the sort of variables
sort pop
// This can also happen if you use bysort to create variables!

// Now, states get other IDs 
// --> might cause problems if they were used for identification/matching
gen state_id_new = _n




*******************************
*** 0.1 is not equal to 0.1 ***

sysuse census.dta, clear

gen urban_share = popurban/pop

// Let's assume you know the size of the urban population and the total population in Alabama
list popurban pop if state=="Alabama"

// So, you should be able to find the observation for which urban_share== 2,337,713 / 3,893,888
tab state if urban_share== 2337713 / 3893888
// No observation was found

/*
Stata calculates with a high precision, but the default storage of variables is
float, which is not as precise. Most of the time, this doesn't matter, but in
some cases, this can be crucial. In these cases, make sure that variables are 
created as double (or are imported as double).
*/

gen double urban_share_prec = popurban/pop
tab state if urban_share_prec== 2337713 / 3893888

// If you only want to filter, use the float() to transform the precise calculation into a float:
tab state if urban_share== float(2337713 / 3893888)

// This can also happen when you don't expect it
gen dummy = 0.1
count if dummy==0.1
count if dummy==float(0.1)
// The reason is that Stata works with binary, not decimal numbers
// --> 0.1 is an infinitive binary number, such as 1/11 in decimal numbers

/* 
You can also set the default such that Stata always creates variables as double, 
but this needs much more storage ("set type"). Make sure to use "compress" to 
save storage for all variables which are stored in a higher class than needed.
*/




**************************************************
*** Large numbers need to be stored in doubles ***

// Precision is not only relevant for decimals, but also for large numbers.
// Imagine the following setup: You have a dataset with many observations. 
// You wish to create a unique identifier based on the number of each observation:

clear all

// Let's create a really large dataset: 22,500,350 observations
set obs 22500350

// Let's create an id based on the number of each observation
gen id = _n

// Now, check whether the id is a unique identifier (which it should be)
isid id
//!\\ The id is NOT a unique identifier //!\\

// A full description with the codebook command:
codebook id
// --> Stata identifies only 19,638,783 unique values

// To see what happened, we format the id variable and look at it at high values
format id %12.0g
browse in 20000000/22500350
// --> There are several entries for some numbers, and none for others
// This is because our id variable is in the float format, which does not have
// enough precision for this

// Solution: Store the number of each observation in a double
gen double id2 = _n

// Let's look at the codebook again
codebook id2
// Now the number of unique values is correct

// Direct comparison:
browse in 20000000/22500350

/*
To sum up, whenever you want to store large values in a variable, make it a
double. Before storing the dataset, you can use "compress" to minimize the
storage without loss of information - this means that the variable will stay
a double if necessary, but become a float/long variable if possible.
*/




********************************************
*** Time variables need a high precision ***

// Time variables are a special case of the precision problem
// Let's assume you have some exact time information, such as start of an interview
clear
set obs 10
gen time = clock("15. April 2021 12:00:00","DMYhms")
format time %tc
br time
// According to the variable, the interview was started 12:00:03, how can this happen?

// Time variables need more digits than can be stored in floats --> need double
gen double exact_time = clock("15. April 2021 12:00:00","DMYhms")
format exact_time %tc
br *time




**********************************************
*** Strings can contain unicode characters ***

// Most string functions exist twice, once as "standard", once as "unicode" function
// Unicode are special characters, which are treated differently by the program
display strlen("Hallöchen")
display ustrlen("Hallöchen")
/*
If you have strings containing unicode characters, such as country names, use 
the respective funtion. The unicode functions also work for the "normal" strings, 
so if you are unsure what your strings contain, use the unicode functions.
*/




***********************************
*** For replication, set a seed ***

/*
If you create random variables, or use commands which are based on random draws, 
Stata uses an algorithm to draw the random numbers. The result of the draw depends
on the start value for the algorithm. If you run the same command again, you will
get different results. To make sure that you can replicate your code, set a seed 
directly before the random function or command.
*/


clear 
set obs 500000
gen standard_normal = rnormal()
gen standard_normal2 = rnormal()
br standard_normal*
// Both contain different values

set seed 31459176
gen standard_normal3 = rnormal()
// Again, different values, but now we can replicate them

set seed 31459176
gen standard_normal4 = rnormal()
br standard_normal*
// standard_normal3 and standard_normal4 are identical!

gen standard_normal5 = rnormal()
br standard_normal*
// standard_normal5 is again different: the random draw continues, as no new seed
// was set, i.e. no new start value was defined. But if you run the complete code
// again, you will see that standard_normal5 (and all other variables generated
// since the seed) will always look the same