Variable Types
Binary variables
sysuse census.dta, clear
gen urban_share = popurban/pop
* Intuitive way to create indicator/binary/dummy variables
gen urban_majority = 1 if urban_share>0.5
replace urban_majority = 0 if urban_share<=0.50
tab urban_majority
* Shortcut
gen urban_majority_shortcut = urban_share>0.5
tab urban_majority urban_majority_shortcut
* Common mistake: Missings are treated as infinitively large
// generate missings for pedagogic purposes
replace urban_share = . if region==3
// You want to create a variable indicating a majority of urban population
gen urban_majority_wrong = urban_share>0.5
tab urban_majority_wrong
tab region urban_majority_wrong
/*
The variable should be missing for region 3 (South), but is coded as 1.
Stata treats missings as infinitively large, hence "urban_share>0.5" is true
for those with a urban share above 50%, and is true for those with missing
values for urban share! To prevent this, use an if-expression or replace:
*/
gen urban_majority_right = (urban_share>0.5) if urban_share<.
// --> results in 0's, 1's and missings
// alternatively to "if urban_share<.", you can use "if !missing(urban_share)"
// Check results
tab region urban_majority_right
* Turn categories into indicator/binary/dummy variables with tabulate
tab region, generate(region_dummy)
* Other useful functions
gen state_list = inlist(state,"Alabama","Oklahoma") // abbrev. for (state=="Alabama" | state=="Oklahoma")
br state state_list
gen pop_range = inrange(pop,1e+6,2e+6) // abbrev. for (pop>=1e+6 & pop<=2e+6)
br pop pop_range
gen pop_high = cond(pop>10e+6,1,0,.) // very flexible function
// could specify any if-then outcome, e.g. cond(pop>10e+6,10,-10,.)
br pop pop_high
Exercise
Load the pre-installed dataset auto.
- Generate a variable which indicates whether a car had a repair record above three. For how many cars was this the case?
- Summarize the variable “price” in detail. Generate a variable which indicates whether the car’s price is above the median (50th percentile). Check your results.
- Generate a variable which indicates whether a car costs between 4,000 and 6,000. How many cars a within this price range?