Categorical variables & value labels
Categorical variables
sysuse census.dta, clear
* Variant 1: Own definition
// categorical variable for population size:
gen pop_cat = .
replace pop_cat = 1 if pop<2e+6
replace pop_cat = 2 if pop>=2e+6 & pop<4e+6
replace pop_cat = 3 if pop>=4e+6 & pop<.
label define lblPop 1 "Small" 2 "Medium" 3 "Large" // define a value label
label value pop_cat lblPop // attach value label to variable
codebook pop_cat
// a nicer alternative is the user-written command "fre"
* Variant 2: Group variable is string
gen region_str = ""
replace region_str = "North East" if region==1
replace region_str = "North Central" if region==2
replace region_str = "South" if region==3
replace region_str = "West" if region==4
* a) egen
egen region_no_2a = group(region_str), label
des region_no_2a
label list region_no_2a
* b) encode
encode region_str, gen(region_no_3b)
des region_no_3b
label list region_no_3b
codebook region_no*
* Variant 3: Combination of variables indicates groups
tab region, gen(reg_)
egen region_no_3 = group(reg_1-reg_4), label
// not in the same order, depends on sorting
* Variant 4: Group variable is numeric
su pop, detail
egen coded_pop_1 = cut(pop), at(0,2e+6,4e+6,3e+7) icodes
// creates categories for continuous variables
recode pop (0/2e+6=0 Small) (2e+6/4e+6=1 Medium) (4e+6/max=2 Large), gen(coded_pop_2)
// similar, but actually thought for categorical variables
/*
What's the difference? "recode" is actually labeled as a command for recoding
categorical variables, rather than generating a categorical variable from
continuous variables. Still, it is quite useful for this purpose, especially
with the option to directly define value labels, use "min" and "max" and set
missing codes (see below). However, it requires more thinking in some cirumstances:
the user needs to make sure that categories are not overlapping or leaving gaps
(you might use the option "test" for this). Also, "recode" takes the full range
of the values, i.e., 0/2e+6 includes 0 and 2e+6, while "cut" only takes the left
value, i.e., >=0 & <2e+6. So, what happens to the second 2e+6 in recode? It is
actually ignored. In other words, "recode" and "cut" do not actually generate
the same variable - if there are observations at the boundary, they will diverge.
*/
* More about labels
help label
help label language // for multiple languages
help labelbook // for overview on labels
* Recode & missings
codebook region
recode region (1=2 NE) (2=1 "N Cntrl"), gen(region_new)
// useful to swap value for categories or combine them
codebook region_new
replace pop = 997 in 1/25 if region==3
replace pop = 999 in 26/50 if region==3
replace pop = 888 if region==4
mvdecode pop, mv(997 999=.d \ 888=.r)
codebook pop
br pop
mvencode pop, mv(.r=888)
mvencode pop in 1/25, mv(.d=997)
mvencode pop in 26/50, mv(.d=999)
// Special case of recode, but with built-in error if value already exists
Exercise
Load the pre-installed dataset auto.
- Generate a variable which groups the cars into those with a weight below 2000, those with a weight 2000 or above, but below 4000, and those with a weight of 4000 and above. Label the values as “light”, “medium”, and “heavy”. How many cars fall in each category?
- Generate a variable which divides the observations into groups based on the combination of the variable “foreign” and the variable generated in (1). Check your result by tabulating the summary of the new variable by the variables “foreign” and variable from (1) (use tabulate, summarize).
- Create value labels describing the combination of values for the variable from (2), e.g. “Low weight, domestic”, and attach the labels to the variable.