String variables

Note

If you do not work with string variables or strings in general, you might skip this subchapter. See the PDF from Coder’s Corner on how to use string functions to clean and match country names. There is a great in-depth guide on string functions and regular expressions by Asjad Naqvi on Medium (there is no stable link to the article, search for "Regular expressions (regex) in Stata"). He also provides a cheat sheet for regular expressions which you can find below.

String variables

sysuse census.dta, clear

* Generate string variables through text input
gen country = "USA"
gen county2 = "US"+"A"
br country*

* Generate string variables from other string variables
gen code = country + state2
br country* code
replace code = country + " " + state2	// also see egen --> concat

help string function
help egen

* Identify observations based on string information
tab state if regexm(state,"New")==1
gen name_length = strlen(state)
tab state if name_length>=10
tab state if strlen(state)>=10

	* Unicode
	// Most string functions exit twice, once as "standard", once as "unicode" function
	// Unicode are special characters, which are treated differently by the program
	display strlen("Hallöchen")
	display ustrlen("Hallöchen")
	help unicode
	/*
	If you have strings containing unicode characters, such as country names, use 
	the respective funtion. The unicode functions also work for the "normal" strings, 
	so if you are unsure what your strings contain, use the unicode functions.
	*/

* Modify string variables
gen state_rep = regexr(state,"New","Old")
br state state_rep
replace state = strupper(state)
di stritrim("Very useful to delete     inner blanks")
di strupper(stritrim("Functions can    be nested"))
split state, gen(state_)
br state*

* Obtain string from numeric variables (and vice versa)
gen year = 1980
gen year_str1 = string(year)			// function
tostring year, gen(year_str2)			// more options (see help-file)
br year*
gen year_num1 = real(year_str1)			// function
destring year_str2, gen(year_num2)		// more options (see help-file)
br year*

* Obtain string from numeric variable with value labels (and vice versa)
decode region, gen(region_str)
br region*
replace region_str = "North" if regexm(region_str,"N")
encode region_str, gen(region_new)			
			
			

Exercise

Load the pre-installed dataset auto.

  1. The variable “make” consists of the car brand and the car model. Use a string function to display all values of “make” of the brand “Merc.” (Mercedes).
  2. Use a string function to create a new variable containing the brand of each car model (hint: you need a function which gives you the first word of “make”). How many cars of each brand are in the dataset?
  3. Replace the new variable by “VW/Audi” if the brand is “Audi” or “VW”
  4. Generate a new numeric variable which contains a different value for each brand (i.e., a categorical variable) and has appropriate value labels.