Merge data

Another way of combining data is to add additional variables. For this, the dimensions of the datasets matter:
merge

Merge & joinby

Datasets

household_members_all.dta
household_members_north.dta
household_members_south.dta
households.dta

			// We have another household dataset, "households", with additional information

use "households.dta", clear 

describe
summarize
browse
// 100 observations, one observation per household

/*
Let's assume we know that the household id is the same for both datasets.
We would like to add information (variables) from "households" dataset to the 
member dataset
--> merge
Which variable(s) can we use to combine the data?
	hhid
Are the datasets long or wide with respect to hhid?
	households: 	one observation per household --> wide
	member: 	several observations per household --> long
Which is the dataset currently used, which one needs to be loaded?
	households: 	currently in use --> master
	member: 	needs to be loaded --> using
*/

merge 1:m hhid using "household_members_all.dta"

/*
Not all observations could be matched: there are some which only occur in the
"master" data (_merge==1) and some which only occur in the "using" data (_merge==2).
*/

browse if _merge==1
// Households with an ID above 100 are only part of the household data 
// --> would need to check why this is the case, whether data in the membership dataset is missing, etc.
// Let's assume it is clear why they are missing, and you decide you don't need them
drop if _merge==1

browse if _merge==2
su hhid member age employed hourly_wage if _merge==2
su hhid member age employed hourly_wage if _merge==3
// No clear pattern, would need to run further checks

// Further potential check: hhsize variable 
bysort hhid: gen hhsize_test=_N
count if hhsize!=hhsize_test
count if hhsize!=hhsize_test & _merge==3
// All discrepancies come from observations which couldn't been matched

// Preferred alternative: Use assert (very useful for automated checks)
assert hhsize==hhsize_test if _merge==3



// The other way round: member as "master" data set, household as "using"
use "household_members_all.dta", clear
// Now, this is a m:1 merge (many observations per household in member data)
merge m:1 hhid using "households.dta"

/*
Main difference: Now, all value definitions etc. are taken from the member data.
If both datasets contain the same variable, the observations from the master 
data will be taken for all matched cases (unless "update" or "replace" is specified). 
Unmatched cases will contain observations from the respective datasets
--> very dangerous if variable definitions differ.
*/



*** Important things to consider ***

/*
1. Always make sure that variables in both datasets have the same units/categories/definitions
2. For labels & variables with the same name, append/merge always use the definition/observations of the master file (unless specified differently)
3. Always explore whether appending/merging was correctly done
4. Never use a m:m merge. Use joinby instead.
*/