Data manipulation is an important part of data
analysis, which ensures accuracy of the results you get. Stata is one of the
best packages available for data manipulation; Stata not only allows you to
choose between command-based and menu-based options to manipulate a data set, you
can also choose from a wide range of commands to manipulate data according to
your specific needs. Below are a list of common commands for data manipulation
you should know by now:
generate,
replace, recode, clonevar, label, note, drop, keep, encode
In this tip, I want to introduce to you two new commands
useful for data manipulation. These two commands are mvencode and mvdecode.
Description: mvencode changes missing
values to a specified numeric value for (any number of) variables in your data
set; mvdecode works just like the opposite, and changes a
specified numeric value (of any number) of variables to missing values.
Let’s learn mvencode and mvdecode in a context –
using examples.
NOTE: I am using The Asia Foundation’s A
Survey of the Afghan People (2014) data in my examples. If you do not have this data,
you can download it from the following below:
Examples:
Suppose you want to get rid of “don’t
know” responses in your data set, which are coded 99. Why
would you want to do that? Perhaps you want to run regression, correlation, or a ttest, whose result
will be biased if you have “don’t know” responses, coded 99, since its variance
is meaningless to your analysis.
There is a question (Q57) in A
Survey of the Afghan People (2014) asking how much the respondent has sympathy
with armed opposition groups. The responses to the question include: “a
lot of sympathy,” “some sympathy,” and “no
sympathy at all.” But there are some survey respondents who refused to answer this
question, or responded “don’t know.” As you see
below, responses “refused” and “don’t know” are coded 98 and 99 respectively, and by
including these responses to our analysis, we are biasing our results.
So in this case, and other similar cases, which you
need to get rid of all 99s, or any other value, a number of commands can be
used: replace, recode, and mvdecode. In the
following, I explain how they to use each of these codes for our purpose.
Generally replace is used to
change contents of a variable. For instance, to change all code 99s from your
variable q57 to missing values, replace does it in the following manner:
replace q57=. if q57==99
recode is a very
powerful command to use for data manipulation (see Stata tips #3 and #16 for
more detailed information). recode changes the values of numeric
variables according to rules that you specify. In addition to what replace does, recode allows for more
convenient data manipulation, and create label values.
Using recode, to change all value 99s to
missing values in variable q57, you will do the following:
recode q57 (99=.)
this command leaves values that do not meet
any of the conditions of the rules unchanged, i.e. here any values except 99.
In other words, values 1, 2, 3 and 98 will be left unchanged.
Another command that you could use to change numeric
values (e.g. 99) to missing is mvdecode. While replace and recode are general
purpose commands with their own merits and shortcomings, mvdecode is a command
specially designed for changing values to missing. In other words, the only
function of mvdecode is to change numeric values to
missing values.
To change a numeric value to missing value using mvdecode, you have to
type the command followed by variable list, and then followed by a comma and
list of numbers in parentheses:
mvdecode q57, mv(99)
this command changes all value 99s in variable
q57 to missing values.
If you are interested to change more values, for
instance 98 too, you need to add it inside the parentheses beside 99. If you
are interested to bring the mentioned change to more than just one variable,
you could type more variable names after mvdecode and next to q57.
And if you want to change all variables’ 99 and 98 values to missing, you could
use shortcut _all, which means all variables:
mvdecode q1 q57, mv(98 99)
this command replaces variables q1 and q57’s values
98 and 99 to missing values.
mvdecode _all, mv(98 99)
this command changes values 98 and 99 to missing
values in all variables (_all) of your data set.
Now that you are able to change numeric values to
missing values using replace, recode and mvdecode, how about
changing missing values to a numeric value?
As you might already guessed, commands replace and recode could change
missing values to a numeric value too. The procedure is similar to what we have
done earlier with changing value 99s in variable q57 to missing values. Now let’s
do the opposite, and change the missing values to value 99 using replace and recode:
replace q57=99 if q57==.
recode q57 (.=98)
To change missing values to a numeric value, like
what I just did, there is a special command called mvencode. Noticed the similarity
between mvencode and mvdecode? They do exactly the opposite of each
other, i.e. mvdecode changes numeric values to missing
values, while mvencode changes missing values to numeric
values, making them inverse twins.
The syntax for mvencode is extremely
similar to mvdecode too, but does exactly the opposite of it:
mvencode q57, mv(99)
the above command changes missing values to 99.
You can also use more than one variable with mvencode, but you cannot
have more than one numeric value, such as 99, to convert missing values to. You
can also use _all if you want the operation to apply on all variables
in your data set.
A general tip: as you notice,
usually you can use several commands to obtain your goal in Stata. But that
does not mean you only should learn one of them. Because, each command or
method has their own merits and shortcomings, and learning all these commands
will help you perform various tasks, and saves your time.