Tuesday, July 28, 2015

Stata tip #35: Changing missing values to numeric values and vice versa

Data manipulation is an important part of data analysis, which ensures accuracy of the results you get. Stata is one of the best packages available for data manipulation; Stata not only allows you to choose between command-based and menu-based options to manipulate a data set, you can also choose from a wide range of commands to manipulate data according to your specific needs. Below are a list of common commands for data manipulation you should know by now:
generate, replace, recode, clonevar, label, note, drop, keep, encode

In this tip, I want to introduce to you two new commands useful for data manipulation. These two commands are mvencode and mvdecode.

Description: mvencode changes missing values to a specified numeric value for (any number of) variables in your data set; mvdecode works just like the opposite, and changes a specified numeric value (of any number) of variables to missing values.
Let’s learn mvencode and mvdecode in a context – using examples.

NOTE: I am using The Asia Foundation’s A Survey of the Afghan People (2014) data in my examples. If you do not have this data, you can download it from the following below:

Examples:
Suppose you want to get rid of “don’t know responses in your data set, which are coded 99. Why would you want to do that? Perhaps you want to run regression, correlation, or a ttest, whose result will be biased if you have “don’t know” responses, coded 99, since its variance is meaningless to your analysis.

There is a question (Q57) in A Survey of the Afghan People (2014) asking how much the respondent has sympathy with armed opposition groups. The responses to the question include: “a lot of sympathy,” “some sympathy,” and “no sympathy at all.” But there are some survey respondents who refused to answer this question, or responded “don’t know.” As you see below, responses “refused” and “don’t know” are coded 98 and 99 respectively, and by including these responses to our analysis, we are biasing our results.

So in this case, and other similar cases, which you need to get rid of all 99s, or any other value, a number of commands can be used: replace, recode, and mvdecode. In the following, I explain how they to use each of these codes for our purpose.
Generally replace is used to change contents of a variable. For instance, to change all code 99s from your variable q57 to missing values, replace does it in the following manner:
replace q57=. if q57==99

recode is a very powerful command to use for data manipulation (see Stata tips #3 and #16 for more detailed information). recode changes the values of numeric variables according to rules that you specify. In addition to what replace does, recode allows for more convenient data manipulation, and create label values.
Using recode, to change all value 99s to missing values in variable q57, you will do the following:
recode q57 (99=.)
this command leaves values that do not meet any of the conditions of the rules unchanged, i.e. here any values except 99. In other words, values 1, 2, 3 and 98 will be left unchanged.

Another command that you could use to change numeric values (e.g. 99) to missing is mvdecode. While replace and recode are general purpose commands with their own merits and shortcomings, mvdecode is a command specially designed for changing values to missing. In other words, the only function of mvdecode is to change numeric values to missing values.

To change a numeric value to missing value using mvdecode, you have to type the command followed by variable list, and then followed by a comma and list of numbers in parentheses:
mvdecode q57, mv(99)
this command changes all value 99s in variable q57 to missing values.

If you are interested to change more values, for instance 98 too, you need to add it inside the parentheses beside 99. If you are interested to bring the mentioned change to more than just one variable, you could type more variable names after mvdecode and next to q57. And if you want to change all variables’ 99 and 98 values to missing, you could use shortcut ­_all, which means all variables:
mvdecode q1 q57, mv(98 99)
this command replaces variables q1 and q57’s values 98 and 99 to missing values.
mvdecode _all, mv(98 99)
this command changes values 98 and 99 to missing values in all variables (_all) of your data set.

Now that you are able to change numeric values to missing values using replace, recode and mvdecode, how about changing missing values to a numeric value?
As you might already guessed, commands replace and recode could change missing values to a numeric value too. The procedure is similar to what we have done earlier with changing value 99s in variable q57 to missing values. Now let’s do the opposite, and change the missing values to value 99 using replace and recode:
replace q57=99 if q57==.
recode q57 (.=98)

To change missing values to a numeric value, like what I just did, there is a special command called mvencode. Noticed the similarity between mvencode and mvdecode? They do exactly the opposite of each other, i.e. mvdecode changes numeric values to missing values, while mvencode changes missing values to numeric values, making them inverse twins.

The syntax for mvencode is extremely similar to mvdecode too, but does exactly the opposite of it:
mvencode q57, mv(99)
the above command changes missing values to 99.

You can also use more than one variable with mvencode, but you cannot have more than one numeric value, such as 99, to convert missing values to. You can also use _all if you want the operation to apply on all variables in your data set.


A general tip: as you notice, usually you can use several commands to obtain your goal in Stata. But that does not mean you only should learn one of them. Because, each command or method has their own merits and shortcomings, and learning all these commands will help you perform various tasks, and saves your time.

3 comments:

  1. Thank you so so so much you saved me big time

    ReplyDelete
    Replies
    1. Glad to be of help. I should probably continue doing this.

      Delete
  2. This is very helpful Masood. I learnt a lot. Please continue.

    ReplyDelete