Tuesday, July 28, 2015

Stata tip #35: Changing missing values to numeric values and vice versa

Data manipulation is an important part of data analysis, which ensures accuracy of the results you get. Stata is one of the best packages available for data manipulation; Stata not only allows you to choose between command-based and menu-based options to manipulate a data set, you can also choose from a wide range of commands to manipulate data according to your specific needs. Below are a list of common commands for data manipulation you should know by now:
generate, replace, recode, clonevar, label, note, drop, keep, encode

In this tip, I want to introduce to you two new commands useful for data manipulation. These two commands are mvencode and mvdecode.

Description: mvencode changes missing values to a specified numeric value for (any number of) variables in your data set; mvdecode works just like the opposite, and changes a specified numeric value (of any number) of variables to missing values.
Let’s learn mvencode and mvdecode in a context – using examples.

NOTE: I am using The Asia Foundation’s A Survey of the Afghan People (2014) data in my examples. If you do not have this data, you can download it from the following below:

Examples:
Suppose you want to get rid of “don’t know responses in your data set, which are coded 99. Why would you want to do that? Perhaps you want to run regression, correlation, or a ttest, whose result will be biased if you have “don’t know” responses, coded 99, since its variance is meaningless to your analysis.

There is a question (Q57) in A Survey of the Afghan People (2014) asking how much the respondent has sympathy with armed opposition groups. The responses to the question include: “a lot of sympathy,” “some sympathy,” and “no sympathy at all.” But there are some survey respondents who refused to answer this question, or responded “don’t know.” As you see below, responses “refused” and “don’t know” are coded 98 and 99 respectively, and by including these responses to our analysis, we are biasing our results.

So in this case, and other similar cases, which you need to get rid of all 99s, or any other value, a number of commands can be used: replace, recode, and mvdecode. In the following, I explain how they to use each of these codes for our purpose.
Generally replace is used to change contents of a variable. For instance, to change all code 99s from your variable q57 to missing values, replace does it in the following manner:
replace q57=. if q57==99

recode is a very powerful command to use for data manipulation (see Stata tips #3 and #16 for more detailed information). recode changes the values of numeric variables according to rules that you specify. In addition to what replace does, recode allows for more convenient data manipulation, and create label values.
Using recode, to change all value 99s to missing values in variable q57, you will do the following:
recode q57 (99=.)
this command leaves values that do not meet any of the conditions of the rules unchanged, i.e. here any values except 99. In other words, values 1, 2, 3 and 98 will be left unchanged.

Another command that you could use to change numeric values (e.g. 99) to missing is mvdecode. While replace and recode are general purpose commands with their own merits and shortcomings, mvdecode is a command specially designed for changing values to missing. In other words, the only function of mvdecode is to change numeric values to missing values.

To change a numeric value to missing value using mvdecode, you have to type the command followed by variable list, and then followed by a comma and list of numbers in parentheses:
mvdecode q57, mv(99)
this command changes all value 99s in variable q57 to missing values.

If you are interested to change more values, for instance 98 too, you need to add it inside the parentheses beside 99. If you are interested to bring the mentioned change to more than just one variable, you could type more variable names after mvdecode and next to q57. And if you want to change all variables’ 99 and 98 values to missing, you could use shortcut ­_all, which means all variables:
mvdecode q1 q57, mv(98 99)
this command replaces variables q1 and q57’s values 98 and 99 to missing values.
mvdecode _all, mv(98 99)
this command changes values 98 and 99 to missing values in all variables (_all) of your data set.

Now that you are able to change numeric values to missing values using replace, recode and mvdecode, how about changing missing values to a numeric value?
As you might already guessed, commands replace and recode could change missing values to a numeric value too. The procedure is similar to what we have done earlier with changing value 99s in variable q57 to missing values. Now let’s do the opposite, and change the missing values to value 99 using replace and recode:
replace q57=99 if q57==.
recode q57 (.=98)

To change missing values to a numeric value, like what I just did, there is a special command called mvencode. Noticed the similarity between mvencode and mvdecode? They do exactly the opposite of each other, i.e. mvdecode changes numeric values to missing values, while mvencode changes missing values to numeric values, making them inverse twins.

The syntax for mvencode is extremely similar to mvdecode too, but does exactly the opposite of it:
mvencode q57, mv(99)
the above command changes missing values to 99.

You can also use more than one variable with mvencode, but you cannot have more than one numeric value, such as 99, to convert missing values to. You can also use _all if you want the operation to apply on all variables in your data set.


A general tip: as you notice, usually you can use several commands to obtain your goal in Stata. But that does not mean you only should learn one of them. Because, each command or method has their own merits and shortcomings, and learning all these commands will help you perform various tasks, and saves your time.

Sunday, July 26, 2015

Stata tip #34: tab1

Command tabulate or tab is perhaps the first and the most useful commands for any Stata user. Command tab displays a table of frequencies and percentages for a variable, when used with one variable. When used with two variables, it produces a two-way table of frequencies distribution. You could add percentages to the two-way table by including column and row percentages.

Recently I came across a command called tab1, which is an offspring of command tabulate, used to create table of frequencies and percentages for more than one variable, in one command. In other words, if you want to perform tabulate command more than once, and type as many line of code as the number of variables, tab1 can save time by doing so with just one line of code.

You could also think of tab1 as a loop – see Stata tip #33 for loops – which repeats command tabulate for a list of variables.

Let’s see some examples now.

NOTE: I am using The Asia Foundation’s A Survey of the Afghan People (2014) data in my examples. If you do not have this data, you can download it from the following below:

Examples:
Suppose you want to see frequency distribution (table of frequencies) of q6a, q6b, q6c, q6d, q6e and q6f questions. The long way is to type five commands for six mentioned variables:
tab q6a
tab q6b
tab q6c
tab q6d
tab q6e
tab q6f
Another way of doing the above 6 commands is to use loops, command foreach (see Stata tip #33):
foreach x in q6a q6b q6c q6d q6e q6f {
tab `x’
}


A shorter and more straightforward command for doing tab multiple times is to use tab1. For tab1, all you have to do is list all the variables you want after tab1, like below:
tab1 q6a q6b q6c q6d q6e q6f






Although loops are usually the most convenient way of performing a single command over a number of variables, tab1 is the easiest way of obtaining table of frequencies and percentages, i.e. looping tab, for a number of variables.

There are some options available with command tab1, as well as tab. The most commons are missing, nolabel, plot and sort. As a general rule in Stata, options come after a comma.

Option missing requests that missing values, if available, be treated like other values in counts, percentages and other statistics (in composite command tab, sum( ) – see Stata tip #5). The command with missing would look like:
tab1 q6a q6b q6c q6d q6e q6f, missing

Option nolabel causes the numeric codes to be displayed rather than the value labels:
tab1 q6a q6b q6c q6d q6e q6f, nolabel

Option plot adds a simple bar chart of relative frequencies, while omitting the percentage and cumulative columns, to the command tab1:
tab1 q6a q6b q6c q6d q6e q6f, plot

And finally, option sort puts the tables in descending order of frequency/percentage.
Note, if interested, there is no limit to the number of options to be used with one command. Therefore, you can use any number of options at once.

Thursday, July 23, 2015

Stata tip #33: foreach Loops


We often need to execute a command or perform a similar action repeatedly for a large number of variables. For example, we might want to see the table of frequencies for 10 variables in our data set, or we want to change the value of 98 and 99 in our data set to missing for all variables. One way to do is to type command tab 10 times for the first example, and type recode for each variable (as many as they are) for the second example. An easier way to do both examples is to use a “loops” and let Stata create those 10 frequency tables, or recode all 98 and 99’s to missing.

Loops is used to execute an action/a command repeatedly for many variables (as many as you like!) at once. Stata have three commands for performing loops: foreach, forvalues and while.

In this tip, I will teach you how to loop using foreach loop, which is perhaps the most common one. You will receive tips about forvalues and while loops soon.

Note: I am using The Asia Foundation’s A Survey of the Afghan People (2014) data in my examples. If you do not have this data, you can download it from the link below:

Examples:

Suppose you want to see frequency distribution (table of frequencies) of q27a, q27b, q27c, q27d and q27e questions. The long way is to type five commands for five mentioned variables:
tab q27a
tab q27b
tab q27c
tab q27d
tab q27e

Another way of doing the above 5 commands is to use command foreach and loop:
foreach x in q27a q27b q27c q27d q27e {
display “`x’”
tab `x’
}
The above loop repeats command tab for each variable mentioned after in in the first line of command. In reality, the first line of command creates a local macro called x, which includes variables q27a, q27b, q27c, q27d and q27e. If you recall Stata tip #31, macros are used as a shorthand to a list of variables (or strings/text).

The above loop has four lines of codes. The first line foreach creates a local macro called x that includes variables q27a, q27b, q27c, q27d and q27e. The second line executes command display, which displays each elements of local macro x after using doubt quotes “” in order to display it as a text. The third line executes command tab for each element of local macro x. The last line, always closes the loop using a curly bracket.

Rules:

There are some general rules that apply with loops: foreach, forvalues and while:
  1. Curly brackets must be used to specify the beginning and end of the loop;
  2. The open brace must appear on the same line as the loop command foreach, forvalues or while;
  3. Nothing may follow the open brace except, comments; the first command to be executed must appear on a new line; and
  4. The close brace must appear on a line by itself at the end.
Remember, each line is typed and entered. But as you follow the above rules, if you enter a line, Stata will not execute anything until the loop is closed using the curly bracket in the last line of codes.

The above example demonstrates one way of using foreach in terms of syntax, the structure of which can be written as below:
1.         foreach  local-macro-name  in  list-of-variables {

Stata has other variants of foreach in terms of syntax, which loops over a list of variables. Here are four other variants. Remember, the key words are typed in boldface:
2.         foreach  local-macro-name  of  local  `local-macro-name’ {
3.         foreach  local-macro-name  of  global  $global-macro-name {
4.         foreach  local-macro-name  of  numlist  list-of-numbers {
5.         foreach  local-macro-name  of  varlist  list-of-variables {

The 2 and 3 syntaxes obtain a list of variables, which should be defined in local and global macros already. If I perform the example about q27 using the syntaxes 2 and 3, it will be as follows:
2.         
local  lmacname  q27a  q27b  q27c  q27d  q27e
foreach  x  of  local  lmacname {
display  “`x’”
tab  `x’
}

3.         
global  gmacname  q27a  q27b  q27c  q27d  q27e
foreach  x  of  global  gmacname {
display  “`x’”
tab  `x’
}

The syntax 4 with numlist is different since it takes only a list of numbers to loop with.
For example, I want to see how many people of various ages (d2) live in rural or urban areas (m6b). Using foreach, I am tabulate rural/urban variable from the data set for each years of age from 18 to 22:
4.         
foreach  y  of numlist  18/22 {
display  “Age = `y’”
tab  m6b  if  d2==`y’
}

In the above command, the first line says: create a local macro y that includes numbers from 18 to 22 (here slash “/” works to indicate all the integer numbers from 18 to 22, i.e. 18, 19, 20, 21 and 22), and opens the foreach loop. The second line says: display “Age = ” followed by the elements of loop y, i.e. 18, 19, 20, 21 and 22. The third line says: tabulate variable m6b conditional to age being equal to elements local macro y. The last line closes the loop. Since there are 5 elements/numbers inside local macro y, the two commands display and tab will run 5 times, one for each element of the macro y.

The last variant of foreach loops is the syntax 5 from above. This variation of foreach loop is similar to the first syntax with in, with slight difference. foreach loop with in allows a general list, with elements being typed and separated using a space. foreach loop with of varlist is different in the sense that Stata gives an interpretation of list to the elements, meaning that Stata knows the elements typed are not variables, but they are a list of variables. Therefore, the syntax 5 allows for variable abbreviations.

Some common variable abbreviations include:
-          q27*: an abbreviation for all variables prefix (variables that start with) q27. Meaning all variables named q27 and followed by something.
-          q27a-q27e: meaning all variables q27a through q27e in the order that the variables are recorded in the data set.
-          _all or *: meaning all variables.

So, let’s go back to the question of recoding all 98 and 98’s to missing in our data set. Remember, Stata does not have an undo option per se. To undo an action, you need to execute command preserve before your action, and then execute restore to undo any changes you have brought after execution of preserve. See tip #27 about restore and preserve. So, I recommend you execute preserve and restore before performing this action, or simply do not save your data set after your work, because your original data set which included values 98 and 99 will disappear.
foreach  w  of  varlist  _all {
capture  recode  `w’  (98  99=.)
}

The first line of command says: create a local macro w that includes all variables, and open a foreach loop. The second line recodes all 98 and 99’s from all variables (elements of local macro w) to missing. Notice that I have put capture before command recode. capture executes command recode (or any other command for that matter), suppressing all error messages (if any). In other words, as the command recode will be recoding each variable in your data set, if there is a problem with one of the variables (maybe one variable is not numeric, and thus does not have any 98 or 99, which is the case with some variables in our data set), Stata will stop the loop unfinished. Thus, when capture is used, Stata will suppress that error and continue performing for all elements of the local macro w. The last line has only the closing curly bracket and closes the loop.
There are other ways to recode all 98 and 98’s to missing. For instance, using an asterisk *:
foreach  w  of  varlist  * {
capture  recode  `w’  (98  99=.)
}

Wednesday, July 22, 2015

Stata tip #32: Sort & By

Sorting data is an important function of any data analysis package. It is common for various users to want to arrange data in ascending or descending orders. Stata commands sort and gsort are two commands that you can use for that purpose. However, since you do not need to sort data for analysis in Stata, sort and gsort are mostly used with by, which is used to repeat a command for each unique value of the variable that is used with. After we practice sort and gsort, and by, I will introduce you to a simpler command, the bysort command, the combination of by and sort.

Note: I am using The Asia Foundation’s A Survey of the Afghan People (2014) data. If you do not have this data, you can download it from the link below:

Example:

After loading your data set, you can look at the raw data in a spreadsheet format by executing command browse or simply br. If you are interested to see one or more specific variables, you have to type variable names with this command. For instance, I want to see variables d1, m7, and m6b.
            browse  d1  d2  m7  m6b         or         br  d1  d2  m7  m6b

As you will observe, data are not arranged –it is arranged how it was entered. Now, if I want to arrange the data based on a variable, perhaps age (d2) of respondents, I use sort:
            sort d2

The above command sorts the data by age in ascending order –from the smallest to largest number of age. Execute browse to see.

You can also have more than one variable sorted. For instance, I want to sort observations by province; then within each province (m7), I want to sort them by rural /urban (m6b); then sort them by gender (d1); and finally sort them by age (d2). To do so, the variable names should be typed with sort in that order:
            sort  m7  m6b  d1  d2

What command sort cannot do is arranging the data in descending order –from the smallest to largest numbers. For this purpose, command gsort is created, which has the flexibility of arranging in either ascending or descending orders.

In order to arrange data in ascending order, gsort should be used with plus sign “+” before a variable name; and for descending order, gsort should be used with minus sign “–”.
            gsort  –d2
            gsort  +d2

It is also possible to sort data by more than one variable with gsort, which is similar to sort. The advantage of using gsort for this purpose is that gsort allows you to sort ascending or descending in each variable in the command. For instance, I want to sort descending the province (m7); within each province, I sort ascending rural / urban (m6b); within each province and rural / urban, I sort descending the gender (d1); and within all those, I sort ascending age (d2):
            gsort –m7 +m6b –d1 +d2


In my experience working with Stata, sort or gsort are hardly very useful commands without using by. As mentioned earlier, by repeats a command for each unique value of the variable used with by.
Let’s start with a simple example. Suppose, you are interested to see what is the average (mean) age for male and female (d1) respondents, using sort, by and summarize:
            sort  d1
            by  d1:  sum  d2

So, the above two lines of commands tells Stata that first sort the data by gender, and then calculate summary statistics of age by the sorted variable, gender.

There are numerous alternatives to get above result, of which one of the easiest is using bysort command. bysort is basically a combination of by and sort, thus instead of typing two lines of command, using bysort you only have to type once:
            bysort  d1:  sum  d2

It is also possible to use more than one variable with bysort. Here is an example:
            bysort  d1  m6b:  sum  d2

Also, it is possible to use by sort with most commands you know, such as tab, mrtab, reg, pwcorr, etc.:
            bysort  d1:  tab  q74 m6b,  col  nofreq