Wednesday, July 22, 2015

Stata tip #32: Sort & By

Sorting data is an important function of any data analysis package. It is common for various users to want to arrange data in ascending or descending orders. Stata commands sort and gsort are two commands that you can use for that purpose. However, since you do not need to sort data for analysis in Stata, sort and gsort are mostly used with by, which is used to repeat a command for each unique value of the variable that is used with. After we practice sort and gsort, and by, I will introduce you to a simpler command, the bysort command, the combination of by and sort.

Note: I am using The Asia Foundation’s A Survey of the Afghan People (2014) data. If you do not have this data, you can download it from the link below:

Example:

After loading your data set, you can look at the raw data in a spreadsheet format by executing command browse or simply br. If you are interested to see one or more specific variables, you have to type variable names with this command. For instance, I want to see variables d1, m7, and m6b.
            browse  d1  d2  m7  m6b         or         br  d1  d2  m7  m6b

As you will observe, data are not arranged –it is arranged how it was entered. Now, if I want to arrange the data based on a variable, perhaps age (d2) of respondents, I use sort:
            sort d2

The above command sorts the data by age in ascending order –from the smallest to largest number of age. Execute browse to see.

You can also have more than one variable sorted. For instance, I want to sort observations by province; then within each province (m7), I want to sort them by rural /urban (m6b); then sort them by gender (d1); and finally sort them by age (d2). To do so, the variable names should be typed with sort in that order:
            sort  m7  m6b  d1  d2

What command sort cannot do is arranging the data in descending order –from the smallest to largest numbers. For this purpose, command gsort is created, which has the flexibility of arranging in either ascending or descending orders.

In order to arrange data in ascending order, gsort should be used with plus sign “+” before a variable name; and for descending order, gsort should be used with minus sign “–”.
            gsort  –d2
            gsort  +d2

It is also possible to sort data by more than one variable with gsort, which is similar to sort. The advantage of using gsort for this purpose is that gsort allows you to sort ascending or descending in each variable in the command. For instance, I want to sort descending the province (m7); within each province, I sort ascending rural / urban (m6b); within each province and rural / urban, I sort descending the gender (d1); and within all those, I sort ascending age (d2):
            gsort –m7 +m6b –d1 +d2


In my experience working with Stata, sort or gsort are hardly very useful commands without using by. As mentioned earlier, by repeats a command for each unique value of the variable used with by.
Let’s start with a simple example. Suppose, you are interested to see what is the average (mean) age for male and female (d1) respondents, using sort, by and summarize:
            sort  d1
            by  d1:  sum  d2

So, the above two lines of commands tells Stata that first sort the data by gender, and then calculate summary statistics of age by the sorted variable, gender.

There are numerous alternatives to get above result, of which one of the easiest is using bysort command. bysort is basically a combination of by and sort, thus instead of typing two lines of command, using bysort you only have to type once:
            bysort  d1:  sum  d2

It is also possible to use more than one variable with bysort. Here is an example:
            bysort  d1  m6b:  sum  d2

Also, it is possible to use by sort with most commands you know, such as tab, mrtab, reg, pwcorr, etc.:
            bysort  d1:  tab  q74 m6b,  col  nofreq

No comments:

Post a Comment