Sorting data is an important function of
any data analysis package. It is common for various users to want to arrange data
in ascending or descending orders. Stata commands sort and gsort are two commands that you can use for that purpose. However, since you
do not need to sort data for analysis in Stata, sort and gsort are mostly used with by, which is used to repeat a command for each unique value of the
variable that is used with. After we practice sort and gsort, and by, I will introduce you to a simpler command, the bysort command, the combination of by and sort.
Note: I am
using The Asia Foundation’s A Survey of the
Afghan People (2014) data. If you do not have this data, you
can download it from the link below:
Example:
After loading your data set, you can
look at the raw data in a spreadsheet format by executing command browse or simply br. If you are interested to see one or more specific variables, you have
to type variable names with this command. For instance, I want to see variables
d1, m7, and m6b.
browse d1 d2 m7 m6b or br d1 d2 m7 m6b
As you will observe, data are not
arranged –it is arranged how it was entered. Now, if I want to arrange the data
based on a variable, perhaps age (d2) of respondents, I use sort:
sort d2
The above command sorts the data by age
in ascending order –from the smallest to largest number of age. Execute browse to see.
You can also have more than one variable
sorted. For instance, I want to sort observations by province; then within each
province (m7), I want to sort them by rural /urban (m6b); then sort them by
gender (d1); and finally sort them by age (d2). To do so, the variable names
should be typed with sort in that order:
sort
m7 m6b d1 d2
What command sort cannot do is arranging the data in descending order –from the smallest
to largest numbers. For this purpose, command gsort is created, which has the flexibility of arranging in either ascending
or descending orders.
In order to arrange data in ascending
order, gsort should be used with plus sign “+” before a variable name; and for
descending order, gsort should be used with minus sign “–”.
gsort –d2
gsort +d2
It is also possible to sort data by more
than one variable with gsort, which is similar to sort. The advantage of using gsort for this purpose is that gsort allows you to sort ascending or descending in each variable in the
command. For instance, I want to sort descending the province (m7); within each
province, I sort ascending rural / urban (m6b); within each province and rural
/ urban, I sort descending the gender (d1); and within all those, I sort
ascending age (d2):
gsort –m7 +m6b –d1 +d2
In my experience working with Stata, sort or gsort are hardly very useful commands without using by. As mentioned earlier, by repeats a command for each unique value
of the variable used with by.
Let’s start with a simple example.
Suppose, you are interested to see what is the average (mean) age for male and
female (d1) respondents, using sort, by and summarize:
sort d1
by d1:
sum d2
So, the above two lines of commands
tells Stata that first sort the data by gender, and then calculate summary
statistics of age by the sorted variable, gender.
There are numerous alternatives to get
above result, of which one of the easiest is using bysort command. bysort is basically a combination of by and sort, thus instead of typing two lines of command, using bysort you only have to type once:
bysort d1:
sum d2
It is also possible to use more than one variable
with bysort. Here is an example:
Also, it is possible to use by sort with most
commands you know, such as tab, mrtab, reg, pwcorr, etc.:
No comments:
Post a Comment