A Short Introduction To STATA
A Short Introduction To STATA
A Short Introduction To STATA
1) Introduction:
This session serves to link everyone from theoretical equations to tangible results under the amazing
promise of Stata!
Stata is a statistical package that includes a wide variety of capabilities, such as data management,
statistical and econometric analysis, graphics, etc. The user’s interface includes the following windows
(see Figure 1.)
• Command Window (highlighted in red): the window where we can type all the commands;
• Results Window (highlighted in blue): the window displays all the results and output generated
by the commands we have typed;
• Variables Window (highlighted in orange): the window shows all the variables currently stored
in the Stata’s memory. We can visualize these variables as in spreadsheet by typing in the
Command Window browse (br) followed by the variables to be displayed (if no variables are
specified, Stata will show all the variables). If we want to make changes to the data, we will type
edit in the Command Window.
• Command History (highlighted in green): the window keeps a record of all the commands used
in each session.
• Current Working Directory (highlighted in black): the window shows the current directory in the
file of your computer from where State will read or save any files. It can be changed by writing
in the Command Window cd path_to_the_new_directory (e.g. cd c:\desktop\State11\session1
or cd “c:\desktop\State11\session 1” if the directory contains a space); or from the Stata menu:
File/Change Working Directory.
To clear all the variables saved in Stata’s memory from last session, we can type in the Command
Window clear;
When we need to learn the use of a command, like what options it allows, or to see some examples of
its uses, we can type help name_of_the_command or findit name_of_the_command in the Command
Window. Try help reg and findit reg, and see the differences.
If we are not sure about the name the command we need, we can type search instead.
Any command in Stata that is preceded by a star (*) will be regarded as comment, and will not be
executed by Stata.
Stata can also be used a calculator by using the command display (e.g. display 4+5).
3) Entering Data:
I. Input from .xls or .xlsx files
If your original data source in an excel files or workbook looks like this:
Econ526 students may recognize this is the data set from C. Dougherty’s textbook Introduction to
Econometrics, with eaef21.xls as its file name. The command to input this into Stata is
Here, excel cannot be omitted, as we do not only import excel, we also import others like txt file.
firstrow means to treat the first row in the excel file as the default variable names in Stata. Notice they
are all in upper case letters, so case(lower) is used as part of the command to have lower case letters as
variable names. A Capital letter and the same lower case letter are different variables in Stata. So
likewise, case(preserve) keeps the names unchanged from the excel file; use case(upper) if you want
upper case names anyway.
II. Input from .csv files
A .csv file is different from an .xls file in that data are separated by comma in .csv files. Using the same
data set for example, save is as an .csv file, you are supposed to use thefollowing command to load it:
import delimited using eaef21.csv
Here, you don’t need to specify the firstrow or case(lower) as the first row from .csv file serves as
variable names and they are in lower case automatically. It makes sense since .csv file has separated
data already, it eases Stata to pin down the data structure, thus you benefit by having an easier
command. Another way to load a .csv file is to usean older version command insheet:
These two commands yield the same result. Starting from Stata14, insheet is replaced by a new
command import delimited. So if you are using an old version, use insheet. It still works in up-to-date
versions of Stata, its help file just may no longer update.
This data "earnings" is taken from R. Davidson and J.G. MacKinnon Econometric Theory and Method,
New York, Oxford University Press, 2004. The first column is observation number; column 2 to 4 are
dummy variables for individuals in group 1, 2 and 3 respectively. The last column is average annual
earnings in 1988 and 1989, measured in 1982 US dollars. You may notice there are no names shown up
in the first row, so you are supposed to key in the variable names all by yourself, and the command for
dealing with .txt files is infile:
where obs is the variable name for observation numbers, so are d1 d2 d3 and earnings.
IV. Miscellaneous
Actually it’s also quite easy for us to generate number of observations in a given data set:
gen n = _n
gen is short for generate, n is the variable name, _n is the way Stata tracks observations. For example,
Let’s regress earnings on two dummies d1 and d2.
reg earnings d1 d2
lf you want to run a regression without using the first 500 observations, just plus if_n>500 in the
command:
Since referring to a specific observation is quite handy, we don’t really need the variable obs in our data
set. The way to delete it is to use drop
drop obs
You can drop variables, you can also drop part of the observations, before we do that, let’s preserve the
data first so that we can restore it easily after this destructive trial.
preserve
drop if _n <=1000
restore
After carrying out the second command, Stata reminds you that 1000 obs have been deleted. But once
you preserve the data, you can always restore it, and restore it onceonly! Au contraire, the reverse
operation of drop is keep.
To prevent you from forgetting about what a particular variable is about, label it:
var stands for variable, anything put in the quotation is the label, pretty self-clear.
Stata stores on hard drive its own data set as a .dta file. Whenever you want to open an existing data set,
use the following command:
use earnings
Again, like every case above, you have to put earnings.dta under the current working directory. Stata
also contain 27 data sets (in the 14th version) of its own, those data sets cannot be deleted providing
your Stata is intact, and they also serve repeatedly as example data for demonstrative purpose in Stata’s
User Reference Manual which I highly recommend anyone who wants to learn more. Please type
sysuse dir
to form an initial impression of these data sets. The command to invoke any of them
We have seen commands that can help us explore and understand the data better. Type the following
command to use the NLSW88 dataset (National Longitudinal Survey of Women in 1988)
webuse nlsw88 or webuse nlsw88, clear if you need to clear preloaded variables
Now, try the following commands and see the differences between them:
describe
summarize wage
sum wage
sum wage, de
codebook wage
inspect wage
Note that when we add if followed by a condition (e.g. wage>16.5 the command will be executed only
for those observations in the dataset that meet this condition.
5) Visualizations
A. Histograms
For example, type histogram wage; or hist wage, normal if you would like to add a normal distortion to
it in the Command Window, you should see the following picture.
.15
.1
Density
.05
0
0 10 20 30 40
hourly wage
B. Scatter Graphs
Note that in the context of graphs, by is used as an option (after a comma) rather than as a prefix.
C. Matrix Graphs
D. Box Graphs
20
10
0 white black other
From the picture, it seems that median wage among the three ethnic groups does not differ too much,
even though the whites have more high income outlier.
6) An OLS regression:
To run an OLS regression we can use the command regress or, in short, reg followed by the dependent
variable (the one we want to explain) and the independent variable or variables (the ones that we
suspect explain the dependent variable). For example: runs a regression of wage on tenure, collgrad,
and married.
After running a regression, Stata temporarily stores (until another regression is run) some useful items.
For example we can generate the residuals of the regression by using the command predict:
Residuals of the aforementioned regression are then saved in the variable myresids. Are my residuals
correlated with any other variables that perhaps is missing in my regression? Use the command
correlate or a scatter graph as shown below to check this.
7) Hypothesis Testing
Hypothesis testing is straight forward in Stata, for instance, if we want to test the coefficient of tenure
equals zero:
test tenure = 0
( 1) tenure = 0
F( 1, 2227) = 58.18
Prob > F = 0.0000
This is a single variable test. The joint significant test for the coefficients on collgrad and marrid equal
zero is:
( 1) collgrad - married = 0
( 2) collgrad = 0
F( 2, 2227) = 80.20
The following commands get you fitted values 𝑦̂ and the residuals 𝑢̂
predict yhat, xb
predict u, res
To get them out of the regression, the command is predict, yhat and u are names, option xb tells Stata
you want the fitted values, and resid is just short for residuals. You’ll find two more variables appear on
your variable list. Finally, all the useful information has been stored in the e-class 3 (e stands for
estimation) returns. Please take a look at them by using the following command after the regression:
ereturn list
8) Extra Resources
http://www.stata.com/links/resources-for-learning-stata/
http://www.stata.com/links/video-tutorials/