Course website www.ntu.edu.sg/home/cslennox/teaching.htm • You use this website for getting the teaching notes and syllabus • This site also contains a link to MySite which is where I keep the datasets that are used in the course.
Objectives of this class • By the end of the class you should be able to: • Create and save programs, data and output • Manage and manipulate your data files • Construct new variables • Perform simple univariate tests • Analyze your data graphically
1. Basic data analysis and introduction to STATA • 1.1 Starting up • 1.2 Creating and saving programs and output • 1.3 Manipulating multiple datasets • 1.4 Describing the data • 1.5 Constructing variables • 1.6 Repeated observations for the same company • 1.7 Tests of means, medians, contingency tables and correlation matrices • 1.8 Graphs
1.1 Starting up (set mem) • Start up your STATA software • Four main windows • Review (list of prior commands) • Variables (list of variables) • Results (output) • Command (enter commands interactively) • You need to have sufficient memory allocated to STATA in order to import your data files • set mem 500m, perm
1.1 Starting up (display) • Type display "1" and display "clive" in the command window • We can also execute the command display 1 but not display clive because clive is a “string” variable rather than a numerical variable. • The " " tells STATA that the bit within quotes is a string • e.g., display "1+1" and display 1+1
1.1 Starting up (describe, edit, rename) • Type describe in the command window • You get the same result by typing d or using the pull down menu (Data) • At the moment there is no data in the memory • We can input data interactively (type edit in the command line or use the Data pull down menu) • Now type describe or d in the command window • You can rename the variable • rename var1 clive • Page up to edit a previous command that you wish to change
1.1 Starting up (list, in) • Type list to see all the observations • We can also look at the data by typing browse • You can use the in command to see a subset of observations • list in 1 • list in 2/5
1.1 Starting up (insheet) • Usually, you want to import a data file rather than input the data manually • The data file may not be in STATA format, e.g., a spreadsheet • To see how we import data, go to File, Import. • Many file types cannot be directly imported into STATA (e.g., Excel files, SAS files) without special software. • You can first save the data in a file that has text format (e.g., .csv). • If the data file is small enough I do this using Excel • If I am downloading data from Compustat or CRSP, I usually save it in text format straight away • Once your source data is stored in a text file you can import the data into STATA using the insheet command • For exampleinsheet using "J:\phd\Fees1.csv"
1.1 Starting up (drop, _all, clear) • You can only load a new dataset into STATA if there is no data currently in STATA’s memory • Otherwise you will get the error message, “you must start with an empty dataset” • Three solutions • drop all the variables in memory • drop _all • use clear at the end of the insheet command • insheet using "J:\phd\Fees1.csv", clear
Class exercise 1a • Go to My Site • Download “Fees1.xls” to your hard drive • Open the file in Excel. • Import the data into STATA • Play around with the commands we learned earlier (list, describe, drop, rename)
1.1 Starting up (browse, clear) • When using the list command, you can scroll through the data, page by page, by clicking on “more” or by entering a space in the command window • You can scroll right to the bottom of the dataset by entering multiple spaces in the command window • With such a large dataset, you probably don’t want to list every observation. How do we break the output? • Simultaneously press Ctrl, Alt, Break
1.1 Starting up (save, replace) • Let’s tidy up the variable names: • rename companytype1ifprivatecompanywith companytype • rename salesthousands sales • rename auditfeesthousandsauditfees • rename nonauditfeesthousandsnonauditfees • We can save the data in memory: • save "J:\phd\Fees1.dta" • Try to save it again – we get an error message that a file with this name already exists • save "J:\phd\Fees1.dta", replace
1.1 Starting up (keep, use, exit) • Suppose we want to drop some variables or keep them • drop sales, keep companyid yearend • Since we have saved the previous data in STATA format we can easily load it back again. When importing a STATA format file the command is use (not insheet) • use "J:\phd\Fees1.dta", clear • To exit STATA, you drop the variables in memory and type exit: • drop _all • exit
1.2 Creating and saving programs and output (log, capture) • You can record the commands and output from a session using the drop down File, log menu or type: • log using "J:\phd\output.log", replace • When you want to finish recording the session, type log close
1.2 Creating and saving programs and output (log, capture) • Using the term capture before a command is very useful because it ensures that STATA will ignore a command that is impossible to execute. Without capture, STATA would return an error code which would interrupt our program. • For example: type log using "J:\phd\output.log", replace twice • The second time does not work because the log file was already open and STATA cannot overwrite an open file. The best way around this is to always type capture log close before you try to open a log file.
Class exercise 1b • Start a log file in STATA • Load up a dataset (insheet the .csv file or use the .dta file) • Perform some commands on the data (e.g., describe, list) • Close the log file in STATA • Inspect the log file in a text editor (e.g., Word)
1.2 Creating and saving programs and output (doedit, do) • We can write a program that contains all the commands that we wish to perform using a “do” file • A do file can be written in any text editor (e.g., Word) including STATA’s own text editor • Type doeditin the command line • Type use "J:\phd\Fees1.dta", clear in the text editor and write some command (e.g., describe). Make sure you press Enter after the final command line. • Save this as a do file in the STATA editor (e.g., using the name program.do) • Run the do file in STATA using the drop down menu, File, Do or type do "J:\phd\program.do"
1.2 Creating and saving programs and output • Some suggestions • It is very important that you can reproduce whatever results you found previously • There could be a delay of several months between submitting to a journal and returning to the project for revision (i.e., you should not rely on your memory) • To do this you need to document all your commands in a program
1.2 Creating and saving programs and output • Work interactively only when you are exploring what you will want to include in the program • Write a “master program” which captures everything you want to keep: • from the very first step in your analysis where you import the data, right through to • the final steps where you have the results you are going to report in your paper • Within the “master” program you can embed “mini” programs (e.g., analyses performed on different subsets of the data)
1.2 Creating and saving programs and output • If you stick to the rule of having just ONE master program for each project it is very easy to keep track of what you have done (even years later) • This rule also means that you clearly see which data files you used to obtain your results
Class exercise 1c • Open up the text editor in which you are going to write your program (doedit is convenient for a small program). • In this program I want you to: • Open a log file that will record all the commands and output • Import the “Fees1.csv” data file using the insheetcommand • Use some commands (e.g., describe, rename, label) that help to make the data more easily understandable • Close the log file • Save your program and then run this program in STATA (using the do command)
1.3 Manipulating multiple datasets • So far we have been working with a single file that contains data for one year (2000) on seven variables • I have another file that contains the same variables for the year 2001 (Fees2.csv) • How do we combine these two datasets?
1.3 Manipulating multiple datasets • The data are organized such that each row corresponds to a company-year • So we need to add rows for the additional observations in 2001 • The variable names need to be the same in each dataset in order for the observations to appear in the correct columns
1.3 Manipulating multiple datasets (append) • We import the data for 2000 and then add the data for 2001 using the append command: • insheet using "J:\phd\Fees1.csv", clear • save "J:\phd\Fees1.dta", replace • insheet using "J:\phd\Fees2.csv", clear • append using "J:\phd\Fees1.dta" • save "J:\phd\Fees1.dta", replace
1.3 Manipulating multiple datasets (merge) • I have another file that contains data on different variables for the same companies in 2000 • current assets, current liabilities, total assets, auditor type • The data are saved in Fees6.csv • We need to add columns for the additional variables • How do we combine the two sets of variables for each unique company-year?
1.3 Manipulating multiple datasets (sort, merge) • We can add the columns using the sort and merge commands • insheet using "J:\phd\Fees1.csv", clear • sort companyid yearend • save "J:\phd\Fees1.dta", replace • describe • insheet using "J:\phd\Fees6.csv", clear • sort companyid yearend • merge companyid yearend using "J:\phd\Fees1.dta" • save "J:\phd\Fees1.dta", replace • describe • I include the describe commands to check that the number of observations is the same in the two datasets
1.3 Manipulating multiple datasets (merge, help) • Notice that after merging the two datasets, a new variable has been created called _merge. Why? • Whenever there is something about a command that we don’t understand we can easily find out the answer using the help command • We can use the help command to check out any of the commands in STATA, e.g., help describe, help drop, help help.
1.3 Manipulating multiple datasets (merge, help) help merge By default, merge creates a new variable, _merge, containing numeric codes concerning the source and the contents of each observation in the merged dataset. These codes are explained below in the match results table. numeric equivalent code word (results) description ------------------------------------------------------------------- 1 master observation appeared in master only 2 using observation appeared in using only 3 match observation appeared in both 4 match_update observation appeared in both, missing values updated 5 match_conflict observation appeared in both, conflicting nonmissing values
1.3 Manipulating multiple datasets (summarize, help) • I have asserted that the Fees6 and Fees1 data files contain identical company-years • If this is true, all the values of _merge should equal 3 • To check, we can inspect the values of _merge • list companyid yearend _merge • It would be too time-consuming to manually check every observation. A quicker way is to summarize this variable • summarize _merge • After checking that the merge has worked correctly, you should drop the _merge variable because otherwise any later merge commands cannot be executed. • drop _merge
1.4 Describing the data • You must carefully investigate your data before performing any tests or running regressions • The variable values should “make sense” • E.g., audit fees cannot be negative • Some databases assign numerical values to “missing data” (e.g., missing stock returns may be coded -99) • Missing values created by STATA are automatically assigned a non-numerical character (.) • The validity of statistical estimation often depends on the distributional characteristics of the data • are there problems of outliers? • is the distribution highly skewed?
1.4 Describing the data • There are many commands in STATA that are useful for inspecting the variables • The ones I most often use are: • summarize, centile, tabulate, and count • Graphs are also very useful • you will see how to use graphs later
1.4 Describing the data (detail) • use "J:\phd\Fees.dta", clear • We have already performed commands on one variable • e.g., summarize incorporationyear • We can also perform commands on multiple variables • e.g., summarize incorporationyear big6 • e.g., summarize incorporationyear-big6 • e.g., summarize _all or simply summarize • We can obtain more detailed descriptive statistics • e.g., summarize incorporationyear, detail
1.4 Describing the data (centile) • We may want to see more detail about the distribution than is provided by detail • We can get additional detail using centile • summarize auditfees, detail • centileauditfees, centile(0(10)100) • centileauditfees, centile(0(5)100)
1.4 Describing the data (tabulate) • For discrete variables, it is often more informative to use tabulate rather than summarize • For example, we want to know how many observations there are for each type of company • tabulate companytype
1.4 Describing the data (count) • We can find out how many observations take a particular value using count. For example, suppose we want to know how many companies are audited by Big Six and Non-Big Six audit firms • count if big6==1 • count if big6==0 • count if big6==0 & big6==1 • count if big6==0 | big6==1 • Missing values are coded “.” in STATA (unless your raw data uses a different coding) • Be careful because “.” is equivalent to infinity when you use the if qualifier • count if big6==. (indicates how many observations have missing values) • count if big6>=0 (the count includes missing observations) • count if big6<. (the count excludes missing observations) • count if big6!=. (the count excludes missing observations)
1.4 Describing the data (tabulate) • Often you will be interested in the joint distribution of two variables • If the variables are “discrete”, tabulate is a useful command • tabulate companytype big6 • tabulate is not useful if one or more variables are continuous • tabulate companytypeauditfees • Later I will show you how to group continuous variables into discrete intervals, which will enable you to use tabulate on the intervals of the continuous variable
1.4 Describing the data (tabulate) • Suppose you want to know what % of the Big 6 observations lie in each company type category • tabulate companytype big6, column • If you want to know what % of each company type lies in the Big 6 / Non-Big 6 categories • tabulate companytype big6, row • Not surprisingly, publicly traded companies tend to hire Big 6 auditors while private companies tend to hire Non-Big 6 auditors. • Later I will show you how to test whether there is a significant univariate association • You can report both % simultaneously • tab companytype big6, row col • or combine with the if qualifier • tab companytype big6 if companytype<=3, row col
1.4 Describing the data (by) • If one variable is continuous while the other is “discrete”, by is a useful tool • For example, suppose that you want to know the mean level of audit fees for each of the six types of company • You need to sort the discrete variable and then summarize the continuous variable for each value of the discrete variable: • sort companytype • by companytype: summarize auditfees • An even quicker way to do this: • by companytype, sort: summarize auditfees • You can even combine these commands with detail • by companytype, sort: summarize auditfees, detail
1.5 Constructing new variables • Often your raw data need to be transformed into new variables. • For example: • most audit fee studies use the log of audit fees rather than the raw value of fees. • we have a variable indicating six types of company, but maybe we are interested in whether companies are traded on a public equity market • How do we combine company types 2, 3, 5 (publicly traded) and types 1, 4, 6 into a binary variable?
1.5 Constructing new variables (gen, replace) • We will create a new variable (“listed”) that equals one if the company is traded on a public market, zero otherwise. • Be careful to ensure that the new variable is coded missing if the underlying variable is missing • gen listed=0 • replace listed=1 if companytype==2 • replace listed=1 if companytype==3 • replace listed=1 if companytype==5 • replace listed=. if companytype==.
1.5 Constructing new variables (gen, replace) • As you become more experienced, you will realize there are quicker ways to write the code: • drop listed • gen listed=0 if companytype<. • replace listed=1 if companytype==2 | companytype==3 | companytype==5 • It is always worth double-checking what you have done because it is easy to make programming errors: • tab listed companytype
1.5 Constructing new variables (gen) • It is very important to closely examine the variable that you are planning to transform. Example, suppose you want to generate a variable equal to the (natural) log of audit fees: • sum auditfees, detail • the mean is more than three times as large as the median • the standard deviation is very large (> six times the mean) • There are severe outliers at the top of the distribution
1.5 Constructing new variables (gen) • How will taking logs affect these distributional properties? • gen lnaf=ln(auditfees) • sum lnaf, detail • the mean is close to the median • the standard deviation is small (< the mean) • Any outliers are likely to be much less “influential”
Class exercise 1d • Simunic (1984) finds a significant positive association between audit fees and non-audit fees. He attributes this positive association to “knowledge spillovers” • You wish to examine whether the same association holds in your data. • Most prior studies use the log of fees rather than actual fees, so you have decided to generate new variables for the log of audit fees and the log of non-audit fees. • REQUIRED: Generate these two logged fee variables • Hint: you should first examine the distributions of the audit fee and non-audit fee variables
Datatypes • STATA distinguishes between “string” variables that contain letters (e.g., the company’s name) and “real” numbers (e.g., audit fees) • STATA stores the real number variables and the string variables differently
Datatypes • Strings are stored as str#, for instance, str1, str2, str3, ... str80. The number after the str indicates the maximum length of the string. A str5 could hold the word "male", but not the word "female" because "female" has six characters. • gen x="male" • gen y=female (why does this not work?)
Datatypes (compress) • You can provide a keyword with gen to specify the storage type of the new variable • gen str3 gender= "male" • list gender in 1/10 • There may be times when variables are using up more memory than they really need • drop gender • gen str30 gender= "male" • browse • The gender variable is using unnecessary memory • describe gender • compress gender
Datatypes • Similar keywords can be used for generating real numbers • drop x • gen int x=1 • describe x • Without a keyword the default storage type is float • drop x • gen x=1 • describe x • Usually you can allow STATA to choose the storage type for you • If you need to allocate less memory to data storage, you can use the compress command