1 / 210

Discrete and Categorical Data

Discrete and Categorical Data. William N. Evans Department of Economics University of Maryland. Part I. Introduction. Introduction. Workhorse statistical model in social sciences is the multivariate regression model Ordinary least squares (OLS)

Télécharger la présentation

Discrete and Categorical Data

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Discrete and Categorical Data William N. Evans Department of Economics University of Maryland

  2. Part I Introduction

  3. Introduction • Workhorse statistical model in social sciences is the multivariate regression model • Ordinary least squares (OLS) • yi = β0 + x1iβ1+ x2iβ2+… xkiβk+ εi • yi = xi β + εi

  4. Linear model yi =  + xi + i •  and  are “population” values – represent the true relationship between x and y • Unfortunately – these values are unknown • The job of the researcher is to estimate these values • Notice that if we differentiate y with respect to x, we obtain • dy/dx = 

  5.  represents how much y will change for a fixed change in x • Increase in income for more education • Change in crime or bankruptcy when slots are legalized • Increase in test score if you study more

  6. Put some concretenesson the problem • State of Maryland budget problems • Drop in revenues • Expensive k-12 school spending initiatives • Short-term solution – raise tax on cigarettes by 34 cents/pack • Problem – a tax hike will reduce consumption of taxable product • Question for state – as taxes are raised, how much will cigarette consumption fall?

  7. Simple model: yi =  + xi + i • Suppose y is a state’s per capita consumption of cigarettes • x represents taxes on cigarettes • Question – how much will y fall if x is increased by 34 cents/pack? • Problem – many reasons why people smoke – cost is but one of them –

  8. Data • (Y) State per capita cigarette consumption for the years 1980-1997 • (X) tax (State + Federal) in real cents per pack • “Scatter plot” of the data • Negative covariance between variables • When x>, more likely that y< • When x<, more likely that y> • Goal: pick values of  and  that “best fit” the data • Define best fit in a moment

  9. Notation • True model • yi =  + xi + i • We observe data points (yi,xi) • The parameters  and  are unknown • The actual error (i)is unknown • Estimated model • (a,b) are estimates for the parameters (,) • ei is an estimate of i where • ei=yi-a-bxi • How do you estimate a and b?

  10. Objective: Minimize sum of squared errors • Min iei2 = i(yi – a – bxi)2 • Minimize the sum of squared errors (SSE) • Treat positive and negative errors equally • Over or under predict by “5” is the same magnitude of error • “Quadratic form” • The optimal value for a and b are those that make the 1st derivative equal zero • Functions reach min or max values when derivatives are zero

  11. The model has a lot of nice features • Statistical properties easy to establish • Optimal estimates easy to obtain • Parameter estimates are easy to interpret • Model maximizes prediction • If you minimize SSE you maximize R2 • The model does well as a first order approximation to lots of problems

  12. Discrete and Qualitative Data • The OLS model work well when y is a continuous variable • Income, wages, test scores, weight, GDP • Does not has as many nice properties when y is not continuous • Example: doctor visits • Integer values • Low counts for most people • Mass of observations at zero

  13. Downside of forcing non-standard outcomes into OLS world? • Can predict outside the allowable range • e.g., negative MD visits • Does not describe the data generating process well • e.g., mass of observations at zero • Violates many properties of OLS • e.g. heteroskedasticity

  14. This talk • Look at situations when the data generating process does lend itself well to OLS models • Mathematically describe the data generating process • Show how we use different optimization procedure to obtain estimates • Describe the statistical properties

  15. Show how to interpret parameters • Illustrate how to estimate the models with popular program STATA

  16. Types of data generating processes we will consider • Dichotomous events (yes or no) • 1=yes, 0=no • Graduate high school? work? Are obese? Smoke? • Ordinal data • Self reported health (fair, poor, good, excel) • Strongly disagree, disagree, agree, strongly agree

  17. Count data • Doctor visits, lost workdays, fatality counts • Duration data • Time to failure, time to death, time to re-employment

  18. Econometric Resources • Recommended textbook • Jeffrey Wooldridge, undergraduate and grad • Lots of insight and mathematical/statistical detail • Very good examples • Helpful web sites • My graduate class • Jeff Smith’s class

  19. Part II A quick introduction to STATA

  20. STATA • Very fast, convenient, well-documented, cheap and flexible statistical package • Excellent for cross-section/panel data projects, not as great for time series but getting better • Not as easy to manipulate large data sets from flat files as SAS • I usually clean data in SAS, estimate models in STATA

  21. Key characteristic of STATA • All data must be loaded into RAM • Computations are very fast • But, size of the project is limited by available memory • Results can be generated two different ways • Command line • Write a program, (*.do) then submit from the command line

  22. Sample program to get you started • cps87_or.do • Program gets you to the point where can • Load data into memory • Construct new variables • Get simple statistics • Run a basic regression • Store the results on a disk

  23. Data (cps87_do.dta) • Random sample of data from 1987 Current Population Survey outgoing rotation group • Sample selection • Males • 21-64 • Working 30+hours/week • 19,906 observations

  24. Major caveat • Hardest thing to learn/do: get data from some other source and get it into STATA data set • We skip over that part • All the data sets are loaded into a STATA data file that can be called by saying: use data file name

  25. Housekeeping at the top of the program • * this line defines the semicolon as the ; • * end of line delimiter; • # delimit ; • * set memork for 10 meg; • set memory 10m; • * write results to a log file; • * the replace options writes over old; • * log files; • log using cps87_or.log,replace; • * open stata data set; • use c:\bill\stata\cps87_or; • * list variables and labels in data set; • desc;

  26. ------------------------------------------------------------------------------------------------------------------------------------------------------------ • > - • storage display value • variable name type format label variable label • ------------------------------------------------------------------------------ • > - • age float %9.0g age in years • race float %9.0g 1=white, non-hisp, 2=place, • n.h, 3=hisp • educ float %9.0g years of education • unionm float %9.0g 1=union member, 2=otherwise • smsa float %9.0g 1=live in 19 largest smsa, • 2=other smsa, 3=non smsa • region float %9.0g 1=east, 2=midwest, 3=south, • 4=west • earnwke float %9.0g usual weekly earnings • ------------------------------------------------------------------------------

  27. Constructing new variables • Use ‘gen’ command for generate new variables • Syntax • gen new variable name=math statement • Easily construct new variables via • Algebraic operations • Math/trig functions (ln, exp, etc.) • Logical operators (when true, =1, when false, =0)

  28. From program • * generate new variables; • * lines 1-2 illustrate basic math functoins; • * lines 3-4 line illustrate logical operators; • * line 5 illustrate the OR statement; • * line 6 illustrates the AND statement; • * after you construct new variables, compress the data again; • gen age2=age*age; • gen earnwkl=ln(earnwke); • gen union=unionm==1; • gen topcode=earnwke==999; • gen nonwhite=((race==2)|(race==3)); • gen big_ne=((region==1)&(smsa==1));

  29. Getting basic statistics • desc -- describes variables in the data set • sum – gets summary statistics • tab – produces frequencies (tables) of discrete variables

  30. * get descriptive statistics; • sum; • * get detailed descriptics for continuous variables; • sum earnwke, detail; • * get frequencies of discrete variables; • tabulate unionm; • tabulate race; • * get two-way table of frequencies; • tabulate region smsa, row column cell;

  31. STATA Resources - Specific • “Regression Models for Categorical Dependent Variables Using STATA” • J. Scott Long and Jeremy Freese • Available for sale from STATA website for $52 (www.stata.com) • Post-estimation subroutines that translate results • Do not need to buy the book to use the subroutines

  32. In STATA command line type • net search spost • Will give you a list of available programs to download • One is Spostado from http://www.indiana.edu/~jslsoc/stata • Click on the link and install the files

  33. Continuous Distributions • Random variables with infinite number of possible values • Examples -- units of measure (time, weight, distance) • Many discrete outcomes can be treated as continuous, e.g., SAT scores

  34. How to describe a continuous random variable • The Probability Density Function (PDF) • The PDF for a random variable x is defined as f(x), where f(x) $ 0 If(x)dx = 1 • Calculus review: The integral of a function gives the “area under the curve”

  35. Cumulative Distribution Function (CDF) • Suppose x is a “measure” like distance or time • 0 # x # 4 • We may be interested in the Pr(x#a) ?

  36. CDF What if we consider all values?

  37. Properties of CDF • Note that Pr(x # b) + Pr(x>b) =1 • Pr(x>b) = 1 – Pr(x # b) • Many times, it is easier to work with compliments

  38. General notation for continuous distributions • The PDF is described by lower case such as f(x) • The CDF is defined as upper case such as F(a)

  39. Standard Normal Distribution • Most frequently used continuous distribution • Symmetric “bell-shaped” distribution • As we will show, the normal has useful properties • Many variables we observe in the real world look normally distributed. • Can translate normal into ‘standard normal’

  40. Examples of variables that look normally distributed • IQ scores • SAT scores • Heights of females • Log income • Average gestation (weeks of pregnancy) • As we will show in a few weeks – sample means are normally distributed!!!

  41. Standard Normal Distribution • PDF: • For -# z #

  42. Notation • (z) is the standard normal PDF evaluated at z • [a] = Pr(z  a)

  43. Standard Normal • Notice that: • Normal is symmetric: (a) = (-a) • Normal is “unimodal” • Median=mean • Area under curve=1 • Almost all area is between (-3,3) • Evaluations of the CDF are done with • Statistical functions (excel, SAS, etc) • Tables

  44. Standard Normal CDF • Pr(z  -0.98) = [-0.98] = 0.1635

  45. Pr(z  1.41) = [1.41] = 0.9207

More Related