STATS & RELATED

STATS & RELATED GEOG 111/211A

Types of Data • Qualitative data* • Non numeric • Also called categorical • Example: colors, types of material, activities, car types • Quantitative data • Discrete • Count data (1,2,3,4,5…) • Ordered (worst to best) • Continuous • Infinite (length, volume, time) *do not confuse with qualitative research methods

Additional Concepts • Nominal data = no order and they are given a name • Ordinal = They have an order but interval between categories has no meaning • Interval = Interval between measurements has meaning but no starting point • Ratio = they have both order and interval

Population and sample • Population = The totality of units exhibiting a property to study • Santa Barbara travel behavior – population are all residents and visitors • Sample = A portion of the population • We want a sample to represent the population • We use sample statistics to INFER population characteristics • We always commit an error

Terms • Parameter = a characteristic of the population • Statistic = a characteristic of the sample • Estimate = a statistic used to identify the value of a parameter • Sample average is used to identify the value of population mean

Data Collection Settings • Cross Section: Observation on "many" observational units all drawn at the same point in time. • Example: Typical home interviews such as the Nationwide Household Transportation Survey (http://www.bts.gov/programs/national_household_travel_survey/) • Time Series: Many observations on one observational unit drawn at a number of points in time. Usually equally spaced over time. • Example: National yearly averages such as energy consumption by vehicle type • Panel : Time-series cross sections; Large numbers of cross-sectional units observed at a few points in time. • Example: The Dutch National Mobility Panel Data Set ---> every six months diaries and sociodemographic interviews of 7,000 persons

Discussion • Which survey/database is good for what? • Examples? • What would you do to perfectly measure travel behavior of a population?

Data Analysis • Begin with summary descriptive statistics. • 1. Measures of location ---> CENTER of the data • 2. Measures of scale ---> DISPERSION of the data • 3. Measures of association ---> RELATION within data

Measures of Central Tendency • Average • Also called arithmetic mean • Median = middle ranked observation (50% left and 50% right) • Sample midrange = (Max-Min)/2

Measures of Dispersion • Sample Standard Deviation

Preferable Statistics • When they use all the data at hand • Depending on the situation • Make sure you do not convey unwanted messages • See typical statements such as the average American drives 15,000 miles a year (really?)

Box Plot • A box plot=a basic graphing tool that displays centering, spread, and distribution of a continuous data set (a variable) • Provides a 5 point summary of the data : • The box represents the middle 50% of the data. • The median is the point where 50% of the data is above it and 50% below it. • The 25th quartile is where, at most, 25% of the data fall below it. • The 75th quartile is where, at most, 25% of the data is above it. • The whiskers cannot extend any further than 1.5 times the length of the inner quartiles. • Data points outside this they will show up as outliers.

Sample

Statistics Always report sample size = measure of credibility because the more units we have the more credible our study becomes (unless there are other limitations)

Box Plot Example

Skewness = measure of the symmetricity of a distribution (expectation of the cubic deviation). Perfectly symmetric distribution ---> skewness=0. • Kurtosis = measure of the thickness of the tails of a distribution (expectation of the deviation to the fourth power). [3 for normal distribution] • (Kurtosis - 3) = degree of excess.

All this was for one variable What happens when we consider two variables at the same time and their relationship?

Measures of Association Covariance

Measures of Association Standardized Covariance = Correlation

Correlation Types

Example WORK1 = number of trips to work in the two days NFEM = number of females in household NWRKR = number of employed people in household LOW, MED = yearly gross income level dummies Descriptive Statistics • Variable Mean Std. Dev. Skew. Kurt. Minimum Maximum Cases • WORK1 2.7050 2.5247 1.227 5.884 0.0000 17.00 400 • NFEM 0.89750 0.48226 0.401 8.455 0.0000 4.000 400 • NWRKR 1.0625 0.79698 0.185 2.384 0.0000 4.000 400 • LOW 0.18750 0.39080 1.599 3.555 0.0000 1.000 400 • MED 0.66750 0.47170 ‑0.710 1.502 0.0000 1.000 400 • Skewness = measure of the symmetricity of a distribution (expectation of the cubic deviation). Perfectly symmetric distribution ---> skewness=0. • Kurtosis = measure of the thickness of the tails of a distribution (expectation of the deviation to the fourth power). [3 for normal distribution] • (Kurtosis - 3) = degree of excess.

Covariance Matrix • 1‑WORK1 2‑NFEM 3‑NWRKR 4‑LOW 5‑MED • 1‑WORK1 6.3739 • 2‑NFEM 0.21781 0.23258 • 3‑NWRKR 1.4521 0.99154E‑01 0.63518 • 4‑LOW ‑0.15257 ‑0.32895E‑02 ‑0.61873E‑01 0.15273 • 5‑MED ‑0.28158E‑01 ‑0.65977E‑02 ‑0.42293E‑02 ‑0.12547 0.22250 • 1. DIAGONAL CONTAINS THE VARIANCES • 2. THIS IS A SYMMETRIC MATRIX • Example: As the number of females in the household increases the number of work trips increases

Web resources • General information (http://davidmlane.com/hyperstat/A37797.html) • Box Plot for Excell (http://www.analyse-it.com/boxplot_y.htm ---- see also http://course1.winona.edu/cmalone/excel/templates/boxplot.xls) • http://www.statsoft.com/textbook/stathome.html

Digression on Accuracy and Precision • Accuracy = measure of rightness • Precision = measure of exactness • We want our sample estimates to be both accurate and precise with respect to the population values

PRECISE YES NO NO ACCURATE YES Source: Most likely a Chemistry book(?) – see also http://www.flatsurv.com/accuprec.htm

Precision vs Accuracy • NO-NO =The darts are not clustered together and are not near the bull's eye. • Precise but not accurate = The darts are clustered together but did not hit the intended mark. • Accurate but not precise = The darts are not clustered, but their 'average' position is the center of the bull's eye. • Accurate and Precise = The darts are tightly clustered and their average position is the center of the bull's eye.

Estimation of Parameters Objective: From the sample data infer the value of a population parameter Θ • Point Estimate: A statistic computed from a sample that provides a single value for the population parameter Θ • Standard Error: The standard deviation of the sampling distribution of the statistic • Interval Estimate: Range of values that contain the true parameter Θ

More on estimation • Relationship between interval estimate and point estimate: • Estimator: A strategy for using the data to estimate the parameter Θ • Discussion: What is a "good" estimator of the population mean?

What is a good estimator? The "goodness" of an estimator is based on many properties: • Finite sample properties -- valid regardless of sample size • Some Criteria: • Unbiasedness = The estimator is on the target Θ - accurate • Efficiency = Highly concentrated around the target, min variance - precise • Large sample properties -- valid as sample size increases (asymptotic properties) • Some Criteria: • Consistency = Eventually (as the sample size increases) the estimator will be on target – accurate for large samples

Hypothesis Testing • See: http://davidmlane.com/hyperstat/logic_hypothesis.html

Regression Objective • For a variable explain its variation as function of other variables and chance • Why do we do that? • Understand how people behave • Use a model to predict behavior

Regression • Multiple cross-sectional regression model • K independent variables (explanatory) • Y is the dependent (to be explained)

Example 1

Regression • Explain travel time as function of age and cars in household • What do you expect?

Excell Regression Output • Regression Statistics • ANOVA table • Model estimates • Tools – Data Analysis – Regression – follow options • If not in tools you need to add toolpack • Tools – addins - toolpack

ANOVA OUTPUT Explained Variation of the Y variable by the Xs (The larger the better) Total Variation of the Y variable Total Variation of the Y variable not explained (the smaller the better)

OUTPUT Example 1 SS Regression / SS Total

Estimates Is this a good model for the data? SPSS output

Example 2

Equation I Used • Travel time = 10 + 0.5*age + 10* cars • If we run regression with this known equation we get:

ANOVA and Coefficients Huge Values – Never happens in real word

Hypothesis testing in regression • What is a significantly different than zero coefficient? • Rule of thumb? • Specification = definition of variables • Strategies for specification?

Example 3 Data

Regression Model

Estimates Travel time for a zero year old with no cars is 64.096 on average per day

Estimates (10 year old) Travel time for a 10 year old with 1 car in household is 64.096 + 10 * 0.015 + 10.241 * 1

Regression summary • Try to explain as much variation as possible with Xs • Detect which Xs offer good explanation with hypothesis testing, goodness of fit, and telling us a convincing story • Use the regression model to predict a variable of interest (example: Number of trips in trip generation) • There are many types of regression models depending on the variable we want to explain (more on this in Mode choice and Geog 211B)

STATS &amp; RELATED

STATS &amp; RELATED

Presentation Transcript

STATS & RELATED

STATS & RELATED