E N D
Data Analysis andStatistics PERPI Training Hotel Puri Denpasar March 30,2017 Version 2 by T.S.Lim Quantitative Senior Research Director and Partner Leap Research
Agenda 1 What isStatistics? 2 Types of Variables and Levels ofMeasurement 3 DescriptiveStatistics 4 InferentialStatistics 5 Independent and DependentSamples 3
References • Carr, Rodney. Practical Statistics. XLent Works. http://www.deakin.edu.au/~rodneyc/PracticalStatistics/,2013 • Gonick, Larry, and Woollcott Smith. The Cartoon Guide to Statistics (New York: HarperPerennial, 2015), Kindleedition • Lind, Douglas A., William G. Marchal, and Samuel A. Wathen. Statistical Techniques in Business & Economics. 15th ed. New York: McGraw-Hill/Irwin,2012 • Malhotra, Naresh K. Marketing Research: An Applied Orientation. Global Edition, 6th ed. Upper Saddle River: Pearson Education,2010 • Rumsey, Deborah. Statistics Essentials For Dummies. Hoboken: Wiley,2010 4
Statistics • The science of collecting, organizing, presenting, analyzing, and interpreting data to assist in making more effectivedecisions • 2 categories: descriptive statistics and inferential statistics • DESCRIPTIVE STATISTICS: Methods of organizing, summarizing, and presenting data in an informativeway • E.g., via various charts, tables,infographics • INFERENTIAL STATISTICS: The methods used to estimate a property of a population on the basis of asample • E.g., T-Test, Z-Test, ANOVA, Regression Analysis, Factor Analysis, ClusterAnalysis Source: Lind, Marchal, and Wathen(2012) 7
Ethics andStatistics • A guideline can be found in the paper “Statistics and Ethics: Some Advice for YoungStatisticians,” • in The American Statistician 57, no. 1(2003) • The authors advise us to practice statistics with integrity and honesty, and urge us to “do the right thing” when collecting, organizing, summarizing, analyzing, and interpreting numerical information • The real contribution of statistics to society is a moral one. Financial analysts need toprovide • information that truly reflects a company’s performance so as not to mislead individualinvestors. • Information regarding product defects that may be harmful to people must be analyzed and reported with integrity andhonesty • The authors of The American Statistician article further indicate that when we practicestatistics, • we need to maintain “an independent and principledpoint-of-view” In Marketing Research, we change the data values only when it’s clearly justifiable; e.g., data entry orcoding error. We must never change the values just to increase / decrease the meanscore. Source: Lind, Marchal, and Wathen (2012), page14 8
Types ofVariables Source: Lind, Marchal, and Wathen(2012) 10
Four Levels of Measurement Data can be classified according to levels of measurement. The level of measurement of the data dictates the calculations that can be done to summarize and present the data. It will also determine the statistical tests that should beperformed. Observations of a qualitative variable can only be classified andcounted NominalLevel Data are represented by sets of labels or names; they have relative values and hence they can be ranked orordered OrdinalLevel It includes all the characteristics of the ordinal level, and additionallythe difference between values is a constant size IntervalLevel It has all the characteristics of the interval level, and additionally the 0 point is meaningful and the ratio between two numbers ismeaningful RatioLevel Source: Lind, Marchal, and Wathen(2012) 11
Four Levels of Measurement Summary Source: Lind, Marchal, and Wathen(2012) In Marketing Research, we usually assume that variables of non Nominal level to have at least Intervallevel 12
Measures ofLocation • Measures of location that we discuss are measures of central tendency because they tend to describe the center of thedistribution • If the entire sample is changed by adding a fixed constant to each observation, then the mean, mode and median change by the same fixedamount • Mean: The mean, or average value, is the most commonly used measure ofcentral • tendency • The measure is used to estimate the unknown population mean when the data have been collected using an interval or ratioscale • The data should display some central tendency, with most of the responses distributed around the mean • Note: Sample Mean is prone to the presence of outliers (very big or very small numbers) in thedata Source: Malhotra(2010) 14
Measures of Location(Cont.) • Mode: The mode is the value that occurs mostfrequently • It represents the highest peak of thedistribution • The mode is a good measure of location when the variable is inherently categorical or has otherwise been grouped intocategories • Median: The median of a sample is the middle value when the data are arrangedin • ascending or descendingorder • If the number of data points is even, the median is usually estimated as the midpoint between the two middle values by adding the two middle values and dividing their sum by2 • The median is the 50thpercentile • The median is an appropriate measure of central tendency for ordinal data • Note: Sample Median is robust to the presence of outliers in the data. However, the mathematics involved in dealing with median and ordinal level data in general isdifficult. 15
The Relative Positions of the Mean, Median, andMode Source: Lind, Marchal, and Wathen(2012) 16
MeasuresVariability • The measures of variability, which are calculated on interval or ratio data, include the range, interquartile range, variance or standard deviation, and coefficient ofvariation • Range: The range measures the spread of thedata • It is simply the difference between the largest and smallest values in thesample • Interquartile Range (IQR): The interquartile range is the difference between the 75th and 25thpercentiles • For a set of data points arranged in order of magnitude, the pth percentile is the value that has p%of • the data points below it and (100 – p)% aboveit • If all the data points are multiplied by a constant, the interquartile range is multiplied by the same constant Source: Malhotra(2010) 17
Measures Variability(Cont.) • Variance: The difference between the mean and an observed value is called the deviation from the mean. The variance is the mean squared deviation from themean. • The variance can never benegative • When the data points are clustered around the mean, the variance is small. When the data points are scattered, the variance islarge. • If all the data values are multiplied by a constant, the variance is multiplied by the square ofthe • constant • Standard Deviation: The standard deviation is the square root of thevariance • Thus, the standard deviation is expressed in the same units as the data, rather than in squaredunits • (like in the variance) • Coefficient of Variation: The coefficient of variation is the ratio of the standard deviation to the mean expressed as a percentage, and it is a unitless measure of relativevariability 18
Example of Charts(1) Column Line Bar Radar Combo Funnel 19
Example of Charts(2) Waterfall Histogram Pareto Box &Whisker Treemap Sunburst 20
Estimating a Population Parameter: Making Your Best Guesstimate • We want to estimate a population parameter (a single number that describes a population) by using statistics (numbers that describe a sample ofdata) • Examples: • Estimating Overall Liking score of a newproduct • Estimating Customer SatisfactionIndex • Estimating the average units purchased per purchaseoccasion • Estimating % agreement to a statement • Types of estimates: • Point Estimate one single numberonly • Interval Estimate an interval containing a range of numbers (called ConfidenceInterval) 22
Simulation: One ProportionInference http://www.rossmanchance.com/applets/OneProp/OneProp.htm The highestStandard Error for Proportion is achieved at p =0.5 0.040 0.035 0.030 StandardError 0.025 0.020 0.015 When the Proportionsare small or big, the Standard Errors aresmall 0.010 0.005 0.000 0 0.1 0.2 0.3 0.40.5 0.6 0.7 0.8 0.9 1 Proportion 23
Simulation: Confidence Intervals forMeans http://www.rossmanchance.com/applets/ConfSim.html 24
A General Procedure for HypothesisTesting • HYPOTHESISTESTING • A procedure based on sample evidence and probability theory to determine whether the hypothesis is a reasonable statement • Examples: • The heavy and light users of a brand differ • in terms of psychographicscharacteristics • One hotel has a more upscale image than its closecompetitor • Concept A is rated higher than Concept B on Overall Liking Source: Malhotra(2010) 25
Type I and Type II Errors in HypothesisTesting Alpha (α) is the probability of making a Type I error We want α to be as low aspossible! Beta (β) is the probability of making a Type IIerror. The power of a test is the probability (1 – β) of rejecting the null hypothesis when it is indeed false andhence should be rejected We want power to be as high as possible! Unfortunately, α and β are interrelated. So, it’s necessary to balance the two types oferrors. The level of α along with the sample size will determine the level of β for a particular researchdesign. The risk of both α and β can be controlledby increasing the sample size. For a given level of α, increasing the sample size will decrease β, and hence increasing the power of the test (1 –β). In practice, we usually set α at 1%, 5%, or 10%. Think of sample size as a magnifying glass. Sources: Lind, Marchal, and Wathen (2012). Malhotra(2010). 26
Hypothesis Tests Related toDifferences Interval or RatioLevel Nominal or OrdinalLevel Source: Malhotra(2010) 27
Two Independent Samples: Evaluating the Difference between Two MeanScores • The data come from 2 unrelated samples, drawn randomly from differentpopulations • The 2 samples are not experimentally related. The measurement of one sample hasno • effect on the values of the secondsample. • Note: In a monadic design, the samples areindependent • Examples • Comparing the Purchase Intent mean scores of Concept X vs. ConceptY • Comparing the responses of Females vs. Males • Comparing the reaction towards TVC A vs. TVCB • Onlinetools: • http://www.evanmiller.org/ab-testing/t-test.html • http://www.quantitativeskills.com/sisa/statistics/t-test.htm 29
Two Independent Samples: Evaluating the Difference between TwoProportions • The data also come from 2 unrelated samples, but we focus on evaluating the proportions • Examples: comparing Top Box, Top 2 Boxes, Bottom Box, Bottom 2 Boxes,Brand Association • Caution: declaring 2 proportions as statistically significantly different when the actual difference is small T2B Differences: Proto 1 (a) – Proto 2 (b) =5% Proto 1 (a) – Proto 4 (d) =4% • An online tool:http://www.evanmiller.org/ab-testing/chi-squared.html 30
Some BasicFormulas Source: Lind, Marchal, and Wathen(2012) 31
The Case of More Than Two IndependentSamples • Method: One-way ANOVA for a quantitative (numerical)variable • E.g., Overall Liking, Purchase Intention, Product Attribute, Imageryattribute • Examples: • In a blind product test, comparing the performances of 3 different facialmoisturizer • In a concept test, comparing the acceptance of 5 new powdered milkconcepts • In a U&A study, comparing the responses from SES Upper vs. Middle vs.Lower • In a TVC pre-test, comparing the performances of 3 different newads 32
Simulation: One Way Analysis ofVariance http://www.rossmanchance.com/applets/AnovaSim.html 33
Two DependentSamples • Paired data is formed from measurements of essentially the same quantitative variable (ordinal, internal, or ratio level) done on the sameindividuals • Examples: • Concept score vs. Product score of a new mix (in a concept-product testproject) • Perceptions ‘Before’ and ‘After’ an exposure (e.g., aTVC) • Perceptions ‘Before’ and ‘After’ attending a brand sponsoredevent • Statistical test for quantitative (numerical) variable: Pairwise T-Test forMeans • Onlinetools: • http://scistatcalc.blogspot.co.id/2013/10/paired-students-t-test.html • http://vassarstats.net/tu.html 34
The Case of More Than Two DependentSamples • The statistical method employed in this project was RepeatedMeasures ANOVA (inSPSS) • Please consult with your in-house Statistician ifyou face this kind ofproject Deodorant Usage in 3-WeekPeriod 9 8 (xxx)vs. Week1 (xxx) 7.79 7.52 (***) 7.53 (***)vs. Week1 7.03 Usage(grams) 7 7.07 (xxx) 6.37 6 Total Usage Females : 21.63 grs / person Total Usage Males : 21.68 grs /person 5 4 Week1 Week2 Week3 Females Males 35
Relationship Among Techniques: T-Test, ANOVA, ANCOVA, Regression Interval or Ratiolevel Source: Malhotra(2010) 36
Some PracticalTips Always focus on the research and business objectives when analyzing yourdata Plan the analysis early, even at the proposal stage. Envision the end results as early as possible. Consult with your in-houseStatistician. Always prepare a DP Specs. Take your time to prepare a proper one. Get feedback from your DP if you’re notsure. Once the data are ready, always check & recheck for errors. Compare the Excel tables to the SPSS rawdata. Before jumping to creating charts, do review the Excel tables from your DP. Look for patterns,interesting findings, anomalies. Try extracting and creating your preliminarystory. 37
Phone: +62 818 906875 Email:ts.lim@leap-research.com LeapResearch SOHO Podomoro City, Unit 18-05 Jl. Letjen S. Parman Kav. 28 Jakarta11470
QUESTIONS ANY 39