Summarizing One or Two Categorical Variables & Relationships Between Categorical Variables

Presentation 2 Summarizing One or Two Categorical Variables & Relationships Between Categorical Variables

Types of Variables • Categorical – Possible values define group or categories, not necessarily in an apparent ordering Ex. Color of M&M’s Gender Stat 200 Section • Ordinal – Categorical variable where values or categories have a natural ordering Ex. Rate the roller coaster on a scale of 1-5 (1 is terrible and 5 is excellent) Age groups (child, teen, adult, senior citizen) Shirt sizes (S, M, L, XL) • Quantitative – Measurements or counts, recorded as numerical values Ex. Height Temperature # of Red M&M’s

Possible Roles Played by Variables: • Response Variables – are the variables of which we want to determine the outcome. These are the variables of main interest. • Explanatory Variables – are partially explain the value of the response variable for the individual.

For each of the following identify the response and the explanatory variables as well as the variable type: • Is there a relationship between a person’s gender and their favorite kind of music? Response: Explanatory: • Do men and women listen to the same number of hours of music? Response: Explanatory: • Does a person’s hometown influence the amount they would pay for a single CD? Response: Explanatory: • Do people who play musical instruments rate the types of music the same? Response: Explanatory: • Do people who have a CD burner prefer to buy or burn their CDs? Response: Explanatory:

Summarizing Categorical Variables: • For one variable: • Numerical Summaries: counts and percents • Graphical Summaries: Pie Chart or Bar Graph • For two variables: • Numerical Summaries: 2-way tables with counts and row percents. The explanatory variable should be the row variable (first variable entered in Minitab) and the response variable should be the column variable (second variable entered in Minitab). • Graphical Summaries: Bar Graph

Example for One Categorical Variable: • Where do Penn State alumni live? The PSU Alumni Association would like to obtain the answer to this question from all PSU alumni. They can’t ask all alumni so they take a random sample of 50 alumni from the directory. They determined the state of residence from the address. Here are the results: What do these descriptive statistics tell us?

Example for Two Categorical Variables: • Do most college students have a credit card? A study would like to determine if the percentage of students that have at least one credit card differs based on year in school. Four different samples (Fr, So, Jr, Sr) each having 100 PSU students, were obtained. Each student was asked one question, “Do you currently have at least one credit card?” Identify the response and the explanatory variable in this case: Response: Explanatory: What do these descriptive statistics tell us?

Assessing the Statistical Significance of the Relationship between two Categorical Variables. Suppose we ask 15 randomly picked students 2 questions: • Do you smoke? 2. Did you have a beer last night? We summarize the results using the Cross Tabulation function in Minitab : • Tabulated Statistics: smoke, beer • Rows: smoke Columns: beer • n y All • n 9 2 11 • 81.82 18.18 100.00 • y 1 3 4 • 25.00 75.00 100.00 • All 10 5 15 • 66.67 33.33 100.00 • Cell Contents -- • Count • % of Row

Inference about the Population! • How can we tell if there’s a relationship between being a smoker and drinking beer last night? • Does the relationship presented in sample data hold in the population presented by this sample? • Techniques used to make generalizations about the population using a sample are known as inferential statistics. • A statistically significant relationship is one that is large enough to be unlikely to have occurred in the observed sample if there is no relationship in the population.

Null and Alternative Hypotheses • Another way to express our objective is that we are deciding between two possible hypotheses about the population: Null Hypothesis: The two variables are not related. Alternative Hypothesis: The two variables are related. • In our example we have: Null Hypothesis: Being a smoker and drinking beer last night are not related. Alternative Hypothesis: Being a smoker and drinking beer last night are related.

Chi-square Statistic • We usually use Chi-square Statistic to handle this type of questions. • Chi-square Statisticmeasures the statistical significance of the association between 2 categorical variables. A large Chi-square Statisticindicates there is a statistically significant relationship between the 2 variables. • How Chi-square Statistic works? It measures the difference between the observed counts and the counts that would be expected if there were no relationship (under the null hypothesis).

Chi-Square Statistic and p-value • A large Chi-square Statisticindicates there is a statistically significant relationship between the 2 variables. However, how large is large? • This is why we need to use “p-value” as an indicator to tell us if the Chi-square Statisticis “large enough”. • We can obtain the p-value in our Minitab output. • How to use the p-value? • The bigger the Chi-square Statistic is, the smaller the p-value will be. • Generally, when the p-value is less than 0.05 (5%), we will assume that the observed relationship did not occur by chance, and it is statistically significant. • Generally, when the p-value larger than 0.05 (5%), we will say the observed relationship could have occurred just by chance. Therefore, we can not reject the null hypothesis that there is no relationship. • Example: Part 3 of the activity….

Summarizing One or Two Categorical Variables & Relationships Between Categorical Variables