Lecture 5

Lecture 5
Testing for Associations: Chi-Square, Correlation and Simple Regression Analysis

Objectives: Recognise when a chi-square test of independence is appropriate; Check the assumptions and corresponding conditions for a chi-square test of independence; Run and interpret a chi-square test of independence; Produce and explain a scatter plot to display the relationship between two quantitative variables; Interpret the association between two quantitative variables using a Pearson's correlation coefficient; Model a linear relationship with a least squares regression model; Explain and Check the assumptions and conditions for inference about regression models; and Examine the residuals from a linear model to assess the quality of the model. Lecture 5

Chi Square
Test of Independence

Frequencies Questions related to a single variable Describes one variable at a time Cross Tabulations Questions related to two or more variables Describes two or more variables simultaneously Cross tabulations

The table below shows the importance of personal appearance for several age groups. Are Age and Appearance independent, or is there an association? Chi-Square Test of Independence

A stacked barchart suggests an association: Test for independence using a chi-square test of independence. Chi-Square Test of Independence

A chi-square statistic is computed to measure the amount of discrepancy between the ideal sample (expected frequencies from H0) and the actual sample data (the observed frequencies = fo). H0: There is no association (independent) H1: There is an association (NOT independent) A large discrepancy results in a large value for chi-square and indicates that the data do not fit the null hypothesis and the hypothesis should be rejected. The Chi-Square Test for Independence

Suppose we are interested in the relationship between gender and attending university. If there is no relationship between gender and attending university and 40% of our total sample attend uni, we would expect: 40% Male, 40% Female If there is a relationship between gender and attending uni, we would expect a higher proportion of one group attending uni than the other group, e.g. 60% to 20%. Independence Demonstrated

Expected frequencies are computed as if there is no difference between the groups, i.e. both groups have the same proportion as the total sample in each category of the test variable. Since the proportion of subjects in each category of the group variable can differ, we take group category into account in computing expected frequencies as well. To summarize, the expected frequencies for each cell are computed to be proportional to both the breakdown for the test variable and the breakdown for the group variable. Expected Frequencies

H0: There is no associationbetween the two variables. Gender and attending university are statistically independent. UniversityMen Women Total No 71.171.171.1 Yes28.928.928.9 Total 100100 100

H1: The two variables are associatedin the population. Gender and attending university are statistically dependent. University Men Women Total No 83.357.271.1 Yes16.742.828.9 Total 100100100

fe= (column marginal)(row marginal) N University Men Women Total No 83.357.271.1 Yes16.742.828.9 Total 100100100 fe= 100 * 71.1 = 71.1 100 Calculating Expected Frequencies

The test statistic that summarizes the differences between the observed (fo) and the expected (fe) frequencies. fe= expected frequencies fo = observed frequencies Chi-Square (obtained)

The sampling distribution of chi-square tells the probability of getting values of chi-square, assumingno relationship exists in the population. The chi-square sampling distributions depend on the degrees of freedom. df = (r – 1)(c – 1) where r = the number of rows c = the number of columns The Sampling Distribution of Chi-Square

Counted Data Condition – Data must be counts Independence Assumption – Counts need to be independent from each other. Check for randomization Randomization Condition – Random sample needed Sample Size Assumption – There must be enough data so check the following condition. Expected Cell Frequency Condition – No more than 20% of Expected counts < 5 No Expected counts < 1 Assumptions and Conditions

Chi Square Statistic Used for assessing if an observed association in Cross tabs is statistically significant Phi and Cramer’s V Correlation Coefficient Measures the strength of association or degree of association Phi: measure of strength of association for a table with two rows and two columns (2x2) Cramer’s V: modified version of Phi, for tables larger than 2x2 Testing Significance

Phiand Cramer’s Vare correlation coefficients Range from -1 to 1 -1 = Perfect Negative relationship 0 = No relationship 1 = Perfect Positive relationship As a guide: ± (0.1 – 0.3) = Weak ± (0.4 – 0.7) = Moderate ± (0.7+) = Strong Strength (degree) of Association

A developer needs to pick heating systems and appliances for newly built homes. They want to configure homes that match the demand for gas and heat. If the home has electric heat, it’s cheaper to install electric appliances in the kitchen. If the home has gas heat, gas appliances make more sense in the kitchen. Does everyone who heats with gas, prefer to cook with gas as well? Variables: Type of heating (Gas/Electricity) Type of cooking (Gas/Electricity) Example

Calculate Expected Values Row total x Column total / Overall total 272 * 149 / 447 = 90.67 272 * 298 / 447 = 181.33 175 * 149 / 447 = 58.33 175 * 298 / 447 = 116.67 Example

Calculate Chi Square (Observed – Expected)2 / Expected (136 – 91)2/ 91= 22.3 (136 – 181)2/ 181= 11.19 (13 – 58)2/ 58 = 34.9 (162 – 117)2/ 117 = 17.3 Total = 85.69 df = (r – 1)(c – 1) = 1x 1 = 1 Example

χ2= 85.69(1) Is there a significant association? Example

Are the %’s column or row percentages? What would you recommend based on the results? Homeowners contacted by the developer prefer natural gas to electric heat by about 2 to 1. These findings suggest building about two-thirds of the homes with gas heat and the rest with electric heat. Of those with electric heat, install electric kitchen appliances, but those with gas heat, put an electric kitchen in one half and gas in the other.

Perform Cross tabs Analyze Descriptive Statistics Crosstabs (follow cross tab instructions) Click on Statistics Select Chi-Square Phi and Cramer’s V Click OK. Chi Square test in SPSSWords

1. 2. 3. Chi Square test in SPSSVisuals

Chi Square test in SPSSVisuals Click on Cells… 7. 5. Use the > button to move into the Row(s) and Column(s): Select the variable/s 4. Select Observed and Row Percentages (or Column) 8. 6. Select Display Clustered Bar Charts

Chi Square test in SPSSVisuals 9. Click on Statistics… Select Chi-square and Phi and Cramer’s V 10.

Objective: A ski resort wants to determine if there is an association between the amount of snowfall and the amount of lift passes sold during the ski season. Example

Cross tabsOutput Higher percentage of low numbers of lift passes sold when snowfall is low. Higher percentage of lift passes sold when snowfall is high.

Based on the percentages (from previous slide) and by looking at the clustered bar chart: There was more likely to be a high number of lift passes sold when snowfall was high. About three quarters (76.9%) of days with high lift pass sales, also had a high level of snowfall. There appears to be an association between snowfall and lift pass sales, that is, lift pass sales are not independent of snowfall. Cross tabsInterpretation

P-value < 0.05 P-value = 0.028 Therefore, there is a SIGNIFICANT ASSOCIATION between Snowfalland Lift Pass Sales. Chi SquareIs the relationship Significant?

1 2 1 2 Phi or Cramer’s V? 2 rows x 2 columns Strength (degree) of AssociationPhi or Cramer’s V? Phi = 0.439 indicating a moderate positive associationbetween Snowfall and Lift Pass Sales.

A Chi Square test was undertaken to determine whether there was a significant association between Snowfalland Lift Pass Sales. Results indicate that there is a moderate, positive statistically significant association between the two variables (Phi = 0.439, p=0.028). There is strong evidence to suggest an association between snowfall and lift pass sales. Lift pass sales are more likely to be high when snowfall is high and likely to be low when snowfall is low. Chi SquareInterpretation

Correlation

Scatterplots are the ideal way to picture associationsbetween two quantitative variables. Scatterplots

The directionof the association is important. A pattern that runs from the upper left to the lower right is said to be negative. A pattern running from the lower left to the upper right is called positive. Looking at Scatterplots

The second thing to look for in a scatterplot is its form. If there is a straight line relationship, it will appear as a cloud or swarm of points stretched out in a generally consistent, straight form. This is called linearform. Sometimes the relationship curves gently, while still increasing or decreasing steadily; sometimes it curves sharply up then down. Looking at Scatterplots

The third feature to look for in a scatterplot is the strengthof the relationship. Do the points appear tightly clustered in a single stream or do the points seem to be so variable and spread out that we can barely discern any trend or pattern? Looking at Scatterplots

Finally, always look for the unexpected. An outlier is an unusual observation, standing away from the overall pattern of the scatterplot. Looking at Scatterplots

Correlation is used to explore the relationship between two numerical variables It is used to determine: Whether a linear or straight-line relationship exists The strength (degree) of the relationship The direction of the relationship (positive or negative) Pearson’s Correlation (r) Two normally distributed numerical (ratio or interval) variables Spearman’s Rho (ρ) If one or both variables are on ordinal scales Or at least one numerical variable is not normally distributed Correlation

Correlation Coefficients range from -1 to 1 -1 = Perfect Negative relationship 0 = No relationship 1 = Perfect Positive relationship As a guide: ± (0.1 – 0.3) = Weak ± (0.4 – 0.7) = Moderate ± (0.7+) = Strong Strength (degree) Of Association

The RTA issues an annual report on traffic congestion and its cost to society and business. Describe the scatterplot of Congestion Cost against Freeway Speed. Example

The RTA issues an annual report on traffic congestion and its cost to society and business. The scatterplot of Congestion Cost against Freeway Speed is roughly linear, negative, and strong. As the Peak Period Freeway Speed (mph) increases, the Congestion Cost per person tends to decrease. Example

Bookstore Data gathered from a bookstore show Number of Sales People Working and Sales (in $1000). Given the scatterplot, describe the direction, form, and strength of the relationship. Are there any outliers? Example

Bookstore Data gathered from a bookstore show Number of Sales People Working and Sales (in $1000). Given the scatterplot, describe the direction, form, and strength of the relationship. Are there any outliers? The relationship between Number of Sales People working and Sales Is positive, linear, and strong. As the Number of Sales People working increases, Sales tends to increase also. There are no outliers. Example

Correlation Conditions Before you use correlation, you must check three conditions: Quantitative Variables Condition:Correlation applies only to quantitative variables. Linearity Condition: Correlation measures the strength only of the linear association. Outlier Condition:Unusual observations can distort the correlation. Conditions

Correlation Tables Sometimes the correlations between each pair of variables in a data set are arranged in a table like the one below. Correlation

There is no way to conclude from a high correlation alone that one variable causes the other. There’s always the possibility that some third variable—a lurking variable—is simultaneously affecting both of the variables you have observed. Lurking Variables and Causation

The scatterplot below shows Life Expectancy (average of men and women, in years) against Doctors per Person for 40 countries of the world. The correlation is strong, positive, and linear (r = 0.705). Should we send more doctors to developing countries to increase life expectancy? Lurking Variables and Causation

Should we send more doctors to developing countries to increase life expectancy? No. Countries with higher standards of living have both longer life expectancies and more doctors. Higher standards of living is a lurking variable. Resist the temptation to conclude that x causes y from a correlation, no matter how obvious the conclusion may seen. Lurking Variables and Causation

Analyze Correlate Bivariate Select “Variable 1” and “Variable 2” - use the > button to move into the Variables: box Click Pearson - if two normally distributed numerical variables Spearmans Rho – if ordinal or non normally distributed data Click Flag significant correlations Correlation in SPSSWords

1. 2. 3. Correlation in SPSSVisuals

4. Select the variables 5. Use the > button to move into the Variables: box Select Pearson or Spearman 6. Correlation in SPSSVisuals 7. Select Flag significant correlations

Objective: McDonalds would like to know if there is a correlation between a consumers rating of McDonalds (score out of 10) and whether consumers are concerned with the amount of fat they are eating at fast food restaurants. Example

CorrelationOutput P-value = 0.353 0.353 > 0.05 The association is NOT significant Spearmans Rho (ρ) = -0.030 Negative, extremely weak association

Spearmans Rho Correlation analysis was undertaken to determine if there was a relationship between the rating out of 10 for McDonald’s and whether respondents consider the amount of fat they eat at fast food restaurants (both numerical variables). Results indicated that there was NO statistically significant relationship between the two variables (r = -0.030, p = 0.353). This suggests that whether people consider the amount of fat that they eat at fast food restaurants or not, it has no significant impact on the rating they give McDonald’s out of 10. CorrelationInterpretation

Regression

Regression Regression is a technique use to analyse relationships between a metric dependent variable and one or more metric independent variables Note: Independent variables can also be dichotomous (categorical) Bivariate Regression A procedure for deriving a mathematical relationship, in the form of an equation, between a single metric dependent variable and a single independent variable Multiple Regression Simultaneously develops a mathematical relationship between a single dependent variable and two or more independent variables

Regression It can be used to determine: Whether a relationship exists: Whether the independent variables explain a significant variation in the dependent variable The strength of the relationship: How much of the variation in the dependent variable can be explained by the independent variables The structure or form of the relationship: the mathematical equation relating the independent and dependent variables Predict the values of the dependent variable: when values for the independent variables are known Control for other dependent variables: when evaluating the contributions of a specific variable or set of variables.

The scatterplot below shows Lowe’s sales and home improvement expenditures between 1985 and 2007. The relationship is strong, positive, and linear (r = 0.976). The Linear Model

We see that the points don’t all line up, but that a straight line can summarize the general pattern. We call this line a linear model. A linear model describes the relationship between x and y. Dependent Independent / Predictor The Linear Model

This linear model can be used to predict sales from an estimate of residential improvement expenditures for the next year. We know the model won’t be perfect, so we must consider how far the model’s values are from the observed values. Residuals A linear model can be written in the form where b0and b1are numbers estimated from the data and is the predicted value. The difference between the predicted value and the observed value, y, is called the residualand is denoted e. Residuals

In the computer usage model for 301 stores, the model predicts 262.2 MIPS (Millions of Instructions Per Second) and the actual value is 218.9 MIPS. We may compute the residual for 301 stores. Residuals Example

The Line of “Best Fit” Some residuals will be positive and some negative, so adding up all the residuals is not a good assessment of how well the line fits the data. If we consider the sum of the squares of the residuals, then the smaller the sum, the better the fit. The line of best fitis the line for which the sum of the squared residuals is smallest – often called the least squares line. Which line is a better fit? The Line of Best fit

Straight lines can be written as The scatterplot of real data won’t fall exactly on a line so we denote the model of predicted values by the equation But if the model is a good one, the data values will scatter closely around it. The “hat” on the y will be used to represent an approximate, or predicted, value. The Line

For the Lowe’s data, the line shown with the scatterplothas this equation: A slope of 0.346 says that each additional $1M in Improvements is associated with an additional average $346,000 sales. An intercept of –19,679 is the value of the line when the x-variable (Improvements) is zero. This is only interpreted if has a physical meaning.

Pizza Sales and Price A linear model to predict weekly Sales of frozen pizza from the average price ($/unit) charged by a sample of stores in Newcastle in 39 recent weeks is: Sales = 141,865.53 – 24,369.49*Price What is the predictor variable? What is the dependent variable? What does the slope mean in this context? Is the y-intercept meaningful in this context? Example

Pizza Sales and Price A linear model to predict weekly Sales of pizza from the average price ($/unit) charged by a sample of stores in Newcastle in 39 recent weeks is: What is the predictor variable? AveragePrice What is the dependent variable? Sales What does the slope mean in this context? Sales decrease by $24,369.49 per dollar increase in price. Is the y-intercept meaningful in this context? It means nothing because stores will not set their price to $0. Sales = 141,865.53 – 24,369.49*Price Example

Pizza Sales and Price A linear model to predict weekly Sales of pizza from the average price ($/unit) charged by a sample of stores in Newcastle in 39 recent weeks is: What is the predicted Sales if the average price charged was $3.50 for a pizza? If the sales for a price of $3.50 turned out to be $60,000, what would the residual be? Sales = 141,865.53 – 24,369.49*Price Sales = 141,865.53 – 24,369.49*3.50 =56,572.32 Residual = 60,000 – 56,572.32 = 3427.69 Example Continued…

Least squares lines are commonly called regressionlines. We’ll need to check the same condition for regression as we did for correlation. Quantitative Variables Condition Linearity Condition Outlier Condition Extra Condition: Equal Spread Condition – check a residual plot for equal scatter for all x-values. Conditions

Residuals help us see whether the model makes sense. A scatterplot of residuals against predicted values should show nothing interesting – no patterns, no direction, no shape. If nonlinearities, outliers, or clusters in the residuals are seen, then we must try to determine what the regression model missed. Checking the Model

A plot of the residuals is given below. It does not appear that there is anything interesting occurring. Checking the Model

A plot of the residuals is given below. The residual plot reveals a curved pattern, which tells us the scatterplot is also nonlinear. Checking the Model

A plot of the residuals is given below. It appears that the spread in the residuals is increasing. Checking the Model

The variation in the residuals is the key to assessing how well a model fits. If the correlation were 1.0, then the model predicts y perfectly, the residuals would all be zero and have no variation. If the correlation were 0, the model would predict the mean for all x-values. The residuals would have the same variability as the original data. Variation in the Model and R2

Lowe’s Sales has a standard deviation of 14,090 $M. The residuals have a SD of only 3097 $M. The variation in the residuals is smaller than the data but larger than zero. How much of the variation is left in the residuals? If you had to put a number between 0% and 100% on the fraction of variation left in the residuals, what would you guess? Variation in the Model and R2

All regression models fall somewhere between the two extremes of zero correlation or perfect correlation of plus or minus 1. We consider the square of the correlation coefficient r to get r2 which is a value between 0 and 1. r2 gives the fraction of the data’s variation accounted for by the model and 1 – r2 is the fraction of the original variation left in the residuals. Variation in the Model and R2

r2 by tradition is written R2 and called “R squared”. The Lowe’s model had an R2 of (0.976)2 = 0.952. Thus 95.2% of the variation in Sales is accounted for by the number of stores, and 1 – 0.952 = 0.048 or 4.8% of the variability in Sales has been left in the residuals. Variation in the Model and R2

How Big Should R2 Be? There is no value of R2 that automatically determines that a regression is “good”. Data from scientific experiments often have R2 in the 80% to 90% range. Data from observational studies may have an acceptable R2 in the 30% to 50% range. Variation in the Model and R2

Example:Bookstore Recall data gathered from a bookstore that show Number of Sales People Working and Sales (in $1000). The correlation is 0.965 and the regression equation is: Determine and interpret R2. Variation in the Model and R2

Example:Bookstore Recall data gathered from a bookstore that show Number of Sales People Working and Sales (in $1000). The correlation is 0.965 and the regression equation is: Determine and interpret R2. R2= (correlation)2 = (0.965)2 = 0.931. About 93.1% of the variability in Sales can be accounted for by the Number of Sales People Working. Variation in the Model and R2

How useful is a model? We know observations vary from sample to sample. So we imagine a true line that summarizes the relationship between x and y for the entire population, Where µy is the population mean of y at a given value of x. We write µy instead of y because the regression line assumes that the means of the y values for each value of x fall exactly on the line. The Population and the Sample

For a given value x: Most, if not all, of the y values obtained from a particular sample will not lie on the line. The sampled y values will be distributed aboutµy. We can account for the difference between ŷ and µy by adding the error residual, or ε: The Population and the Sample

Regression Inference Collect a sample and estimate the population β’sby finding a regression line: The residuals e = y – ŷ are the sample based versions of ε. Account for the uncertainties in β0 and β1 by making confidence intervals. The Population and the Sample

The inference methods are based on these assumptions (check these assumptions): Linearity Assumption – This condition is satisfied if the scatterplot of x and y looks straight. 2. Independence Assumption –Look for randomization in the sample or the experiment. Also check the residual plot for lack of patterns. 3. Equal Variance Assumption – Check the Equal Spread Condition, which means the variability of y should be about the same for all values of x. 4. Normal Population Assumption – Assume the errors around the idealized regression line at each value of x follow a Normal model. Check if the residuals satisfy the Nearly Normal Condition. Assumptions and Conditions

Summary of Assumptions and Conditions Make a scatterplot of the data to check for linearity. (Linearity Assumption) Fit a regression and find the residuals, e, and predicted values ŷ. Make a scatterplot of the residuals against x or the predicted values. This plot should not exhibit a “fan” or “cone” shape. (Equal Variance Assumption) Assumptions and Conditions

Summary of Assumptions and Conditions Make a histogram and Normal probability plot of the residuals (Normal Population Assumption and Outliers) Data from Nambé Mills (p.g.321) Assumptions and Conditions

For a sample, we expect b1 to be close, but not equal to the model slope β1. For similar samples, the standard error of the slopeis a measure of the variability of b1 about the true slope β1. Standard Error of the Slope

Which of these scatterplots would give the more consistent regression slope estimate if we were to sample repeatedly from the underlying population? Hint: Compare se’s. Standard Error of the Slope

Which of these scatterplots would give the more consistent regression slope estimate if we were to sample repeatedly from the underlying population? Hint: Compare sx’s. Standard Error of the Slope

Which of these scatterplots would give the more consistent regression slope estimate if we were to sample repeatedly from the underlying population? Hint: Compare n’s. Standard Error of the Slope

A test for the Regression Slope

The usual null hypothesis about the slope is that it’s equal to 0. Why? A slope of zero says that y doesn’t tend to change linearly when x changes. In other words, if the slope equals zero, there is no linear association between the two variables. A test for the Regression Slope

Example : Soap A soap manufacturer tested a standard bar of soap to see how long it would last. A test subject showered with the soap each day for 15 days and recorded the weight (in grams) remaining. Conditions were met so a linear regression gave the following: Dependent variable is: Weight R squared = 99.5% s = 2.949 Variable Coefficient SE(Coeff) t-ratio P-value Intercept 123.141 1.382 89.1 <0.0001 Day -5.57476 0.1068 -52.2 <0.0001 What is the standard deviation of the residuals? What is the standard error of b1? What are the hypotheses for the regression slope? At α = 0.05, what is the conclusion? Example

Example : Soap Dependent variable is: Weight R squared = 99.5% s = 2.949 Variable Coefficient SE(Coeff) t-ratio P-value Intercept 123.141 1.382 89.1 <0.0001 Day -5.57476 0.1068 -52.2 <0.0001 What is the standard deviation of the residuals? se= 2.949 What is the standard error of b1? SE(b1) = 0.0168 What are the hypotheses for the regression slope? At α = 0.05, what is the conclusion? H0: β1 = 0 HA: β1 ≠ 0 Since the p-value is small (<0.0001), reject the null hypothesis. There is strong evidence of a linear relationship between Weight and Day.

A test for the Regression Slope

Example : Soap A soap manufacturer tested a standard bar of soap to see how long it would last. A test subject showered with the soap each day for 15 days and recorded the weight (in grams) remaining. Conditions were met so a linear regression gave the following: Dependent variable is: Weight R squared = 99.5% s = 2.949 Variable Coefficient SE(Coeff) t-ratio P-value Intercept 123.141 1.382 89.1 <0.0001 Day -5.57476 0.1068 -52.2 <0.0001 Find a 95% confidence interval for the slope? Interpret the 95% confidence interval for the slope? At α = 0.05, is the confidence interval consistent with the hypothesis test conclusion? Confidence Interval Example

Example : Soap Dependent variable is: Weight R squared = 99.5% s = 2.949 Variable Coefficient SE(Coeff) t-ratio P-value Intercept 123.141 1.382 89.1 <0.0001 Day -5.57476 0.1068 -52.2 <0.0001 Find a 95% confidence interval for the slope? B1 ± t* SE(b1) = -5.57476 ± (2.160)(0.1068) = (-5.805; -5.344) Interpret the 95% confidence interval for the slope? At α = 0.05, is the confidence interval consistent with the hypothesis test conclusion? We can be 95% confident that weight of soap decreases by between 5.34 and 5.8 grams per day. Yes, the interval does not contain zero, so reject the null hypothesis. Confidence Interval Example

Linear Regression in SPSSWords Analyze Regression Linear Select the “Dependent Variable” - use the > button to move into the Dependent: box Select the “Independent Variables” - use the > button to move into the Independent(s): box Click Statistics Select Descriptives

Linear Regression in SPSSVisuals 1. 2. 3.

LinearRegression in SPSSVisuals 5. Use the > button to move variable into the Dependent: box 6. Use the > button to move variable into the Independent(s): box Click on Plots 7. 4. Select Variables

Plot standardised residuals against standardised predicted values 8. 9. Click ‘Continue’ Gives us the information we need to assess the normality, linearity and homoscedasticity of residuals assumption.

1. 4. Drag Variables: Y = Dependent Variable X = Predictor Variable 2. Choose ‘Scatterplot’ 3. Select the type of Scatterplot Make the Scatterplot

Check plots of the standardised residuals for normally distributed residuals. The scatterplot of standardised residuals against standardised predicted values can be used to assess the assumptions of normality, homoscedasticity and also linearity. The absence of a pattern indicates the assumptions have been met. Checking Assumptions

Linear Regression in SPSSOutput R square - the proportion of variance in the dependent variable that can by accounted for by the predictor This tells us that 76.8% (0.768 x 100) of the variation in Cost can be explained by weight. Is the relationship Significant? Does the model have predictive utility? That is, is it strong enough to indicate there is also a relationship in the population? P value = 0.000 < 0.05 Therefore, the relationship is significant

LinearRegression Output Coefficients table: Details the role of the predictor variable in the regression model. Unstandardised Coefficients: indicate the predicted change in the dependent variable with a 1-unit change in the predictor variable. E.g. A 1-unit increase in weight will increase cost by 1.101 units. Standardised Coefficients: coefficients that have been adjusted so that the y intercept (constant) is zero and S.D. is 1. E.g. a 1 S.D. increase in weight will result in a 0.876 increase in cost. Cost = 4.5 + 1.101*Weight

Linear RegressionInterpretation Linear Regression analysis was undertaken to determine the amount of variation in the cost of a customer’s shipment that can be explained by differences in the weight of the package being shipped. Results indicate thatWeight(t=7.718, p=0.000), accounted for 76.8% of the variation in shipping costs, and this was significant. The regression model is: Shipping Cost = 4.5 + 1.101*Weight For every one unit increase in the package weight, the shipping cost will increase by 1.101 units.

A test for association examines whether there is a relationship between two or more variables. When testing for association using statistics, the following process should be undertaken; determine appropriate test, calculate test statistic, convert to p-value, assess p-value in light of hypotheses. When testing for association between nominal categorical variables use a chi-squared test When testing for association between quantitative variables (or ordinal variables) use correlation When testing for association between several variables or trying to make predictions use a regression. Summary

Lecture 5

Lecture 5

Presentation Transcript

Lecture 5

Chi Square

Correlation

Regression

Lecture 5

Lecture 5

[lecture#5]

Lecture 5

Lecture 5

LECTURE 5

Lecture 5

Lecture 5

Lecture 5

Lecture 5

Lecture 5

Lecture 5

Lecture 5

Lecture 5

Lecture 5

Lecture 5

LECTURE 5

Lecture 5

Lecture 5

Lecture 5

Lecture 5

Lecture 5