210 likes | 223 Vues
Explore how regression analysis is used to analyze the relationship between reading and numeracy scores in NAPLAN data. Learn about the method, its application, complexities, and interpretation of results.
E N D
UC Stats Networking DayLinear Regression Shuang Liu Shuangzhe.Liu@canberra.edu.au
Outline • How I use the method in my work; • How I learned about the method; • Complexities Ihave experienced in applying the method.
How Iuse the method in my work • I’ve been using the method in my research: Regression & Time Series Modelling, Estimation, Diagnostics & Data Analysis; • I’ve delivered lectures for the units Regression Modelling/G 6546/6558, among others (Introduction to Statistics, SADMG, Business Statistics, Econometrics, Survey Design, Experimental Design, Nonparametric Statistics, Multivariate Statistics, etc).
Application to NAPLAN data • We have a data set in Excel with the NAPLAN scores for 90 ACT schools.
Application to the NAPLAN data… • We may be interested in the relationship between Numeracy and Reading scores. • Numeracy can be the response, while Reading can be the predictor • How do Numeracy scores respond to changes in Reading scores? • First, look at a scatter plot to see the overall relationship, and then fit a model, analyze the goodness-of-fit and test if the relationship is statistically significant, when conducting diagnostics …
How I learned about the method…my research • Develop a fitted equation which represents the relationship between the independent and dependent variables. • The above equation is an estimate of the population model, which exists (as assumed) and excludes the hats and includes an error term:
Errors • Errors are the component that is not explained by the regression line. • Present in the population model • Why is it there? • Captures the discrepancies present in the relationship, such as an extremely high Numeracy score corresponding to a low Reading score.
Residuals • Residuals are an estimate of the errors in the regression line. • It is distance between data point to the fitted regression line.
Assumptions • Assumptions for the residuals: • Independence • Look at residuals vs fitted value plot • Points should look randomly scattered with no pattern • Constant variance/homoscedasticity • Look at residuals vs fitted value plot • Point should be scattered with an even spread • Normality • Look at Q-Q plot • Points should follow a straight line
Pros of regression • Graphical means of depicting a relationship between two variables • Allows us to determine if the relationship is ‘significant’ • Allows us to predict Y for a given X value
Cons of regression • Never 100% correct. • The regression line is not going to be a 100% fit to the data • Dangerous to forecast beyond the range of our data
Complexities Ihave experienced in applying the method…my research • Variable or model selection • Outlier or influential observation identification • Application (to complex data etc) • Extension (to generalised, multilevel, longitudinal, PLS, semi- or non-parametric methods etc)
Application to the NAPLAN data… • We may be interested in the relationship between reading and numeracy scores. • Reading can be the predictor while numeracy can be the response. • How do numeracy scores change with a change in reading scores? • Look at a scatter plot to see the overall relationship first.
Back to the NAPLAN data… • We see a positive relationship between Reading and Numeracy! • As Reading skills increase, Numeracy skills also increase. • But by how much? • This can be found by fitting a regression line.
Back to the NAPLAN data… • The regression equation is: Where Y = numeracy score x = reading score • Interpretation? • As the reading score increases by 1 unit, the numeracy score will increase by 0.7543 units. • When the reading score is 0, the numeracy score will be 80.533
Back to the NAPLAN data… • The NAPLAN data gives the coefficient of determination as 0.70527 • This means that 70.527% of the variation in the numeracy scores is explained by the reading scores. • This is a fairly high R-squared value, meaning the model is a moderately good fit to the data and is good for prediction purposes.