Predictive Statistics: Introduction to Econometrics and Linear Regression

IS6146 Databases for Management Information SystemsLecture 8: Predictive statistics: an intro to econometrics and linear regression Rob Gleasure R.Gleasure@ucc.ie robgleasure.com

IS6146 • Today’s session • Simple linear regression • Multiple regression

Econometrics and predictive modelling • Sometimes we want to go beyond analysing general patterns and within-variable trends and look at whether one variable predicts another • Does the salary spent on salespeople predict sales revenue? • Does the amount of coffee consumed on a particular premises predict overtime? • Do secondary school results predict university grades? • Do changes in the euro, yen, or dollar predict bitcoin prices? • Econometrics basically refers to the use of statistics to test theoretical (abstract) relationships with real world estimators • Cause represented by predictor/independent variables • Effect represented by outcome/dependent variables

Econometrics and predictive modelling • Imagine we are designing a catering strategy for UCC and one of us presents the argument that catering should focus on students who live far away, as other students can eat at home • Hypothesis: students living further from UCC are more likely to eat meals there. • Imagine we go around the class with 2 questions • How many minutes does it take (roughly) for you to get to UCC? • How many meals/week do you eat in UCC? • Now imagine we sketch this out on a 2-dimensional graph

Econometrics and predictive modelling • Let’s go back to our dataset at http://corvus2vm.ucc.ie/phd/rgleasure//rgleasure/databases/tomslee_airbnb_boston_1429_2017-07-10.csv • Is there a relationship between bedrooms and the number that can be accommodated, i.e. does one predict the other? • Select both columns of data and go to INSERT  Charts  Scatter

Econometrics and predictive modelling • Are more southerly lets more expensive? • Let’s select latitude and price and INSERT  Charts  Scatter ?

Simple linear regression • Problems such as whether age (a continuous variable) is a reliable predictor of income (also a continuous variable) lend themselves towards a simple linear regression • Arguably the most fundamental statistical test • Basically graphs variables against each other and checks if a line can be drawn that fits the distribution of data

Simple linear regression • A simple linear regression basically graphs variables against each other and checks a line can be drawn that fits the scatterplot 65 60 55 50 45 Income 40 35 30 25 80 30 60 70 20 40 50 Age

Simple linear regression • Most tests adopt ‘ordinary least squares’ approach to judge a line • Formally referred to as Pearson’s r 65 60 55 50 45 Income 40 35 30 25 80 30 60 70 20 40 50 Age

Simple linear regression • The basic formula for a simple linear regression is • Predicted value = slope of line * value of predicting variable * Intercept * Error i.e. Y’ = bX + A + E • Overall, we tend to prioritise two things • Does the predictor variable reliably predict the outcome variable? • We often refer to this as the ‘p value’, which states how likely the relationship could have occurred by natural variation • How strong is the relationship (does a change in the predictor cause a big or small change in the outcome)? • We often refer to this as the ‘beta value’, which states how much of the variation in the outcome variable is explained by variation in the predictor

Simple linear regression • Go to Data  Data Analysis  Regression and select the outcome variable cells under Y and the predictor variable cells under X

Simple linear regression • We should get the following results Amount of variance explained (0 being none, 1 being 100%) Probability pattern occurred by fluke

Multiple regression • Sometimes we want to run multiple predictors simultaneously • i.e. Y’ = b1X1+ b2X2 + … + bnXn+ A + E • Line these up in adjacent cells and select them all for X Range

Multiple regression • We should get the following results Coefficient for each individual predictor Probability for each individual predictor

OK… Break for 5?

Regression in R • Let’s have a look at a scatterplot of price and latitude plot(my_data$latitude, my_data$price)

Regression in R • Let’s have a look at a scatterplot of price and longitude plot(my_data$longitude, my_data$price)

Regression in R • And a similar plot of room type and latitude plot(my_data$room_type,my_data$latitude)

Regression in R • And level of education and age plot(my_data$room_type,my_data$longitude)

Running the model • To do this, we create a new variable to store the results of the model our_model = lm(my_data$price ~ my_data$latitude + my_data$longitude) • The show a summary of that variable summary(our_model)

Regression in R • We can also add the line using the abline() function if we wish to see it on the scatterplot plot(my_data$latitude,my_data$price) simple_model = lm(my_data$price ~ my_data$latitude) abline(simple_model, col="red")

Assumptions made by regression • A regression will break down if • The relationship isn’t linear, .e.g. it plateaus or changes direction after some point • The variance around the regression line is uneven • The predictions errors are not normally distributed, e.g. the ‘residuals’ fit well at one end of the line and not the other • The predictor variables are closely correlated themselves (only for multiple regression)

Checking assumptions in R (cont) • R has built in diagnostics, which we can cycle through by entering the following code and hitting enter to cycle through plots plot(our_model) • The first one looks for abnormally distributed errors in the residuals (we want a flat line)

Checking assumptions in R (cont) • The second one also looks for abnormally distributed errors in the standardised residuals (we want them to fit on the diagonal) • It can also help to spot weird patterns in variance, e.g. if one side falls away more than the other

Checking assumptions in R (cont) • The third and fourth ones also looks for abnormally distributed standardised residuals (again, we want a neat line) • Where data don’t meet the assumptions of a regression • Transformed tests, e.g. logarithmic transformations • Ranked tests, e.g. Spearman’s rho

What if the assumptions don’t hold? • A logistic/logit regression may be an option, which runs a similar regression formula against a binomial outcome, e.g. True or False, 0 or 1, etc. • Imagine we want to check whether the number of reviews is great than one - we could create a new binomial variable (either 0 or 1) my_data$has_reviews= ifelse(my_data$reviews>0,1,0) • The rerun the test with family set to ‘binomial’ logit_model = lm(my_data$has_reviews ~ my_data$price, family = "binomial") summary(logit_model)

What if the assumptions don’t hold? • Not very interesting • Nor do the longitude and latitude seem very informative. Oh well. logit_model2 = lm(my_data$has_reviews ~ my_data$latitude + my_data$longitude, family = "binomial") summary(logit_model2)

What if the assumptions don’t hold? • Back to pricing - let’s create e new variable to see if a price is above the median price median_price = median(my_data$price) my_data$isExpensive = ifelse(my_data$price>median_price,1,0) • A little bit more encouraging logit_model3 = lm(my_data$isExpensive ~ my_data$latitude + my_data$longitude, family = "binomial") summary(logit_model3)

What if the assumptions don’t hold? • Let’s get a little more detailed • It looks like the problem stems from the shape of our data hist(my_data$price, breaks = 200) plot(my_data$price,my_data$latitude) plot(my_data$price,my_data$longitude)

What if the assumptions don’t hold? • We can transform our data to stretch it back into a more easily analysed shape • Note – this won’t create any new false correlations, though it may make them harder to interpret • A log10 transform may fix a negative skew (a square root transform fixes a positive skew) my_data$logPrice = log10(my_data$price) • We’ll also need to fix any instances where original value was zero, as this will break the transformation my_data$logPrice = ifelse(my_data$logPrice=="-Inf",0,my_data$logPrice)

What if the assumptions don’t hold? • Let’s have a look at the transformed value hist(my_data$logPrice, breaks = 200) plot(my_data$logPrice,my_data$latitude) plot(my_data$logPrice,my_data$longitude)

What if the assumptions don’t hold? • Now let’s run our tests and check the assumptions log_transformed__model = lm(my_data$logPrice ~ my_data$latitude + my_data$longitude) summary(log_transformed__model) plot(log_transformed__model)

Want to read more? • http://www.r-tutor.com • https://www.r-bloggers.com/ • https://www.ablebits.com/office-addins-blog/2016/05/11/make-histogram-excel/ • Nice discussion of regressions • http://www.biddle.com/documents/bcg_comp_chapter4.pdf

Regression by hand Image from https://www.pinterest.ie/pin/497295983830246961/?lp=true

Regression by hand • Don’t worry if this feels intimidating – remember, you’ll never be asked to do it – it’s purely for your own sense of deep understanding • To keep this simple, let’s walk through 7 rows of fictional data

Regression by hand • As mentioned already, the basic formula for a regression is • Predicted value = slope of line * value of predicting variable * Intercept * Error i.e. Y’ = bX + A + E • To work out the slope (b), we need • The mean of X and Y (MX and My) • The standard deviation of X and Y (SXand Sy) • The correlation between X and Y (r)

Regression by hand • So how do we work out which line is the best fit? • First, we compute the mean for X (45) and subtract this from all values of X. The new variable is called x (the deviation scores for X). We do the same for Y (mean = 43) to create a new variable y.

Regression by hand • We then calculate the sum of deviation for each x and y individually

Regression by hand • The correlation (r) is then calculated as • In other words The total sum of xy values √ (the total sum of x values * the total sum of y values) • In our case 804 / √(2485*561) • Which equals 804 / √1394085 • Which equals 804 / 1180.7 • Hence, r = 0.68

Regression by hand • Now that we have the correlation, we can now calculate the slope of the line (b) based on the standard deviation b = r sY/sX • In other words b = correlation * standard deviation of Y / standard deviation of X • In our case b = 0.68 * 6.9 / 22.18 = 0.21

Regression by hand • The intercept (A) can be calculated from the means as A = MY– bMX • In our case A = 92.9 - 0.68 * 44.4 = 62.7 • We can now test to see if this is significant using the following formula where N is the number of value pairs • In our case t = 0.68 * √ (7-2) / √1-(0.68)2 • Equals 0.68 * √5 / √1-0.46 • Equals 0.68 * 2.23 / √0.53 • Equals 1.52 / 0.728 = 2.08 This value has to be mapped to a separate table to be translated into % confidence

Predictive Statistics: Introduction to Econometrics and Linear Regression

Predictive Statistics: Introduction to Econometrics and Linear Regression

Presentation Transcript

Rob Lonardo

Rob Haley

Rob Dyrdek

Rob Ford

Rob Hierons

Rob Gronkowski

Martin Kenneally Brenda Lynch m.kenneally@ucc.ie brendalynch@ucc.ie

Rob Jelier

Work Progress Yassine LASSOUED y.lassoued@ucc.ie

Rob O’Connor

Rob Doyle

Rob MacKenzie

Work Progress Yassine LASSOUED y.lassoued@ucc.ie

Rob Roebeling

Rob Hodgkins

Rob Chubb

Rob Webster

Rob Chioini - A Remarkably Talented Professional