410 likes | 512 Vues
This lecture explores the relationship between two connected variables through observation and analysis of paired data. It presents concepts such as covariance, Pearson's correlation coefficient, and the method of least squares for curve fitting. The step-by-step procedure for determining the line of best fit using regression is discussed, along with practical examples using data points. The lecture also illustrates how to handle non-linear relationships by transforming them into linear forms, and it highlights the application in multiple regression scenarios.
E N D
Lecture (14,15) More than one Variable, Curve Fitting, and Method of Least Squares
Two Variables Often two variables are in some way connected. Observation of the pairs: X Y X1 Y1 X2 Y2 . . . . . . Xn Yn
Covariance The covariance gives the some information about the extent to which the two random variables influence each other.
x ( )( ) - - y - - x x y y x x y y i i i i 0 3 - 3 0 0 2 2 - 1 - 1 1 3 4 0 1 0 4 0 1 - 3 - 3 6 6 3 3 9 å = = 7 y 3 = x 3 Example Covariance What does this number tell us?
Pearson’s R • Covariance does not really tell us anything • Solution: standardise this measure • Pearson’s R: standardise by adding std to equation:
Procedure of Best Fitting (Step 1) How to find out the relation between the two variables? 1. Make observation of the pairs: X Y X1 Y1 X2 Y2 . . . . . . Xn Yn
Procedure of Best Fitting (Step 2) 2. Make plot of the observations. It is always difficult to decide whether a curved line fits nicely to a set of data. Straight lines are preferable. We change the scale to obtain straight lines.
Method of Least Square (Step 3) 3. Specify a straight line relation. Y=a+bX We need to find a and b that minimises the square of the differences between the line and the observed data.
= , predicted value = , true value ε = residual error ε Step 3 (cont.) find best fit of a line in a cloud of observations: Principle of least squares y = ax + b
Example We have the following eight pairs of observations:
Example (Cont.) Construct the least square line: N=8 1/n
Example (Cont.) Equation Y = 0.545+ 0.636 * X Number of data points used = 8 Average X = 7 Average Y = 5
Excel Application • See Excel
Covariance and the Correlation Coefficient • Use COVAR to calculate the covariance Cell =COVAR(array1, array2) • Average of products of deviations for each data point pair • Depends on units of measurement • Use CORREL to return the correlation coefficient Cell =CORREL(array1, array2) • Returns value between -1 and +1 • Also available in Analysis ToolPak
Analysis ToolPak • Descriptive Statistics • Correlation • Linear Regression • t-Tests • z-Tests • ANOVA • Covariance
Mean, Median, Mode Standard Error Standard Deviation Sample Variance Kurtosis Skewness Confidence Level for Mean Range Minimum Maximum Sum Count kth Largest kth Smallest Descriptive Statistics
Correlation and Regression • Correlation is a measure of the strength of linear association between two variables • Values between -1 and +1 • Values close to -1 indicate strong negative relationship • Values close to +1 indicate strong positive relationship • Values close to 0 indicate weak relationship • Linear Regression is the process of finding a line of best fit through a series of data points • Can also use the SLOPE, INTERCEPT, CORREL and RSQ functions
Linear Quadratic Cubic General Polynomial Regression • Minimize the residual between the data points and the curve -- least-squares regression Must find values of a0 , a1, a2, … am
Polynomial Regression • Residual • Sum of squared residuals • Minimize by taking derivatives
Polynomial Regression • Normal Equations
Example Regression Equation y = - 0.359 + 2.305x - 0.353x2 + 0.012x3
Nonlinear Relationships To make it linear, take logarithm of both sides • If relationship is an exponential function Now it’s a linear relation between ln(y) and x • If relationship is a power function To make linear, take logarithm of both sides Now it’s a linear relation between ln(y) and ln(x)
Examples • Quadratic curve • Flow rating curve: • q = measured discharge, • H = stage (height) of water behind outlet • Power curve • Sediment transport: • c = concentration of suspended sediment • q = river discharge • Carbon adsorption: • q = mass of pollutant sorbed per unit mass of carbon, • C = concentration of pollutant in solution
x vs y X=Log(x) vs Y=log(y) Example – Log-Log
Example – Log-Log Using the X’s and Y’s, not the original x’s and y’s
Example – Carbon Adsorption q = pollutant mass sorbed per carbon mass C = concentration of pollutant in solution, K = coefficient n = measure of the energy of the reaction
Example – Carbon Adsorption Linear axes: K = 74.702, and n = 0.2289
Example – Carbon Adsorption Logarithmic axes: logK = 1.8733, K = 101.6733 = 74.696, n = 0.2289
e x é ù é ù x é ù y x 1n 12 b1 1 1 11 é ù ê ú ê ú ê ú = + e x x b2 y x ê ú ê ú ê ú ê ú 22 2n 2 21 2 ë û bn ê ú ê ú ê ú e x y x x ë û ë û ë û m1 m m m2 mn Multiple Regression • Y1 = x11b1 +x12b2 +…+ x1nbn + e1 Y2 = x21b1 +x22b2 +…+ x2nbn + e2 : Ym = xm1b1 +xm2b2 +…+ xmnbn+ em . Regression model Multiple regression model In matrix notation
e x é ù é ù x é ù y x 1n 12 b1 1 1 11 é ù ê ú ê ú ê ú = + e x x b2 y x ê ú ê ú ê ú ê ú 22 2n 2 21 2 ë û bn ê ú ê ú ê ú e x y x x ë û ë û ë û m1 m m m2 mn Multiple Regression (cont.) Observed data = design matrix * parameters + residuals