1 / 108

Lecture 3 Summer Semester 2009

Lecture 3 Summer Semester 2009. BEA 140 By Leon Jiang. Some points more for univariate data. Central tendency. Mean Median Mode. Variance. Population:  2 = (  X i 2 - (  X i ) 2 /N ) / N Sample: s 2 = (  X i 2 - (  X i ) 2 /n ) / (n-1). Standard deviation.

mele
Télécharger la présentation

Lecture 3 Summer Semester 2009

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Lecture 3Summer Semester 2009 BEA 140 By Leon Jiang Leon Jiang, University of Tasmania

  2. Some points more for univariate data Leon Jiang, University of Tasmania

  3. Central tendency • Mean • Median • Mode Leon Jiang, University of Tasmania

  4. Variance • Population: 2 = ( Xi2 - (Xi)2/N ) / N • Sample: s2 = ( Xi2 - (Xi)2/n ) / (n-1) Leon Jiang, University of Tasmania

  5. Standard deviation • s2 = ( Xi2 - (Xi)2/n ) / (n-1) Leon Jiang, University of Tasmania

  6. The meaning of Stdv. • “ For most data batches around two thirds ( or 68%) of the data will fall within one standard deviation of the mean, and around 95% within two standard deviations of the mean.” • - empirical rule • - rule of thumb Leon Jiang, University of Tasmania

  7. MEASURING FROM GROUPED DATA Leon Jiang, University of Tasmania

  8. Measuring For Grouped Data • When no raw data but only secondary source of data available, we have to analyze this secondary set of data, which has been grouped for reporting purposes. • A set of grouped data is not like a set of raw data in that the information in it has already been grouped arbitrarily. • A set of grouped data is subjective or at least it is not so objective as raw data, therefore small errors exist. Leon Jiang, University of Tasmania

  9. Generally we use a frequency distribution table to show the grouping of data Leon Jiang, University of Tasmania

  10. Class mark for frequency distribution of grouped data • Class mark , Xj is a representative value of all observations located in the class. • A class mark is determined by the largest value and the smallest value in the class. • Xj = ( largest value + smallest value ) / 2 • Xj = (RUCL + RLCL) / 2 • Where, RUCL => the largest value ; RLCL => the smallest value Leon Jiang, University of Tasmania

  11. Central tendency for grouped data • Mean of g.d (grouped data) is defined as the weighted sum of class marks, with class frequencies as weights. i.e. • X(mean) = (Σfj xj ) / n • X ( mean ) = 294/54=5.44 Leon Jiang, University of Tasmania

  12. Median for g.d • Locating the median class : • the class containing the median. • But how and where? • Total number of calls in the frequency distribution is 54 (=> even number). • and therefore, according to the formula of median ( median = n + 1 / 2 ), the median ought to be the 27.5th value. • The class containing the 27.5th value is the median class. Leon Jiang, University of Tasmania

  13. FORMULA FOR MD: • MD = LCL + class width * ( how far into class ) / (how many in class ) 3.0 + 2 * (27.5 - 11) / 19 Leon Jiang, University of Tasmania

  14. MD = LCL + class width * ( how far into class ) / (how many in class ) 3.0 + 2 * (27.5 - 11) / 19 Leon Jiang, University of Tasmania

  15. Small errors likely exist most of the time • Median from raw data = 4.4 • Median from grouped data = 4.47 Leon Jiang, University of Tasmania

  16. An example: MD = LCL + class width * ( how far into class ) / (how many in class ) Leon Jiang, University of Tasmania

  17. LCL + class width * (how far into the class) / how many in the class • 100 + 10 * (8.5 – 3) / (9 – 3) Median = 109.17 Leon Jiang, University of Tasmania

  18. Mode for g.d. • With grouped data, we tend to talk more of a modal class – the class (classes) with the highest frequency rather than the mode. • But, if asked for a mode with grouped data, the best we can do is to tell the class mark of modal class as follows: Modal class: 3 &U 5 ( 19 observations ) Mode : 4 ( class mark of modal class ) Leon Jiang, University of Tasmania

  19. Dispersion ( variance ) for grouped data • The sample variance formula is : • S2 ={Σfj Xj2 – (Σfj Xj)2 / n }/ (n-1) The population variance formula is : • = {Σfj Xj2 – (Σfj Xj)2 / N }/ N • Standard deviation = or Leon Jiang, University of Tasmania

  20. Preparing a table to help work out S.d. Leon Jiang, University of Tasmania

  21. Working out the standard deviation for the example~! • S2 ={Σfj Xj2 – (Σfj Xj)2 / n }/ (n-1) • Standard deviation = • S = 14.14 • Mean = 1770 / 16 = 110.625 Leon Jiang, University of Tasmania

  22. Shape • Skewness – relates to symmetry of distribution. • Positively skewed or right skewed: tail extends to right , mean > Median > Mode • Negatively skewed or left skewed: tail extends to left, mean < median < mode Leon Jiang, University of Tasmania

  23. Standard scores • The standard score expresses any observation in terms of the number of standard deviation it is from the mean. • t score ( for sample) * z score (for population) Leon Jiang, University of Tasmania

  24. Interpretation of standard score • Mean 5, standard deviation 2, for a sample • t score for 8 = (8-5)/2=1.5 • Interpretation: the observation is 1.5 standard deviations above the sample mean. Leon Jiang, University of Tasmania

  25. Bivariate Variables Summary measures Leon Jiang, University of Tasmania

  26. Bivariate variables • In the previous parts, we were all the time talking about a single numerical variable such as the rate of return of mutual funds. • From this lecture, we shall start to study two variables with correlation. Leon Jiang, University of Tasmania

  27. Two numerical variables • A case: • In a call center, operators were trained to receive phone calls. However, the duration of calls shows a significant difference from one another. The shorter the duration of a call, the more efficient an operator proves to be. • Suppose, the call center manager wants to know if the training hours the operators received have any correlation to the duration of those phone calls the operators handled. • The data pooled down are as the follows: • X Training hours • Y Duration minutes Leon Jiang, University of Tasmania

  28. Data pooled like this X (training hours): 6.5 7.5 6 8.5 5.5 3.5 8.5 8 8 7 8.5 9.5 Y (duration mins): 6.2 2.9 9.2 3.2 8.9 13.6 2.5 4.2 4.3 3.1 3.4 2.7 X (training hours): ……………………………………………………. Y (duration mins): ……………………………………………………. Anyway, in total there have been 54 phone calls in this set of data being studied. * Now, what we are about to find out is to know whether these two variables ( X training hours of operators ; Y duration minutes of calls) show any real correlation. Or , by putting it simply, the call center manager wants to know if the more training hours the operators receive, the shorter the duration of calls the operators handle will be. Leon Jiang, University of Tasmania

  29. Setting up a scatter diagram for the data here ~! • A scatter diagram ( scattergram ) between two variables will indicate the form, type and strength of the relation. • Form – whether linear or non-linear • Type – direct (positive) or inverse (negative) • Strength – how closely data are co-ordinated, e.g. if linear, how close ordered pairs are to a line describing their relationship. This is indicated by a correlation measure. Leon Jiang, University of Tasmania

  30. (Pearson’s) Coefficient of Correlation • This is a summary measure that describes the form, type and strength of a scattergram. • The range of r is between –1 , 0 , 1. • -1: perfect negative relationship – all points exactly ona negative sloping line • 0: no linear correlation • 1: perfect positive relationship Leon Jiang, University of Tasmania

  31. Back to the case study • r( Pearson’s coefficient of correlation) = - 0.9209 • This means X and Y have a very strong negative linear relationship. • Or , let’s say the training hours the operators received really show a strong negative relationship with the duration of calls they handled. Leon Jiang, University of Tasmania

  32. In-depth analysis of this linear relationship – linear regression • Determining the Coefficient of Correlation is concerned with summarizing the form, type and strength of the relationship between two variables. • The motivation for regression is the desire to quantify the relationship, often for the purposes of using the knowledge of one variable to predict the other. • Say , using one variable ( X ) to predict the other variable ( Y ). Leon Jiang, University of Tasmania

  33. The regression line is mathematically expressed by this equation • Yc = a + bX • Yc is the computed value of Y. • a is the sample regression constant, or Y-intercept. • b is the sample regression coefficient, or slope of the line. Leon Jiang, University of Tasmania

  34. Least squares method • This is a mathematical technique that determines what values of a and b minimize the sum of squared differences. Any values for a and b other than those determined by the least-squares method result in a greater sum of squared differences between the actual value of Y and the predicted value of Y. • Simply put, least-squares method is used to find a line of best fit for two correlated variables. Leon Jiang, University of Tasmania

  35. Working out the linear regression ~! • Residual is defined as the vertical distance between the actual value and the predicted value ( the point on the line of best fit). • In least-squares regression, we find the values of a and b, such that sum of squares of residuals, is a minimum. • Actual pairs : (X1, Y1), (X2, Y2),… ... • Predicted (calculated )pairs: (X1, Yc1), (X1,Yc2), … … Leon Jiang, University of Tasmania

  36. Back to the case study~! • Since we have known that the training hours correlate to the duration of calls. It is somehow to say : if we know the training hours an operator received , in some sense we can predict how many minutes , on average, he or she should take to handle a phone call. • Or, in linear regression, we know X and by using the least squares method, we can calculate out Y. Leon Jiang, University of Tasmania

  37. Solutions for a & b • Two formulae respectively for a and b. Leon Jiang, University of Tasmania

  38. Establishing a table to work out linear regression Leon Jiang, University of Tasmania

  39. Outcomes ~! • b=-1.79595 • a=18.40399 . • Then Yc=18.404 –1.796X • This is the linear regression. • Interpretation : for each extra hour of training, there is an associated decrease of 1.796 minutes in call duration. Leon Jiang, University of Tasmania

  40. One consideration~! • Note: regression says nothing about causation, only about association~! • This means X does not necessarily cause a change in Y. • Or, the training hours do not necessarily change the duration of calls, instead they have correlation. • Think about : does smoking cigarettes cause life expectancy shorter? • Not really~! ? Leon Jiang, University of Tasmania

  41. The standard error of the estimate • Standard error measures how well actual Y and computed Y are matched – the smaller Se, the better the match and predictive accuracy. Leon Jiang, University of Tasmania

  42. Note! • Standard error is very similar to standard deviation. • Standard error is for bivariate, whilst standard deviation is for univariate. Leon Jiang, University of Tasmania

  43. Computational form for Se. • You can use this computational form to find out Se. Leon Jiang, University of Tasmania

  44. Coefficient of determination • Total variation = SST= • Explained variation = SSR • Unexplained variation = SSE= • Coefficient of determination =SSR / SST= Leon Jiang, University of Tasmania

  45. Coefficient of determination - • The Coefficient of determination by calculation turned out to be 0.848 • This means 85% of total variation in call duration (around the average duration level) has been explained by a linear relation between duration and training hours. Leon Jiang, University of Tasmania

  46. We just saw summery measures for dealing with two numerical variables. What about ordinal data? Leon Jiang, University of Tasmania

  47. Two ordinal variables • A scattergram can also be used to illustrate a possible relationship between two ordinal variables. • We often have ordinal variables in fields such as Marketing and Management where people have been asked to rank some attribute. • An example could be a series of taste trials carried out during product development, such as the example below, where a panel was asked to rank soft drinks by “Refreshing ness” and “Sweetness”. Leon Jiang, University of Tasmania

  48. Understanding this example • This example illustrates which one of the drinks is the most refreshing and which is the second most refreshing … • Likewise, which is the sweetest and which is the second sweetest … Leon Jiang, University of Tasmania

  49. Leon Jiang, University of Tasmania

  50. Leon Jiang, University of Tasmania

More Related