570 likes | 800 Vues
DESCRIPTIVE STATI STI CS. UNIT 1: Collecting and Presenting Statistical Data. TouchText. Inferential vs. descriptive statistics. Statistics vs. parameters. Frequency distributions. Histograms. Using MS Excel. Problems and Exercises. Next. (Descriptive) STATISTICS: Definition.
E N D
DESCRIPTIVE STATISTICS UNIT 1: Collecting and Presenting Statistical Data TouchText • Inferential vs. descriptive statistics. • Statistics vs. parameters. • Frequency distributions. • Histograms. • Using MS Excel. Problems and Exercises Next
(Descriptive) STATISTICS: Definition As an academic subject, Statistics covers the following topics: Statistics Dictionary Descriptive Statistics Inferential Statistics • From the data … • Relationships are established • Hypotheses are tested • Predictions are made • Data itself is … • Displayed • Collected and Organized • Summarized (This course!) Take Notes Back Next
Statistics and Parameters When data is collected from every possible subject, one is gathering data on an entire population; and the characterization of that data is known as the population’s parameters. Dictionary It is not always possible (or necessary!) to get data from every single member of population. Usually, one gets data from a representative sample(subset) of the population. Characterization of this data is known as the sample statistics. Population (parameters) Sample (statistics) Take Notes Back Next
Descriptive and Inferential Statistics (cont.) Descriptive Statistics involve (a) collecting, and (b) describing data collected from a sample of the population. Inferential Statistics use sample statistics to make inferences (i.e. reasoned conclusions) about the population as a whole. Dictionary Population (parameters) Infer Sample Statistics Describe Keywords Take Notes Back Next
TouchText Format: Key Words and Phrases • Key Words and Phrases: • Infer, Inferential Statistics • Deduce Dictionary Back Back to:
DATA: Observations and Attributes In statistics, Data are simply bits of information collected about people or things. Dictionary • Each one of those people or things from which data is collected is referred to interchangeably as a(n): • Observation • Subject • Record (in databases) • Each unit of data collected about those people or things is referred to interchangeably as a(n): • Variable • Attribute • Field (in databases) Take Notes Back Next Examples 1 and 2
Two Examples: to emphasize difference between observations and variables Example 1: Information about university students Example 2: Information about financial stock prices In these examples: 4 observations and 5 variables! Take Notes Back
Describing Data A data set is a collection of (values of) variables on a sample (or population) of observations. Dictionary Data is sometimes ordered in a sequence depending upon their values. Trends, if any, are a critical aspect of data collected over time. The data could be even (symmetrical), or it could be heavily weighted toward one side of the middle. The center of the data is always of interest. Sometimes all of the values form a nice collection. Other times there are misfits, or “outliers”. The data’s variability – highs and lows – is also important. Charts and graphs often bring out these various aspects of data. Take Notes Back Next
Data Types • Data will be either • Quantitative(i.e. numeric) • Qualitative(i.e. words) • Quantitative data will be either • Continuous(i.e. defined over an interval) • Discrete (i.e. specific value) Dictionary Take Notes Back Next
Univariate Data To understand (and graph!) statistical distributions of variables, it is easiest to start with a reasonably large sample (i.e. a large number of observations), but to gather information on only one variable. When there is only one variable, the data set is described as univariate. Dictionary Examples of univariate data sets would be: GPAs of a group of students; ages of an airport’s airline passengers; finish times in a race of marathon runners; unemployment rates for various groups of citizens; and so on. Take Notes Back Next
Univariate Data: Listing Data Quantitative Discrete In the following example, we record the age (variable) from a sample of twenty (n = 20) bus passengers (observations). Take Notes Back Next
At present, the sample of passengers and ages is in no particular order. The data is written as a vertical column, but it could have been written as a horizontal row, or a matrix with both columns and rows, or even listed in text form in a paragraph. In this text, where room permits, we usually write sample values as columns so that we can build upon them on spreadsheets. Take Notes Back
Univariate Data: Ordering Data Quantitative Discrete We next re-order passengers by their ages, from lowest to highest. Re-ordering data is not necessary to compute certain statistics, but it is generally necessary to graph or otherwise portray the sample values. Dictionary In MS Excel, select the data in a vertical column and choose DATA > SORT Re-Order A Z Take Notes Back Next
Univariate Data: Frequency Distributions Quantitative Discrete After re-ordering passengers by age, we group all those passengers with the same ages. A frequency distribution lists each unique variable value (in this case age) along with the number of times it occurs in the sample. Dictionary Frequency Take Notes Back Next Example 3
In the table below are exam scores of 30 students. The exam had 20 questions. Who/What/How Many are the observations? Who/What/How Many are the variables? Re-Order the test scores from lowest to highest. Collect similar scores to create a frequency distribution. If possible, do this task on the pre-formatted MS Excel spreadsheet (link below). Take Notes Back Next
(Answer) 1. There are n = 30 observations of students. 2. There is one variable, the test score. Frequency Distribution Discussion: How did you get (a) the ordered set, and (b) the frequency distribution, on MS Excel? Take Notes Back
Graphing Frequency Distributions Quantitative Discrete Most frequency distributions are graphed as a Histogram, like the one seen below for the bus passenger age example. Dictionary * In this histogram, note that the ages (on the horizontal X-axis) are equally separated, even though the actual age differences from one to the next can be very different. This is because ages with 0 frequency are excluded from the graph. Take Notes Back Next
Frequency Distributions (alternate) Quantitative Discrete This graph uses the same data, but includes all ages, even those with 0 frequencies. Dictionary * In this graph, ages with 0 frequency are included. This makes the vertical spacing along the X-axis appropriate, but this takes up a lot of space. Therefore, not all ages can be written explicitly on the X-axis. Instead, we use intervals of 10 here. Take Notes Back Next
Compare the two graphs (of the same data) presented in the two preceding pages. Which do you prefer, and why? Create similar histograms on MS Excel with the exam score database used earlier. Take Notes Back
Relative Frequency Distributions Quantitative Discrete A relative frequency distribution lists frequencies as a percentage of all frequencies, which must sum to 1. Dictionary • In a frequency distribution, the total of all frequencies (graphically, the sum total of all bar heights) always equals the sample size n. • In a relative frequency distribution, the total of all frequencies (graphically, the sum total of all bar heights) always equals 100%, or 1. Take Notes Back Next
Relative Frequency Distributions: Graphically Quantitative Discrete Relative frequency distribution graphs have the same shapes as their associated frequency distributions. The only difference is what’s indicated (frequency or relative frequency) on the vertical Y-axis. Dictionary Take Notes Back Next
At this point, one might ask “What’s the advantage, if any, if using relative frequency distributions instead of frequency distributions?” The answer is that when looking at large numbers on a table (not at a graph), relative frequencies are quite useful. Also, relative frequencies (and probabilities) are used for statistical analyses and tests. The main benefit to relative frequencies calculated from samples is that they are independent of the sample size. Create a relative frequency distribution (with zeros) on MS Excel from the test score database used above. Take Notes Back
Cumulative (Relative) Frequency Distributions Quantitative Discrete Frequency distributions can also be cumulative – showing the frequency of observations at or below a certain value. Dictionary Cumulative frequency distributions go from 0 to n, left to right. And cumulative relative frequency distributions go from 0% to 100% (not shown). Take Notes Back Next
Cumulative (Relative) Frequency Distributions Ages with 0 frequency are included here. Dictionary Take Notes Back
It should be emphasized that we are using the same raw data in all of these graphs. The data is just being presented graphically in various ways. In practice, readers should be careful in how they interpret graphs, because the creators of graphs sometimes choose a graph design that appears to convey a result or message that they want to present to the reader. Take Notes Back
Continuous Quantitative Data QuantitativeContinuous Some quantitative data is continuous– e.g. the time it takes to drive from Los Angeles to San Francisco. With continuous data, the frequency of any one particular (i.e. infinitely small) outcome is always zero! With continuous data, frequencies must be defined over ranges of output – e.g. between 7.00 hours and 7.25 hours. Dictionary Take Notes Back Next
Grouping Quantitative Continuous Data QuantitativeContinuous There is no “right way” to group continuous quantitative data. But there are good ways and bad ways to do so. Dictionary These groupings are sometimes called “intervals”, “categories” (qualitative data), “classes”, “bands” (MS Excel), and other synonyms. It is typically recommended that a sample consist of somewhere between 5 and 20 classes, with larger samples being grouped into a larger number of classes. But the exact number of categories is up to the discretion of the person doing the sampling. Take Notes Back Next
Even Grouping of Data QuantitativeContinuous Often it is best to create same-sized intervals. Dictionary For example, the time to drive from Los Angles to San Francisco took …. Less than 5 hours (interval 1) 5 to 5-1/2 hours (interval 2) 5-1/2 to 6 hours (interval 3) 6 to 6-1/2 hours (interval 4) 6-1/2 to 7 hours (interval 5) 7 to 7-1/2 hours (interval 6) 7-1/2 to 8 hours (interval 7) Greater than 8 hours (interval 8) In this example, the intervals are 30 minutes each over the relevant range of time. Take Notes Back Next
Example: Even Grouping of Data QuantitativeContinuous Example: Driving times of 34 cars, Los Angeles to San Francisco. Even intervals of 30 minutes were selected. Take Notes Back
Notice that continuous data can be made to appear discrete simply by rounding the number. For example, 7.45456654333….. hours is equivalent to 7 hours, 27 minutes and 16.4394…. seconds. Rounding the number into 7h27 doesn’t change the fact that the data is continuous in nature. This raises an important point. Just as continuous quantitative data mustbe grouped into intervals, discrete quantitative data canbe grouped into intervals, and is often done so. Take Notes Back
Natural Grouping of Data QuantitativeContinuous Sometimes it is better to create uneven intervals, consistent with the objectives of the sampler and/or the nature of the sample. Dictionary For example, a sample of cinema goers’ ages might be broken down into groupings related to ticket prices …. (Interval 1) Infants less than 4 years old (free entry) (Interval 2) Children above 4 but under 16 years old (discounted ticket) (Interval 3) Adults above 16 but under 60 (full priced ticket) (Interval 4) Senior citizens above 60 years old (discounted ticket) In this example, the intervals are determined by the various ticket categories and prices. Take Notes Back Next
Example: Natural Grouping of Data Quantitative Discrete Example: Ages of 36 cinema goers. Intervals were selected to match various ticket prices. Take Notes Back
Qualitative Data QualitativeData Qualitative data – expressed with words – can be categorized and placed in a frequency distribution, much like discrete or grouped data. Dictionary *For categorical data, the graph is referred to as a bar chart, rather than a histogram. Take Notes Back Next
Qualitative Data (cont.) QualitativeData However, because there is no natural ranking of categories (from smallest to largest, say), qualitative data is often stated in relative frequencies and placed in a pie chart, rather than a bar chart. Dictionary Take Notes Back Next
Group Task: Conduct Your Own Survey • If practicable, work with other students in a group to • Design a questionnaire asking fellow students about one qualitative variable and one quantitative variable. • Administer the survey in class to collect answers to your two questions. • Create appropriate tables on an MS Excel spreadsheet that display the answers to your questions. • Present those results on a graph(s) of your choosing. • Share and discuss those your results with others in your class. Dictionary Take Notes Back
Time Series A time series set of data records quantitative values as they change over time. Dictionary Examples would include the monthly unemployment rate, the day’s closing stock market value, and the high weather temperature of the day. The time periods in the data set (seconds, hours, days, weeks, months, years) is determined by the problem at hand. In any case, data is typically recorded from left (earlier) to right (later), and often presented with a line graph instead of a bar chart. Take Notes Back Next
Time Series Line Graph Dictionary Time series graphs are often embellished with trend lines, upper and lower bounds, etc. to help further analysis of the data. Create a similar time series graph on the pre-loaded MS Excel spreadsheet (link below). Take Notes Back Next
Creating Frequency Distributions and Histogramson MS Excel Students have been encouraged to learn how to enter, format and present data using spreadsheets rather than hand calculators. It is likely that most students have done most of their data organization and collection themselves (i.e. the “hard way”), rather than relying on MS Excels data analysis features. Dictionary The next several slides show how to use MS Excel features to create a frequency distribution and a histogram on MS Excel 2007 or 2010. To see this demonstration, click on the spreadsheet icon below in the navigation pane. Take Notes Back Next
How to Group Data and Creating Frequency Distributions in MS Excel 1. Activate MS Excel’s Data Analysis Add-In. You need to do this only once and Excel will retain the add-in until removed. MS Office: Button (2007) or File (2010) > Options > Add-Ins > Data Analysis ToolPak Take Notes Back Next
Grouping Data and Creating Frequency Distributions in MS Excel 2. (Recommended but not necessary) Create a vertical list of your data. Data can actually be stored in any cells. But it’s easiest to keep it in a list, if it’s not too long, or in a matrix. Take Notes Back Next
Grouping Data and Creating Frequency Distributions in MS Excel 3. Create a list of values – ordered from smallest to largest - at which you want MS Excel to group the data. (MS Excel calls these “bins”). You don’t need to enter the lowest possible value (here, it is zero) or the highest possible value, because MS Excel makes room for these automatically. ZOOM Take Notes Back Next
Grouping Data and Creating Frequency Distributions in MS Excel 4. SelectData > Data Analysis > Histogram > OK Take Notes Back Next
Grouping Data and Creating Frequency Distributions in MS Excel • 5. Select the relevant fields for your histogram. • Select the input range (your raw data). • Select the bin range (your data classification range limits). • Make sure that “New Worksheet Ply” is selected – you can even rename the sheet if you like. • Check the “Chart Output” box and click “OK”. Take Notes Back Next
Grouping Data and Creating Frequency Distributions in MS Excel 6. Your output will appear on your new MS Excel worksheet as follows: Take Notes Back Next
Grouping Data and Creating Frequency Distributions in MS Excel 7. Make any cosmetic changes to the histogram that you see fit. Here, we changed the chart type, title and the range where the horizontal axis labels come from. We also changed the histogram to relative frequencies (by creating the data from the frequency list) and showed the frequencies above each bar. Take Notes Back Next
Symmetric and Asymmetric Distribution Until we derive more precise statistical values for various distributions (in the units to follow), we can only comment vaguely about their shapes. Dictionary Recall that the only restriction on frequency distributions is that no observed frequency can be negative in value. And the only additional restriction on relative frequency distributions is that the sum total of relative frequencies adds to 100%, or 1. With few mathematical restrictions, frequency distributions and their associated histograms can have almost any form or shape. However, we can make general distinctions about the forms and shapes of distributionscommonly seen in practice. Take Notes Back Next
Symmetrical Distributions Many distributions are symmetricalaround their centers, such as the distribution and histogram shown below. Dictionary Clearly, the scores are centered around a value of 7 out of 10, and are symmetrically distributed around that value. In this example, 25 students are given a score out of 10 maximum. Take Notes Back Next
Positively Skewed Distributions Some distributions are positively skewed, with extremely high observations above their centers, such as the distribution and histogram shown below. Dictionary Positive Skew Here, the value of the “center” of the distribution is ambiguous, because of the long positive tail of the distribution. In this example, 25 students are given a score out of 10 maximum. Take Notes Back Next
Negatively Skewed Distributions Some distributions are negatively skewed, with extremely low observations below their centers, such as the distribution and histogram shown below. Dictionary Negative Skew Here, the value of the “center” of the distribution is ambiguous, because of the long negative tail of the distribution. In this example, 25 students are given a score out of 10 maximum. Take Notes Back Next
Other Data Presentation Methods: Stem-and-Leaf Displays A Stem-and-Leaf Plot, or Display, is a way of presenting quantitative data that economizes on space (and is thus often used in textbooks). Moreover, unlike histograms that group data, stem-and-leaf displays retain the original data points. Dictionary How to Read a Stem-and-Leaf Display Consider the 6 stem at bottom, indicating six 10’s, or 60. There are three 60-something numbers in the data set, indicated by the three numbers to the right of the 6 stem. The first number is 60 + 1, or 61, the next number is 60 + 6, or 66, and the final number is 60 + 7, or 67. These values can be verified on the frequency distribution (far left). Purple Math MathWorld Stem: In this example, how many 10’s in the observed value. Leaf: In this example, how many 1’s in the observed value. Take Notes Back Next