Probability and Statistics Lecture notes 03

Probability and Statistics Lecture notes 03

Lesson Overview • Types of Data • Qualitative (Categorical) • Quantitative (Numerical): • Discrete vs. Continuous • Levels of Measurement: • Nominal, Ordinal, Interval, Ratio • Data Summary and Presentation • The Stem-and-leaf Diagram • The Frequency Distribution Tables • Histogram • The Box Plot • Time Sequence Plots

Types of Data Data can be classified as either numeric or nonnumeric. Specific terms are used as follows: • Qualitative data are nonnumeric. {Poor, Fair, Good, Better, Best}, colors (ignoring any physical causes), and types of material {straw, sticks, bricks} are examples of qualitative data. • Qualitative data is often termed catagorical data. Some books use the terms individual and variable to reference the objects and characteristics described by a set of data. They also stress the importance of exact definitions of these variables, including what units they are recorded in. The reason the data were collected is also important.

Types of Data • Quantitative data are numeric. • Quantitative data are further classified as either discrete or continuous. • Discrete data are numeric data that have a finite number of possible values.A classic example of discrete data is a finite subset of the counting numbers, {1,2,3,4,5} perhaps corresponding to {Strongly Disagree... Strongly Agree}. • When data represent counts, they are discrete. An example might be how many students were absent on a given day. Counts are usually considered exact and integer. Consider, however, if three tradies make an absence, then aren't two tardies equal to 0.67 absences?

Quantitative data / Types of Data • Continuous data have infinite possibilities: 1.4, 1.41, 1.414, 1.4142, 1.141421...The real numbers are continuous with no gaps or interruptions. • Physically measureable quantities of length, volume, time, mass, etc. are generally considered continuous. At the physical level (microscopically), especially for mass, this may not be true, but for normal life situations is a valid assumption. The structure and nature of data will greatly affect our choice of analysis method. By structure we are referring to the fact that, for example, the data might be pairs of measurements.

Levels of Measurement • The experimental (scientific) method depends on physically measuring things. • The concept of measurement has been developed in conjunction with the concepts of numbers and units of measurement. • Statisticians categorize measurements according to levels. • Each level corresponds to how this measurement can be treated mathematically.

Levels of Measurement (Measurement Scales) – Four common types • Nominal: Nominal data have no order and thus only gives names or labels to various categories. • Ordinal: Ordinal data have order, but the interval between measurements is not meaningful. • Interval: Interval data have meaningful intervals between measurements, but there is no true starting point (zero). • Ratio: Ratio data have the highest level of measurement. Ratios between measurements as well as intervals are meaningful because there is a starting point (zero). (Gender is something you are born with, whereas sex is something you should get a license for.)

Levels of Measurement (measurement Scales) – Four common types • Nominal scales are for things that are mutually exclusive/non-overlapping, but there is no order or ranking. For example, professors are divided into departments by subject, but no subject is ranked as better than another. • Ordinal Levels of Rank are categories that can be ordered, but not precisely. For example, letter grades, movie quality (excellent, good, adequate, bad, terrible). • Interval Level ranks the data in precise scales, but there is no meaningful zero. For example: IQ tests and temperature. Neither have a meaningful zero. • Ratio Level Data can be ranked and there are precise differences between the ranks, as well as having a meaningful zero. For example: Height, weight, Salary, and Age.

Types of Data / Levels of Measurement • Example 1: ColorsTo most people, the colors: black, brown, red, orange, yellow, green, blue, violet, gray, and white are just names of colors. • To an electronics student familiar with color-coded resistors, this data is in ascending order and thus represents at least ordinal data. • To a physicist, the colors: red, orange, yellow, green, blue, and violet correspond to specific wavelengths of light and would be an example of ratio data.

Types of Data / Levels of Measurement • Example 2: TemperaturesWhat level of measurement a temperature is depends on which temperature scale is used.Specific values: 0°C = 32°F = 273.15 K = 491.69°R 100°C = 212°F = 373.15 K = 671.67°R -17.8°C = 0°F = 255.4 K = 459.67°Rwhere C refers to Celsius; F refers to Fahrenheit; K refers to Kelvin; R refers to Rankine. • Only Kelvin and Rankine have zeroes (starting point) and ratios can be found. Celsius and Fahrenheit are interval data; certainly order is important and intervals are meaningful. However, a 180° dashboard is not twice as hot as the 90° outside temperature (Fahrenheit assumed)! Although ordinal data should not be used for calculations, it is not uncommon to find averages formed from data collected which represented Strongly Disagree, ..., Strongly Agree! Also, averages of nominal data (zip codes, social security numbers) is rather meaningless!

Data Sources • Published source • Designed experiment • Survey • Observational study

Data (Variables) Quantitative (Numerical) Qualitative (Nonnumerical) Nominal (Categorical) Ordinal (Rank ordered categories) Discerete Continues DATA SUMMARY

DATA SUMMARY AND PRESENTATION • The Stem-and-leaf Diagram • The Frequency Tables: • Standard, Relative, and Cumulative • Histograms • The Box Plot • Time Sequence Plots

Graphical Displays • The distribution of a variable describes what values the variable takes and how often each value occurs. • The frequency of any value of a variable is the number of times that value occurs in the data. • The relative frequency of any value is the proportion (fraction or percent) of all observations that have that value.

DATA SUMMARY AND PRESENTATION • Frequency Tables: Standard, Relative, and Cumulative • Histograms, Ogive, Pareto Diagrams, • Pie Charts • Exploratory Data Analysis • Stem-and-Leaf Diagram • Boxplots

Graphical Displays • The distribution of a variable describes what values the variable takes and how often each value occurs. • The frequency of any value of a variable is the number of times that value occurs in the data. • The relative frequency of any value is the proportion (fraction or percent) of all observations that have that value.

Types of Variables • Categorical variable: Places an individual into one of several categories. • Examples: Gender, race, political party, zip code • Quantitative variable: Takes numerical values for which arithmetic operations make sense. • Examples: OYS score, number of vote, cost of textbooks

Graphs for categorical variables • Pie charts require relative frequencies since they display percentages and not raw data. The relative frequency of each category corresponds to the percent of the pie that is occupied by that category. • Bar graphs display data where the categories are on the horizontal axis and the frequencies (or relative frequencies) are on the vertical axis.

Graphs for quantitative variables Histograms: • The data are divided into classes of equal width and the number (or percentage) of observations in each class is counted. • Data scale is on the horizontal axis. • Frequency (or relative frequency) scale is on the vertical axis. • Bars are draw where base of each bar covers the class, height of each bar covers the frequency (or relative frequency).

Stem-plots or Stem and Leaf Displays: • Separate each observation in a stem unit (all but the final rightmost digit of (rounded) data) and a leaf unit (the final digit of (rounded) data). • Write the stems in a vertical column, smallest to largest from top to bottom. • Write each leaf in the row to the right of its stem, in increasing order.

Histograms vs. Stemplots • Both are used to describe the distribution of data. • Stemplots display actual data values. • Stemplots are used for small data sets (less than 100 values). • Histogram can be constructed for larger data sets.

Common Distributional Shapes: • A symmetric distribution is one where both sides about the center line are approximately mirror images of each other. • A skewed distribution is one where one side of the center line contains more data than the other. • Skewed to the right: The right side of the histogram extends much farther than the left side. • Skewed to the left: The left side of the histogram extends much farther than the right side.

Common Distributional Shapes: • A bimodal distribution has two humps where much of the data lies. • All classes occur with approximately the same frequency in a uniform distribution. • An outlier in any graph of data is an individual observation that falls outside the overall pattern of the graph.

DATA SUMMARY AND PRESENTATION • THE STEM-AND-LEAF DIAGRAM • A stem-and-leaf diagramis a good way to obtain an informative visual display of a data set x1, x2, ..., xn, where each number xiconsists of at least two digits. • To construct a stem-and-leaf diagram, we divide each number xiinto two parts: • a stem, consisting of one or more of the leading digits, and • a leaf, consisting of the remaining digits.

Write the stems in a vertical column, smallest to largest from top to bottom. • Write each leaf in the row to the right of its stem, in increasing order.

THE STEM-AND-LEAF DIAGRAM • EXAMPLE • Construct a stem-and-leaf display for the following data:

THE STEM-AND-LEAF DIAGRAM SOLUTION • We will select as stem values the numbers 7, 8, 9, 10, 11, …, 24. • The resulting stem-and-leaf diagram is presented in the following figure.

THE STEM-AND-LEAF DIAGRAM

THE STEM-AND-LEAF DIAGRAMStem is sorted in decreasing order, leaf ordered in increasing order

THE STEM-AND-LEAF DIAGRAM • Inspection of this display immediately reveals that most of the data lie between 110 and 200 and that a central value is somewhere between 150 and 160. Furthermore, the data are distributed approximately symmetrically about the central value. • The stem-and-leaf diagram enables us to determine quickly some important feature of the data that were not immediately obvious in the original display in original table.

THE FREQUENCY DISTRIBUTION TABLES • Frequency Tables • Frequency refers to the number of times each category occurs in the original data • Afrequency table lists in one column the data categories or classes and in another column the corresponding frequencies. A common way to summarize or present data is with a standard frequency table.

Frequency Tables • Often, the category column will have continuous data and hence be presented via a range of values. In such a case, terms used to identify the class limits, class boundaries, class widths, and class marks must be well understood. • Class limits are the largest or smallest numbers which can actually belong to each class. Each class has a lower class limit and an upper class limit. • Class boundaries are the numbers which separate classes. They are equally spaced halfway between neighboring class limits.

Frequency Tables • Class marks are the midpoints of the classes. It may be necessary to utilize class marks to find the mean and standard deviation, etc. of data summarized in a frequency table. • Class width is the difference between two class boundaries (or corresponding class limits).

Frequency Tables Following are guidelines for constructing frequency tables. • The classes must be "mutually exclusive"—no element can belong to more than one class. • Even if the frequency is zero, include each and every class. • Make all classes the same width. (However, open ended classes may be inevitable.) • Target between 5 and 20 classes, depending on the range and number of data points. • Keep the limits as simple and as convenient as possible.

Frequency Tables • Relative freqency tables contain the relative frequency instead of absolute frequency. Relative frequencies can be expressed either as percentages or their decimal fraction equivalents. • Cumulative frequency tables contain frequencies which are cumulative for subsequent classes. In a cumulative frequency table, the words less than usually also appear in the left column.

Frequency Tables The frequency distribution • A frequency distributionis a more compact summary of data than a stem-and-leaf diagram. • To construct a frequency distribution, we must divide the range of the data into intervals, which are usually called class intervals, cells, or bins.

Frequency Distrubion Tables • EXAMPLE • Construct the frequency distribution table for the following data:

THE FREQUENCY DISTRIBUTION TABLES • SOLUTION • Class relative frequency • Cumulative frequency

Frequency Distrubion Tables Another example containing student distributions as follows:

Probability and Statistics Lecture notes 03