BMS 617

BMS 617 Lecture 2: Types of Variable, measures of central tendency, and scatter Marshall University Genomics Core Facility

Types of Variable • Understanding the type of variable with which you are working is important. • Type of variable determines which arithmetic operations make sense • Helps determine which tests are appropriate for hypothesis testing Marshall University School of Medicine

Determining the type of variable • To determine the type of variable we are using, we ask the following questions: • Is there an ordering for values of the variable? • If there is an ordering, is there a scale? • i.e. Does an increase in one unit always mean the same thing? • If there is an ordering and a scale, does the value zero have a specific meaning? • Additionally, we ask if the variable is continuous or discrete. • Continuous means there’s always a value lying strictly between any two distinct values • So it must be able to take on fractional values • Discrete means it takes on only specific, disjoint, values. Marshall University School of Medicine

Nominal variables • Nominal variables are those whose values have no ordering. • Just qualitative categories. • Cannot be continuous. • Examples: • Gender • Values are "Male", "Female” • Race • Values are "Black", "White", "Asian", "Native American", etc… Marshall University School of Medicine

Ordinal values • Ordinal variables are variables with qualitative categories which have an ordering, but no scale. • Example: Economic status • Values are typically stated as "Low", "Medium", or "High", which are computed using a number of factors (income, education level, occupation, wealth). • These are ordered because there is a natural ordering low → medium → high. • They have no scale because the difference between low and medium is not necessarily the same as the difference between medium and high. Marshall University School of Medicine

Interval Variables • Interval variables are variables with ordering and scale, but with no meaningful zero • Examples: Temperature in celsius or fahrenheit • There is a scale, because a difference in one degree means the same thing, no matter what the starting temperature is. • However, the choice of a zero value is essentially arbitrary. Marshall University School of Medicine

Operations on interval variables • Computing differences of values of interval variables makes sense. • For example, computing a change in temperature (difference between two temperatures) makes sense, since a change of one unit (one degree) makes sense. • Computing ratios of values of interval varaibles does not make sense, because there is no meaningful zero value. • Ratios of values are dimensionless • Have no units • Should be the same no matter what units we start in. • 100°C is not double 50°C • These values are equal to 212°F and 132°F respectively. Marshall University School of Medicine

Operations on Ratio Variables • It makes sense to compute differences and ratios of ratio variables. • A blood pressure of 120 is double the blood pressure of 60. • Note that the difference of values of an interval variable is always a ratio variable • For example, elapsed time (essentially the difference between two dates) is a ratio variable Marshall University School of Medicine

Examples For each of the following, determine the type of the variable (Nominal, Ordinal, Interval, Ratio). Also determine whether it is continuous or discrete. Marshall University School of Medicine

Ambiguity in variable types • Determining the type of variable can depend on context, and/or on the measurement techniques used. • In a psychological experiment, patients are exposed to flash cards of various colors and activity in specific parts of the brain is measured. • Color here is (most likely) a nominal variable. • In a cosmological experiment, the colors of stars are observed and used (along with other data) to determine their relative speeds. • Color here is measured by wavelength of light, and is a ratio variable. • Is Age a continuous or discrete variable? • Age is really a continuous (ratio) variable: it's the amount of time elapsed since birth. However, it is often collected as a discrete variable, by rounding down to a whole number of years. The imprecision in this rounding is usually insignificant, since effects of age tend to be more noisy than this loss of precision anyway. However, it is usually better to collect data on a subject's date of birth and subsequent dates of important events in the study: this way ages can be calculated to the number of days if required. • In statistical analysis, it is usually fine to treat age as a continuous variable, even when the measurement is rounded to a whole number of years. All continuous data is measured to a degree of precision, and the loss of precision becomes part of the noise. This is no different with age. Marshall University School of Medicine

Summarizing Data • The next sections of the course will focus on continuous data. • Or data that may be treated as continuous • Often, experiments will collect more data than can reasonably be presented in a poster, presentation, or manuscript. • If this is not the case, then present all the data! • Typically, we collect datapoints in the range of dozens upwards (to trillions, in the case of sequencing experiments) • Data must be summarized for presentation and interpretation. Marshall University School of Medicine

Aims of Summarizing Data • Summarized data may be presented textually (in a table) or graphically • A good summary should: • Demonstrate what a "typical" value looks like. • Demonstrate the extent to which values deviate from the "typical" value. • Provide as much detail as is realistically possible. • Clearly state how the summary was made. Marshall University School of Medicine

Measures of Central Tendency • "Typical" values in a data set are identified by a measure of central tendency • Choosing the right measure is important • Mean • Median • Mode • All these are kinds of "Average" Marshall University School of Medicine

Mean • The mean is the measure of central tendency most commonly understood by the word "average". • Sum of all the values divided by the number of values. • Since values are summed, mean only makes sense for interval and ratio data. • The mean can be dramatically affected by extreme outliers. Marshall University School of Medicine

Median • The median is the "middle" value. • Computed by ordering all values and taking the middle one. • Mean of the middle two if there are an even number of values. • Not affected by a small number of outliers, no matter how extreme. • A good measure for ordinal data. Marshall University School of Medicine

Mode • The mode is the most common value. • The French word mode means fashion. • Value that occurs most often. • Makes no sense for continuous data • If measured with enough precision, no value could occur more than once. • The best measure of central tendency for nominal data • Does not always measure the "center" of the data Marshall University School of Medicine

Averages do not tell the story • Both patients appear healthy… • Merely stating an average can be extremely misleading. • The average human being has one breast and one testicle. • Example (simulated). Two patients have blood pressure measured every two hours from 6 a.m. to 10 p.m. Marshall University School of Medicine

Example • Patient A no longer appears healthy… However, examine all the data: Marshall University School of Medicine

Measures of Variability • Range • Just the minimum and maximum values in the data • Interquartile range • The range of the "middle half" of the data • Variance and/or standard deviation • A measure of the average deviation from the mean • Coefficient of variation • The standard deviation relative to the mean. Marshall University School of Medicine

Range • Very susceptible to outliers • One bad reading can completely change the range • Range is the simplest measure of variability. • Just the minimum and maximum values. • For our simulated blood pressure data, already gives a good clue as to what is happening. Marshall University School of Medicine

Interquartile Range • Simliar philosophy to the median • Order the values in the data set • Find the 25th percentile and the 75th percentile • The values ¼ and ¾ the way along the ordering • The difference is the interquartile range • The interquartile ranges for the patients in our blood pressure example are 34 and 7 • Verify this! Marshall University School of Medicine

Standard Deviation • Standard Deviation is the most commonly used measure of variability • Intuitively, it measures the average difference between each data point and the mean. • Gives a sense of the average spread of the data Marshall University School of Medicine

Computing the Standard Deviation The formula for the standard deviation is given by Yi represents each data point Y is the mean n is the number of data points. Motulsky (p 73) has a good discussion of why n-1 is used instead of n. Marshall University School of Medicine

Variance • Variance is just the square of the standard deviation • Useful quantity for performing some statistical tests we’ll see later • Interpretation less intuitive than standard deviation • Units of standard deviation are the same as the units of the measurement • Units of variance are the square of the units of the measurement Marshall University School of Medicine

Coefficient of Variation • The coefficient of variation (CV) is simply the standard deviation divided by the mean • Only makes sense for ratio variables (why?) • CV has no units • Often presented as a percentage • Occasionally useful for comparing scatter in variables in unrelated units Marshall University School of Medicine

Graphing Data • We’ll look at four ways of graphing our blood pressure data: • Column Scatter Plot • Box and Whisker Plot • Column or Bar Chart • Line Chart • In all these, it’s important to show both a measure of central tendency (average) and a measure of variability Marshall University School of Medicine

Column Scatter Plot • A column scatter plot plots all the data as individual points in a column • Rarely used • But very useful, for up to 100 data points • Not much software support • GraphPad Prism, for which Marshall SOM has a license, can do this Marshall University School of Medicine

Column Scatter Plot Example Marshall University School of Medicine

Box and Whisker Plot • A box and whisker plot shows the range, interquartile range, and median of the data set • A good choice when the median and interquartile range are good measures of central tendency and variation for your data • The median is marked with a horizontal line • The interquartile range is marked with a box • "Whiskers" extend to the full range of the data • A variation is for the whiskers to extend to most of the range, and outliers to be marked individually as points Marshall University School of Medicine

Box and Whisker Plot Example Marshall University School of Medicine

Bar Chart • Bar charts use horizontal or vertical bars to demonstrate the mean of the data set • "Error bars" are used to show a measure of variability • Some important considerations for bar charts: • It is natural to look at the relative size of the bars in order to compare the relative values of the means. • Therefore, bar charts should only be used with ratio data and should have the base of the bar at zero • There are various ways the error bars can be drawn (we will see later), so always clearly state what the error bars represent Marshall University School of Medicine

Bar chart example Marshall University School of Medicine

Line Chart • A line chart is useful if the data points are ordered, and the ordering is important • For example, if we want to track the data over time • Like a column scatter plot, a line chart plots all the data Marshall University School of Medicine

Line Chart Example Marshall University School of Medicine

BMS 617

BMS 617

Presentation Transcript

API 617

BMS 617

BMS 617

BMS 617

BMS 617

BMS 617

BMS 617

BMS 617

BMS 617

BMS 617

BMS 617

BMS 617

BMS

642-617

Falcon BMS

BMS 617

BMS 617

617

Bms