Chapter 2 Examining Relationships in Data
Examining Relationships? • No, this is not a new soap opera. When we say examine relationships, we mean relationships in data. • For example, plot ACT scores (explanatory) against College GPA (response) • Many statistical studies seek relationships among two or more variables • My own research concerns CPU clock speeds, RAM timings, storage sub-systems and more as performance predictors in statistical analysis programs….
More Relationships • For now, we examine Bivariate Data • Study of relationships between two variables • For control, we often measure both variables from the same individual(s) • In general, we want to explain why the response variables changes in relation to the explanatory variable • The experimenter/statistician often has some control over the explanatory variable
Definitions • A response (dependent) variable “measures an outcome of a study” (page 86) • An explanatory (independent) variable “explains or influences changes in a response variable” • Since we can usually exercise control over the explanatory variable, we often set values of explanatory variable to see how it affects response variable
Variables with relationships? • When we don’t set the values of either variable, we can only observe outcomes. Hence, we may have no explanatory and response variables. • Calling one variable explanatory and the other response does not necessarily mean that changes in one variable cause changes in the other variable
Summing up the relationship • Fortunately, statistical analysis of multi-variable data uses the same styles of analysis we already know (I told you Stat 226 is cumulative!) • Step one: Plot the Data and Generate Numerical Summaries • Step 2: Examine the overall patterns • Look for deviations from the patterns • Step 3: When the overall pattern seems “regular”, use a simple mathematical model to describe it • Regular? No outliers or at least we know why we have outliers. Perhaps the data follow a known pattern (e.g., Normal)
Section 2.1: The Scatterplot • A scatterplot shows the relationship between two quantitative variables measured on the same individual(s) • The values of one variable appear on the horizontal axis (x), and the values of the other variable appear on the vertical axis (y) • Each individual appears in the plot as an (x, y) point. For example, X is the ACT score while Y is the College GPA. Perhaps one individual has X = 28 and Y = 3.5.
Tips to Remember • The explanatory variable is customarily on the x-axis (horizontal) • ACT Score • The response variable is customarily on the y-axis (vertical) • College GPA • Hence, statisticians usually call the explanatory variable x and the response variable y • If you cannot make the explanatory-response distinction, then either variable can be on either axis
Our Magic Red Triangle • The Correlation
Interpreting a Scatterplot • In any graph of data, look for • The Overall pattern and • Striking deviations from that pattern • You can describe the overall pattern of a scatterplot by the • Form (Linear pattern?) • Direction (Positive, Negative or Flat) • Strength of the relationship (Correlation in 2.2) • Watch for outliers • An outlier is an individual value(s) that falls outside the overall pattern of the relationship
Interpreting Scatterplots • Form • Linear pattern? • Distinct outliers or clusters of points? • Direction • Positively associated:The points trend upward. As one value increases so does the associated value. • Negatively associated:The points trend downward. As one value decreases the other value increases. • Strength • How closely the points follow a clear form
Analyzing the Descriptions • Form: • Fairly Linear • Direction: • Positive • Strength: • Moderately strong • Outliers • Yes – one point and a second questionable
Spotting the Outlier • The outlier becomes apparent when we use the density ellipse
Identify the Attributes • Form • Not linear. • Direction • None? • Strength • Hmm…. • Outliers • You tell me!
Identify the Attributes • Form • Linear • Direction • Negative • Strength • Fairly strong • Outliers • None apparent
Identify the Attributes • As we have just seen, not all data will follow a linear pattern • Moreover, not all relationships have a clear direction • Sometimes you will encounter a cloud of points that has no “strength” of shape
Adding Categorical Variables • Use different colors or symbols to plot points when you want to add a categorical variable to a scatterplot • Sometimes several individuals have identical data • Use a different plotting symbol to call attention to such points
Categorical Variables • For our ACT vs. GPA data set, we also know the ages of the individuals • A = Age 18-22 • B = Age 22+ • To change the symbols in JMP, we first sort by the categorical variable:
Categorical Variables • Our Sorted Data by Age Group:
Changing the Symbol/Color • Select all the Individuals of the “A” category. Right Click in the region shown to bring up the menu.
The Graph with Age Groups • A = Red + B = Black Dot ‘.’
Section 2.1 Summary • To study relationships between variables, we must measure the variables on the same group of individuals. • If we think that a variable (x) may explain or even cause changes in another variable (y), we call x an explanatory variable and y a response variable.
Section 2.1 Summary • A scatterplot displays the relationship between two quantitative variables measured on the same individuals. Mark values of one variable on the horizontal axis and values of the other variable on the vertical axis. Plot each individual’s data as a point on the graph.
Section 2.1 Summary • Always plot the explanatory variable, if there is one, on the x axis of a scatterplot. Plot the response variable on the y axis. • Plot points with different colors or symbols to see the effect of a categorical variable in a scatterplot.
Section 2.1 Summary • To examine a scatter plot, look for an overall pattern showing the form, direction, and strength of the relationship. Also, look for outliers or other deviations from this pattern.
Section 2.1 Summary • Form: Linear relationships, where the points show a straight-line pattern, are an important form of relationship between two variables. Curved relationships and clusters are other forms to watch for.
Section 2.1 Summary • Direction: If the relationship has a clear direction, we speak of either positive association (high values of the two variables tend to occur together) or negative association (high values of one variable tend to occur with low values of the other variable).
Section 2.1 Summary • Strength: The strength of a relationship is determined by how close the points in the scatterplot lie to a simple form such as a line or ellipse.