Statistical Methods in Computer Science

Data 3: Correlations and Dependencies Ido Dagan Statistical Methods in Computer Science

Connecting Variables So far: talked about the data reflected by a single variable Common scientific goal: relate between variables Find out whether a relation exists between values of variables Find out the strength of this relation Find out the nature of this relation Our focus here: The relation between two variables e.g., the relation between input size and run-time e.g., the relation between time spent coordinating, and productivity e.g., the relation between shoe-size and reading skills

Paired Samples The starting point for our discussion: Bi-variate data Paired samples, for each X, give its corresponding Y: <input-size, run-time> <time spent coordinating, productivity> <shoe-size, reading skills> These paired samples come from the experiment The experiment should record the data to allow us the desired pairing Pairing can be implicit, through fields/variables Test at beginning of year, test at end of year: pair by student

Tools in identifying bi-variate relations Visualize: Scatter Diagram (Scatter Plot) Ordinal variables: Pearson's correlation coefficient, rXY Spearman's rank-correlation coefficient, rho () Categorical variables Dependency tests (Chi-Square – in recitation)

Visualization: the X-Y Scatter Plot One variable declared X, the other Y Axes of equal length (make it easier to see) Plot values of X and Y together For each X, plot matching Y (or Ys).

Is there a relation? We see that in general, there is some relation here: Lower X => lower Y Higher X => higher Y But how can we recognize this systematically? From “Statistical Reasoning”, Minium, King, and Bear 1993

Reminder: Variance • Sum of squares • Shorthand for: Sum of squared deviations from the mean • And normalizing for the size of the sample • This is called the variance of the sample • Distribution/Population variance is denoted by , defined relative to μ

Covariance Positive correlation: Lower X <=> Lower Y Negative correlation: Lower X <=> Higher Y How do we transform this into a measure? Intuition: Multiply pairs, and sum the results positive X positive = positive; negative X negative = positive, .... Covariance sign determined by accumulative values from points in 1st & 3rd quartiles vs. 2nd & 4th big X small = small, big X big = big

From Covariance to Correlation Big positive Cov(X,Y) means that X, Y grow together Big negative Cov(X,Y) means that X, Y grow negatively together Problem: How big is big? This depends on the values of X, Y For instance: Large x (100000) multiplied by small y (0.00001) Where both x and y are the largest values? Solution: Pearson's correlation coefficient rXY (or simply, r): 1.0: Perfect positive correlation -1.0: Perfect negative correlation 0: No correlation

Reminder: z Scores • Key idea: Express all values in units of standard deviation • This allows comparison of values from different distributions • But only if shapes of distributions are similar • Example usage: Sequence mining • We find the most frequent sequences of any length k • What are the most frequent sequences of the entire DB? • This is difficult to answer: • There are more short sequences than long ones • This can be solved with transforming frequency counts into their z Scores

Formulas for r z-Score based formula: Deviation-score based formula (equivalent): where Sk denotes the standard deviation of variable k.

Warning about misleading curves Using r is no substitute for visualization. Always Visualize! r good for linear relationships r =+0.82 From Anscombe, 1973

Correlation and Transformations Mean changes with additions, std dev does not Raise all scores by 10 ==> raise mean by 10, no change to stddev Mean changes with multiplications, std dev does too Multiply all scores by 10 ==> multiple mean & std dev by 10. Pearson's r not affected by any linear transformation, on either X and/or Y Adding = translating points Multiplying = scaling Neither affects relation between the variables.

Interpreting Correlation Always visualize! Pearson's coefficient only appropriate for linear relationships r measures how closely points “hug” a straight line Other measures exist for non-linear relations (Spearman's, eta) r sensitive to value ranges within the target population Smaller range=> smaller r - differences in values are less meaningful E.g. correlation between age and math skills for a small age range Large absolute r is not necessarily indicative of significance r is subject to sampling variation: May change from sample to sample, and significance depends on sample size We will address significance test of r later r is affected by the way some phenomenon is measured (e.g. grades on different types of scales – grades A,B,… vs. 1-100)  Need to report specific conditions for correlation measurements, and test again under different conditions to see if still correlated

Correlation and Causation IMPORTANT: Correlation is not causation! Example of positive correlations: Grip strength and mathematical skills Shoe size and reading level ... But shoe sizes does not causes reading level! The results are in kids 6-13!

Possible Explanations Two correlated variables may be: Causally related (one causes the other) Affected by the same third variable (that causes both – control variable) Two uncorrelated variables (according to r) may be: Correlated in highly non-linear fashion (always visualize!) E.g. a circle around 0 (balanced in all quartiles) There are specific ways to address these cases Example: Partial correlation Correlation of a,b, given c Example: Manipulation controls (experiment design) E.g. measure grip strength vs. math skill separately in different age groups

Partial Correlation A test for correlation between a, b, given c intuitively, correlation between a & b remaining after neutralizing their correlation with c For instance (“Empirical Methods in AI”, Cohen 1995)

Visualize as well From “Empirical Methods in AI”, Cohen 1995

Correlation for ordinal variables Pearson's coefficient is intended for ratio and interval data Ordinal data cannot be used as is Here, difference between subsequent values is meaningless Only direction matters (above or below) Examples: Correlation between military rank of career soldiers and the time they have been in the army Correlation between user and system ranking of search results Spearman's rank-correlation (rho, ) addresses this

Spearman's rho: Step 1 First step: Transform all scores to ranks First = 1, Second = 2, ..... Ties: Replace with average of intended ranks For instance, for ordinal data: X = Private Sgt. Sgt. Lt. Capt. Capt. Capt. Maj. Col. Col. General ==> Xrank = 1 2.5 2.5 4 6 6 6 8 9.5 9.5 11 (2+3)/2 (5+6+7)/3 (9+10)/2

Calculating rho: Step 2 • Generally: • Ranges in [-1,1] • With no ties, can simply use Pearson's r on the ranks with identical results • May be useful (in addition to r) also for data of numerical scores, when we don’t trust the scale properties of the scores and rank really matters • E.g. correlation between user and system relevance scores for the ranked pages in search results • “Debugging” note: • maintained for averaged ties, as sum of all ranks (for X and Y) = n(n-1)/2

Statistical Methods in Computer Science