Statistical Methods

Statistical Methods Chichang Jou Tamkang University

Chapter Objectives • Explain methods of statistical inference in data mining • Identify different statistical parameters for accessing differences in data sets • Describe Naïve Bayesian Classifier and the logistic regression method • Introduce log-linear models using correspondence analysis of contingency tables • Discuss ANOVA analysis and linear discriminant analysis of multidimensional samples

Background • Statistics is to collect and organize data and draw conclusions from data sets • Descriptive Statistics: • Organization and description of the general characteristics of data sets • Statistical Inference: • Draw conclusions from data • Main focus of this chapter

5.1 Statistical Inference • We are interested in arriving at conclusions concerning a population when it is impossible or impractical to observe the entire set of observations that make up the population • Sample in Statistics • Describes a finite data set of n-dimensional vectors • Will be called data set • Biased • Any sampling procedure that produces inferences that consistently overestimate or underestimate some characteristics of the population

5.1 Statistical Inference • Statistical Inference is the main form of reasoning relevant to data analysis • Statistical Inference methods are categorized as • Estimation: • Goal: make the expected prediction error close to 0 • Regression vs. classification • Tests of hypothesis • Null hypothesis H0: any hypothesis we wish to test • The rejection of H0 leads to the acceptance of an alternative hypothesis

5.2 Assessing Difference in data sets • Mean • Median: better for skewed data • Mode: the value that occurs most frequently • For unimodal frequency curves that are moderately asymmetrical, the following empirical relation is useful: mean –mode = 3 x (mean –median) • Standard deviation σ (variance: σ2)

5.3 Bayesian Inference • Prior distribution: given probability distribution for the analyzed data set • Let X be a data sample whose class label is unknown. Hypothesis H: X belongs to a specific class C. P( H / X) = [ P( X / H) ˙ P(H)]/P(X) • See p.97 for an example of Naïve Bayesian Classifier P( Ci / X) = [ P( X / Ci) ˙ P(Ci)]/P(X) P( X /Ci ) = • Bayesian classifier has the minimum error rate in theory. In practice, this is not always true because of inaccuracies in the assumptions of attributes and class-conditional independence.

5.4 Predictive Regression • Common reasons for performing regression analysis • The output is expensive to measure • The values of the inputs are known before the output is known, and a working prediction of the output is required • Controlling the input values to predict the behavior of corresponding outputs • To identify the causal link between some of the inputs and the output

Linear regression • Y=α+β1X1+β2X2+…+βnXn • Applied to each sample • yj=α+β1x1j+β2x2j+…+βnxnj+εj • Example with one input variable (p.99 –p.100) • Y=α+βX • The sum of squares of errors (SSE) • Differentiate SSE w.r.t. α and β, and set them to 0 • Equations for α and β error

General Linear Model • For real-world data mining, the number of samples may be several millions. Due to exponentially increased complexity of linear regression, it is necessary to find modifications/approximations in the regression, or to use totally different regression methods. • Example: Polynomial regression can be modeled by adding polynomial terms to the basic linear model. (p. 102) • The major effort of a user is in identifying the relevant independent variables and in selecting the regression model. • Sequential search approach • Combinatorial approach

Quality of linear regression • Correlation coefficient r

5.5 Analysis of Variance (ANOVA) • ANOVA is a method of identifying which of the β’s in a linear regression model are non-zero. • Residues: • Ri = yi– f(xi) • Thevariance is estimated by: • S2 allows us to compare different linear models • Only if the fitted model does not include inputs that it ought to, will S2 tend to be significantly larger than σ2

ANOVA algorithm • First start with all inputs and compute S2 • Omit inputs from the model one by one (This means forcing the corresponding βi to 0) • If we omit a useful input, the new estimate S2 will significantly increase • If we omit a redundant input, the new estimate S2 will not change much • F-ratio (example in p.105) • Multivariate analysis: The output is a vector. Allow correlation between outputs. (MANOVA)

Linear regression is used to model continuous-value functions. Generalized regression models try to apply linear regression to model categorical response variables. Logistic regression models the probability of some (YES/NO) event occurring as a linear function of a set of predictor (input) variables. It tries to estimate the probability p that the dependent (output) variable will have a given value. If p is greater than 0.5, then the prediction is closer to YES It supports a more general input data set by allowing both categorical and quantitative inputs 5.6 Logistic Regression

Logistic Regression • P(yj=1)=pj, P(yj=0)=1-pj • The linear logistic model • This is to prevent pj from going out of range • Example (p. 107) • Suppose logit(p) = 1.5 - 0.6 x1 + 0.4 x2 -0.3 x3 • With (x1, x2, x3) = (1,0,1) • p=0.35 • Y=1 is less probable than Y=0

5.7 Log-Linear Models • Log-linear modeling is a generalized linear model where the output Yi is assumed to have a Poisson distribution, with expected value μj • It is to analyze the relationship between categorical (or quantitative) variables • It approximates discrete, multi-dimensional probability distributions

Log-Linear Models • log(μj) is assumed to be linear function of inputs • We need to find which β’s are 0 • If βi is 0, then Xi is not related to other input variables • Correspondence Analysis: • Log-linear models when no output variable is defined • Use contingency tables to answer the question: Any relationship between the attributes?

Correspondence Analysis • Transform a given contingency table into a table with expected values, under the assumption that the input variables are independent • Compare these two metrics using the squared distance measure and the chi-square test Example: p. 108 , p. 111

5.8 Linear Discriminant Analysis • Linear Discriminant Analysis (LDA) is for classification problems where the dependent variable is categorical (nominal or ordinal) and the independent variables are metric • LDA is to construct a discriminant function that yields different scores when computed with data from different output classes Fig. 5.3

Linear Discriminant Analysis • LDA tries to find a set of weight values wi that maximizes the ratio of the between-class to the within-classvariance of the discriminant score for a pre-classified set of samples. It is then used to predict. • Cutting scores serve as the criteria against which each individual discriminant score is judged. Their choice depends on the distribution of samples in classes. • Let zA and zB be the mean discriminant score of pre-classifed samples frm classes A and B. • If the two classes of samples are of equal size and are uniformly distributed • If the two classes of samples are not of equal size • Multiple discriminant analysis (p. 113)

Statistical Methods