This chapter uses MS Excel and Weka

This chapter uses MS Excel and Weka

Statistical Techniques Chapter 10

10.1 Linear Regression Analysis Equation 10.1

10.1 Linear Regression Analysis • A Supervised technique that generalizes a set of numeric data by creating a math equation relating one or more ,nput variables to a single output variable. • With linear regression we attemp to model vairation in a dependent variable as a linear combination of one or more independent variable • Linear regression is appro when the relation betwee the dependent and the independent variables are nearly linear

Simple Linear Regression(slope-intercept form) Equation 10.2

Simple Linear Regression(least squares criterion) Equation 10.3

Multiple Linear Regression with Excel

Try to estimate the value of a building

A Regression Equation for the District Office Building Data

10.1 Linear Regression Analysis • How accurate are the results • Use scatterplot diagram, and the line for the formula • Which ind vars are linearly related to dep vars. Use the stats? • Coefficient determination=1, no difference between actual (in the table) and computed values for dependent variable.(reps corrolation between actual and computed values) • Standard error for the estimate of dep var.

F stat for the regression analysis • Used to establish, if the coeff. deter. İs significant. • Look up f critical values (459) from one-tailed F tables in stat books using v1(number of ind vars, 4), v2 (no of instance – no of vars, 11-5=6) • Regression equation is able to correctly determineassesed values of office buildings that are part of the training data

Figure 10.1 A simple linear regression equation

Regression Trees

Figure 10.2 A generic model tree

Regression Tree • Essentially a desicion tree with leaf node with numeric variables • The value at an individual leaf node is numeric average of the output attribute for all instances passing through the tree to the leaf node posititon • Regresion trees are more accurate than lınear regression, when data is nonlinear • But is more difficult to interpret • Sometime regression trees are combined with linear regression to form model trees

Model Trees • Regression tree + linear regression • Each leaf node represents a linear regression quation instead of an average value • Model trees simplify regession trees by reducing the number of nodes in the tree. • More complex tree means less linear relationship between dep and ind vars.

Figure 10.3 A model tree for the deer hunter dataset (output attribute yes)

10.2 Logistic Regression

Logistic Regression • Using linear regresion to model problems with observed outcome restricted to 2 values (e.g. yes/no) is sriously flawed. Value restriction placed on output var is not observed in the regression equation, Linear regression produce straight line unbounded onboth ends. • Therefor the linear equation must be transform to restric output to [0,1], Thus regression equation can be thought of as producing a probablity of occurence or nonoccurence of a measured event. • Logistic regression applies logaithmic transform.

Transforming the Linear Regression Model Logistic regression is a nonlinear regression technique that associates a conditional probability with each data instance. 1 denotes observaton of one class (yes) 0 denotes observation of another class (no) Thus a conditional proabality of seeing class associatied with y=1 (yes) p(y=1|x), given the values in the feature vector x

The Logistic Regression Model Determine the coefficients in x, (ax+c) using an iterative method (tries to minimize the sum of logarithms of predicted probablities) Convergence occurs when logarithmic summation is close to 0 or when it doesn’t change from iteration to iteration Equation 10.7

Figure 10.4 The logistic regressioin equation

Logistic Regression: An Example Credit card Example: CreditCardPromotionNet file. LifeIns Pro is output CreditCardIns and Sex are most influantion attribs.

Logistic Regression • Classify a new instance using logistic regression • income=35K • Credit card insurance=1 • Sex=0 • Age=39 • P(y=1|x)=0.999

10.3 Bayes Classifier • Supervised classification tech, categorical output attrib • All input vars are independent, of equal importance • P(H|E) likelihood of H (dependent var representing a predicted class) • P(E|H) conditional probability of H is true given evidence E (computed from training data) • P(H) apriori probability, denotes probability of H before the presentation of evidence E (computed from training data) Equation 10.9

Bayes Classifier: An Example Credit card promotion data set Sex is output

The Instance to be Classified Magazine Promotion = Yes Watch Promotion = Yes Life Insurance Promotion = No Credit Card Insurance = No Sex = ? 2 hypothesis, sex=female, sex=male

Computing The Probability For Sex = Male Equation 10.10

Conditional Probabilities for Sex = Male P(magazine promotion = yes | sex = male) = 4/6 P(watch promotion = yes | sex = male) = 2/6 P(life insurance promotion = no | sex = male) = 4/6 P(credit card insurance = no | sex = male) = 4/6 P(E | sex =male) = (4/6) (2/6) (4/6) (4/6) = 8/81

The Probability for Sex=Male Given Evidence E P(sex = male | E)  0.0593 / P(E)

The Probability for Sex=Female Given Evidence E P(sex = female| E)  0.0281 / P(E) P(sex = male | E) > P(sex = female| E) The instance is most likely a male credit card customer

Zero-Valued Attribute Counts Problem with Bayes is when of the counts are 0, to solve this problem a small constant to numerator/dominator n/d becomes k is 0.5 for an attrib with 2 possible values Example: P(E | sex =male) = (3/4)(2/4)(1/4)(3/4) = 9/128 P(E | sex =male) = (3.5/5)(2.5/5)(1.5/5)(3.5/5) = 0.0176 Equation 10.12

Missing Data With Bayes classifier missing data items are ignored.

Missing Data • Example

Numeric Data

Numeric Data Probability Density Function, (attribute values are assumed to be normally distributed) where e = the exponential function m = the class mean for the given numerical attribute s = the class standard deviation for the attribute x = the attribute value Equation 10.13

Numeric Data • Magazine Promotion = Yes • Watch Promotion = Yes • Life Insurance Promotion = No • Credit Card Insurance = No • Age = 45 • Sex = ? • … • P(E|sex=male) = …. P(age=45|sex=male) • σ = 7.69 П = 37, x=45 • P(age=45|sex=male) = 1/(….) = 0.03 • P(sex=male|E) = 0.0018/P(E) • P(sex=female|E) = 0.0016/P(E) • Instance belong to male

10.4 Clustering Algorithms

Agglomerative Clustering Place each instance into a separate partition. Until all instances are part of a single cluster: a. Determine the two most similar clusters. b. Merge the clusters chosen into a single cluster. 3. Choose a clustering formed by one of the step 2 iterations as a final result.

Agglomerative Clustering: An Example

Agglomerative Clustering Final step of the Algorithm is to choose final clustering among all. (Requires heuristics) Use similarity measure for creating clusters, compare average within-cluster similarity with overall similarity of all instances in dataset (domain similarity) This technique can be best used to eliminate clusterings rather than to choose a final result

Agglomerative Clustering Final step of the Algorithm is to choose final clustering among all. (Requires heuristics) Use within-cluster similarity measure and within-cluster similarities of pairwise-combined clusters in the cluster set. Look for the highest similarity This technique can be best used to eliminate clusters rather than to choose a final result

Agglomerative Clustering Final step of the Algorithm is to choose final clustering among all. (Requires heuristics) Use previous 2 techniques to eliminate some of the clusterings Feed each remaining clustering to a rule generator The clustering with best defining rules is chosen. (4th tech) Bayesian Information Criterion

This chapter uses MS Excel and Weka

This chapter uses MS Excel and Weka

Presentation Transcript

MS Excel Intermediate

Excel Tutorials (MS Excel 2007)

MS Excel Project

MS Excel 2007

MS-Excel XP

MS Excel 2003

MS Excel: Interface

MS EXCEL

MS Excel 2003

Ms Excel

MS Excel unlocker

MS-Excel XP

MS EXCEL