980 likes | 1.15k Vues
Tópicos Especiais em Aprendizagem. Reinaldo Bianchi Centro Universitário da FEI 2012. 2 a . Aula. Parte B. Objetivos desta aula. Apresentar os conceitos de Statistical Machine Learning Continuação de Regressão. Métodos de Validação e Seleção. Aula de hoje:
E N D
Tópicos Especiais em Aprendizagem Reinaldo Bianchi Centro Universitário da FEI 2012
2a. Aula ParteB
Objetivos desta aula • Apresentar os conceitos de Statistical Machine Learning • Continuação de Regressão. • Métodos de Validação e Seleção. • Aula de hoje: • Capítulos 3 e 7 do Hastie. • Wikipedia e Matlab Help
Métodos de Validação e Seleção Como ter certeza que o método escolhidoébom? Capítulo 7 do Hastie e Wikipedia
A Discussion about Linear Regression • LMS can be used to determine the least squared regression equation. • The least squared regression method provides an equation which gives the best linear relationship that exists between the dependent and independent variables.
A Discussion about Linear Regression • Sometimes, however, the “best” relationship is not sufficient or reliable enough for estimation. • If you are estimating inventory or other major business decisions, it is very costly to be inaccurate. • Statistics gives various measures for determining whether a regression line is “good” or reliable.
Modelos de Validação e Seleção: Why? • “Thegeneralisation performance of a learningmethod relates to its predictioncapabilityonindependenttest data.” • “Assessmentofthis performance is extremelyimportant in practice, since it guidesthechoiceoflearningmodel, andgive us a measureofthequalityofthechosenmodel.” (Hastieet al.)
We might have in mind 2 goals… • Model Selection: • Estimating the performance of different models in order to choose the (approximate) best one. • Model Assessment: • Having chosen a final model, estimating its prediction error (generalisation error) on new data.
Train (1) Validation (2) Test (3) In a data-rich (or paradise) situation… • The best approach would be to randomly divide the dataset into three parts: a training set, a validation set, and a test set. • (1) Fit the model (~50%) • (2) Estimate prediction error (model selection ~25%) • (3) Assess final model (generalization error ~25%)
Measures for Evaluating a Regression Line • Statistics Provides several Measures for Determining How Reliable a Line is for Estimation • The Coefficient of Determination, r2. • The Correlation Coefficient, r. • Variance, standard deviation and Z score. • Hypotheses Tests for a Significant Relationship: • T and Test for the Simple Linear Regression Model.
The Coefficient of Determination, r2 • The coefficient of determination gives a value between 0 and 1. • r2 provides the proportion of the total variation in y explained by the simple linear regression model. • The closer this value is to 1 the more reliable the regression line is for estimating y.
SST, SSr and SSE • SST = total sum of squares. • SSR = sum of squares due to regression. • SSE = sum of squares due to error. • Relationship Among SST, SSR, SSE: SST = SSR + SSE
The Coefficient of Determination • The coefficient of determination is: r2 = SSR/SST where: • SST = total sum of squares • SSR = sum of squares due to regression
Relationship Among SST, SSR, SSE . { } observed SSE . } SST SSR estimated mean where: SST = total sum of squares SSR = sum of squares due to regression SSE = sum of squares due to error
Example: Reed Auto sales Reed Auto periodically has a special week-long sale. As part of the advertising campaign Reed runs one or more television commercials during the weekend preceding the sale. Data from a sample of 5 previous sales are shown on the next slide.
Numberofcarsoldsx TV advertisements Number of TV AdsNumber of Cars Sold 1 14 3 24 2 18 1 17 3 27
Numberofcarsoldsx TV advertisements • x = [1,3,2,1,3]' • y = [14,24,18,17,27]' • one = ones(5,1) • X = [one, x] • v = (X'*X)\(X'*y) v = 10 5
Example: Reed Auto Sales • Coefficient of Determination r2 = SSR/SST = 100/114 = 0.87 • The regression relationship is very strong because 88% of the variation in number of cars sold can be explained by the linear relationship between the number of TV ads and the number of cars sold.
Height x Foot size wi=ones(10,1) r2 = 0.6133 wi=selective r2 = 0.8196
The Correlation Coefficient, r • The correlation coefficient gives a value between -1 and +1. • The closer thatris to -1 the stronger the negative linear relationship is between your independent and dependent variables. • The closer that r is to +1 the stronger the positive linear relationship is between your independent and dependent variables. • The closer that r is to 0, the weaker the linear relationship is between your independent and dependent variables.
The Correlation Coefficient, r • Sample Correlation Coefficient where: • b1 = the slope of the estimated regression equation
Example: Reed Auto Sales • Sample Correlation Coefficient • The sign of b1 in the equation is “+”.
Example: ProstateCancer ProstateCancer StudybyStameyet al. (1989) thatexaminedthecorrelationbetweenthelevelofprostatespecificantigen (PSA) and a numberofclinicalmeasures. Thegoal is to predictthelogof PSA (lpsa) from a numberofmeasurements.
Example: ProstateCancer SignificantNotSignificant
Varianceand Standard deviation • Variance- s2 • is a measure of theamount of variationwithinthevalues of that variable, takingaccount of allpossiblevalues and theirprobabilitiesorweightings. • Standard deviation- s • isthesquareroot of thevariance. • Widelyusedmeasureofthevariability
An estimative of the Variance • The mean square error (MSE) provides the estimate of s2, called s2: where: • n = number of examples, p = dimensions • is the degrees of freedom. The estimator s2 is called the sample variance, since it is the variance of the sample (x1, …, xn).
An estimative of the Variance • The mean square error (MSE) provides the estimate of s2, called s2:
An estimative of the Variance • An Estimate of Variance -s2 The mean square error (MSE) provides the estimate of s2, called s2: s2 = MSE = SSE/(n – (p +1)) • where: • n = number of examples • p = dimensions
Degrees of Freedom • In Statistics, degreesof freedomisthenumber of values in the final calculation of a statisticthat are free tovary. • Forvariance, we use • p = dimension • In Linear Regressionisthenumber of parameterstofit.
Estimate of Standard Deviations • To estimate s we take the square root of s2. • The resulting s is called the standard error of the estimate.
Z Score • In statistics, a standardscoreindicates how manystandarddeviationsanobservationordatumisaboveorbelowthe mean. • Thestandarddeviationistheunitofmeasurementofthez-score. • The use of "Z" isbecausethe normal distributionisalsoknown as the "Z distribution"
Z Score To test thehypothesisthat a particular coefficientβj = 0, weformthestandardizedcoefficientor Z-score: wherevjisthejth diagonal elementof
Z Scorex Normal curve http://en.wikipedia.org/wiki/Standard_score
Testing for Significance • To test for a significant regression relationship, we must conduct a hypothesis test to determine whether the value of b1 is zero. • Two tests are commonly used • t Test • F Test • All require an estimate of s2, the variance of e in the regression model.
Testing for Significance: Student t-Test • A t-test is a statisticalhypothesis test in whichthe test statisticfollows a Student'stdistributionifthenullhypothesisissupported: • ThenullhypothesisH0proposes a general ordefaultposition, such as thatthereis no relationshipbetweentwomeasuredphenomena, orthat a potentialtreatment has no effect.
Testing for Significance: Students t-Test • Thet-test assesseswhetherthemeansoftwogroups are statisticallydifferentfromeachother. • Ifthey are, a secondhypothesisisvalid: • ThealternativehypothesisHa , whichasserts a particular relationshipbetweenthephenomena.
History • Thet-test wasintroduced in 1908 by William Sealy Gosset, a chemistworkingforthe Guinness brewery ("Student" washispenname). • Gosset hadbeenhireddueto Claude Guinness'sinnovativepolicyofrecruitingthe best graduatesfrom Oxford and Cambridge toapplybiochemistryandstatisticsto Guinness' industrial processes.
Computing t • Usually, t = Z/s where: • Z is designed to be sensitive to the alternative hypothesis: • Its magnitude tends to be larger when the alternative hypothesis is true • s is a scaling parameter that allows the distribution of T to be determined.
Definingtheresult • Once a tvalueisdetermined, a p-value can be foundusing a tableofvaluesfromStudent'st-distribution. • Ifthep-valueisbelowt in thethresholdchosenforstatisticalsignificancethenthenullhypothesisisrejected in favor ofthealternativehypothesis. • Ift > p, thet-valueislargeenoughto be significant.
Statisticalsignificance • Oneresultisstatisticallysignificantifitisunlikelytohaveoccurred by chance. • Usual thresholdforsignificance: • 0.95: Correspondingto a 5% chance ofrejectingthenullhypothesiswhenitistrue. • 0.99: Correspondingto a 1% chance ofrejectingthenullhypothesiswhenitistrue. • 0.995: Correspondingto a 0.5% chance ofrejectingthenullhypothesiswhenitistrue.
Studentst-distributiontable Statisticalsignificance Degrees of freedomN-(p+1)
Example: Independentone-samplet-test • In testingthenullhypothesisthatthepopulation mean isequalto a specifiedvalueμ0, one uses thestatistic: where: • isthe mean valueofxi. • μ0 isthevalueto test. • sisthestandartdeviation.
Example: Independentone-samplet-test • Supposethattheteacherclaimsthatan average studentofhisschoolstudies 8 hoursperday. • Wedesireto test thetruthofthisclaim. • In this case: • H0: μ = 8, whichessentiallystatesthat mean hoursofstudyperdayis no differentfrom 8 hours. • Ha: μ≠ 8, whichisnegationoftheclaim.
Example: Independentone-samplet-test • Data from 10 students: • X = [5,6,7,3,5,9,9,2,3,10] • Computingt: • Mean = 5.90 • Samplestandartdeviation = 2.807 • t = - 2.3660. • Usingmatlab: • t = (mean (X)-8)/(std(X)/sqrt(10))
Example • How toverifywhichhypotesisistrue? • Find a p-valueusing a tableofvaluesfromStudent'st-distribution, for a desiredlevelofsignificance. • Thedegreesoffreedomused in this test isn − 1 = 9 • Thedesiredlevelofsignificanceis 95% • p95= 2.262