1 / 16

Lab exercises: working with real datasets, plotting, more regression, kNN and K- means…

This lab exercise covers working with real datasets, plotting, regression, kNN, and K-means clustering using R. It includes various scripts for analyzing and visualizing data, as well as working with different datasets.

alexish
Télécharger la présentation

Lab exercises: working with real datasets, plotting, more regression, kNN and K- means…

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Lab exercises: working with real datasets, plotting, more regression, kNN and K-means… Peter Fox Data Analytics – ITWS-4600/ITWS-6600 Week 5b, February 26, 2016

  2. Plot tools/ tips http://statmethods.net/advgraphs/layout.html http://flowingdata.com/2014/02/27/how-to-read-histograms-and-use-them-in-r/ pairs, gpairs, scatterplot.matrix, clustergram, etc. data() # precip, presidents, iris, swiss, sunspot.month (!), environmental, ethanol, ionosphere More script fragments in R will be available on the web site (http://aquarius.tw.rpi.edu/html/DA )

  3. Scripts Lab5b_pairs1_2016.R Lab5b_splom_2016.R Lab5b_gpairs1_2016.R Lab5b_mosaic_2016.R Lab5b_spm_2016.R Lab5b_wknn_2016.R Lab5b_kknn1_2016.R Lab5b_kknn2_2016.R Lab5b_kknn3_2016.R Lab5b_kmeans1_2016.R Lab5b_ctree2_2016.R Lab5b_nyt_2016.R Lab5b_bronx1_2016.R Lab5b_bronx2_2016.R Lab5b_nbayes1_2016.R Lab5b_nbayes2_2016.R Lab5b_nbayes3_2016.R

  4. K Nearest Neighbors (classification) Script – Lab5b_nyt_2016.R > nyt1<-read.csv(“nyt1.csv") … from week 3b slides or script > classif<-knn(train,test,cg,k=5) # > head(true.labels) [1] 1 0 0 1 1 0 > head(classif) [1] 1 1 1 1 0 0 Levels: 0 1 > ncorrect<-true.labels==classif > table(ncorrect)["TRUE"] # or > length(which(ncorrect)) > What do you conclude?

  5. Bronx 1 = Regression > plot(log(bronx$GROSS.SQUARE.FEET), log(bronx$SALE.PRICE) ) > m1<-lm(log(bronx$SALE.PRICE)~log(bronx$GROSS.SQUARE.FEET),data=bronx) You were reminded that log(0) is … not fun  THINK through what you are doing… Filtering is somewhat inevitable: > bronx<-bronx[which(bronx$GROSS.SQUARE.FEET>0 & bronx$LAND.SQUARE.FEET>0 & bronx$SALE.PRICE>0),] > m1<-lm(log(bronx$SALE.PRICE)~log(bronx$GROSS.SQUARE.FEET),data=bronx) • Lab5b_bronx1_2016.R

  6. Interpreting this! Call: lm(formula = log(SALE.PRICE) ~ log(GROSS.SQUARE.FEET), data = bronx) Residuals: Min 1Q Median 3Q Max -14.4529 0.0377 0.4160 0.6572 3.8159 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 7.0271 0.3088 22.75 <2e-16 *** log(GROSS.SQUARE.FEET) 0.7013 0.0379 18.50 <2e-16 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 1.95 on 2435 degrees of freedom Multiple R-squared: 0.1233, Adjusted R-squared: 0.1229 F-statistic: 342.4 on 1 and 2435 DF, p-value: < 2.2e-16

  7. Plots – tell me what they tell you!

  8. Solution model 2 > m2<-lm(log(bronx$SALE.PRICE)~log(bronx$GROSS.SQUARE.FEET)+log(bronx$LAND.SQUARE.FEET)+factor(bronx$NEIGHBORHOOD),data=bronx) > summary(m2) > plot(resid(m2)) # > m2a<-lm(log(bronx$SALE.PRICE)~0+log(bronx$GROSS.SQUARE.FEET)+log(bronx$LAND.SQUARE.FEET)+factor(bronx$NEIGHBORHOOD),data=bronx) > summary(m2a) > plot(resid(m2a))

  9. How do you interpret this residual plot?

  10. Solution model 3 and 4 > m3<-lm(log(bronx$SALE.PRICE)~0+log(bronx$GROSS.SQUARE.FEET)+log(bronx$LAND.SQUARE.FEET)+factor(bronx$NEIGHBORHOOD)+factor(bronx$BUILDING.CLASS.CATEGORY),data=bronx) > summary(m3) > plot(resid(m3)) # > m4<-lm(log(bronx$SALE.PRICE)~0+log(bronx$GROSS.SQUARE.FEET)+log(bronx$LAND.SQUARE.FEET)+factor(bronx$NEIGHBORHOOD)*factor(bronx$BUILDING.CLASS.CATEGORY),data=bronx) > summary(m4) > plot(resid(m4))

  11. And this one?

  12. Bronx 2 = complex example See Lab5b_bronx2_2016.R Manipulation Mapping knn kmeans

  13. KNN! Did you loop over k? { knnpred<-knn(mapcoord[trainid,3:4],mapcoord[testid,3:4],cl=mapcoord[trainid,2],k=5) knntesterr<-sum(knnpred!=mappred$class)/length(testid) } knntesterr [1] 0.1028037 0.1308411 0.1308411 0.1588785 0.1401869 0.1495327 0.1682243 0.1962617 0.1962617 0.1869159 What do youthink?

  14. Return object cluster A vector of integers (from 1:k) indicating the cluster to which each point is allocated. centers A matrix of cluster centres. totss The total sum of squares. withinss Vector of within-cluster sum of squares, one component per cluster. tot.withinss Total within-cluster sum of squares, i.e., sum(withinss). betweenss The between-cluster sum of squares, i.e. totss-tot.withinss. size The number of points in each cluster.

More Related