1 / 26

BIO503: Lecture 5

BIO503: Lecture 5. Jess Mar Department of Biostatistics jmar@hsph.harvard.edu. Harvard School of Public Health Wintersession 2009. Roadmap for Today. Some More Advanced Statistical Models Multiple Linear Regression Generalized linear models Logistic Regression Poisson Regression

jordana
Télécharger la présentation

BIO503: Lecture 5

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. BIO503: Lecture 5 Jess Mar Department of Biostatistics jmar@hsph.harvard.edu Harvard School of Public Health Wintersession 2009

  2. Roadmap for Today Some More Advanced Statistical Models • Multiple Linear Regression • Generalized linear models • Logistic Regression • Poisson Regression • Survival Analysis Multivariate Data Analysis Programming Tutorials Bits & Pieces

  3. Tutorial 4

  4. Multiple Linear Regression Some handy functions to know about: new.model <- update(old.model, new.formula) Model Selection functions available in the MASS package drop1, dropterm add1, addterm step, stepAIC Similarly, anova(modObj, test="Chisq")

  5. Generalized Linear Models Linear regression models hinge on the assumption that the response variable follows a Normal distribution. Generalized linear models are able to handle non-Normal response variables and transformations to linearity.

  6. Logistic Regression When faced with a binary response Y = (0,1), we use logistic regression. where

  7. Problem 2 – Logistic Regression Read in the anaesthetic data set, data file: anaesthetic.txt. Covariates: move binary numeric vector for patient movement (1 = movement, 0 = no movement) conc anaethestic concentration Goal: estimate how the concentration of movement varies with increasing concentration of the anesthetic agent.

  8. Fit the Logistic Regression Model > anes.logit <- glm(nomove ~ conc, family=binomial(link=logit), data=anesthetic) The output summary looks like this: > summary(anes.logit) Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -6.469 2.418 -2.675 0.00748 ** conc 5.567 2.044 2.724 0.00645 ** Estimates of P(Y=1) are given by: > fitted.values(anes.logit)

  9. Estimating Log Odds Ratio To get back the log odds ratio > anes.logit$linear.predictors > plot(anesthetic$conc, anes.logit$linear.predictors) > abline(coefficients(anes.logit)) Looks like the odds of not moving increase significantly when you increase the concentration of the anesthetic agent beyond 0.8.

  10. Problem 3 – Multiple Logistic Regression Read in data set birthwt.txt. We fit a logistic regression using the glm function and using the binomial family.

  11. Problem 4 - Poisson Regression Poisson regression is often used for the analysis of count data or the calculation of rates associated with a rare event or disease. Example: schooldata.csv. We can fit the Poisson regression model using the glm function and the poisson family.

  12. Survival Analysis library(survival) Example: aml leukemia data Kaplan-Meier curve fit1 <- survfit(Surv(aml$time[1:11],aml$status[1:11])) summary(fit1) plot(fit1) Log-rank test survdiff(Surv(time, status)~x, data=aml)

  13. Survival Analysis Fit a Cox proportional hazards model coxfit1 <- coxph(Surv(time, status)~x, data=aml) summary(coxfit1) Cumulative baseline hazard estimator: basehaz(coxph(Surv(time, status)~x, data=aml)) Survival function for one group: plot(survfit(coxfit1, newdata=data.frame(x=1)))

  14. Tutorial 5

  15. Cluster Analysis Clustering observations on the basis of experiments or across a time series. Clustering experiments together on the basis of observations. A clustering problem is generally much harder than a classification problem because we don’t know the number of classes. Hierarchical Methods: (Agglomerative, Divisive) + (Single, Average, Complete) Linkage… Model-based Methods: Mixed models. Plaid models. Mixture models…

  16. 1 2 3 4 1 2 3 4 Examples of Clustering Algorithms Available in R Hierarchical Methods: hclust agnes Partitioning Methods: som kmeans pam Packages: cluster Different Samples Observations

  17. n genes in 1 cluster divisive agglomerative n genes in n clusters Hierarchical Clustering Source: J-Express Manual We join (or break) nodes based on the notion of maximum (or minimum) ‘similarity’. Euclidean distance (Pearson) correlation

  18. Different Ways to Determine Distances Between Clusters Single linkage Complete linkage Average linkage

  19. Partitioning Methods Examples of partitioning methods are k-means, partitioning about medoids (pam). Gap statistic: source("http://www.bioconductor.org/biocLite.R") biocLite("SAGx") ?gap The goal is to minimize the gap statistic.

  20. W – within variance B – between variance K-means Clustering Reference: J-Express manual

  21. 241 genes from 19 cell samples into 6 clusters.

  22. Classification (Machine Learning) Machine learning algorithms predict new classes based on patterns discerned from existing data. Goal: derive a rule (classifier) that assigns a new object (e.g. patient microarray profile) to a pre-specified group (e.g. aggressive vs non-aggressive prostate cancer). • Classification algorithms are a form of supervised learning. • Clustering algorithms are a form of unsupervised learning. • R Package: • class – contains knn, SOM • nnet • MLInterfaces - Biconductor • A simplified way to construct machine learning algorithms from microarray data.

  23. Classification Linear Discriminant Analysis lda Support Vector Machines library(e1071) svm K-nearest neighbors knn Tree-based methods: rpart randomForest

  24. Scaling Methods Principal Component Analysis prcomp Multi-dimensional Scaling MDS Self Organizing Maps SOM Independent Component Analysis fastICA

  25. R Shortcuts Ctrl + A: Ctrl + E: Ctrl + K Esc {Up, Down} Arrow

  26. Laundry List .Rprofile file Outline of R packages Graphics – lattice, Rwiki Homework R/SAS/Stata Comparison Exercises

More Related