Lab 3 DAVID, Clustering and Classification

Lab 3DAVID, Clustering and Classification Yang Li Lin Liu Feb 10 & Feb 11, 2014

DAVID (gene set analysis) http://david.abcc.ncifcrf.gov/summary.jsp Biological processes Molecular function Cellular component

Other gene set analysis tools GSEA http://www.broadinstitute.org/gsea/index.jsp GSAhttp://statweb.stanford.edu/~tibs/GSA/ GOrilla http://cbl-gorilla.cs.technion.ac.il/ Panther http://www.pantherdb.org/pathway/

Clustering Finding structures Key: distance metric/divergence

Hierarchical Clustering • Repeatedly • Merge two nodes (either a gene or a cluster) that are closest to each other • Re-calculate the distance from newly formed node to all other nodes • Branch length represents distance • Linkage: distance from newly formed node to all other nodes

Hierarchical Clustering Linkage Single Complete Average: pairwise distances

Partitional Clustering • Disjoint groups • From hierarchical clustering: • Cut a line from hierarchical clustering • By varying the cut height, we could produce arbitrary number of clusters

K-means Algorithm • Choose K centroids at random Expression in Sample1 Expression in Sample2 Iteration = 0

K-means Algorithm • Choose K centroids at random • Assign object i to closest centroid Iteration = 1

K-means Algorithm • Choose K centroids at random • Assign object i to closest centroid • Recalculate centroid based on current cluster assignment Iteration = 2

K-means Algorithm • Choose K centroids at random • Assign object i to closest centroid • Recalculate centroid based on current cluster assignment • Repeat until assignment stabilize Iteration = 3

Let’s look at the data first

Some review

Hierarchical clustering D=dist(t(ld), method=c("euclidian")) hc=hclust(D, method=c("average"))

K-Means clustering (Mixture model) kmean.cluster <- kmeans(t(ld), 2) kmean.cluster$cluster

Distance metric Euclidean distance Hamming distance (binary) Correlation (range: [0, 1]) Mahalanobis distance

How to choose distance: context specific • RNA-Seq example: (1, 0, 0) -> (0, q1, q2) • Jensen-Shannon divergence • JSD(P, Q) = ½ (D(P||M) + D(Q||M)) • D(A||B) is Kullback-Leibler divergence • M = ½ (P + Q) • Used in RNA-Seq analysis • Problem of JSD? Highly abundant rows will dominate analysis; Not a metric (consider to take squared root)

Mahalanobis distance • Rectify the problem of JSD by normalizing using the entire covariance matrix • d(x, y) = (sum((xi – yi)2/si2))1/2

Nonparametric correlation MIC (Reshef, Reshef and et al. 2011 Science) – Mutual Information Coefficient

Dimension reduction Principal Component Analysis Kernel PCA LDA Isomap Laplacian eigenmap Manifold learning …

Principal Component Analysis pc.cr= prcomp(t(d[genes.set_a,])) summary(pc.cr) biplot(pc.cr)

Fisher’s LDA Key difference between LDA and PCA?

Fisher’s LDA • R code: • library(MASS) • lda = lda(t(d[genes.set_a,]), grouping=c(rep('Normal',4), rep('Cancer',8)), subset=1:12) • predict(lda, t(d[genes.set_a, 13:14]))

Multidimensional scaling D=dist(t(d), method=c("euclidian")) mds = cmdscale(D, k = 2) plot(mds[,1], mds[,2], type="p", main="Clustering using MDS”, xlab = 'mds1', ylab = 'mds2') text(mds, row.names(mds))

Classification Classification is equivalent to prediction with binary outcomes Machine learning cares more about prediction than statistics Machine learning is statistics with a focus on prediction, scalability and high dimensional problems But there’s interconnection between clustering and classification

Like this

SVM library('e1071') model1 = svm(t(d[,1:12]),c(rep('Normal',4), rep('Cancer',8)),type='C',kernel='linear') predict(model1,t(d[,13:14]))

K-Nearest Neighbor #KNN k = 1 class::knn(t(ld[,1:12]), t(ld[,13:14]), c(rep('Normal',4), rep('Cancer',8)), k=1) #KNN k = 3 class::knn(t(ld[,1:12]), t(ld[,13:14]), c(rep('Normal',4), rep('Cancer',8)), k=3)

Naïve Bayes Borrowed from Manolis Kellis’s course slides

Hint for hw1 problem 2 For graduate-level question, try to think about removing batch effects using PCA For ComBat software, try to search “srv bioconductor” on Google.

Lab 3 DAVID, Clustering and Classification