Introduction to Bioinformatics

Introduction to Bioinformatics Biological Networks Department of Computing Imperial College London March 4, 2010 Lecture hours 14-15 Nataša Pržulj natasha@imperial.ac.uk

Data Clustering • find relationships and patterns in the data to achieve insights in underlying biology • Clustering algorithms can be applied to the data to find groups of similar genes/proteins, or groups of similar samples

What is data clustering? • Clustering of data is a method by which large sets of data is grouped into clusters (groups) of smaller sets of similar data. • Example: There are a total of 10 balls which are of three different colours. We are interested in clustering the balls into three different groups. • An intuitive solution is that balls of same colour are clustered (grouped together) by colour. • Identifying similarity by colour was easy, however we want to extend this to numerical values to be able to deal with biological data, and also to cases when there are more features (not just colour).

Clustering • Partition a set of elements into subsets called clusters such that • elements of the same cluster are similar to each other (homogeneity property, H) • Elements from different clusters are different (separation property, S)

Clustering Algorithms • A clustering algorithm attempts to find natural groups of components (or data) based on some notion similarity over the features describing them. • Also, the clustering algorithm finds the centroid of a group of data sets. • To determine cluster membership, many algorithms evaluate the distance between a point and the cluster centroids. • The output from a clustering algorithm is basically a statistical description of the cluster centroids with the number of components in each cluster.

Clustering Algorithms Cluster centroid : • The centroid of a cluster is a point whose parameter values are the mean of the parameter values of all the points in the clusters. Distance: • Generally, the distance between two points is taken as a common metric to assess the similarity among the components of a population. The commonly used distance measure is the Euclidean distance which defines the distance between two points p= ( p1, p2, ....) and q = ( q1, q2, ....) is given by :

Clustering Algorithms • There are many possible distance metrics. • Some theoretical (and intuitive) properties of distance metrics • Distance between two items (elements) must be greater than or equal to zero, • Distances cannot be negative. • The distance between an item and itself must be zero • Conversely if the difference between two items is zero, then the items must be identical. • The distance between item A and item B must be the same as the distance between item B and item A. • The distance between item A and item C must be less than or equal to the sum of the distance between items A and B and items B and C (triangle inequality).

Clustering Algorithms Example distances: • Euclidean (L2) distance • Manhattan (L1) distance • Lm: (|x1-x2|m+|y1-y2|m)1/m • L∞: max(|x1-x2|,|y1-y2|) • Inner product: x1x2+y1y2 • Correlation coefficient • For simplicity we will concentrate on Euclidean and Manhattan distances

Clustering Algorithms Distance Measures: Minkowski Metric • Suppose two objects and both have features : • The Minkowski metric is defined as:

Clustering Algorithms Commonly used Minkowski metrics:

Clustering Algorithms Examples of Minkowski metrics:

Clustering Algorithms Distance/Similarity matrices: • Clustering is based on distances – distance/similarity matrix: • Represents the distance between objects • Only need half the matrix, since it is symmetric

Clustering Algorithms Hierarchical vs Non-hierarchical: • Hierarchical clustering is the most commonly used methods for identifying groups of closely related genes or tissues. Hierarchical clustering is a method that successively links genes or samples with similar profiles to form a tree structure. • K-means clustering is a method for non-hierarchical (flat) clustering that requires the analyst to supply the number of clusters in advance and then allocates genes and samples to clusters appropriately.

Clustering Algorithms Hierarchical Clustering: • Given a set of N items to be clustered, and an NxN distance (or • similarity) matrix, the basic process hierarchical clustering is this: • Start by assigning each item to its own cluster, so that if you have N items, you now have N clusters, each containing just one item. • Find the closest (most similar) pair of clusters and merge them into a single cluster, so that now you have one less cluster. • Compute distances (similarities) between the new cluster and each of the old clusters • Repeat steps 2 and 3 until all items are clustered into a single cluster of size N.

Clustering Algorithms Hierarchical Clustering: • Scan the matrix for the minimum • Join items into one node • Update matrix and repeat from step 1

Clustering Algorithms Hierarchical Clustering: • Distance between two points – easy to compute • Distance between two clusters – harder to compute: • Single-Link Method / Nearest Neighbor • Complete-Link / Furthest Neighbor • Average of all cross-cluster pairs

Clustering Algorithms Hierarchical Clustering: • Single-Link Method / Nearest Neighbor (also called the connectedness, or minimum method): • distance between one cluster and another cluster is equal to the shortest distance from any member of one cluster to any member of the other cluster • Complete-Link / Furthest Neighbor (also called the diameter or maximum method) • the distance between one cluster and another is equal to the longest distance from any member of one cluster to any member of the other cluster • Average-link clustering • the distance between one cluster and another cluster to be equal to the average distance from any member of one cluster to any member of the other cluster

Clustering Algorithms Hierarchical Clustering: • 2. Example: Single-Link (Minimum) Method: Resulting Tree, or Dendrogram:

Clustering Algorithms Hierarchical Clustering: • Example: Complete-Link (Maximum) Method: Resulting Tree, or Dendrogram:

Clustering Algorithms Hierarchical Clustering: In a dendrogram, the length of each tree branch represents the distance between clusters it joins. Different dendrograms may arise when different Linkage methods are used.

Clustering Algorithms K-Means Clustering: • Basic Ideas : use cluster centroids (means) to represent cluster. • Assigning data elements to the closet cluster (centroid). • Goal: Minimize intra-cluster dissimilarity.

Clustering Algorithms K-Means Clustering: • Pick (usually randomly) k points as centers of k clusters. • Compute distances between a non-center point v and each of the k center points • find the minimum distance, say it is to center point Ci, and assign v to the cluster defined by Ci. • Do this for all non-center points and obtain k non-overlapping clusters containing all the points. • For each cluster, compute its new center, which is the point the with minimum sum of distances from that point to all other points in the cluster. • Repeat until the algorithm converges, i.e., the same set of centers is chosen as in previous iteration. This results in non-overlapping clusters of potentially different sizes.

Clustering Algorithms K-Means Clustering Example:

Clustering Algorithms K-means vs. Hierarchical clustering: • Computation Time: – Hierarchical clustering: O( m n2 log(n) ) – K-means clustering: O( k t m n ) • t: number of iterations • n: number of objects • m-dimensional vectors • k: number of clusters • Memory Requirements: – Hierarchical clustering: O( mn + n2 ) – K-means clustering: O( mn + kn ) • Other: • Hierarchical Clustering: • Need to select Linkage Method • to perform any analysis, it is necessary to partition the dendrogram into k disjoint clusters, cutting the dendrogram at some point. A limitation is that it is not clear how to choose this k • K-means: Need to select K • In both cases: Need to select distance/similarity measure

Introduction to Bioinformatics