1 / 16

Clustering

Clustering. Relevance to Bioinformatics Array(s) analysis Examples Principal Component Analysis Clustering Algorithms. Bioinformatic arrays. Array = View of biological (sub) system(s) in terms of a vector of attributes Arrays at different resolutions Ecosystem: Spectrum of life in a pond

helki
Télécharger la présentation

Clustering

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Clustering • Relevance to Bioinformatics • Array(s) analysis • Examples • Principal Component Analysis • Clustering Algorithms Lecture 13, CS567

  2. Bioinformatic arrays • Array = View of biological (sub) system(s) in terms of a vector of attributes • Arrays at different resolutions • Ecosystem: • Spectrum of life in a pond • Populations: • Characteristics of a population of species • Organism: • “FBI profile” (multiple attribute types) • Single Nucleotide Polymorphism (SNP) profile (single attribute type) Lecture 13, CS567

  3. Bioinformatic arrays • Arrays at different resolutions • Cell: • mRNA expression array in cancer cell • Cell subcompartment: • Cytosolic protein array • Multiple arrays • Trivially, replicates • Increase signal/noise ratio • Under different conditions • Useful for analysis of • Relationships between conditions • Relationships between attributes Lecture 13, CS567

  4. Bioinformatic clusters • Genes {g} that are turned on together in condition (e.g., disease, phase of life) x • Conditions {x} that gene g is turned on in • Sequence/subsequence families • Structure/substructure families • Hierarchical classification of molecules/systems • Group of people sharing disease • Group of diseases common to a people/species Lecture 13, CS567

  5. Principal Component Analysis • Not really clustering but of some use in analysis of multidimensional data • Grounds for PCA • Humans restricted to viewing only 3 dimensions at a time; PCA renders a lower dimensional approximation of a high dimensional space • PCA highlights attributes showing maximum variation; helps in ignoring attributes showing minimal variation Lecture 13, CS567

  6. Principal Component Analysis • Project data from N dimensional space to M dimensional spaces such that • M << N • The first principal component is aligned with the “direction” of maximal variability • Data remaining after removal of projection along a principal component is “residual” data, with “residual” variance • Each component is derived in ranked order of capturing residual variance • All PCs are orthogonal • Limitations • Different clusterings may look alike by PCA • Combination of attributes that constitute a PCA may not be biologically intuitive Lecture 13, CS567

  7. Clustering • Ultimate goal: • “Birds of a feather stick together; We be brothers thicker than blood” • Minimum similarity within a cluster should exceed maximum similarity between clusters • Maximum distance within a cluster should be less than minimum distance between clusters • Clustering: Classifying data without class labels • Supervised: Classes are known (“Classifying Olympic athletes who are missing their country tags”) • Unsupervised: Classes not known (“Classifying forms of life on an alien planet”) • Hybrid: Combination of above (“Find classes based on sample of data, followed by classifying remaining data into the newly defined classes”) Lecture 13, CS567

  8. The notion of Distance • Definition • Positivity • Symmetry (Why relative entropy is not a distance. “One way traffic not really distance!”) • Triangular inequality (“’Beam me up Scotty’ route not a distance”) • Getting from a to b via some c is never less than getting from a to b directly • Self distance = 0 (“If you are trying to find yourself, just open your eyes – you are right here!”) • Examples of true distances • Minkowski distances (déjà vu) • Pearson correlation coefficient Lecture 13, CS567

  9. Representing Distances • Absolute measures • Measures that can be defined without reference to other data points • Examples: Height, Weight • Pairwise distances • Defined with reference to pair of data points • May or may not be derivable from absolute measures • Inverse of similarity between points • Examples: Difference in height Lecture 13, CS567

  10. Distances in bioinformatics • Absolute • (sub)Sequence Length/Molecular weight • Pairwise • Between primary structures • Based on probabilistic similarity scores obtained from pairwise sequence alignment • Based on expectations of similarity scores • Examples: Clustering based on BLAST searches • Between secondary/tertiary structures • Root mean square (RMS) difference of pairs of 3D arrays • Examples: Structural classes in CATH database Lecture 13, CS567

  11. Data Types • Real numbers • Interval-scaled • Linear scale • Ratio-scaled • Non-linear scale • Fine grained resolution • Nominal • Named categories • Distance typically Boolean (“You are either a Muggle or a Wizard!”) • Coarse distance • Ordinal • Named categories with implicit ranking (“Distance from janitor to CEO > distance from President to CEO, not counting the .com boom years”) • Distance depends on relative ranks (Intermediate granularity) Lecture 13, CS567

  12. Clustering algorithms • K-means (Centroid) • Algorithm • Pick k points • For each point i, find Dmin(i,k) and assign i to k • Replace the k points by the centroids (means) of each cluster • Calculate mean square distance from cluster centers • Iterate 2-4 till membership of points does not change OR 4 is minimized • Scales well • Limitations: • Number of clusters predetermined • Clusters always spherical in shape • Adversely affected by outliers (‘coz centroids are virtual data points a la “Simone”) • Mean needs to be definable • Only local minimum found Lecture 13, CS567

  13. Clustering algorithms • K-means (Centroid) – EM variation • Algorithm • Pick k points • For each point i, find all D(i,k) and assign i to each k probabilistically • Replace the k points by the centroids (means) of each cluster • Calculate mean square distance from cluster centers • Iterate 2-4 till expectation of error function is minimized (likelihood is maximized) • Better minimum found (Why?) • Limitations: • Number of clusters predetermined • Clusters always spherical in shape • Adversely affected by outliers (‘coz centroids are virtual data points) • Mean needs to be definable Lecture 13, CS567

  14. Clustering algorithms • K-mediods (Use medians rather than means) • Algorithm • Pick k points • For each point i, find Dmin(i,k) and assign i to k • Calculate mean square distance Cold from cluster centers • Replace one of the k points by a random point from each cluster, calculating Cnew every time cluster structure is changed • If (Cnew < Cold), then accept new clustering • Iterate till membership of points does not change OR C is minimized • Outliers handled well • Limitations: • Number of clusters predetermined • Clusters always spherical in shape • Doesn’t scale as well as k-means Lecture 13, CS567

  15. Clustering algorithms • Hierarchical clustering (Tree building-Average Linkage clustering) • Algorithm • Find points i and j that are the closest and unite with a node • Treat i and j as a single cluster by taking their average • Now iterate 1-2 till all points are included in tree • Biologically relevant results • Limitations: • Solution frequently suboptimal • Cluster boundaries not clear • Linear orderings may not correspond to similarity Lecture 13, CS567

  16. Clustering algorithms • Single linkage clustering • Algorithm • If D(i,j) < cut-off, then include them in cluster • If i in cluster x is within cut-off distance of j in cluster y, then merge x and y into single cluster • Handles non-spherical clusters well • Limitations: • Not the best way to go • Doesn’t scale well • Essentially, assumes transitivity in similarity (“Saudi princely clan, Indian joint family, Godfather’s clan”) • Clusters not compact, but scraggly Lecture 13, CS567

More Related