1 / 50

Bioinformatics

Bioinformatics. Microarray Analysis: Clustering 27/11/2006. Preprocessing. Array by array approach. ANOVA based. Background corr. Background corr. Log transformation. Log transformation. Filtering. Filtering. normalization. Linearisation. Ratio. Test statistic (T-test).

saul
Télécharger la présentation

Bioinformatics

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Bioinformatics Microarray Analysis: Clustering 27/11/2006

  2. Preprocessing Array by array approach ANOVA based Background corr Background corr Log transformation Log transformation Filtering Filtering normalization Linearisation Ratio Test statistic (T-test) Bootstrapping

  3. Overview • MICROARRAY ANALYSIS • Gene expression • Omics era • Transcript profiling • Experiment design • Preprocessing • Further analysis • Identification of differentially expressed genes • Clustering

  4. Overview further analysis Raw data Preprocessing Preprocessed data Test statistic Clustering Clusters of coexpressed genes Differentially expressed genes

  5. Differentially expressed genes 2 sample design Control sample Induced sample Statistical testing Retrieve statistically over or under expressed genes Type1: Comparison of 2 samples

  6. Differentially expressed genes Test Statistic Comparison of 2 experiments: • Fold test • T-test • SAM • … A plethora of different method available Which one performs best? Different underlying statistical assumptions Implication on the final result Difficult to define the best method

  7. Clustering • Measure expression of all genes • During time (dynamic profile) • In different conditions Clustering Identify coexpressed genes Motif Finding Identify mechanism of coregulation

  8. Clustering Multiple array design • Study of Mitotic cell cycle of Saccharomyces cerevisiae with oligonucleotide arrays (Cho et al.1999) - 15 time points (E=18) • time points 90 & 100 min deleted (Zhang et al. 1999, Tavazoie et al., 1999) Original dataset : 6178 genes • Preprocessing: • select 4634 most variable (25 % most variable) • variance normalized • adaptive quality based clustering (32 clusters) (95%)

  9. Preprocessing Log2 Ratio clustering

  10. Clustering: principle 1) gene = vector with expression values, usually log2(R/G) Gene 1 2 4 6 . . . . . . . . . . . . Gene n 6 6 4

  11. Gene 1 Gene 2 Clustering: Principle 2) measure distances between expression vectors 3) group genes with minimal distance Data from microarrays Different metric distances Different algorithms

  12. Rescaling • mean centering: for each gene the average gene expression value (row average) is subtracted from the expression values. The average expression value of the gene will be 0 • mean centering and dividing each gene by its variance

  13. Rescaling • Effects of rescaling • genes with a similar expression profile but strongly different expression values will have a closer distance in the M-dimensional space and will be more easily grouped together • the noise in the dataset will be boosted up

  14. Influence of normalisation on the clustering cyan: clustering without rescaling red: clustering with rescaling Rescaling • most algorithms require normalisation. Without normalisation the algorithms tend to cluster the noisy genes since they are more closely in distance than the few genes that alter their expression level

  15. rescaling Noisy profiles Significant profiles normalization Noisy sequences are rescaled and might deteriorate the quality of the cluster

  16. Clustering methods Hierarchical Clustering algorithms Non-Hierarchical Clustering algorithms K-means A.Q.B.C. SOM Agglomerative Divisive • Algorithms require • Specific preprocessing • Specific metric • Specific parameter settings • Specific properties Design algorithm that combines biological relevant characteristics

  17. Clustering Distance • Minkowsky distance

  18. Clustering Distance • Similarity measures • Pearson correlation coefficient • Mutual information • Variance weighted distance measures

  19. Algorithms: hierarchical clustering • Agglomerative method (phylogenetic classification) • Calculate pairwise distances between genes (distance matrix) • Metrics • Pearson correlation • Mutual information • Euclidean distance • distance matrix is searched for the two most similar genes (clusters) • Rules • Single linkage • Average linkage • complete linkage

  20. Algorithms: hierarchical clustering • The two selected clusters (genes) are merged to produce a new object (e.g. average of two merged objects) • Distance is recalculated (between genes, between merged objects, between genes & merged objects) • Process is repeated until all genes are clustered

  21. Algorithms: hierarchical clustering • Properties • deterministic • userdefined parameters: • Cut off value • Metric definition • Rule • Advantages • visualisation possible: dendrogram • Length of the branches is indicative for the distance between the clusters • Disadvantages • the number of clusters user defined.

  22. Algorithms: K-means

  23. Algorithms: K-means Predefined number of clusters = 5 Initialisation : randomly choose cluster centers (red points)

  24. Algorithms: K-means Attribute each point (gene) to cluster with closest center

  25. Algorithms: K-means Attribute each point (gene) to cluster with closest center

  26. Algorithms: K-means Recalculate cluster centers = mean expression profile of genes in cluster

  27. Algorithms: K-means Repeat the whole process

  28. Algorithms: K-means • Properties • Userdefined parameters • number of clusters • number of iterations • Nondeterministic: dependent on the initialisation • Advantages • Easy to understand • Fast • Disadvantages • number of cluster has to be user-specified • outcome parameter sensitive (elaborated parameter finetuning essential) • all genes in the dataset will be clustered: the presence of noisy genes will disturb the average profile and the quality of the cluster of interest

  29. Algorithms: K-means Sensitivity of K-means towards parameter setting K-means, nr. of clusters: 10; nr. of iterations: 100 number of clusters = low big clusters containing noise

  30. Cluster algorithms Analysis number of clusters =high smaller clusters  higher resolution K-means K-means, nr. of clusters: 60; nr. of iterations: 100

  31. literature/knowledge dataset • small clusters • contain genes with highly similar profile (+) • some information given up in first step (-) • big clusters • contain all real positives (+) • increasing number of false positives (-) validate “core” clusters extend clusters Motif finding DNA level

  32. Gene 1 Gene 2 Normalized Expression Data from microarrays Algorithms: Quality based clustering

  33. Algorithms: Quality based clustering • http://www.esat.kuleuven.ac.be/~thijs/Work/Clustering.html • Quality = cluster radius computed by fitting a model to the data by EM • Initialisation : • * cluster center = mean expression profile of entire dataset • * cluster radius = radius of hypersphere enclosing entire dataset • * qual = radius hypershepere / 2 • Recalculate cluster center based on qua

  34. Algorithms: Quality based clustering • Recalculate cluster center • * Find genes with distance < qual from the center • * Recalcalute center = mean expression profile of these genes • Recalculate qual • * For every gene calculate distance to new center & plot distribution • * Randomize dataset, calculate distances to new center & plot distribution • * Compare the distributions • * Derive new qual • Iterate until stopcriterion (actual cluster radius < qual) is reached • Genes in cluster are discarded from the dataset , repeat for next cluster

  35. Algorithms: Quality based clustering • Adaptive quality based clustering Initialise cluster center = mean expression profile Find genes with distance < qual from the center

  36. Algorithms: Quality based clustering • Adaptive quality based clustering Recalculate center = mean of blue genes Recalculate qual : actual radius of the cluster > qual

  37. Algorithms: Quality based clustering • Recalculate qual

  38. Algorithms: Quality based clustering A.Q.B.C.: QP: 95%; min nr genes 15 determines the number of clusters automatically determines the number of iterations automatically determines for each cluster an optimal radius (statistically determines whether clusters should be merged or separated) 1 important user defined parameter i.e. confidence level 0.95 % default: Defines that a gene assigned to a cluster has a probability of 95 % or more to belong to the cluster

  39. Algorithms: Quality based clustering Comparison with K-means K-means, nr. of clusters: 32; nr. of iterations: 100 finding optimal parameter setting requires a lot of parameter finetuning

  40. Comparison with K-means K-means(nr. of clusters = 32) A.Q.B.C. NOG=200 NOG=188 MCB replication & DNA synthesis Common = 159 NOG=153 NOG=118 NOG=87 M14 organisation of the centrosome Common = 42 Common = 44 NOG=147 NOG=18 ECB budding & cell polartity Common = 10

  41. Comparison with K-means NOG=147 K-means forces every gene into a cluster clusters contain more noise

  42. Comparison with K-means NOG=147 NOG=18 ECB budding & cell polarity Genes with noisy profile rejected from A.Q.B.C. cluster are retained by K-means Small groups of genes with highly similar profiles

  43. INTRODUCTION MICROARRAY ANALYSIS VALIDATION OF THE RESULTS • Statistical validation • Biological validation

  44. literature/knowledge Cluster validation dataset • small clusters • contain genes with highly similar profile (+) • some information given up in first step (-) • big clusters • contain all real positives (+) • increasing number of false positives (-) validate “core” clusters Motif finding DNA level

  45. Comparison with K-means K-means(nr. of clusters = 32) A.Q.B.C. NOG=200 NOG=188 MCB replication & DNA synthesis Common = 159 NOG=153 NOG=118 NOG=87 M14 organisation of the centrosome Common = 42 Common = 44 NOG=147 NOG=18 ECB budding & cell polartity Common = 10

  46. AC0020D11428 SRS, Medline, GeneCards,. MIPS,Gene Ontology. Clustering Manual Query :huge task Accession Nrs Literature/knowledge data Text Mining and Gene Ontologies Cluster Validation: Functional Enrichment Rationale:

  47. Cluster Validation: Functional Enrichment • METABOLISM (1066 ORFs) • amino acid metabolism (204 ORFs) • amino acid biosynthesis (118 ORFs) • biosynthesis of the cysteine-aromatic group (1 ORF) • biosynthesis of the pyruvate family (alanine, isoleucine, leucine, valine) and D-alanine (1 ORF) • regulation of amino acid metabolism (33 ORFs) • amino acid transport (23 ORFs) • amino acid degradation (catabolism) (35 ORFs) • degradation of amino acids of the glutamate group (1 ORF) • degradation of glutamate (1 ORF) • degradation of amino acids of the cysteine-aromatic group (1 ORF) • degradation of glycine (1 ORF) • other amino acid metabolism activities (5 ORFs) • nitrogen and sulfur metabolism (74 ORFs) • nitrogen and sulfur utilization (38 ORFs) • regulation of nitrogen and sulphur utilization (29 ORFs) • nitrogen and sulfur transport (7 ORFs) • nucleotide metabolism (144 ORFs) • purine ribonucleotide metabolism (45 ORFs) • pyrimidine ribonucleotide metabolism (29 ORFs) • deoxyribonucleotide metabolism (11 ORFs) • metabolism of cyclic and unusual nucleotides (8 ORFs) • regulation of nucleotide metabolism (13 ORFs) • polynucleotide degradation (23 ORFs) • nucleotide transport (14 ORFs) • http://mips.gsf.de/proj/yeast/CYGD/db/index.html MIPS functional category

  48. Motif finding Cluster Validation: Motif Detection cDNA arrays Preprocessing of the data Clustering Upstream regions Gibbs sampling EMBL BLAST

More Related