1 / 46

Bioinformatics : Gene Expression Data Analysis

University at Buffalo. The State University of New York. 05.12.03. Bioinformatics : Gene Expression Data Analysis. Aidong Zhang Professor Computer Science and Engineering University at Buffalo. What is Bioinformatics. Broad Definition

cayton
Télécharger la présentation

Bioinformatics : Gene Expression Data Analysis

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. University at Buffalo The State University of New York 05.12.03 Bioinformatics : Gene Expression Data Analysis Aidong Zhang Professor Computer Science and Engineering University at Buffalo

  2. What is Bioinformatics • Broad Definition • The study of how information technologies are used to solve problems in biology • Narrow Definition • The creation and management of biological databases in support of genomic sequences • Oxford English Dictionary (proposed) • Conceptualizing biology in terms of molecules and applying information techniques to understand and organize the information associated with these molecules, on a large scale

  3. Aims of Bioinformatics • Simplest • Organize data in a way that allows researchers to access information and submit new entries as they are produced • Higher • Develop tools and resources that aid in the analysis of data • Advanced • Use these tools to analyze the data and interpret the results in a biologically meaning manner

  4. Subjects of Bioinfromatics

  5. Figure taken from http://www.oml.gov/hgmis

  6. DNA Microarray Experiments http://www.ipam.ucla.edu/programs/fg2000/fgt_speed7.ppt

  7. Gene Expression Data • Gene Expression Data Matrix • Each row represents a gene Gi ; • Each column represents an experiment condition Sj ; • Each cell Xij is a real value representing the gene expression level of • gene Gi under condition Sj; • Xij > 0: over expressed • Xij < 0: under expressed • A time-series gene expression data matrix typically contains O(103) genes and O(10) time points.

  8. sample 1 sample 2 sample 3 X11 X12 X13 X21 X22 X23 X31 X32 X33 genes samples Gene Expression Data • asymmetric dimensionality • 10 ~ 100 sample / condition • 1000 ~ 10000 gene • two-way analysis • sample space • gene space

  9. Microarray Data Analysis • Analysis from two angles • sample as object, gene as attribute • gene as object, sample/condition as attribute

  10. Challenges of Gene Data Analysis (1) • Gene space: Automatically identify clusters of genes which express similar patterns in the data set • Robust to huge amount of noise • Effective to handle the highly intersected clusters • Potential to visualize the clustering results

  11. Co-expressed Genes Gene Expression Data Matrix Gene Expression Patterns Co-expressed Genes • Why looking for co-expressed genes?  Co-expression indicates co-function;  Co-expression also indicates co-regulation.

  12. Challenges of Gene Data Analysis (2) • Sample space: unsupervised sample clustering presents interesting but also very challenging problems • The sample space and gene space are of very different dimensionality (101 ~ 102 samples versus 103 ~104 genes). • High percentage of irrelevant or redundant genes. • People usually have little knowledge about how to construct an informative gene space.

  13. Sample Clustering • Gene expression data clustering

  14. Gene Expression Matrices Gene Expression Patterns Microarray Data Analysis Microaray Data Microarray Images Sample Clusters Gene Expression Data Analysis Visualization Important patterns Important patterns Important patterns

  15. Our Approaches • Density-based approach: recognizes a dense area as a cluster, and organizes the cluster structure of a data set into a hierarchical tree. • caculate the density of each data object based on its neighboring data distribution. • construct the "attraction" relationship between data objects according to object density. • organize the attraction relationship into the "attraction tree". • summarize the attraction tree by a hierarchical "density tree". • derive clusters from density tree.

  16. Our Approaches (2) • Interrelated dimensional clustering -- automatically perform two tasks: • detection of meaningful sample patterns • selection of those significant genes of empirical pattern

  17. Our Approaches (3) TreeView • Visualization tool: offers insightful information • Detects the structure of dataset • Three Aspects • Explorative • Confirmative • Representative • Microarray Analysis Status • Numerical methods dominant • Visualization serve graphical presentations of major clustering methods • Visualization applied • Global visualization (TreeView) • Sammon’s mapping

  18. VizStruct Architecture • Explorative Visualization – Sample space • Confirmative Visualization – Gene space

  19. VizStruct - Dimension Tour • Interactively adjust dimension parameters • Manually or automatically • May cause false clusters to break • Create dynamic visualization

  20. Visualized Results for a Time Series Data Set

  21. Elements of Clustering • Feature Selection. Select properly the features on which clustering is to be performed. • Clustering Algorithm. • Criteria (e.g. object function) • Proximity Measure (e.g. Euclidean distance, Pearson correlation coefficient ) • Cluster Validation.The assessment of clustering results. • Interpretation of the results.

  22. Supervised Analysis • Select training samples (hold out…) • Sort genes (t-test, ranking…) • Select informative genes (top 50 ~ 200) • Cluster or classification based on informative genes Class 1 Class 2 g1 g2 . . . . . . . g4131 g4132 1 1 … 10 0 … 0 1 1 … 10 0 … 0 g1 g2 . . . g4131 g4132 1 1 … 10 0 … 0 1 1 … 10 0 … 0 0 0 … 01 1 … 1 0 0 … 01 1 … 1 0 0 … 01 1 … 1 0 0 … 01 1 … 1

  23. Unsupervised Analysis • Microarray data analysis methods can be divided into two categories: supervised/unsupervised analysis. • We will focus on unsupervised sample classification which assume no membership information being assigned to any sample. • Since the initial biological identification of sample classes has been slow, typically evolving through years of hypothesis-driven research, automatically discovering sample pattern presents a significant contribution in microarray data analysis. • Unsupervised sample classification is much more complex than supervised manner. Many mature statistic methods such as t-test, Z-score, and Markov filter can not be applied without the phenotypes of samples known in advance.

  24. Problem Statement • Given a data matrix M in which the number of samples and the volume of genes are in different order of magnitude (|G|>>| S|) and the number of sample categories K. • The goal is to find K mutually exclusive groups of the samples matching their empirical types, thus to discover their meaningful pattern and to find the set of genes which manifests the meaningful pattern.

  25. Problem Statement samples 1 2 3 456 7 gene1 Informative Genes gene2 gene3 gene4 gene5 Non- informative Genes gene6 gene7 gene8

  26. Problem Statement (2) samples 1 2 3 456 7 8 9 10 gene1 Informative Genes gene2 gene3 gene4 Non- informative Genes gene5 gene6 gene7

  27. Problem Statement (3) Class 1 Class 2 Class3 Class 1 Class 2 Class3 genea geneb genec gened genee genef

  28. Related Work • New tools using traditional methods : • SOM • K-means • Hierarchical clustering • Graph based clustering • PCA • Their similarity measures based on full gene space are interfered by high percentage of noise.

  29. Related Work (2) • Clustering with feature selection: (CLIFF, leaf ordering, two-way ordering) • Filtering the invarient genes • Bayes model • Rank variance • PCA • Partition the samples • Ncut • Min-Max Cut • Pruning genes based on the partition • Markov blanket filter • T-test • Leaf ordering

  30. Related Work (3) • Subspace clustering : • Bi-clustering • δ-clustering

  31. Intra-pattern-steadiness We require each genes show either all “on” or all “off” within each sample class. • Variance of a single gene: • Average row variance:

  32. Intra-pattern-consistency(2)

  33. Inter-pattern-divergence • In our model, both ``inter-pattern-steadiness'' and ``intra-pattern-dissimilarity'‘ on the same gene are reflected. • Average block distance:

  34. Pattern Quality • The purpose of pattern discovery is to identify the empirical pattern where the patterns inside each class are steady and the divergence between each pair of classes is large.

  35. Pattern Quality (2)

  36. Input m samples each measured by n-dimensional genes the number of sample categories K Output A K partition of samples (empirical pattern) and a subset of genes (informative space) that the pattern quality of the partition projected on the gene subset reaches the highest. The Problem

  37. Starts with a random K-partition of samples and a subset of genes as the candidate of the informative space. Iteratively adjust the partition and the gene set toward the optimal solution. Basic elements: A state: A partition of samples {S1,S2,…Sk} A set of genes G’G The corresponding pattern quality  An adjustment For a gene G’, insert into G’ For a gene G’, remove from G’ For a sample in group S’, move to other group Strategy

  38. Strategy (2) • Iteratively adjust the partition and the gene set toward the optimal pattern. • for each gene, try possible insert/remove • for each sample, try best movement.

  39. Improvement • Data Standardization • the original gene intensity values relative values where • Random order • Conduct negative action with a probability • Stimulated annealing

  40. Experimental Results • Data Sets: • Multiple-sclerosis data • MS-IFN : 4132 * 28 (14 MS vs. 14 IFN) • MS-CON : 4132 * 30 (15 MS vs. 15 Control) • Leukemia data • 7129 * 38 (27 ALL vs. 11 AML) • 7129 * 34 (20 ALL vs. 14 AML) • Colon Cancer data • 2000 * 62 (22 normal vs. 40 tumor colon tissue) • Hereditary breast cancer data • 3226 * 22 ( 7 BRCA1, 8 BRCA2, 7 Sporadics)

  41. Experimental Results (2)

  42. Interrelated Dimensional Clustering The approach is applied on classifying multiple-sclerosis patients and IFN-drug treated patients. • (A) Shows the original 28 samples' distribution. Each point represents a sample, which is a mapping from the sample's 4132 genes intensity vectors. • (B) Shows 28 samples' distribution on 2015 genes. • (C) Shows 28 samples' distribution on 312 genes. • (D) Shows the same 28 samples distribution after using our approach. We reduce 4132 genes to 96 genes.

  43. Experimental Results (3) Experimental Results (3)

  44. Experimental Results (4) Experimental Results (4)

  45. Applications • Gene Function • Co-expressed genes in the same cluster tend to share common roles in cellular processes and genes of unrelated sequence but similar function cluster tightly together. • Similar tendency was observed in both yeast data and human data. • Gene Regulation • By searching for common DNA sequences at the promoter regions of genes within the same cluster, regulatory motifs specific to each gene cluster are identified. • Cancer Prediction • Normal vs. Tumor Tissue Classification • Drug Treatment Evaluation • …

  46. Summary • We have developed advanced approaches for gene expression data analysis which work more effectively than traditional analysis approaches • This research area is exciting and challenging. There are a lot of interesting research issues.

More Related