1 / 61

Report on ISMB 2003 (Brisbane, Australia)

This report provides highlights from the ISMB 2003 conference in Brisbane, Australia, including keynote speaker presentations and session topics. Topics covered include phylogeny and genome rearrangements, expression arrays and networks, predicting clinical outcomes, protein clustering and alignment, transcription motifs and modules, structure and HMMs, and more.

sabrinad
Télécharger la présentation

Report on ISMB 2003 (Brisbane, Australia)

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Report on ISMB 2003(Brisbane, Australia) G. Grant and E. Manduchi CBIL Lab meeting July 24, 2003

  2. Keynote Speakers • David Haussler • John Mattick • Ron Shamir • William J. Kent (Overton prize) • David Sankoff • Michael Waterman • Yoshihide Hayashizaki • Sydney Brenner

  3. Sessions • Phylogeny and Genome Rearrangements • Expression Arrays and Networks • Predicting Clinical Outcomes • Protein Clustering, Alignment and Patterns • Transcription Motifs and Modules • Structure and HMMs • Misc. Short Papers

  4. Poster Categories • Data Mining • Data Visualization • Databases • Functional Genomics • Genome Annotation • Microarrays • New Frontiers • Phylogeny and Evolution • Predictive Methods • Sequence Comparison • Structural Biology • Systems Biology

  5. Highlights from Mattick’s talk • The relative amount of non-coding (i.e. non-coding/coding) RNA scales with complexity • Phenotypic variation in eukaryotes not largely associated to the proteome, but to patterns of expression (proteins=components, patterns of expression=their assembly) • Many ncRNAs differentially expressed in different cells and tissues (RIKEN)

  6. Highlights from Shamir’s talk • CLICK algorithm for clustering and EXPANDER (contains CLICK and other clustering algorithms) • SAMBA algorithm for biclustering • PRIMA algorithm for finding Transcription Modules in Human Cell Cycle (combining with gene expression data)

  7. William J. Kent The Overton Award

  8. Highlights from Kent’s talk • Discussed process that lead from the assembly of the HG to annotation using mapping of mRNAs and ESTs via BLAT • Genome Browser, which now displays annotations from a dozen different groups on 3 mammalian genomes • Current research focuses on exploiting comparative genomics and whole genome microarray data

  9. Highlights from Hayashizaki’s talk • Riken mouse genome encyclopedia: comprehensive mouse full-length cDNA collection and sequence db • FANTOM (higher level annotation): homology search based, expression data profiles, protein-protein db • Almost 2 million clones, prepared from 267 tissues, clustered into 171,144 groups: 60,770 representative clones were fully sequenced

  10. 33,409 unique sequences, with more than 18,415 clear protein-encoding genes, of which 4,258 are new • 11,665 new non-coding messages • 41% of transcripts alternatively spliced • cDNA microarray system to print all of these cDNA clones has been developed, as well as protein-protein and protein-DNA interaction screening systems http://genome.gsc.riken.go.jp/

  11. 1. Glocal alignment: finding rearrangements during alignmentM. Brudno, S. Malde, A. Poliakov, C.B. Do, O. Couronne, I. Dubchack, S. Batzoglou (Stanford Univ., LBNL) • Comparisons of entire genomes requires alignment methods which are: • Efficient • Accurate • In particular, need to keep into account of rearrangements: • Inversions • Translocations • Duplications • Their combinations

  12. Explore local rearrangements: between 100 and 100,000 bp • Glocal alignment: a combination of global and local methods to transform one sequence into another, allowing for rearrangement events • Shuffle-LAGAN algorithm: a combination of LAGAN (global aligner) and CHAOS (local aligner): http://lagan.stanford.edu.glocal

  13. 3 stages: • Find local alignments (both strands) using CHAOS • Pick the maximal scoring subset of the local alignments under certain gap penalties to form a 1-monotonic conservation map (non-decreasing in only 1 sequence) • Chain them into a maximal consistent subsegments, which are aligned using LAGAN

  14. Testing results of aligning human and mouse genomes • Split mouse genome into 250 Kbp contigs • Find potential human orthologs using BLAT • Extend the human sequence around the BLAT anchor • Align to the mouse contig using the tested aligner

  15. Results • SLAGAN not quite as sensitive as BLASTZ on the whole genome scale, but higher specificity. SLAGAN more sensitive that LAGAN and slightly more specific. • Sensitivity: % bp in the alignment that meet a particular scoring threshold • Specificity: coverage of human chromosome 20 (very little difference should be seen between its coverage by the whole mouse genome and that by mouse chromosome 2) • Results suggest that as much as 2% of the gene coding regions in the human genome may have evolved by local translocation or duplication since the human/mouse divergence.

  16. See paper for table summarizing different rearrangement events, their proportion in the human genome and the level to which they are conserved. • Duplications as a whole score lower per bp than sequences that have undergone other rearrangements (maybe more freedom to mutate to evolve new function) • Of the non-duplicated sequences, simple inversions tend to be the shortest and to be more conserved.

  17. 2. Combining multiple microarray studies and modeling interstudy variationJ.K.Choi, U. Yu, S. Kim, O.J. Yoo (Korea) • Method to systematically integrate multiple microarray datasets • Goal: draw a consensus among datasets, taking into account interstudy variation, gaining power from the increased sample size • Application to two different sets of cancer profiling studies, designed to compare tumor and non-tumor tissues: • Liver Cancer (LC): 4 independent cDNA studies • Prostate Cancer (PC): 2 cDNA and 2 oligo-based studies

  18. Utilize “effect size”, defined as a standardized index measuring the magnitude of a treatment or covariate effect • For differential expression, the effect size for a gene is taken to be the standardized mean difference • For each gene, model its effect size in each study: yi=i+i, i=+i, i denotes the study • Fixed or Random Effect Model, to be determined on the basis of a homogeneity test: e.g. FEM for LC, REM for PC

  19. After estimating , a z-statistic is computed from it and assigned to that gene. This is compared to a given threshold zth • Assess its significance via permutation and FDR multiple testing correction (they use Benjamini and Hochberg) • Compute the Integration-Driven Discovery (IDD): genes identified as differentially expressed in the multiple study analysis, but not as such by any individual study. IDDs occur when combining “small but consistent” effect sizes

  20. Results • Took significant genes from PC datasets and did KEGG pathway query: • Non-IDD genes mapped into 70 pathways • IDD genes mapped into 51 pathways, in 70% of which they appeared together with at least one non-IDD gene • For the LC results, paper is in the workings

  21. 3. Discovering molecular pathways from protein interaction and gene expression dataE.Segal, H. Wang, D. Koller (CS Dept., Stanford Univ.) • Approach for identification of “pathways” integrating gene expression and protein interaction data • Assumptions on pathways properties: • Their genes have similar expression profiles • The protein products of the genes often interact • Steps: • Build a probabilistic model • Learn the model • Applications and validation

  22. Probabilistic Model • Framework of relational Markov networks: • Set of genes G={g1, g2, …, gn} • Assume that each gene g belongs to exactly one of k pathways, denoted g.C • The variables gi.C are hidden, and one of the goals of the algorithm is to determine their values • Prob. Model has 2 components: gene expression model, protein interaction model • These components are combined into a unified model

  23. Learning the Model • Given a dataset D of gene expression profiles and a set of binary interactions between pairs of genes, want to learn the model parameters together with the hidden variables • Use EM approach: • start with initial guess of the parameters • compute the probability distribution of the hidden variables • maximize the likelihood of the data with respect to the expected sufficient statistics • continue till convergence

  24. Applications and Validation • Used 2 S. cerevisiae gene expression datasets and the DIP dataset • Applied method to each of the expression datasets separately • Evaluated the model (using GeneXPress) using various criteria: • Prediction of held-out interactions • Coherence of pathways according to functional annotations • Coverage of protein complexes

  25. 4. Genome-wide discovery of transcriptional modules from DNA sequence and gene expressionE.Segal, R. Yelensky, D. Koller (CS Dept., Stanford Univ.) • Approach for identification of “transcriptional modules” (sets of genes, co-regulated in a set of experiments through a common motif profile) integrating gene expression and DNA sequence data • Assumptions: • Transcriptional elements should explain the observed expression patterns as much as possible • Genes partitioned into modules which determine their expression profile • Each module is characterized by a motif profile. • Steps: • Build a probabilistic model • Learn the model • Applications and validation

  26. Probabilistic Model • Framework of relational Markov networks: • Set of genes G={g1, g2, …, gn} • Assume that each gene g is associated to exactly one of k modules, denoted g.M • Each module has a motif profile, specifying the extent to which motif Riplays a role in that motive. • The variables gi.M and gi.R are hidden, and one of the goals of the algorithm is to determine their values • Prob. Model has 3 components: gene expression model, motif model, regulation model • These components are combined into a unified model

  27. Learning the Model • Given a dataset D of gene expression profiles and DNA sequences in the upstream region of the TSS for each gene, want to learn the model parameters together with the hidden variables • Use EM approach: • start with initial guess of the parameters • compute the probability distribution of the hidden variables • maximize the likelihood of the data with respect to the expected sufficient statistics • continue till convergence • during this process sequence motifs are dynamically added or removed

  28. Applications and Validation • Used 2 S. cerevisiae gene expression datasets and the SGD database • Applied method to each of the expression datasets separately • Evaluated the model using various criteria: • Predicting expression from sequence • Gene expression coherence • Coherence of pathways according to functional annotations • Coverage of protein complexes • Comparison with binding localization data from the literature

  29. 5. GENIA corpus-a semantically annotated corpus for bio-text miningJ.-D. Kim, T. Ohta, Y. Tateisi, J. Tsujii (CREST and Univ. of Tokyo) • Lack of extensively annotated corpora is a major bottleneck for applying NLP techniques to bioinformatics • Last release (v. 3.0) of GENIA (www-tsujii.is.s.u-tokyo.ac.jp/GENIA): • 2000 abstracts from MEDLINE (limited to human, blood cell, transcription factor) • 400,000 words and 100,000 hand-coded annotation

  30. Articles encoded in an XML-based scheme: Medline ID, title, abstract (whose text is segmented into sentences) • Titles and abstracts marked-up for biologically meaningful terms by two domain experts • These terms have been annotated with descriptors from the GENIA ontology (biological source, biological substance, other) <term>:=<qualifier> * <head noun>

  31. Annotation is complicated when there are coordinate clauses with not all terms fully spelled at the surface of the text, e.g. CD2 and CD25 receptors: • If simple annotation is desired a software tool removes the higher level annotation:

  32. How to turn misfortune into fortuneElisabetta Manduchi • Wait until the airlines lose your luggage • Exaggerate your woes • Make them pay as much as possible Total take: $400 AD (about $350 US)

  33. 7. Gene Structure-based splice variant deconvolution using a microarray platformH.Wang, E. Hubbell, J.Hu, G.Mei, M.Cline, G.Lu, T.Clark, M.A.Siani-Rose, M.Ares, D.Kulp, D.Haussler (Affymetrix) • Current methods of microarrays generally ignore splice variants and measure changes on the gene level. • Profiling at the splice form level is important for more accurate and biologically meaningful results. • Distinction: • Some microarray studies have aimed at finding splice variants. • The purpose here is not to find new slice variants, but to measure known ones accurately.

  34. Methods • Use splice variant specific features: • exon • partial exon (in the case of partial overlap) • exon-exon junction • intron (?) • Probe selection • Junction probes • exon probes • Overlapping exons • Cassette exons

  35. Matrix equation relating observed probe intensities to unobserved transcript concentrations.

  36. Model Fitting • Minimize (Y-AFGT)2 (more or less) over A and T • T is what we are really interested in. • Introduce a few more natural constraints, in order to be able to solve. • Use iterative maximum likelihood estimation approach.

  37. Results • Two splice forms of CD44 (three actually). • Ten sample titration study: • One transcript becoming more dilute as the other becomes less dilute, with constant total concentration. • Repeated for total concentrations of 64 pM, 16 pM, and 4 pM. • Also did something similar for three splice variants of CD44.

  38. Predicted relative concentration in two-variant titration experiments

  39. 8. Deriving phylogenetic trees from the similarity analysis of metabolic pathwaysM. Heymans, A.K. Singh (Univ. CA, Santa Barbara) • Measure the evolution of complete processes, not just individual elements. • Comparing higher level functional components between species might give better understanding of evolutionary relationships. • Pathways: metabolites, gene regulation, protein-protein interactions… • Similarity comparison of pathways • Map pathway to graph structure • Define distance measure between graphs • Apply Phylip distance method.

  40. Obtain pathways with similar function across species (focus on glycolysis, citric acid cycle, carbohydrate and lipid metabolism) from KEGG db • Represent pathways as enzyme graphs, paths represent enzyme-enzyme relationships, • i.e. one of the substrates or products of one enzyme is the same as one of the substrates of products of the other enzyme. • Pairwise comparison of pathways combining structural similarity with similarity between enzymes • Get a similarity matrix • Construct a tree from this matrix • Assess the quality

  41. Graph comparison: four steps • Compute similarity between all pairs of nodes. • This might include sequence similarity, structure similarity, similarity of EC number. • Bipartite graph matching to find the best correspondence between graph structures. • Use matched graphs to recompute node similarities taking graph structure and matching edges into account. • Combine structural similarity with node similarity into a score. • For QC compare to benchmark trees: • NCBI taxonomy • 16s rRNA tree

  42. Examples • Glycosis pathway: two trees, 72 organisms and 48 organisms • Carbohydrate pathway: two trees, 16 organisms, 8 organisms.

  43. 9. Using HMM to analyze gene expression time course dataA. Schliep, A. Schonhuth, C. Steinhoff (Max Planck Institute and Univ. of Cologne) • Experimental setting: • Time course data : gene expression measured over time • cell cycle • response to external factors: changes in environment, nutrition, etc. • Goal: • identify interesting groups of similarly expressed time courses • Two types of methods depending on whether they assume horizontal “dependencies” • Independent: • k-means, hierarchical, SOM • Depend on distance measure between profiles • Permute the time points, get the same clusters • Dependent: • Model based methods: define models, maximize likelihoods of membership • Can model qualitative behavior, and cyclic behavior • Can incorporate prior knowledge into the models • More robust

  44. Methods: • Each cluster modeled by an HMM (a statistical model for a time course). prototype for down-regulation • Start with heterogeneous set of seed HMM’s. • Assign profiles to HMM’s • Use the profiles assigned to the HMM’s to reestimate the HMM parameters giving a new set of HMM’s. • Iterate • If at any step HMM has too few members, eliminate it, if too many, split it into two (in this way number of clusters not fixed, but learned from the data). • Add one “noise” HMM for a noise cluster (generates all possible profiles with uniform probability). This picks up the noise and makes for cleaner clusters. • Can also do some supervision…

  45. GQL interface • This statistical approach is behind the Graphical Query Language (GQL) interface • GQL clustering: • Input: data and collection of prototype HMMs • Use model-based clustering algorithm • Can set model by hand and pull back all profiles satisfying a user-defined likelihood cutoff.

  46. Cluster decomposition • Once a cluster is determined, for each member use Viterbi algorithm to compute most probable path through the model. • Sort profiles according to their labels, e.g. by time and duration of a designated state.

More Related