100 likes | 204 Vues
Explore automated ways to identify common themes among gene lists from literature to aid biologists in understanding gene pathways. Utilizes comparative text mining and hypothesis testing approaches to extract insights from gene-associated articles.
E N D
Annotating Gene List From Literature Xin He Department of Computer Science UIUC
Motivation • Biologists often need to understand the commonalities of a list of genes (e.g. whether they are involved in the same pathway). • These genes typically come from clustering results in microarray expression • Given a list of gene names, is there any automatic way to find the common themes from literature articles?
Related Work • The most popular way is based on the analysis of GO terms associated with genes. • Method: each gene is associated with a set of GO terms. Find the GO terms that are overrepresented in the input list • Hypergeometric test: p-value of a GO term N: total number of genes M: total number of genes annotated with this term n: number of genes in the list k: number of genes in the list annotated with this term
Problems with GO-based Approach • GO cannot cover all the important concepts in the literature. E.g. GO has relatively low coverage for behavior terms (compared with specialized behavior ontology) • The associations of genes and concepts change very rapidly. E.g. new functions of known genes are constantly found..
Text-based Gene List Annotation • Hypothesis testing approach: • find terms that are overrepresented for each gene: Poisson distribution • find common terms across the gene list: hypergeometric distribution • Comparative text mining approach: find the common themes in multiple collections (one for each gene)
Comparative Text Mining • For each gene, find a collection of articles that discuss this gene • Each article in a collection is a mixture of two distributions: a theme common to all collections; and a collection-specific theme • Parameter estimation in the mixture model: the standard EM algorithm
Results: Pelle System • Pelle system in Drosophila: Saptzle, Toll, Pelle, Tube, Cacus, Dorsal • Among the top-50 words: signaling, pathway, receptor, embryo, ventral, dorsoventral, patterning, embryonic
Results: MET cluster • MET cluster from yeast cell-cycle data: MET28, MET14, MET16, MET10, MET2, MUP1 • Among the top-50 words: amino, met25, sulphite
Problems and Plan • Many common words (such as stop words) in the top-list, not properly normalized • Use the entire Medline corpus as background: not working • Hypothesis testing approach as alternative • Single words not very suggestive • Phrase extraction as the postprocessing step