Automatic Gene Summary Generation from Biomedical Literature

Automatically Generating Gene Summaries From Biomedical Literature Jing Jiang, Xu Ling, ChengXiang Zhai University of Illinois at Urbana-Champaign Document Retrieval Sentence Extraction Document Clustering IE IR Problem Definition System Overview Document Similarity and Clustering • Motivation: manual annotation of genes can hardly catch up with the fast growth of biomedical literature • Goal: to automatically extract relevant statements from biomedical literature to summarize the already-discovered knowledge about a target gene • Methodology: two phases • IR phase: retrieve articles that contain relevant information about the gene of interest • IE phase: cluster articles into different information aspects, then extract the facts about the target gene at the sentence level • Goal: group similar documents together to help summarize the target gene in several aspects • Difficulties: • Documents on different genes touch different aspects • No clear boundary between different aspects • False positives may dominate due to gene name ambiguity • Partial solution: • Predefine a number of categories based on observation • Generate prior language models using FlyBase annotations • Use priors on PLSA to guide the clustering • Relevant document retrieval • Query expansion using gene synonym list • Stopword removal from synonym list to reduce false positives • Exact phrase matching for multi-token synonyms • Document clustering • PLSA w/ and w/o prior language models • K-means • Sentence extraction • Rank sentences by cosine similarity to the cluster centroid using TF-IDF vectors • Rank sentences based on probabilities of being generated from corresponding language model • Rank by other heuristics: title, location, etc. Document Clustering Results Sentence Extraction Results Conclusion • Document Clustering • Can help recognize false positives • K-means is more effective than PLSA for this problem • Sentence Extraction • Can represent the document cluster to some extent • Future Work • Automatically remove false positives using document clustering • Derive better scoring function for sentence extraction • Develop more systematic way to evaluate and tune the system • Used PubMed abstracts on Drosophila • Evaluated results on two genes: EAG and SS • Two humans with domain knowledge grouped documents into 6 categories (Std 1 and Std 2) • Another grouping based on false positives (Std 3) • Measured results by class entropy, cluster entropy, and their average • Extracted sentences from standard clusters of documents to isolate the evaluation of sentence extraction from document clustering • Combined top K sentences extracted by different methods into a single judgment pool, and judged by a human with domain knowledge • Measured results by precision at K sentences

Automatic Gene Summary Generation from Biomedical Literature

Automatic Gene Summary Generation from Biomedical Literature

Presentation Transcript

Sentence & Sentence Fragments

Mentor Sentence Compound Sentence

Sentence Structure: Sentence Types

Sentence

Sentence or Sentence Fragment?

Sentence Structure: Sentence Types

Topic Sentence Sentence #1 Sentence #2 Sentence #3 Sentence#4

Extraction

Cognitively Based Parsing and Extraction of Potential Knowledge from Sentence Structures

Multiturn extraction for PS2 First results

2. Wall Boundary Extraction Results ( for Liaoshang Jing )

Sentence Types: Simple Sentence Compound Sentence Complex Sentence Complex-Compound Sentence

Comments-Oriented Blog Summarization by Sentence Extraction

Extraction

Sentence

SENTENCE

Sentence Structure: Sentence Types

DNA Extraction /RNA Extraction

Web-Scale Information Extraction in KNOWITALL (Preliminary Results)

Extraction

Sentence

Data Extraction of Online Quiz Results

Automatic Gene Summary Generation from Biomedical Literature