1 / 17

ISMB 2003 presentation Extracting Synonymous Gene and Protein Terms from Biological Literature

ISMB 2003 presentation Extracting Synonymous Gene and Protein Terms from Biological Literature. Hong Yu and Eugene Agichtein. Dept. Computer Science, Columbia University, New York, USA {hongyu, eugene}@cs.columbia.edu 212-939-7028. Significance and Introduction.

lilah
Télécharger la présentation

ISMB 2003 presentation Extracting Synonymous Gene and Protein Terms from Biological Literature

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. ISMB 2003 presentationExtracting Synonymous Gene and Protein Terms from Biological Literature Hong Yu and Eugene Agichtein Dept. Computer Science, Columbia University, New York, USA {hongyu, eugene}@cs.columbia.edu 212-939-7028

  2. Significance and Introduction • Genes and proteins are often associated with multiple names • Apo3, DR3, TRAMP, LARD, and lymphocyte associated receptor of death • Authors often use different synonyms • Information extraction benefits from identifying those synonyms • Synonym knowledge sources are not complete • Developing automate approaches for identifying gene/protein synonyms from literature

  3. Background-synonym identification • Semantically related words • Distributional similarity [Lin 98][Li and Abe 98][Dagan et al 95] • “beer” and “wine” • “drink”, “people”, “bottle” and “make” • Mapping abbreviations to full forms • Map LARD to lymphocyte associated receptor of death • [Bowden et al. 98] [Hisamitsu and Niwa 98] [Liu and Friedman 03] [Pakhomov 02] [Park and Byrd 01] [Schwartz and Hearst 03] [Yoshida et al. 00] [Yu et al. 02] • Methods for detecting biomedical multiword synonyms • Sharing a word(s) [Hole 00] • cerebrospinal fluid cerebrospinal fluid protein assay • Information retrieval approach • Trigram matching algorithm [Wilbur and Kim 01] • Vector space model • cerebrospinal fluidcer, ere, …, uid • cerebrospinal fluid protein assaycer,ere, …, say

  4. Background-synonym identification • GPE [Yu et al 02] • A rule-based approach for detecting synonymous gene/protein terms • Manually recognize patterns authors use to list synonyms • Apo3/TRAMP/WSL/DR3/LARD • Extract synonym candidates and heuristics to filter out those unrelated terms • ng/kg/min • Advantages and disadvantages • High precision (90%) • Recall might be low, expensive to build up

  5. Background—Machine-learning • Machine-learning reduces manual effort by automatically acquiring rules from data • Unsupervised and supervised • Semi-supervised • Bootstrapping [Hearst 92, Yarowsky 95] [Agichtein and Gravano 00] • Hyponym detection [Hearst 92] • The bow lute, such as the Bambara ndang, is plucked and has an individual curved neck for each string. • A Bambara ndang is a kind of bow lute • Co-training [Blum and Mitchell 98]

  6. Method-Outline • Machine-learning • Unsupervised • Similarity [Dagan et al 95] • Semi-supervised • Bootstrapping • SNOWBALL [Agichtein and Gravano 02] • Supervised • Support Vector Machine • Comparison between machine-learning and GPE • Combined approach

  7. Method--Unsupervised • Contextual similarity [Dagan et al 95] • Hypothesis: synonyms have similar surrounding words • Mutual information • Similarity

  8. Methods—semi-supervised • SNOWBALL [Agichtein and Gravano 02] • Bootrapping • Starts with a small set of user-provided seed tuples for the relation, automatically generates and evaluates patterns for extracting new tuples. “Apo3, also known as DR3…” {Apo3, DR3} “DR3, also called LARD…” “<GENE>, also called <GENE>” {LARD, Apo3} “<GENE>, also known as <GENE>” {DR3, LARD}

  9. Method--Supervised • Support Vector Machine • State-of-the-art text classification method • SVMlight • Training sets: • The same sets of positive and negative tuples as the SNOWBALL • Features: the same terms and term weights used by SNOWBALL • Kernel function • Radial basis kernel (rbf) kernel function

  10. Methods—Combined • Rational • Machine-learning approaches increase recall • The manual rule-based approach GPE has a high precision with lower recall • Combined will boost both recall and precision • Method • Assume each system is an independent predictor • Prob=1-Prob that all systems extracted incorrectly

  11. Evaluation-data • Data • GeneWays corpora [Friedman et al 01] • 52,000 full-text journal articles • Science, Nature, Cell, EMBO, Cell Biology, PNAS, Journal of Biochemistry • Preprocessing • Gene/Protein name entity tagging • Abgene [Tanabe and Wilbur 02] • Segmentation • SentenceSplitter • Training and testing • 20,000 articles for training • Tuning SNOWBALL parameters such as context window, etc. • 32,000 articles for testing

  12. Evaluation-matrices • Estimating precision • Randomly select 20 synonyms with confident scores (0.0-0.1, 0.1-0.2, …,0.9-1.0) • Biological experts judged the correctness of synonym pairs • Estimating recall • SWISSPROT—Gold Standard • 989 pairs of SWISSPROT synonyms co-appear in at least one sentence in the test set • Biological experts judged 588 pairs were indeed synonyms • “…and cdc47, cdc21, and mis5 form another complex, which relatively weakly associates with mcm2…”

  13. Middle <(0.55><ALSO 0.53><CALLED 0.53> <ALSO 0.47><KNOWN 0.47><AS 0.47> <( 0.54> <ALSO 0.54> <TERMED 0.54> Conf 0.75 0.54 0.47 Left - - - Right - - - Results • Patterns SNOWBALL found • Of 148 evaluated synonym pairs, 62(42%) were not listed as synonyms in SWISSPROT

  14. Results

  15. Results

  16. System Tagging Similarity Snowball SVM GPE Time 7 hs 40 mins 2 hs 1.5 h 35 mins Results • System performance

  17. Conclusions • Extraction techniques can be used as a valuable supplement to resources such as SWISSPROT • Synonym relations can be automated through machine-learning approaches • SNOWBALL can be applied successfully for recognizing the patterns

More Related