1 / 26

Biomedical Information Extraction

Biomedical Information Extraction. Outline. Intro to biomedical information extraction PASTA [Demetriou and Gaizauskas] Biomedical named entities Name variability [Cohen, Dolbey, Acquaah-Mensah, and Hunter] Name tagging [Tanabe and Wilbur]. PASTA. [Demetriou and Gaizauskas]

osborn
Télécharger la présentation

Biomedical Information Extraction

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Biomedical Information Extraction

  2. Outline • Intro to biomedical information extraction • PASTA [Demetriou and Gaizauskas] • Biomedical named entities • Name variability [Cohen, Dolbey, Acquaah-Mensah, and Hunter] • Name tagging [Tanabe and Wilbur]

  3. PASTA • [Demetriou and Gaizauskas] • Protein Active Site Template Acquisition

  4. Extraction Tasks • Terminological Tagging • “entities” • Template Filling • “relationships”

  5. protein species residue site region secondary structure supersecondary structure quaternary structure base atom non-protein compound interaction Terminology Tagging

  6. Template Filling protein := NAME: string species := NAME: string in_species := PROTEIN: protein SPECIES: species residue := NAME: string SITE/FUN: string SEC_STRUCT: string QUAT_STRUCT: string REGION: string INTERACTION: string in_protein := RESIDUE: residue PROTEIN protein

  7. PASTA Architecture • Text Preprocessing • Title, author, abstract • Tokenization, sentence boundaries

  8. PASTA Architecture • Terminological Processing • Morphological analysis • biochemical morphemes “-ase” • Lexical lookup • token lookup in databases • token grammatical class tagging • Terminology parsing • create multi-token terms, rule-based parsing using grammatical tags

  9. PASTA Architecture • Syntactic and Semantic Processing • Part-of-speech tags • Phrase structure • Compositional semantics • Discourse Processing • Semantic representations incorporated into discourse model of concept hierarchy and inference rules

  10. PASTA Architecture • Template Extraction • Scan discourse model for template instances, check slots, build template

  11. Performance

  12. PASTAWeb • Index • document -> terminology, template • terms -> templates from multiple documents IE tools need to be incorporated into effective interfaces for biology researchers

  13. Indexing Problem • Variations in expression of same protein name

  14. Contrast and Variability • [Cohen, Dolbey, Acquaah-Mensah, and Hunter] • Named Entities • location vs. • identification • Variability • somatotropin • rat somatotropin • growth hormone

  15. Variability • Non-contrast (synonyms) • tumor protein homolog vs tumour protein homologue • Contrast (diffonyms?) • ACE1 vs ACE2

  16. Transformations • Remove first character • Remove first word • Remove last character • Remove last word • Replace sequence of vowels with one letter • Replace hyphen with space • Remove parenthesized material • Convert to lowercase

  17. Experiment • Collect groups of synonym gene names • Get mouse, rat, and human genes from LocusLink • Group OFFICIAL GENE NAME, PREFERRED GENE NAME, OFFICIAL SYMBOL, PREFERRED SYMBOL, PRODUCT, PREFERRED PRODUCT, ALIAS SYMBOL, ALIAS PROT entries together as synonyms

  18. Results • LMW, RMC, RMW identify contrastive variability • Contrasts likely marked at name boundaries • VS, HYPH, CASE, PM identify non-contrastive variability

  19. Pattern Heuristics • Equivalence of vowel sequences • Optionality of hyphens • Optionality of parenthesized material • Case insensitivity

  20. Tagging Genes and Proteins • [Tanabe and Wilbur] • ABGene • Trained on MEDLINE abstracts • Tested on PUBMED full texts

  21. ABGene • Transformation-based tagger • False-positive and false-negative filters • Compound term recovery • Document ranking

  22. Transformation-Based Tagging • Learns sequence of transformation rules of the form • A -> B / C • greedily, based on number of errors corrected in training data tags • Applies rules sequentially to tag new text

  23. Gene Transformations GENE added as additional POS tag • NNP -> GENE / gene fgoodleft • * -> GENE / hassuf –A • * -> GENE / haspref c- • NNP -> GENE / prev1or2wd genes • NNP -> GENE / nextbigram ( GENE • VBG -> JJ nexttage GENE

  24. Results • Precision up to 0.74 • Recall up to 0.64 • depending on score threshold

  25. Problems in Full Text • Terms that do not appear in abstracts • restriction enzyme site, lab protocol kits, primers, vectors, supply companies, chemical reagents • Figures and tables

  26. Summary • Common thread in biomedical information extraction: normalization is hard!

More Related