Biomedical Information Extraction

Biomedical Information Extraction

Outline • Intro to biomedical information extraction • PASTA [Demetriou and Gaizauskas] • Biomedical named entities • Name variability [Cohen, Dolbey, Acquaah-Mensah, and Hunter] • Name tagging [Tanabe and Wilbur]

PASTA • [Demetriou and Gaizauskas] • Protein Active Site Template Acquisition

Extraction Tasks • Terminological Tagging • “entities” • Template Filling • “relationships”

protein species residue site region secondary structure supersecondary structure quaternary structure base atom non-protein compound interaction Terminology Tagging

Template Filling protein := NAME: string species := NAME: string in_species := PROTEIN: protein SPECIES: species residue := NAME: string SITE/FUN: string SEC_STRUCT: string QUAT_STRUCT: string REGION: string INTERACTION: string in_protein := RESIDUE: residue PROTEIN protein

PASTA Architecture • Text Preprocessing • Title, author, abstract • Tokenization, sentence boundaries

PASTA Architecture • Terminological Processing • Morphological analysis • biochemical morphemes “-ase” • Lexical lookup • token lookup in databases • token grammatical class tagging • Terminology parsing • create multi-token terms, rule-based parsing using grammatical tags

PASTA Architecture • Syntactic and Semantic Processing • Part-of-speech tags • Phrase structure • Compositional semantics • Discourse Processing • Semantic representations incorporated into discourse model of concept hierarchy and inference rules

PASTA Architecture • Template Extraction • Scan discourse model for template instances, check slots, build template

Performance

PASTAWeb • Index • document -> terminology, template • terms -> templates from multiple documents IE tools need to be incorporated into effective interfaces for biology researchers

Indexing Problem • Variations in expression of same protein name

Contrast and Variability • [Cohen, Dolbey, Acquaah-Mensah, and Hunter] • Named Entities • location vs. • identification • Variability • somatotropin • rat somatotropin • growth hormone

Variability • Non-contrast (synonyms) • tumor protein homolog vs tumour protein homologue • Contrast (diffonyms?) • ACE1 vs ACE2

Transformations • Remove first character • Remove first word • Remove last character • Remove last word • Replace sequence of vowels with one letter • Replace hyphen with space • Remove parenthesized material • Convert to lowercase

Experiment • Collect groups of synonym gene names • Get mouse, rat, and human genes from LocusLink • Group OFFICIAL GENE NAME, PREFERRED GENE NAME, OFFICIAL SYMBOL, PREFERRED SYMBOL, PRODUCT, PREFERRED PRODUCT, ALIAS SYMBOL, ALIAS PROT entries together as synonyms

Results • LMW, RMC, RMW identify contrastive variability • Contrasts likely marked at name boundaries • VS, HYPH, CASE, PM identify non-contrastive variability

Pattern Heuristics • Equivalence of vowel sequences • Optionality of hyphens • Optionality of parenthesized material • Case insensitivity

Tagging Genes and Proteins • [Tanabe and Wilbur] • ABGene • Trained on MEDLINE abstracts • Tested on PUBMED full texts

ABGene • Transformation-based tagger • False-positive and false-negative filters • Compound term recovery • Document ranking

Transformation-Based Tagging • Learns sequence of transformation rules of the form • A -> B / C • greedily, based on number of errors corrected in training data tags • Applies rules sequentially to tag new text

Gene Transformations GENE added as additional POS tag • NNP -> GENE / gene fgoodleft • * -> GENE / hassuf –A • * -> GENE / haspref c- • NNP -> GENE / prev1or2wd genes • NNP -> GENE / nextbigram ( GENE • VBG -> JJ nexttage GENE

Results • Precision up to 0.74 • Recall up to 0.64 • depending on score threshold

Problems in Full Text • Terms that do not appear in abstracts • restriction enzyme site, lab protocol kits, primers, vectors, supply companies, chemical reagents • Figures and tables

Summary • Common thread in biomedical information extraction: normalization is hard!

Biomedical Information Extraction

Biomedical Information Extraction

Presentation Transcript

Information Extraction

Information Extraction

Information Extraction

Information Extraction

information extraction

Information Extraction

Information Extraction

Information Extraction from Biomedical Text

Information Extraction

Learning for Biomedical Information Extraction with ILP

Information Extraction from biomedical texts

Information Extraction

Information Extraction

Information Extraction

Information Extraction

Information Extraction

Information Extraction

Domain Adaptation for Biomedical Information Extraction

Information Extraction

Information Extraction from BioMedical Abstracts