Information Extraction from the Cancer Literature The Pediatric Hematology/Oncology Seminar Series

Information Extraction from the Cancer Literature The Pediatric Hematology/Oncology Seminar Series Children’s Hospital of Philadelphia March 8, 2005 Philadelphia, PA

Phenotype Patient records Test results Clinical reports Procedures Phone calls DNA sequence Genomic variation Microarrays RNAi Protein interactions Text Text Text Text Text Leukemia MDS1 Leukemia MDS1 • Cell Clinic A Global Challenge DNA sequence Genomic variation Microarrays RNAi Protein interactions Patient records Test results Clinical reports Procedures Phone calls Natural language understanding

? Too Much Text • Biomedical text: • 15 million articles • 1.5 billion words • Solution 1: Approximate • What you can find • What finds you • Solution 2: Read everything • Leukemia: 181,394 articles • 20/day=25 years • 385,034 new articles by then Solution 3: Impose structure on the descriptions

IE Process Phase 1: Domain selection and definition Phase 2: Manual annotation Phase 3: Create and train machine-learning algorithms Phase 4: “Active Annotation” Phase 5: Utilization of annotations

Domain Biological Domains Genomic variations in malignancy Neuroblastoma Entity Classes Genes (genes, transcripts, proteins) Genomic variations (type, location, state) Malignant type Malignancy attributes Developmental state Clinical stage Histology Malignancy site Differentiation status Heredity status

Document Sets • MEDLINE: Abstracts --> Full Text • Annotation training set: 4,000 MEDLINE abstracts • Genes commonly mutated in various malignancies • Genes implicated in neuroblastoma • Abstracts are manually annotated (dual pass) • Results are used to train automated taggers

Workflow Management

MDS1 gene alterations often cause leukemia Extraction Process

MDS1 gene alterations often cause leukemia Parsing Separate MDS1 gene alterations often cause leukemia

Grammar Separate MDS1 gene alterations often cause leukemia MDS1 gene alterations often cause leukemia Noun Noun Plural noun Adverb Verb Noun Part-of-speech Tagging

Part-of-speech Tagging

Separate MDS1 gene alterations often cause leukemia Grammar MDS1 gene alterations often cause leukemia Noun Noun Plural noun Adverb Verb Noun Part-of-speech Tagging

Grammar Label MDS1 alterations Gene Process MDS1 gene alterations often cause leukemia leukemia Noun Noun Plural noun Adverb Verb Noun Disease gene Named Entity Recognition

Definitions: Process • Initial Definitions: Domain Experts • Analyze representative subset of text mentions • Input of specific knowledge • Manual Annotation • Tag text with initial definitions • Iterative re-definition process • More text: Tighter and more robust definitions • Widen Domain Expertise • Publication and Utilization

Proteins Transcripts Genes Genes Other Gene Entities Definitions Individual Gene Gene Family Gene Superfamily

Definitions Gene The Gene-Entity category includes genes as well as their downstream products such as transcripts and proteins, in addition to the more general groups of gene and protein families, super-families, and so forth. Note that the category name 'Gene-Entity’ is not a completely accurate description of the members of this class since the category includes things other than genes. However, most things in this class are genes, and everything is either a gene or gene derived (transcripts and proteins). The diagram that follows attempts to illustrate this point and provides some examples. What is and What is Not Included? There are two ways to think about genes. 1. Genes as conceptual entities. (This is what we want to capture.) Genes refer to segments of the genome which have been identified with a specific function or product (for example, the gene for eye color in a fly or a membrane receptor in humans). Although they are "things", they really represent abstract concepts. We can talk about the gene "K-Ras", but we are really referring to an abstract concept – an "ideal form" of the K-Ras gene, which has known attributes. We can’t point to K-Ras; we can only point to instances of K-Ras. Each of these instances (a specific manifestation of the gene as described in #2 below) has the attributes and characteristics of the abstract concept of K-Ras but the different instances of K-Ras may vary slightly between them. (This parallels the concept of "species". We all have an intuitive grasp of the species concept, and can differentiate most species apart: a grizzly bear from a polar bear. However, when we visit the zoo we encounter instances of a species -- individual bears -- and not the concept itself.) Although this may seem pedantic, there is an important reason for making this distinction which we’ll describe below. Let’s consider some examples based upon this logic: a. For genes: c-kit, CD117, and alpha-smooth muscle actin b. A non-biology example: a 2003 Ferrari Modena. This is an abstract concept for a specific type of car. However, you can’t point to an abstract 2003 Ferrari Modena, you can only point to specific instances which may vary, even if slightly, between one another. c. K-Ras as investigated in Bob. This can be a tricky example since it would appear as though we are talking about a specific instance of K-Ras. But remember, in nearly all cases, genes are paired in humans (sometimes there are even more

Definitions • Confounding Issues: • Levels of specificity • Protein/enzyme/kinase/tyrosine kinase/NTRK1 • TRK antibody • Colon cancer vs. cancer of the colon • Boundary issues • Retinoblastoma • Head and neck cancer • MEN type 2B syndrome

Entity Annotation

Label MDS1 alterations Gene Process MDS1 gene alterations often cause leukemia leukemia Noun Noun Plural noun Adverb Verb Noun Disease gene Named Entity Recognition

Label MDS1 MDS1 gene alterations often cause leukemia Syntax Gene Process Disease often cause leukemia Noun phrase alterations often cause leukemia Verb phrase Noun phrase Adverb phrase Noun phrase cause leukemia leukemia Syntactic Analysis

Treebanking

MDS1 MDS1 gene alterations often cause leukemia Syntax Gene Process Disease often cause leukemia Noun phrase alterations often cause leukemia Verb phrase Noun phrase Adverb phrase Noun phrase cause leukemia leukemia Syntactic Analysis

Relationships Label Object: MDS1 MDS1 MDS1 gene gene alterations often cause leukemia Syntax Gene Process Disease often cause leukemia Noun phrase alterations alterations often cause leukemia Event: Verb phrase Noun phrase Adverb phrase Noun phrase Frequency: often cause leukemia Action: cause leukemia leukemia Result: Relation Tagging

Relation Tagging

Annotation Viewer Annotation Viewer

Annotations

Automated Algorithms • Pretagger • Assigns token, sentence, paragraph, section boundaries • Nearly 100% accuracy • Pipeline implementation: Finished • Bio Part-of-speech tagger • Assigns part-of-speech tags to tokens • Uses pretagging annotations • Accuracy of 97.3% • Pipeline implementation: Finished

Entity Taggers • Entity Taggers: Automated, machine-learning algorithms for named entity recognition in text • Goals • Highly accurate, precision > recall • Rapid deployment • Flexible design • Technique • Conditional random fields • Text feature-based • Uses pretagging, POS annotations • Probabilistic maximization of feature weights • Corrects for overfitting

Entity Taggers • GeneTaggerCRF • Tags gene symbols, names, and descriptions • KDR, VEGFR-2, VEGF receptor-2 • vascular endothelial growth factor receptor type 2 • 86% precision/79% recall • Pipeline implementation: Imminent • VTag • Simulataneously tags variation types, locations, states • point mutation, loss of heterozygosity • codon 12, 11q23, base pair 17, Ki-ras • GGT, glycine, Asp • 85% precision/79% recall • Pipeline implementation: Imminent

Entity Taggers • Mtag • Tags malignant type labels • acute myeloid leukemias (AMLs) • translocation t( 9;11) - positive leukemia • NB • transitional cell carcinoma of the bladder • Hypoplastic myelodysplastic syndrome • predominantly cystic bilateral neuroblastomas • 85% precision/82% recall • Pipeline implementation: Imminent

Entity Taggers

Relation Tagger • Relation Taggers: Identifying relationships between entities • Given this text: • Missense mutation at codon 45 (TCT to TTT) • Can we automatically identify: • 1. Pairwise associations [(codon 45 and TCT); (TCT and TTT); etc.] • 2. The entire mutation event: • VARIATION EVENT #60609 • Variation type: missense mutation • Variation location: codon 45 • Variation state 1: TCT • Variation state 2: TTT

Relation Tagger • Goals: Accurate, rapid, flexible • Technique • Maximum entropy • Feature-based probabilistic model • Events built upon binary associations • Uses pretagging, POS, and entity annotations • Domain • Genomic variation events • Tested on 447 abstracts: 1218 relations, 4773 entities • 38% of relations were non-binary • Baseline: Two entities within 5 words = related

Relation Tagger • Results • Binary • Tagger: 77% precision/82% recall • Baseline: 66% precision/77% recall • Event-wide • Tagger: 63% precision/77% recall • Baseline: 43% precision/66% recall • Example • ”most common base change was a A ->Gtransitionatcodon 12or 13” • Manual annotation: • (transition, codon 12, A, G) • (transition, codon 13, A, G) • Automated annotation: • (transition, codon 12, A, G) • (transition, codon 13, A, G) • (base change, codon 12, A, G) • (base change, codon 13, A, G)

Data Management

Annotation Pipeline Document Pretagging POS tagging Entity tagging Relation tagging Treebanking Propbanking Database Normalization Integration Interface

Annotation Pipeline Annotation Pipeline Carolyn Felix

Annotation Retrieval Biomedical Annotation Database

Applications: Entity Lists • What is this all good for, anyway? • Objective: To align the literature with genomic objects • Goal: Can we replicate a manually curated list of genes implicated in a biological process? • Domain: Angiogenesis • Rationale: To focus on the subset of genes implicated in the process of angiogenesis from whole- genome expression profiling

Applications: Entity Lists • The manual list • Genes represented on the Affy U133 chips • 340 genes, identified through: • Prior knowledge • Literature reviews • PubMed searches • Gene Ontology codes • Gene family-based inference

Applications: Entity Lists • The automated list • Twelve partially specific angiogenic terms • Concordancy searching of MEDLINE: 41,276 abstracts • Trained GeneTaggerCRF with ~100 hand-annotated angiogenesis abstracts • Tagged the document set • 104,118 mentions • 22,662 non-redundant mentions

Applications: Entity Lists • Normalization • Human gene/alias/identifier list • Compiled identifiers from 19 public databases • 302,976 entries • 156,860 non-redundant entries • All entries mapped to 25,096 “official” gene symbols • Aligned normalized gene and tagged gene lists • 50.01% of entries matched a known gene term • 2,389 identified genes

Applications: Entity Lists Gene Description Frequency VEGF Vascular endothelial growth factor 9688 NUDT6 Antisense basic fibroblast growth factor 1887 FGF2 Fibroblast growth factor 2 (basic) 1463 KDR Kinase insert domain receptor 1287 TGFB1 Transforming growth factor, beta 1 909 TNF Tumor necrosis factor 908 FLT1 Fms-related tyrosine kinase 1 (VEGF/VPF receptor) 880 MMP2 Matrix metalloproteinase 2 598 IL8 Interleukin 8 571 IL28B Interleukin 28B 559 PECAM1 Platelet/endothelial cell adhesion molecule 558 ECGF1 Endothelial cell growth factor 1 545 EGF Epidermal growth factor 524 TP53 Tumor protein p53 524 THBS1 Thrombospondin 1 501 PTGS2 Prostaglandin-endoperoxide synthase 2 427 FN1 Fibronectin 1 407 IL6 Interleukin 6 407

Applications: Entity Lists • Accuracy: • 247 (72.6%) of manual genes on the automated list • 91 (26.8%) of manual genes had no literature support • 2 (0.6%) of manual genes were missed for technical reasons • Overall, 99.2% recall • Prediction: • Relevance ranked auto-tagged genes by number of mentions • Evaluated the top 40 NOT on the manual list • All 40 appear to be legitimate angiogenesis-related genes • Gene Ontology (GO): 42 human genes associated with “angiogenesis” or related terms

Applications: Entity Lists Gene Description Frequency NUDT6 Antisense basic fibroblast growth factor 1887 TNF Tumor necrosis factor 908 IL28B Interleukin 28B 559 EGF Epidermal growth factor 524 TP53 Tumor protein p53 524 FN1 Fibronectin 1 407 IL6 Interleukin 6 407 CD34 CD34 antigen 384 EGFR Epidermal growth factor receptor 373 IL1B Interleukin 1, beta 323 PCNA Proliferating cell nuclear antigen 277 SOS1 Son of sevenless homolog 1 243 FGF1 Fibroblast growth factor 1 (acidic) 239 TM7SF2 Transmembrane 7 superfamily member 2 230 GALGT2 4-GalNAc transferase 229 PRAP1 Proline-rich acidic protein 1 219 BMP6 Bone morphogenetic protein 6 202 BCL2 B-cell CLL/lymphoma 2 201

Applications: Directed Retrieval • Locus-specific Databases: Repositories of recorded mutation information • > 300 human genes • > 100 databases • Highly curated • Limited resources • CDKN2A database: Somatic and germline p16 mutations • Over 1400 mutation instances • Primarily identified through manual literature perusal • Large and inefficient effort • < 20% of identified articles contain mutation instances

Applications: Directed Retrieval • Experiment: Identify mutation instance-containing articles from “relevant” articles • Literature search of PubMed using p16 key words: • 418 articles (1/2000 to 6/2002) • 78 articles contained mutation data (experts) • Training • 218 articles • Logistic regression classifier • Features: words and word pairs

Applications: Directed Retrieval • Evaluation • Experts • Identified 200 candidate articles • 32 articles contained mutation information • 16% precision; ~100%(?) recall; F-measure 0.28 • Algorithm • Predicted that 88 of the 200 articles contained relevant info • 29 of 32 with relevant info identified • 44% precision; 91% recall; F-measure 0.59 • Second random trial: comparable results • Relevance ranking: Associated with value • In progress: refinement of relevance with text annotations • Conclusion: automation significantly reduces workload

The Global Challenge What is MYCN? What is MYCN related to? How? Genes Proteins Pathways Cells Tissues Phenotypes Traits Diseases Behaviors Environment

Genome Cell MYCN Literature Disease Integration Cellular location Genomic position Genomic context Protein function Known alteration Cell type Disease association Symptom Environmental factor Clinical observation

Information Extraction from the Cancer Literature The Pediatric Hematology/Oncology Seminar Series