Supporting Annotation Layers for Natural Language Processing

Supporting Annotation Layers for Natural Language Processing Marti Hearst, Preslav Nakov, Ariel Schwartz, Brian Wolf, Rowena Luk UC Berkeley Stanford InfoSeminar March 17, 2006 Supported by NSF DBI-0317510 And a gift from Genentech

Outline • Motivation: NLP tasks • System Description • Annotation architecture • Sample queries • Database Design and Evaluation • Related Work • Future Work

Double Exponential Growth in Bioscience Journal Articles From Hunter & Cohen, Molecular Cell 21, 2006

BioText Project Goals • Provide flexible, intelligent access to information for use in biosciences applications. • Focus on • Textual Information from Journal Articles • Tightly integrated with other resources • Ontologies • Record-based databases

Project Team • Project Leaders: • PI: Marti Hearst • Co-PI: Adam Arkin • Computational Linguistics and Databases • Presley Nakov • Ariel Schwartz • Brian Wolf • Barbara Rosario (alum) • Gaurav Bhalotia (alum) • User Interface / IR • Rowena Luk • Dr. Emilia Stoica • Bioscience • Janice Hamerja • Dr. TingTing Zhang (alum)

BioText Architecture Sophisticated Text Analysis Annotations in Database Improved Search Interface

Sample Sentence “Recent research, in proliferating cells, has demonstrated that interaction of E2F1 with the p53 pathway could involve transcriptional up-regulation of E2F1 target genes such as p14/p19ARF, which affect p53 accumulation [67,68], E2F1-induced phosphorylation of p53 [69], or direct E2F1-p53 complex formation [70].”

Motivation • Most natural language processing (NLP) algorithms make use of the results of previous processing steps: • Tokenizer • Part-of-speech tagger • Phrase boundary recognizer • Syntactic parser • Semantic tagger • No standard way to represent, store and retrieve text annotations efficiently. • MEDLINE has close to 13 million abstracts. Full text has started to become available as well.

System overview • A system for flexible querying of text that has been annotated with the results of NLP processing. • Supports • self-overlapping and parallel layers, • integration of syntactic and ontological hierarchies, • and tight integration with SQL. • Designed to scale to very large corpora. • Most NLP annotation systems assume in-memory usage • We’ve evaluated indexing architectures

Text Annotation Framework • Annotations are stored independently of text in an RDBMS. • Declarative query language for annotation retrieval. • Indexing structure designed for efficient query processing.

Key Contributions • Support for hierarchical and overlapping layers of annotation. • Querying multiple levels of annotations simultaneously. • First to evaluate different physical database designs for NLP annotation architecture.

Layers of Annotations • Each annotation represents an interval spanning a sequence of characters • absolute start and end positions • Each layer corresponds to a conceptually different kind of annotation • Protein, MESH label, Noun Phrase • Layers can be • Sequential • Overlapping • two multiple-word concepts sharing a word • Hierarchical (two different ways) • spanning, when the intervals are nested as in a parse tree, or • ontologically, when the token itself is derived from a hierarchical ontology

Layers of Annotations

Layers of Annotations Full parse, sentence and section layers are not shown.

Example: Query for Noun Compound Extraction Goal: find noun phrases consisting ONLY of 3 nouns • plastic water bottle • blue water bottle • big plastic water bottle FROM [layer=’shallow_parse’ && tag_name=’NP’ ˆ [layer=’pos’ && tag_name="noun"] [layer=’pos’ && tag_name="noun"] [layer=’pos’ && tag_name="noun"] $ ] AS compound SELECT compound.content

Query for Noun Compound Extraction (SQL wrapping) SELECTLOWER(compound.content), COUNT(*) FROM ( BEGIN_LQL [layer=’shallow_parse’ && tag_name=’NP’ ˆ [layer=’pos’ && tag_name="noun"] [layer=’pos’ && tag_name="noun"] [layer=’pos’ && tag_name="noun"] $ ] AS compound SELECT compound.content END_LQL ) AS lql ORDER BY freq DESC

Query for Noun Compound Extraction (using artificial layers) Goal: find noun phrases which have EXACTLY two nouns at the end, but no nouns before those two. • “big blue water bottle” • “plastic water bottle” FROM [layer=’shallow_parse’ && tag_name=’NP’ ˆ( { ALLOW GAPS } ![layer=’pos’ && tag_name="noun"] ( [layer=’pos’ && tag_name="noun"] [layer=’pos’ && tag_name="noun"] ) $ ) $ ] AS compound SELECT compound.content

Example: Paraphrases • Want to find phrases with certain variations: • Immunodeficiency virus(?es) in ?the human(?s) • immunodeficiency virus in humans • immonodeficiency viruses in humans • immunodeficiency virus in the human • immunodeficiency virus in a human

Query for Paraphrases(optional layers and disjunction) [layer=’sentence’ [layer=’pos’ && tag_name="noun" && content = "immunodeficiency"] [layer=’pos’ && tag_name="noun" && contentIN ("virus","viruses")] [layer=’pos’ && tag_name=’IN’] AS prep ?[layer=’pos’ && tag_name=’DT’ && contentIN ("the","a","an")] [layer=’pos’ && tag_name="noun" && contentIN ("human", "humans")] ] SELECT prep.content

Example: Protein-Protein Interactions • Find all sentences that consist of a • An NP containing a gene, followed by • a morphological variant of the verb “activate”, “inhibit”, or “bind”, followed by • another NP containing a gene. Sentence Activate(d,ing) Inhibit(ed,ing) Bind(s,ing) protein protein

Query for Protein-Protein Interactions SELECT p1_text, verb_content, p2_text, COUNT(*) AS cnt FROM ( BEGIN_LQL [layer='sentence' { ALLOW GAPS } [layer='shallow_parse' && tag_name='NP' [layer='gene'] $ ] AS p1 [layer='pos' && tag_name="verb" && (content ~ "activate%" || content ~ "inhibit%" || content ~ "bind%") ] AS verb [layer='shallow_parse' && tag_name='NP' [layer='gene'] $ ] AS p2 ] SELECT p1.text AS p1_text, verb.content AS verb_content, p2.text AS p2_text END_LQL ) lql GROUP BY p1_text, verb_content, p2_text ORDER BY count(*) DESC

Protein-Protein InteractionsSample Output

Example: Chemical-Disease Interactions • “A new approach to the respiratory problems of cystic fibrosis is dornase alpha, a mucolytic enzyme given by inhalation.” • Goal: extract the relation that dornase alpha (potentially) prevents cystic fibrosis. • MeSH C06.689 subtree contains pancrediseases • MeSH supplementary concepts represent chemicals.

Query onDisease-Chemical Interactions

Query onDisease-Chemical Interactions [layer='sentence' { NO ORDER, ALLOW GAPS } [layer='shallow_parse' && tag_name='NP‘ [layer='chemicals'] AS chemical $ ] [layer='shallow_parse' && tag_name='NP' [layer='mesh' && tree_number BELOW 'C06.689%'] AS disease $ ] ]] AS sent SELECT chemical.text, disease.text, sent.text

Results: Chemical-Disease

Query Translation

Database Design & Evaluation

Database Design • Evaluated 5 different logical and physical database designs. • The basic model is similar to the one of TIPSTER (Grishman, 1996). Each annotation is stored as a record in a relation. • Architecture 1 contains the following columns: • docid: document ID; • section: title, abstract or body text; • layer_id:a unique identifier of the annotation layer; • start_char_pos: starting character position, relative to particular section and docid; • end_char_pos: end character position, relative to particular section and docid; • tag_type: a layer-specific token unique identifier. • There is a separate table mapping token IDs to entities (the string in case of a word, the MeSH label(s) in case of a MeSH term etc.)

Database Design (cont.) • Architecture 2 introduces one additional column, sequence_pos, thus defining an ordering for each layer. • Simplifies some SQL queries as there is no need for “NOT EXISTS” self joins, which are required under Architecture 1 in cases where tokens from the same layer must follow each other immediately. • Architecture 3 adds sentence_id, which is the number of the current sentence and redefines sequence_pos as relative to both layer_id and sentence_id. • Simplifies most queries since they are often limited to the same sentence.

Database Design (cont.) • Architecture 4 merges the word and POS layers, and adds word_id assuming a one-to-one correspondence between them. • Reduces the number of stored annotations and the number of joins in queries with both word and POS constraints. • Architecture 5 replaces sequence_pos with first_word_pos and last_word_pos, which correspond to the sequence_pos of the first/last word covered by the annotation. • Requires all annotation boundaries to coincide with word boundaries. • Copes naturally with adjacency constraints between different layers. • Allows for a simpler indexing structure.

Data Layout for all 5 Architectures Example: “Kinase inhibits RAG-1.” PMID PMID SECTION SECTION LAYER LAYER START START END END TAG TAG SENTE SENTE FIRST WORD POS LAST WORD POS SEQUE SEQUE WORD WORD CHAR CHAR NCE NCE NCE NCE ID ID CHAR CHAR TYPE TYPE ID ID POS POS POS POS POS POS 0 (word) 3345 3345 b (body) b (body) 34 34 40 59571 59571 2 1 2 1 59571 59571 3345 3345 b b 0 0 41 41 49 55608 55608 2 2 3 2 55608 55608 3345 3345 b b 0 0 50 50 55 89985 89985 2 3 4 3 89985 89985 3345 3345 b b 1 (POS) 1 (POS) 34 34 40 27 (NN) 27 (NN) 2 1 2 1 59571 59571 3345 3345 b b 1 1 41 41 49 53 (VB) 53 (VB) 2 2 55608 55608 2 3 3345 3345 b b 1 1 50 50 55 27 27 2 3 4 3 89985 89985 1 2 3345 3345 b b 3(s.parse) 3(s.parse) 34 34 40 31(NP) 31(NP) 2 1 3345 3345 b b 3 3 41 41 49 59(VP) 59(VP) 2 2 2 3 3345 3345 b b 3 3 50 50 55 31 31 2 3 3 4 1 2 3345 3345 b b 5 (gene) 5 (gene) 34 34 40 39(prt) 39(prt) 2 1 3345 3345 b b 5 5 50 50 55 39 39 2 2 3 4 1 2 3345 3345 b b 6(mesh) 6(mesh) 34 34 40 10770 10770 2 1 3 4 3345 3345 b b 6 6 50 50 55 16654 16654 2 2 Basic architecture Added, architecture 3 Added, architecture 5 Added, architecture 2 Added, architecture 4

Indexing Structure • Two types of composite indexes: forward and inverted. • An index lookup can be performed on any column combination that corresponds to an index prefix. • The forward indexes support lookup based on position in a given document. • The inverted indexes support lookup based on annotation values (i.e., tag type and word id). • Most query plans involve both forward and inverted indexes • Joins statistics would have been useful • Detailed statistics are essential. • Standard statistics in DB2 are insufficient. • Records are clustered on their primary key

Indexing Structure (cont.)

Experimental Setup • Annotated 13,504 MEDLINE abstracts • Stanford Lexicalized Parser (Klein and Manning, 2003) for sentence splitting, word tokenization, POS tagging and parsing. • We wrote a shallow parser and tools for gene and MeSH term recognition. • This resulted in 10,910,243 records stored in an IBM DB2 Universal Database Server. • Defined 4 workloads based on variants of queries.

Experimental Setup:4 Workloads (a) Protein-Protein Interaction (c) Descent of Hierarchy: [layer='sentence' {ALLOW GAPS} [layer='gene'] AS gene1 [layer='pos' && tag_name="verb" && content="binds"] AS verb [layer='gene'] AS gene2 ] SELECT gene1.content, verb.content, gene2.content [layer='shallow_parse' && tag_name="NP" [layer='pos' && tag_name="noun" ^ [layer='mesh' && tree_numberBELOW"G07.553"] AS m1 $ ] [layer='pos' && tag_name="noun" ^ [layer='mesh' && tree_numberBELOW"D"] AS m2 $ ] ] SELECT m1.content, m2.content A01 A07 limb:vein shoulder: artery (Blaschke et al., 1999) (b) Protein-Protein Interaction (Rosario et al., 2002) [layer='sentence' [layer='shallow_parse' && tag_name="NP"] AS np1 [layer='pos' && tag_name="verb" && content='binds'] AS verb [layer='pos' && tag_name="prep" && content='to'] [layer='shallow_parse' && tag_name="NP"] AS np2 ] SELECT np1.content, verb.content, np2.content (d) Acronym-Meaning Extraction [layer='shallow_parse' && tag_name="NP"] AS np1 [layer='pos' && content='('] [layer='shallow_parse' && tag_name="NP"] AS np2 [layer='pos' && content=')'] (Pustejovsky et al., 2001) (Thomas et al., 2000)

Results

Results • Architecture 5 performs well (if not best) on all query types, while the other architectures perform poorly on at least one query type. • Storage requirement of Architecture 5 is comparable to that of Architecture 1 • Architecture 5 results in much simpler queries • Conclusion: We recommend Architecture 5 in most cases, or Architecture 1, if atomic annotation layer cannot be defined.

Scalability Analysis • Combined workload of 3 query types • Varying buffer pool sizes

Scalability Analysis • Suggests that the query execution time grows as a sub-linear function of memory size. • We believe a similar ratio will be observed when increasing the database size and keeping the memory size fixed • Parallel query execution can be enabled after partitioning the annotation on document_id

Study on a larger dataset • Annotated 1.4 Million MEDLINE abstracts • 10 million sentences • 320 million annotations • 70 GB total database size

Annotation Graphs Find arcs labeled as words, whose phonetic transcription starts with a “hv“: SELECT I WHERE X.[id:I].Y <- db/wrd X.[:hv].[]*.Y <- db/phn; Emu Find sentences of phonetic “A” followed by “p“ both dominated by an “S” syllable: [[Phonetic=A -> Phonetic=p] ^ Syllable=S] Q4M (MATE system) Find nouns followed by the word “lesser”: ($a word) ($b word); ($a pos ~ "NN") && ($a <> $b) && ($b # ~ "lesser") TIQL (TIMS system) Find sentences containing the noun phrase “COUP-TF II” and the verb “inhibit”: (<SENTENCE>  <TERM nf=‘COUP TF II’>)  <V lemma=‘inhibit’> Related Work • Annotation graphs (AG): directed acyclic graph; nodes can have time stamps or are constrained via paths to labeled parents and children. (Bird and Liberman, 2001) • Emu system: sequential levels of annotations. Hierarchical relations may exist between different levels, but must be explicitly defined for each pair.(Cassidy&Harrington,2001) • The Q4M query language for MATE: directed graph; constraints and ordering of the annotated components. Stored in XML (McKelvie&al., 2001) • TIQL: queries consist of manipulating intervals of text, indicated by XML tags; supports set operations. (Nenadic et al., 2002)

What about XQuery/XPath?

Main Advantages of LQL System • Stand-off annotation • Flexible and modular • Multi-layered, including overlaps • LQL – simple yet powerful • Support for hierarchies • Optimized for cross-layer queries • Much more expressive than standard text search engines • Seamless integration with SQL and RDBMS • Easy integration with additional data sources • Simple parallelism • Full text support • Caption search • Formatting-aware queries • Flexible support for document structure

On the Horizon • Full text documents support • Really complex in bioscience text • Caption search • Formatting-aware annotation layers • Flexible support for document structure • Query simplification • Shorthand syntax • GUI helper

Syntax-HelperInterface

Thank you! biotext.berkeley.edu/lql

Overlap Example

Supporting Annotation Layers for Natural Language Processing