Artificial Intelligence in Medicine HCA 590 (Topics in Health Sciences)

Artificial Intelligence in MedicineHCA 590 (Topics in Health Sciences) Rohit Kate 11. Biomedical Natural Language Processing

Reading • Chapter 8, Biomedical Informatics: Computer Applications in Health Care and Biomedicine by Edward H. Shortliffe (Editor) and James J. Cimino (Editor), Springer, 2006.

Outline • Introduction to NLP • Linguistic Essentials • Challenges of Clinical Language Processing • Challenges of Biological Language Processing

What is Natural Language Processing (NLP)? • Processing of natural languages like English, Chinese etc. by computers to: • Interact with people, e.g. • Follow natural language commands • Answer natural language questions • Provide information in natural language • Perform useful tasks, e.g. • Find required information from several documents • Summarize large or many documents • Translate from one natural language to another • Word processing is NOT Natural Language Processing!

Example NLP Task: Information Extraction • Extract some specific type of information from texts • Entity extraction: • Find all the protein names mentioned in a document • Find all the person, organization and location names mentioned in a document • Relation extraction: • Find all pairs of proteins mentioned in a document that interact • Find where a person lives, where an organization is located etc.

Sample Medline Abstract TI - Two potentially oncogenic cyclins, cyclin A and cyclin D1, share common properties of subunit configuration, tyrosine phosphorylation and physical association with the Rb protein AB - Originally identified as a ‘mitotic cyclin’, cyclin A exhibits properties of growth factor sensitivity, susceptibility to viral subversion and association with a tumor-suppressor protein, properties which are indicative of an S-phase-promoting factor (SPF) as well as a candidate proto-oncogene. Other recent studies have identified human cyclin D1 (PRAD1) as a putative G1 cyclin and candidate proto-oncogene. However, the specific enzymatic activities and, hence, the precise biochemical mechanisms through which cyclins function to govern cell cycle progression remain unresolved. In the present study we have investigated the coordinate interactions between these two potentially oncogenic cyclins, cyclin-dependent protein kinase subunits (cdks) and the Rb tumor-suppressor protein. The distribution of cyclin D isoforms was modulated by serum factors in primary fetal rat lung epithelial cells. Moreover, cyclin D1 was found to be phosphorylated on tyrosine residues in vivo and, like cyclin A, was readily phosphorylated by pp60c-src in vitro. In synchronized human osteosarcoma cells, cyclin D1 is induced in early G1 and becomes associated with p9Ckshs1, a Cdk-binding subunit. Immunoprecipitation experiments with human osteosarcoma cells and Ewing’s sarcoma cells demonstrated that cyclin D1 is associated with both p34cdc2 and p33cdk2, and that cyclin D1 immune complexes exhibit appreciable histone H1 kinase activity. Immobilized, recombinant cyclins A and D1 were found to associate with cellular proteins in complexes that contain the p105Rb protein. This study identifies several common aspects of cyclin biochemistry, including tyrosine phosphorylation and the potential to interact directly or indirectly with the Rb protein, that may ultimately relate membrane-mediated signaling events to the regulation of gene expression.

TI - Two potentially oncogenic cyclins, cyclin A and cyclin D1, share common properties of subunit configuration, tyrosine phosphorylation and physical association with the Rb protein AB - Originally identified as a ‘mitotic cyclin’, cyclin A exhibits properties of growth factor sensitivity, susceptibility to viral subversion and association with a tumor-suppressor protein, properties which are indicative of an S-phase-promoting factor (SPF) as well as a candidate proto-oncogene. Other recent studies have identified human cyclin D1 (PRAD1) as a putative G1 cyclin and candidate proto-oncogene. However, the specific enzymatic activities and, hence, the precise biochemical mechanisms through which cyclins function to govern cell cycle progression remain unresolved. In the present study we have investigated the coordinate interactions between these two potentially oncogenic cyclins, cyclin-dependent protein kinase subunits (cdks) and the Rb tumor-suppressor protein. The distribution of cyclin D isoforms was modulated by serum factors in primary fetal rat lung epithelial cells. Moreover, cyclin D1 was found to be phosphorylated on tyrosine residues in vivo and, like cyclin A, was readily phosphorylated by pp60c-src in vitro. In synchronized human osteosarcoma cells, cyclin D1 is induced in early G1 and becomes associated with p9Ckshs1, a Cdk-binding subunit. Immunoprecipitation experiments with human osteosarcoma cells and Ewing’s sarcoma cells demonstrated that cyclin D1 is associated with both p34cdc2 and p33cdk2, and that cyclin D1 immune complexes exhibit appreciable histone H1 kinase activity. Immobilized, recombinant cyclins A and D1 were found to associate with cellular proteins in complexes that contain the p105Rb protein. This study identifies several common aspects of cyclin biochemistry, including tyrosine phosphorylation and the potential to interact directly or indirectly with the Rb protein, that may ultimately relate membrane-mediated signaling events to the regulation of gene expression. Sample Medline Abstract

Example NLP Task: Semantic Parsing • Convert a natural language sentence into executable meaning representation for a domain • Example: Query application for U.S. geography database Which rivers run through the states bordering Texas? Arkansas,Canadian,Cimarron, Gila,Mississippi, RioGrande … Answer Semantic Parsing Query answer(traverse(next_to(stateid(‘texas’))))

Example NLP Task: Textual Entailment • Given two sentences whether the second sentence is implied (entailed) from the first Sentence 1: Eyeing the huge market potential, currently led by Google, Yahoo took over search company Overture Services Inc last year. Sentence 2: Yahoo bought Overture. Sentence 1: The market value of U.S. overseas assets exceeds their book value. Sentence 2: The market value of U.S. overseas assets equals their book value.

Example NLP Task: Textual Entailment • Given two sentences whether the second sentence is implied (entailed) from the first Sentence 1: Eyeing the huge market potential, currently led by Google, Yahoo took over search company Overture Services Inc last year. Sentence 2: Yahoo bought Overture. TRUE Sentence 1: The market value of U.S. overseas assets exceeds their book value. Sentence 2: The market value of U.S. overseas assets equals their book value. FALSE

Example NLP Task: Summarization • Generate a short summary of a long document or a collection of documents • Article:With a split decision in the final two primaries and a flurry of superdelegate endorsements, Sen. Barack Obama sealed the Democratic presidential nomination last night after a grueling and history-making campaign against Sen. Hillary Rodham Clinton that will make him the first African American to head a major-party ticket.Before a chanting and cheering audience in St. Paul, Minn., the first-term senator from Illinois savored what once seemed an unlikely outcome to the Democratic race with a nod to the marathon that was ending and to what will be another hard-fought battle, against Sen. John McCain, the presumptive Republican nominee…. • Summary:Senator Barack Obama was declared the presumptive Democratic presidential nominee.

Example NLP Task: Question Answering • Directly answer natural language questions based on information presented in a corpora of textual documents (e.g. the web). • When was Barack Obama born? (factoid) • August 4, 1961 • Who was president when Barack Obama was born? • John F. Kennedy • How many presidents have there been since Barack Obama was born? • 9

Why is NLP Important? • Natural language is the preferred medium of communication for humans • Humans communicate with each other in natural languages • Scientific articles, magazines, clinical reports etc. are all in natural languages • Billions of web pages are also in natural languages • Computers can do useful things for us if: • Data is in structured form, e.g. databases, knowledge bases • Specifications are in formal language, e.g. programming languages • NLP bridges the communication gap between humans and computers • Can lead to a better and a more natural communication with computers • Process an ever increasing amount of natural language data generated by people, e.g. extract required information from web

Biomedical Applications of NLP • Extract relevant information from large volumes of text, e.g. patient reports, journal articles • Identify diagnoses and procedures in patient documents for billing purposes • Process enormous number of patient reports to detect medical errors • Extract genomic information from literature to curate databases

NLP is Hard • People generally don’t appreciate how intelligent they are as natural language processors! • For them natural language processing is deceptively simple because no conscious effort is required • Since computers are orders of magnitude faster, many find it hard to believe that computers are not good at processing natural languages • An Artificial Intelligence (AI) problem: • Problems at which currently people are better than computers • Three year old kids are better than current computing systems at NLP

What Makes NLP Hard? • Ambiguity: A word, term, phrase or sentence could mean several possible things: • cadrepresents 11 different biomolecular entities in flies and mouse as well as the clinical concept coronary artery disease • The doctor injected the patient with digitalis. • The doctor injected the patient with malaria. • Time flies like an arrow. • I saw a man on the hill with a telescope. • In contrast, computer languages are designed to be unambiguous

I saw a man on the hill with a telescope.

What Makes NLP Hard? • Variability: Lots of ways to express the same thing: • The doctor injected the patient with malaria. • The physician gave the patient suffering from malaria an injection. • The patient with malaria was injected by the doctor. • The doctor injected the patient who has malaria. • Computer languages have variability but the equivalence of expressions can be automatically detected

Why is there Ambiguity and Variability in Natural Languages? • A unique and unambiguous way to express every thing would make natural languages unnecessarily complex • Language is always used in a physical and/or conceptual context, humans possess a lot of world knowledge and are good at inferring; these are used to simplify language at the cost of making it potentially ambiguous • I am out of money. I am going to the bank. • The river is calm today. I am going to the bank. • Variability increases expressivity of a language and allows humans to express themselves in creative ways • I am headed to the bank. • I am off to the bank. • I am hitting the road to bank.

How to Make Computers Process Natural Languages? • Should we model how humans acquire or process language? • A good idea but difficult to model • Human brain is different from computer processor • Humans are good at remembering and recognizing patterns; computers are good at crunching numbers • A compromising approach: Model human language processing as much as possible but also utilize computer’s ability to crunch numbers • Airplanes have wings like birds but they don’t flap them, instead they use engine technology

How Do Humans Acquire Language? • Children pick-up language through experience by associating the language they hear with their perceptual context; language is not taught • It is believed that the language experience children get is too little to learn all the complexities of a language, this is known as the “poverty of stimulus” argument • It has been postulated (controversial) that key language capabilities are innate and hard-wired into human brain at birth by genetic inheritance, known as “universal grammar” (Chomsky, 1957) • Children only set the right parameters of this grammar based on the particular language they are exposed to • An interesting and fun book : “Language Instinct” by Steven Pinker

How to Make Computers Process Natural Languages? • If language capability requires built-in prior knowledge then we should manually encode our language knowledge into computers, this is the traditional “rationalist” approach (dominated from 1960s to mid 80s)

NLP System Automatically Annotated Text Raw Text Rationalist Approach Linguistic Knowledge

How to Make Computers Process Natural Languages? • If language capability requires built-in prior knowledge then we should manually encode our language knowledge into computers, this is the traditional “rationalist” approach (dominated from 1960s to mid 80s) • Very difficult and time-consuming • Led to brittle systems that won’t work with slightly different input

How to Make Computers Process Natural Languages? • An alternate “empiricist” approach is to let the computer acquire the knowledge of language from annotated natural language corpora using machine learning techniques; also known as corpus-based or statistical approach

Machine Learning Manually Annotated Training Corpora Linguistic Knowledge NLP System Automatically Annotated Text Raw Text Empiricist Approach

How to Make Computers Process Natural Languages? • An alternate “empiricist” approach is to let the computer acquire the knowledge of language from annotated natural language corpora using machine learning techniques; also known as corpus-based or statistical approach • This approach has come to dominate NLP since the 90s because: • Advancement of machine learning techniques and computer capabilities • Availability of large corpora to learn from • Typically leads to robust systems • Easy to annotate a corpus than to directly encode language knowledge • These may in turn lead to insights into how humans acquire language

How to Make Computers Process Natural Languages? • Key steps for all NLP tasks: • Formulate the linguistic task as a mathematical/computer problem using appropriate models and data structures • Solve using the appropriate techniques for those models • Essential components: • Linguistics • Mathematical/computer models and techniques

NLP: State of the Art • Several intermediate linguistic analyses for general text can be done with good accuracy: POS tagging, syntactic parsing, dependency parsing, coreference resolution, semantic role labeling • Systems are freely available that do these • Many tasks that only require finding specific things from text (i.e. do not require understanding complete sentences) can be done with reasonable success: information extraction, sentiment analysis; they are also commercially important • For a particular domain, sentences can be completely understood with good accuracy: semantic parsing

NLP: State of the Art • Tasks like summarization, machine translation, textual entailment, question-answering that are currently done without fully understanding the sentences can be done with some success but not satisfactorily • Fully understanding open domain sentences is currently not possible • Is likely to require encoding or acquiring a lot of common world knowledge • Very little work in understanding beyond a sentence, i.e. understanding a whole paragraph or an entire document together

Linguistics Essentials Partially based on Chapter 3 of Manning & Schutze Statistical NLP book.

Basic Steps of Natural Language Processing Phonetics Words Syntactic processing Parses Semantic processing Meaning Pragmatic processing Sound waves Meaning in context We will skip phonetics and phonology.

Words: Morphology • Study of internal structure of words • carried  carry + ed (past tense) • independently  in + (depend + ent) + ly • English has relatively simple morphology, some other languages like German or Finnish have complex word structures • Very accurate morphological analyzers are available for most languages; considered a solved problem • Biomedical domains have rich morphology: • hydroxynitrodihydrothymine => hydroxy-nitro-di-hydro-thym-ine • hepaticocholangiojejunostomy => hepatico-cholangio-jejuno-stom-y • Identifying morphological structure also helps dealing with new words

Words: Parts of Speech • Linguists group words of a language into categories which occur in similar places in a sentence and have similar type of meaning: e.g. nouns, verbs, adjectives; these are called parts of speech (POS) • A basic test to see if words belong to the same category or not is the substitution test • This is a good [dog/chair/pencil]. • This is a [good/bad/green/tall] chair.

Parts of Speech • Nouns: Typically refer to entities and their names like people, animals, things • John, Mary, boy, girl, dog, cats, mug, table, idea • Can be further divided as proper, singular, plural • Pronouns: Variables or place-holders for nouns • Nominative: I, you, he, she, we, they, it • Accusative: me, you, him, her, us, them, it • Possessive: my, your, his, her, our, their, its • 2nd Possessive: mine, yours, his, hers, ours, theirs, its • Reflexive: myself, yourself, himself, herself, ourselves, themselves, itself

Parts of Speech • Determiners: Describe particular reference of a noun • Articles: a, an, the • Demonstratives: this, that, these, those • Adjectives: Describe properties of nouns • good, bad, green, tall • Verbs: Describe actions • talk, sleep, eat, throw • Categorized based on tense, person, singular/plural

Parts of Speech • Adverbs: Modify verbs by specifying space, time, manner or degree • often, slowly, very • Prepositions: Small words that express spatial relations and other attributes • in, on, over, of, about, to, with • They introduce prepositional phrases that typically introduce ambiguity in a sentence. • I saw a man on the hill with a telescope. • Prepositional phrase attachment: Another important NLP problem • Particles: Subclass of prepositions that bond with verbs to form phrasal verbs • take off, air out, ran up

POS Tagging • POS tagging is often the first step in analyzing a sentence • Why is this a non-trivial task? • The same word can have different pos tags in different sentences: • His position was near the tree. • Position him near the tree. John saw the saw and decided to take it to the table. NOUN VERB DT NOUN CONJ VERB TO VERB PRP PREP DT NOUN Noun Verb

Basic Steps of Natural Language Processing Phonetics Words Syntactic processing Parses Semantic processing Meaning Pragmatic processing Sound waves Meaning in context

Phrase Structure • Most languages have a word order • Words are organized into phrases, group of words that act as a single unit or a constituent • [The dog] [chased] [the cat]. • [The fat dog] [chased] [the thin cat]. • [The fat dog with red collar] [chased] [the thin old cat]. • [The fat dog with red collar named Tom] [suddenly chased] [the thin old white cat].

Phrases • Noun phrase: A syntactic unit of a sentence which acts like a noun and in which a noun is usually embedded called its head • An optional determiner followed by zero or more adjectives, a noun head and zero or more prepositional phrases • Prepositional phrase: Headed by a preposition and express spatial, temporal or other attributes • Verb phrase: Part of the sentence that depend on the verb. Headed by the verb. • Adjective phrase: Acts like an adjective.

An Important NLP Task: Phrase Chunking • Find all non-recursive noun phrases (NPs) and verb phrases (VPs) in a sentence. • [NP I] [VP ate] [NP the spaghetti] [PP with] [NP meatballs]. • [NP He ] [VPreckons] [NP the current account deficit ] [VPwill narrow] [PP to ] [NP only # 1.8 billion ] [PP in ] [NP September ] • Some applications need all the noun phrases in a sentence

Phrase Structure Grammars • Syntax is the study of word orders and phrase structures • Syntactic analysis tells how to determine meaning of a sentence from the meaning its of words • The dog bit the man. • The man bit the dog. • A basic question in Linguistics: What forms a legal sentence in a language? • Syntax helps to answer that question • *Bit the the man dog. • Colorless green ideas sleep furiously.

Phrase Structure Grammars • Linguists have come up with many grammar formalisms to capture syntax of languages, phrase structure grammar is one of them and is very commonly used • A context-free grammar that generates sentences, a small grammar with productions: S -> NP VP VP -> Verb VP -> Verb NP NP -> Article Noun Verb -> [slept|ate|made|bit] Noun -> [girl|cake|dog|man] Article -> [A|The]

Phrase Structure Grammars Non-terminals S NP VP Article NN Verb NP ate The girl Article NN Terminals the cake The parse of the sentence is typically shown as a tree The girl ate the cake. A syntactic derivation or a parse tree

Phrase Structure Grammars • Some of the productions can be recursive (like NP -> NP PP) which can then expand several times • [S[NPI [VPsaw [NPthe man [PPon [NPthe hill [PPwith [NPthe telescope [PPin [NPTexas]]]]]]] • Because of recursion in the grammars there are potentially infinite number of sentences in a language

Syntactic Parsing: A Very Important NLP Task • Typically a grammar can lead to several parses of a sentence: Syntactic ambiguity • [S[NPI [VPsaw [NPthe man [PPon [NPthe hill [PPwith [NPthe telescope [PPin [NPTexas]]]]]]] • [S[NPI [VPsaw [NPthe man [PPon [NPthe hill]]] [PPwith [NPthe telescope [PPin [NPTexas]]]] • [S[NPI [VPsaw [NPthe man [PPon [NPthe hill]] [PPwith [NPthe telescope [PPin [NPTexas]]]]] • …. • Not uncommon to have hundreds of parses for a sentence

Artificial Intelligence in Medicine HCA 590 (Topics in Health Sciences)