1 / 30

Rohit Kate

Computational Intelligence in Biomedical and Health Care Informatics HCA 590 (Topics in Health Sciences). Rohit Kate. Natural Language Processing: Words and Parses. Reading.

igor-mclean
Télécharger la présentation

Rohit Kate

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Computational Intelligence in Biomedical and Health Care InformaticsHCA 590 (Topics in Health Sciences) Rohit Kate Natural Language Processing: Words and Parses

  2. Reading • Chapter 8, Biomedical Informatics: Computer Applications in Health Care and Biomedicine by Edward H. Shortliffe (Editor) and James J. Cimino (Editor), Springer, 2006.

  3. Linguistics Essentials

  4. Basic Steps of Natural Language Processing Phonetics Words Syntactic processing Parses Semantic processing Meaning Pragmatic processing Sound waves Meaning in context We will skip phonetics and phonology.

  5. Basic Steps of Natural Language Processing Phonetics Words Syntactic processing Parses Semantic processing Meaning Pragmatic processing Sound waves Meaning in context We will skip phonetics and phonology.

  6. Words: Morphology • Study of internal structure of words • carried  carry + ed (past tense) • independently  in + (depend + ent) + ly • English has relatively simple morphology, some other languages like German or Finnish have complex word structures • Very accurate morphological analyzers are available for most languages; considered a solved problem • Biomedical domains have rich morphology: • hydroxynitrodihydrothymine => hydroxy-nitro-di-hydro-thym-ine • hepaticocholangiojejunostomy => hepatico-cholangio-jejuno-stom-y • Identifying morphological structure also helps dealing with new words

  7. Words: Parts of Speech • Linguists group words of a language into categories which occur in similar places in a sentence and have similar type of meaning: e.g. nouns, verbs, adjectives; these are called parts of speech (POS) • A basic test to see if words belong to the same category or not is the substitution test • This is a good [dog/chair/pencil]. • This is a [good/bad/green/tall] chair.

  8. Parts of Speech • Nouns: Typically refer to entities and their names like people, animals, things • John, Mary, boy, girl, dog, cats, mug, table, idea • Can be further divided as proper, singular, plural • Pronouns: Variables or place-holders for nouns • Nominative: I, you, he, she, we, they, it • Accusative: me, you, him, her, us, them, it • Possessive: my, your, his, her, our, their, its • 2nd Possessive: mine, yours, his, hers, ours, theirs, its • Reflexive: myself, yourself, himself, herself, ourselves, themselves, itself

  9. Parts of Speech • Determiners: Describe particular reference of a noun • Articles: a, an, the • Demonstratives: this, that, these, those • Adjectives: Describe properties of nouns • good, bad, green, tall • Verbs: Describe actions • talk, sleep, eat, throw • Categorized based on tense, person, singular/plural

  10. Parts of Speech • Adverbs: Modify verbs by specifying space, time, manner or degree • often, slowly, very • Prepositions: Small words that express spatial relations and other attributes • in, on, over, of, about, to, with • They introduce prepositional phrases that typically introduce ambiguity in a sentence. • I saw a man on the hill with a telescope. • Prepositional phrase attachment: Another important NLP problem • Particles: Subclass of prepositions that bond with verbs to form phrasal verbs • take off, air out, ran up

  11. POS Tagging • Automatic POS tagging is often the first step in analyzing a sentence • Why is this a non-trivial task? • The same word can have different pos tags in different sentences: • His position was near the tree. • Position him near the tree. John saw the saw and decided to take it to the table. NOUN VERB DT NOUN CONJ VERB TO VERB PRP PREP DT NOUN Noun Verb

  12. Basic Steps of Natural Language Processing Phonetics Words Syntactic processing Parses Semantic processing Meaning Pragmatic processing Sound waves Meaning in context

  13. Phrase Structure • Most languages have a word order • Words are organized into phrases, group of words that act as a single unit or a constituent • [The dog] [chased] [the cat]. • [The fat dog] [chased] [the thin cat]. • [The fat dog with red collar] [chased] [the thin old cat]. • [The fat dog with red collar named Tom] [suddenly chased] [the thin old white cat].

  14. Phrases • Noun phrase: A syntactic unit of a sentence which acts like a noun and in which a noun is usually embedded called its head • An optional determiner followed by zero or more adjectives, a noun head and zero or more prepositional phrases • Prepositional phrase: Headed by a preposition and express spatial, temporal or other attributes • Verb phrase: Part of the sentence that depend on the verb. Headed by the verb. • Adjective phrase: Acts like an adjective.

  15. An Important NLP Task: Phrase Chunking • Find all non-recursive noun phrases (NPs) and verb phrases (VPs) in a sentence. • [NP I] [VP ate] [NP the spaghetti] [PP with] [NP meatballs]. • [NP He ] [VPreckons] [NP the current account deficit ] [VPwill narrow] [PP to ] [NP only # 1.8 billion ] [PP in ] [NP September ] • Some applications need all the noun phrases in a sentence

  16. Phrase Structure Grammars • Syntax is the study of word orders and phrase structures • Syntactic analysis tells how to determine meaning of a sentence from the meaning its of words • The dog bit the man. • The man bit the dog. • A basic question in Linguistics: What forms a legal sentence in a language? • Syntax helps to answer that question • *Bit the the man dog. • Conventionally, ‘*’ indicates ungrammatical sentence • Colorless green ideas sleep furiously. • Meaningless but grammatical

  17. Phrase Structure Grammars • Linguists have come up with many grammar formalisms to capture syntax of languages, phrase structure grammar is one of them and is very commonly used • A context-free grammar (CFG) that generates sentences • Context free: Only one symbol on left side • Productions of a small example grammar: S  NP VP VP  Verb VP  Verb NP NP  Article Noun Verb  [slept|ate|made|bit] Noun  [girl|cake|dog|man] Article  [A|The]

  18. Phrase Structure Grammars Non-terminals S NP VP Article Noun Verb NP ate The girl Article Noun Terminals the cake The parse of the sentence is typically shown as a tree The girl ate the cake. A syntactic derivation or a parse tree

  19. Phrase Structure Grammars • Some of the productions can be recursive (one inside another, like NPNPPP) which can then expand several times • (S • (NP (PRP I)) • (VP • (VBD saw) • (NP (NP (DT the) (NN man)) • (PP (IN on) (NP (NP (DT the) (NN hill))) • (PP (IN with) • (NP (DT the) (NN telescope)))))) • Because of recursion in the grammars there are potentially infinite number of sentences in a language

  20. Syntactic Ambiguity • Typically a grammar can lead to several parses of a sentence, called syntactic ambiguity • (S • (NP (PRP I)) • (VP • (VBD saw) • (NP (NP (DT the) (NN man)) • (PP (IN on) (NP (NP (DT the) (NN hill))) • (PP (IN with) • (NP (DT the) (NN telescope))))))

  21. Syntactic Ambiguity • Typically a grammar can lead to several parses of a sentence, called syntactic ambiguity • (S • (NP (PRP I)) • (VP • (VBD saw) • (NP (NP (DT the) (NN man)) • (PP (IN on) (NP (DT the) (NN hill))) • (PP (IN with) (NP (DT the) (NN telescope))))))

  22. Syntactic Ambiguity • Typically a grammar can lead to several parses of a sentence, called syntactic ambiguity • (S • (NP (PRP I)) • (VP • (VBD saw) • (NP (DT the) (NN man)) • (PP (IN on) (NP (DT the) (NN hill))) • (PP (IN with) (NP (DT the) (NN telescope))))))

  23. Syntactic Parsing: A Very Important NLP Task • Not uncommon to have hundreds of parses for a sentence • Syntactic parsing is the task of finding the best parse for a sentence • Previous rule-based approaches were brittle and would not work well • Statistical methods for syntactic parsing have been more successful and are currently being used

  24. Statistical Syntactic Parsing • Statistical syntactic parsing uses a probabilistic model of syntax in order to assign probabilities to each parse tree • Provides principled approach to resolving syntactic ambiguity • The more likely parse will have higher probability • Includes POS tagging • Probabilities are typically learned from annotated parses of thousands of sentences, called a treebank • Penn Treebank (http://www.cis.upenn.edu/~treebank/) • Most well known treebank • Contains annotated parse trees of a few thousand Wall Street Journal articles • Sparked progress in automated syntactic parsing methods

  25. Probabilistic Context Free Grammar(PCFG) • A PCFG is a probabilistic version of a CFG where each production has a probability. • Probabilities of all productions rewriting a given non-terminal must add to 1, defining a distribution for each non-terminal.

  26. Simple PCFG for Air Travel Domain Lexicon Prob Grammar S → NP VP S → Aux NP VP S → VP NP → Pronoun NP → Proper-Noun NP → Det Nominal Nominal → Noun Nominal → Nominal Noun Nominal → Nominal PP VP → Verb VP → Verb NP VP → VP PP PP → Prep NP 0.8 0.1 0.1 0.2 0.2 0.6 0.3 0.2 0.5 0.2 0.5 0.3 1.0 Det → the | a | that | this 0.6 0.2 0.1 0.1 Noun → book | flight | meal | money 0.1 0.5 0.2 0.2 Verb → book | include | prefer 0.5 0.2 0.3 Pronoun → I | he | she | me 0.5 0.1 0.1 0.3 Proper-Noun → Houston | NWA 0.8 0.2 Aux → does 1.0 Prep → from | to | on | near | through 0.25 0.25 0.1 0.2 0.2 + 1.0 + 1.0 + 1.0 + 1.0

  27. Sentence Probability • Assume productions for each node are chosen independently. • Probability of derivation is the product of the probabilities of its productions. P(D1) = 0.1 x 0.5 x 0.5 x 0.6 x 0.6 x 0.5 x 0.3 x 1.0 x 0.2 x 0.2 x 0.5 x 0.8 = 0.0000216 D1 S 0.1 VP 0.5 Verb NP 0.6 0.5 Det Nominal book 0.5 0.6 Nominal PP the 1.0 0.3 Prep NP Noun 0.2 0.2 0.5 Proper-Noun flight through 0.8 Houston

  28. Syntactic Disambiguation • Resolve ambiguity by picking most probable parse tree. S D2 P(D2) = 0.1 x 0.3 x 0.5 x 0.6 x 0.5 x 0.6 x 0.3 x 1.0 x 0.5 x 0.2 x 0.2 x 0.8 = 0.00001296 0.1 VP 0.3 VP 0.5 Verb NP 0.6 0.5 PP Det Nominal book 1.0 0.6 0.3 Noun Prep NP the 0.2 0.2 0.5 Proper-Noun flight through 0.8 Houston D1 has a higher probability, hence it is the more likely parse according to the PCFG. 28

  29. Syntactic Parsing • State-of-the art in syntactic parsers also uses words to influence probabilities of productions • VP  Verb NP with “sneeze” as the verb will have a different probability than VP Verb NP with “eat” as the verb • You don’t sneeze something but you eat something • Try the online version of the Stanford Parser: • http://nlp.stanford.edu:8080/parser/

  30. Syntax of Biomedical Languages • Clinical language often relaxes many syntactic constraints in order to be highly compact • The cough worsened • Cough worsened • Cough • Increased tenderness. • Because these are widely used, they are not considered ungrammatical, but as a sublanguage • There are wide variety of sublanguages in the biomedical domains each exhibiting specialized content and linguistic forms • Parsers trained in one domain typically do not work well on another domain; requires adaptation

More Related