1 / 18

Statistical Models of Text: From Bags of Words to Structure

Statistical Models of Text: From Bags of Words to Structure. Ralph Weischedel 17 April 2000. Multi-dimensional Meta-data Extraction. Extraction Vision. Outline. Statistical models that support feature extraction Bags of words Topic extraction Sequences (HMMs)

lang
Télécharger la présentation

Statistical Models of Text: From Bags of Words to Structure

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Statistical Models of Text:From Bags of Wordsto Structure Ralph Weischedel 17 April 2000

  2. Multi-dimensional Meta-data Extraction Extraction Vision

  3. Outline Statistical models that support feature extraction • Bags of words • Topic extraction • Sequences (HMMs) • Name extraction and classification • Lexicalized probabilistic context-free grammars • Parses • Facts/relationships • TBD • Propositions

  4. Topic Extraction via Bag of Words Training Program training sentences answers • Topics • Clinton, Bill • Mexico • Money • Economic assistance, American Models “President Clinton dumped his embattled Mexican bailout today. Instead, he announced another plan that doesn’t need congressional approval.” Speech Speech Recognition Topics Classifier Text

  5. T0 General Language P ( Wn |Tj) n T1 P( Tj| Set ) story start story end T2 P( Set ) . . TM Loop Generative Model of Story and Topics • First, choose a Set of topics, T0...TM • For each word in story: • Choose a topic according to P ( Tj | Set ) • Choose a word according to output distribution P ( Wn | Tj ) • Loop

  6. Topic Classification on Broadcast News • Trained on 1 year of stories from July ‘95 to Jun ‘96(42,502 stories) • Tested on 989 stories from July ‘96 • Allowed 4,627 topics that occur at least twice • OOT (out-of-topic) rate was 2.45% • Results: • 75.8% of the first choice topics are among the annotated labels • 63.6% for a simple likelihood-based method • 45% for the traditional tfidf measure used in IR • On cursory examination of errors, often the recognized topic was correct and the annotator failed to include it.

  7. Locations Persons Organizations The delegation, which included the commander of the U.N. troops in Bosnia, Lt. Gen. Sir Michael Rose, went to the Serb stronghold of Pale, near Sarajevo, for talks with Bosnian Serb leaderRadovan Karadzic. Name Extraction via HMMs The delegation, which included the commander of the U.N. troops in Bosnia, Lt. Gen. Sir Michael Rose, went to the Serb stronghold of Pale, near Sarajevo, for talks with Bosnian Serb leader Radovan Karadzic. Training Program training sentences answers NE Models Entities Speech Speech Recognition Extractor Text • Prior to 1997 - no learning approach competitive with hand-built rule systems • Since 1997 - Statistical approaches (BBN, NYU, MITRE) achieve state-of-the-art performance

  8. Bi-gram transition probabilities A Hidden Markov Model Structure of Model • One language model for each category plus on for other (not-a-name) • The number of categories is learned from training

  9. Effect of Speech Recognition Error BBN and NIST found IdentiFinder performance degrades 0.7 points of F per 1% WER

  10. Prior to 1990 - accuracy for non-statistical parsers around 65% Since 1995 - Statistical parsers (IBM, UPenn, Brown and BBN) achieve 85-90% accuracy Parsing via Lexicalized Probabilistic CFGs Training Program training sentences answers NE Models Nawaz Sharif, who led Pakistan, was ousted October 12 by Pervez Musharraf, Pakistani Army General. Trees Speech Speech Recognition Parser Text

  11. S S was S VP was VP NP VP VP ousted SBAR PP S NP VP WHNP NP NP NP NP NP , was ousted Sharif who led Pakistan , Muscharraf 12 by , Pervez General October Nawaz Army Pakistani Example of Generating a Parse Tree

  12. Extracting Facts via LPCFG “Nance, who is also a paid consultant to ABC News, said ...” Training Program training sentences answers Models PositionHolder Person: Nance Post: a paid consultant Org: ABC News Relationships/ Events Speech Speech Recognition Extractor Text • 1998 - First state-of-the-art trainable system (70% accuracy)

  13. Employee relation Coreference person-descriptor organization person Nance , who is also a paid consultant to ABC News , said ... Type of Annotation Required • Training data consists ONLY of • Named entities (as in NE) • Descriptor phrases (for TE) • Descriptor references (for TE) • Relation/events to be extracted (for TR)

  14. The Sentential Model • Search Criterion: find M such that p(M | W) is maximized • Since p(W) is constant, search for: • Model the probability as the product of the probabilities of generating each element in the tree

  15. s Semantic label Syntax label per/np vp per-desc-of/sbar-lnk per-desc-ptr/sbar per-desc-ptr/vp per-desc-r/np emp-of/pp-lnk org-ptr/pp per-r/np whnp advp per-desc/np org-r/np per/nnp , wp vbz rb det vbn per-desc/nn to org-c/nnp org/nnp , vbd Nance , who is also a paid consultant to ABC News , said ... Augmented Semantic Tree

  16. Propositions via TBD Training Program training sentences answers Within the past two months, a bomb exploded in the offices of the El Espectador in Bogata, destroying a major part of its installations and equipment. Models Propositions Speech Speech Recognition Extractor Text

  17. Add Predicate/Argument Markings Add Co-reference S Event: ousted-1 Logical Object: Logical Subject Time: Location: -- Add Verb Sense Markings VP NP VP SBAR PP S Event: led-3 Logical Object: Logical Subject Time: -- Location: -- NP VP WHNP NP NP NP NP NP , was ousted Sharif who led Pakistan , Muscharraf -3 12 by , Pervez General October Nawaz Army Pakistani -1 Towards a Proposition Bank

  18. Language Input Trainer Answers Model Language Input Answers Decoder Statistical Speech/Language Modeling • Technology Input Answers • Speech recognition audio transcription • OCR image characters • Speech understanding audio response • Topic classification document topics • Topic detection text/speech clusters • Topic tracking text/speech relevant stories • Story segmentation speech stories • Information retrieval query text/speech • Named entity text/speech names & typesextraction Advantages • Mathematically rigorous approach • State-of-the-art performance • Highly robust in the face of degraded input • Language independent, requiring only annotated training data • Affordable annotation • Only domain knowledge is needed • Can be performed by students/interns

More Related