1 / 32

Information Extraction

Information Extraction. Sources : Sarawagi , S. (2008). Information extraction. Foundations and Trends in Databases, 1(3), 261–377. Hobbs, J. R., & Riloff , E. (2010). Information extraction. Handbook of Natural Language Processing , 2 . . Context. History.

geona
Télécharger la présentation

Information Extraction

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Information Extraction Sources: Sarawagi, S. (2008). Information extraction. Foundations and Trends in Databases, 1(3), 261–377. Hobbs, J. R., & Riloff, E. (2010). Information extraction. Handbook of Natural LanguageProcessing, 2.

  2. Context

  3. History • Genesis = recognition of namedentities (organization & people names) • Online access = pushestowards • personal desktops -> structureddatabases, • scientificpublications -> structuredrecords, • Internet -> structuredfactfindingqueries.

  4. Driving workshops / conferences • 1987-97: MUC (Message UnderstandingConference)Filling slots, namedentities & coreference (95-) • 1999-08: ACE (Automatic Content Extraction)« supportingvarious classification, filtering, and selection applications by extracting and representinglanguagecontent » • 2008-now: TAC (TextAutomatedComprehension) • Knowledge Base Population (09-11) • Others: Textualentailment, Summarization, QA (until 2009)

  5. Example: MUC 0. MESSAGE: ID TST1-MUC3-0001 1. MESSAGE: TEMPLATE 1 2. INCIDENT: DATE 02 FEB 90 3. INCIDENT: LOCATION GUATEMALA: SANTO TOMAS (FARM) 4. INCIDENT: TYPE ATTACK 5. INCIDENT: STAGE OF EXECUTION ACCOMPLISHED 6. INCIDENT: INSTRUMENT ID - 7. INCIDENT: INSTRUMENT TYPE - 8. PERP: INCIDENT CATEGORY TERRORIST ACT 9. PERP: INDIVIDUAL ID "GUERRILLA COLUMN" / "GUERRILLAS" 10. PERP: ORGANIZATION ID "GUATEMALAN NATIONAL REVOLUTIONARY UNITY" / "URNG" 11. PERP: ORGANIZATION CONFIDENCE REPORTED AS FACT / CLAIMED OR ADMITTED: "GUATEMALAN NATIONAL REVOLUTIONARY UNITY" / "URNG" 12. PHYS TGT: ID "\"SANTO TOMAS\" PRESIDENTIAL FARM" / "PRESIDENTIAL FARM" 13. PHYS TGT: TYPE GOVERNMENT OFFICE OR RESIDENCE: "\"SANTO TOMAS\" PRESIDENTIAL FARM" / "PRESIDENTIAL FARM" 14. PHYS TGT: NUMBER 1: "\"SANTO TOMAS\" PRESIDENTIAL FARM" / "PRESIDENTIAL FARM" 15. PHYS TGT: FOREIGN NATION - 16. PHYS TGT: EFFECT OF INCIDENT - 17. PHYS TGT: TOTAL NUMBER - 18. HUM TGT: NAME "CEREZO" 19. HUM TGT: DESCRIPTION "PRESIDENT": "CEREZO" "CIVILIAN" 20. HUM TGT: TYPE GOVERNMENT OFFICIAL: "CEREZO" CIVILIAN: "CIVILIAN" 21. HUM TGT: NUMBER 1: "CEREZO" 1: "CIVILIAN" 22. HUM TGT: FOREIGN NATION - 23. HUM TGT: EFFECT OF INCIDENT NO INJURY: "CEREZO" DEATH: "CIVILIAN" 24. HUM TGT: TOTAL NUMBER -

  6. Application • Enterprise Applications • News Tracking (terrorists, disease) • Customer care (linking mails to products, etc.) • Data Cleaning • ClassifiedAds • Personal Information Management (PIM) • Scientific Applications (e.g. bio-informatics) • Web Oriented • Citation databases • Opinion databases • Communitywebsites (DBLife, Rexa - UMASS) • Comparison Shopping • Ad Placement on Webpages • Structured Web Searches

  7. IE - Taxonomy • Types of structures extracted • Entities, Records, Relationships • Open/Closed IE • Sources • Granularity of extraction • Heterogenity: machine generated, (semi)structured, open • Input resources • Structured DB • LabelledUnstructuredText • Preprocessing (tokenizer, chunker, parser<)

  8. Process (I) • Annotated documents • Rules hand-crafted by humans (1500 hours!)

  9. Process (I) • Annotated documents • Rules hand-crafted by humans (1500 hours!) • Rulesgenerated by a system • Rulesevaluated by humans

  10. Process (II) • Annotated documents • Rules hand-crafted by humans (1500 hours!) • Rulesgenerated by a system • Ruleslearnt

  11. Process (III) • Annotated documents • Rules hand-crafted by humans (1500 hours!) • Rulesgenerated by a system • Ruleslearnt • Models • Logic: First OrderLogic • Sequence: e.g. HMM • Classifiers: e.g. MEM, CRF • Decompositioninto a series of subproblems • Complexwords, basic phrases, complex phrases, events and merging

  12. Process (IV) • Annotated documents • Relevant & non relevant documents • Rules hand-crafted by humans (1500 hours!) • Rulesgenerated by a system • Ruleslearnt • Models • Logic: First OrderLogic • Sequence: e.g. HMM • Classifiers: e.g. MEM, CRF

  13. Process (V) • Annotated documents • Relevant & non relevant documents • Seeds -> boostrapping • Rules hand-crafted by humans (1500 hours!) • Rulesgenerated by a system • Ruleslearnt • Models • Logic: First OrderLogic • Sequence: e.g. HMM • Classifiers: e.g. MEM, CRF

  14. Recognizingentities / FILLING SLOTS

  15. Rulebasedsystems • Rules to mark an entity (or more) • Before the start of the entity • Tokens of the entity • After the end of the entity • Rules to mark the boundaries • Conflictsbetweenrules • Largerspan • Merge (if same action) • Order the rules

  16. Entity Extraction – rulebased

  17. Learning rules • Algorithms are based on • Coverage [how many cases are covered by the rule] • Precision • Twoapproaches • Top-down (e.g. FOIL): startwithcoverage = 100% • Bottom-up: startwithprecision = 100%

  18. Rules – Autoslog Riloff, E. (1993). Automaticallyconstructing a dictionary for information extraction tasks, 811–811. • Rule Learning • Look at sentences containingtargets • Heuristic: looking for a linguistic pattern

  19. Rules– LIEP Huffman, S. B. (2005). Learning information extraction patterns fromexamples. Learn (sets of meta-heuristics) by usingsyntacticpathsthat relate tworole-fillingconstituents,e.g. [subject(Bob,named),object(named,CE0)]. Followed by generalization (matching + disjonction)

  20. Statisticalmodels • How to label • IOB sequences (Inside, Outside, Beginning) • Sequences • SegmentationAlleged/B guerrilla/I urban/I commandos/I launched/O  two/B highpower/I bombs/I against/O a/B car/I dealership/I in/O down- town/OSan/B Salvador/I this/B morning/I. • Grammarbased (longer dependencies) • Many ML models: • HMM • ME, CRF • SVM

  21. Statisticalmodels (cont’d) • Features • Word • Orthographic • Dictionary • … • First order • Position: • Segment:

  22. Examples of features

  23. Statisticalmodels (cont’d) • Learning: • Likelihood • Max-Margin

  24. Predictingrelationships

  25. Overall • Goal: classify (E1,E2,x) • Features • Surface tokens (words, entities)[Entity label of E1 = Person, Entity label of E2 = Location] • Parsetree (syntaxic, dependency graph)[(POS = (noun,verb,noun), flag = “(1,none,2)”, type = “dependency”]

  26. Models • Standard classifier (e.g. SVM) • Kernel-basedmethods • e.g. measure of commonpropertiesbetweentwopaths in the dependencytree • Convolution basedkernels • Rule-basedmethods

  27. Extractingentities for a set of relationships • Threesteps • Learn extraction patterns for the seeds • Find documents whereentitiesappear close to eachother • Filtering • Generate candidate triplets • Pattern or keyword-based • Validation • # of occurrences

  28. MANAGEMENT

  29. Summary • Performance • Document selection: subset, crawling • Queries to DB: relatedentities (top-k retrieval) • Handling changes • Detectingwhen a page has changed • Integration • Detecting duplicates entities • Redundant extractions (open IE)

  30. Evaluation

  31. Metrics • Metrics • Precision-Recall • F-measure (-> harmonicmean)

  32. The 60% barrier

More Related