1 / 10

Information Extraction Introduction

Information Extraction Introduction. Sunita Sarawagi. Definition. “Information Extraction refers to the automatic extraction of structured information such as entities, relationships between entities, and attributes describing entities from unstructured sources.” .

stew
Télécharger la présentation

Information Extraction Introduction

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Information ExtractionIntroduction SunitaSarawagi

  2. Definition “Information Extraction refers to the automatic extraction of structured information such as entities, relationships between entities, and attributes describing entities from unstructured sources.” • Enables richer forms of queries • Facilitates source integration and queries spanning sources

  3. IE: Multidisiplinary • Roots in NLP • Now many communities • Machine learning • Information retrieval • Databases • Web (web science) • Document analysis • Sarawagi’s categorization of methods • Rule-based • Statistical • Hybrid models

  4. Applications • News Tracking • Customer Care (e.g., unstructured data from insurance claim forms) • Data Cleaning (e.g., converting address strings into structured strings) • Classified Ads • Personal Information Management • Scientific (e.g., bio-informatics) • Citation Databases • Opinion Databases (e.g., enhanced if organized along structured fields) • Community Websites (e.g., conferences, projects, events) • Comparison Shopping • Ad Placement (e.g., product ads next to text mentioning the product) • Structured Web Search • Grand Challenge • Allow structured search queries involving entities and their relationships over the WWW

  5. Types of Structure Extracted • Entities • Relationships • Adjective Descriptors • Structures • Aggregates • Lists • Tables • Hierarchies

  6. Types of Unstructured Sources • Granularity • Record or Sentence • Paragraphs • Documents • Heterogeneity • Machine Generated Pages • Partially Structured Domain Specific • Open Ended

  7. Input Resources for Extraction • Structured Databases “In many applications unstructured data needs to be integrated with structured databases.” • Labeled Unstructured Text • Labeling for machine learning • Labeling to establish ground truth • Preprocessor Libraries (NLP tools) • Sentence analyzer to identify sentence boundaries • Part of speech tagger • Parser to group tagged text into phrases • Dependency analyzer (subject/object) • Formatted text (table & list structures) • Lexical Resources (e.g., WordNet)

  8. Output of Extraction • Identify all instances in the unstructured text • Populate a database For both, the core extraction work remains the same

  9. Challenges • Accuracy (foremost challenge) • Diversity of Clues Required to be Successful • Inherent complexity demands combining evidence • Optimally combining is non-trivial • Problem—far from solved • Difficulty of Detecting Missed Extractions • Recall: percent of actual entities extracted correctly – but without ground truth, can’t know the actual entities • Precision: percent of extracted entities that are correct – easier to tune, can usually know correct/incorrect. • Increased Complexity of Structures Extracted (e.g., parts of a blog that assert an opinion)

  10. Challenges (continued) • Running Time • Lots of documents – just finding the set from which to extract is challenging • Expensive processing steps to apply to many documents • Other System Issues • Dynamically changing sources • Data integration (when extracting the same objects from different sites) • Extraction errors • Attaching confidence • But computing the confidence is non-trivial

More Related