Web scale Information Extraction

Web scale Information Extraction

The Value of Text Data • “Unstructured” text data is the primary form of human-generated information • Blogs, web pages, news, scientific literature, online reviews, … • Semi-structured data (database generated): see Prof. Bing Liu’s KDD webinar: http://www.cs.uic.edu/~liub/WCM-Refs.html • The techniques discussed here are complimentary to structured object extraction methods • Need to extract structured information to effectively manage, search, and mine the data • Information Extraction: mature, but active research area • Intersection of Computational Linguistics, Machine Learning, Data mining, Databases, and Information Retrieval • Traditional focus on accuracy of extraction

Outline • Information Extraction Tasks • Entity tagging • Relation extraction • Event extraction • Scaling up Information Extraction • Focus on scaling up to large collections (where data mining can be most beneficial) • Other dimensions of scalability

Information Extraction Tasks • Extracting entities and relations: this talk • Entities: named (e.g., Person) and generic (e.g., disease name) • Relations: entities related in a predefined way (e.g., Location of a Disease outbreak, or a CEO of a Company) • Events: can be composed from multiple relation tuples • Common extraction subtasks: • Preprocess: sentence chunking, syntactic parsing, morphological analysis • Create rules or extraction patterns: hand-coded, machine learning, and hybrid • Apply extraction patterns or rules to extract new information • Postprocess and integrate information • Co-reference resolution, deduplication, disambiguation

Entity Tagging • Identifying mentions of entities (e.g., person names, locations, companies) in text • MUC (1997): Person, Location, Organization, Date/Time/Currency • ACE (2005): more than 100 more specific types • Hand-coded vs. Machine Learning approaches • Best approach depends on entity type and domain: • Closed class (e.g., geographical locations, disease names, gene & protein names): hand coded + dictionaries • Syntactic (e.g., phone numbers, zip codes): regular expressions • Semantic (e.g., person and company names): mixture of context, syntactic features, dictionaries, heuristics, etc. • “Almost solved” for common/typical entity types

Useful for data warehousing, data cleaning, web data integration House number Zip State Building Road City Example: Extracting Entities from Text Address 4089 Whispering Pines Nobel Drive San Diego CA 92122 1 Ronald Fagin, Combining Fuzzy Information from Multiple Systems, Proc. of ACM SIGMOD, 2002 Citation

ContactPattern  RegularExpression(Email.body,”can be reached at”) Hand-Coded Methods • Easy to construct in some cases • e.g., to recognize prices, phone numbers, zip codes, conference names, etc. • Intuitive to debug and maintain • Especially if written in a “high-level” language: • Can incorporate domain knowledge • Scalability issues: • Labor-intensive to create • Highly domain-specific • Often corpus-specific • Rule-matches can be expensive [IBM Avatar]

The human T cell leukemia lymphotropic virus type 1 Tax protein represses MyoD-dependent transcription by inhibiting MyoD-binding to the KIX domain of p300.“ [From AliBaba] Machine Learning Methods • Can work well when lots of training data easy to construct • Can capture complex patterns that are hard to encode with hand-crafted rules • e.g., determine whether a review is positive or negative • extract long complex gene names • Non-local dependencies

Popular Machine Learning Methods For details: [Feldman, 2006 and Cohen, 2004] • Naive Bayes • SRV [Freitag 1998], Inductive Logic Programming • Rapier [Califf and Mooney 1997] • Hidden Markov Models [Leek 1997] • Maximum Entropy Markov Models [McCallum et al. 2000] • Conditional Random Fields [Lafferty et al. 2001] • Scalability • Can be labor intensive to construct training data • At run time, complex features can be expensive to construct or process (batch algorithms can help: [Chandel et al. 2006] )

Some Available Entity Taggers • ABNER: • http://www.cs.wisc.edu/~bsettles/abner/ • Linear-chain conditional random fields (CRFs) with orthographic and contextual features. • Alias-I LingPipe • http://www.alias-i.com/lingpipe/ • MALLET: • http://mallet.cs.umass.edu/index.php/Main_Page • Collection of NLP and ML tools, can be trained for name entity tagging • MinorThird: • http://minorthird.sourceforge.net/ • Tools for learning to extract entities, categorization, and some visualization • Stanford Named Entity Recognizer: • http://nlp.stanford.edu/software/CRF-NER.shtml • CRF-based entity tagger with non-local features

Alias-I LingPipe ( http://www.alias-i.com/lingpipe/ ) • Statistical named entity tagger • Generative statistical model • Find most likely tags given lexical and linguistic features • Accuracy at (or near) state of the art on benchmark tasks • Explicitly targets scalability: • ~100K tokens/second runtime on single PC • Pipelined extraction of entities • User-defined mentions, pronouns and stop list • Specified in a dictionary, left-to-right, longest match • Can be trained/bootstrapped on annotated corpora

Outline • Overview of Information Extraction • Entity tagging • Relation extraction • Event extraction • Scaling up Information Extraction • Focus on scaling up to large collections (where data mining and ML techniques shine) • Other dimensions of scalability

May 19 1995, Atlanta -- The Centers for Disease Control and Prevention, which is in the front line of the world's response to the deadly Ebola epidemic in Zaire, is finding itself hard pressed to cope with the crisis… interact complex CBF-A CBF-C CBF-B CBF-A-CBF-C complex associates Relation Extraction Examples Disease Outbreaks relation • Extract tuples of entities that are related in predefined way Relation Extraction „We show that CBF-A and CBF-C interact with each other to form a CBF-A-CBF-C complex and that CBF-B does not interact with CBF-A or CBF-C individually but that it associates with the CBF-A-CBF-C complex.“ [From AliBaba]

Relation Extraction Approaches Knowledge engineering • Experts develop rules, patterns: • Can be defined over lexical items: “<company> located in <location>” • Over syntactic structures: “((Obj <company>) (Verb located) (*) (Subj <location>))” • Sophisticated development/debugging environments: • Proteus, GATE Machine learning • Supervised: Train system over manually labeled data • Soderland et al. 1997, Muslea et al. 2000, Riloff et al. 1996, Roth et al 2005, Cardie et al 2006, Mooney et al. 2005, … • Partially-supervised: train system by bootstrapping from “seed” examples: • Agichtein & Gravano 2000, Etzioni et al., 2004, Yangarber & Grishman 2001, … • “Open” (no seeds): Sekine et al. 2006, Cafarella et al. 2007, Banko et al. 2007 • Hybrid or interactive systems: • Experts interact with machine learning algorithms (e.g., active learning family) to iteratively refine/extend rules and patterns • Interactions can involve annotating examples, modifying rules, or any combination

Open Information Extraction [Banko et al., IJCAI 2007] • Self-Supervised Learner: • All triples in a sample corpus (e1, r, e2) are considered potential “tuples” for relation r • Positive examples: candidate triplets generated by a dependency parser • Train classifier on lexical features for positive and negative examples • Single-Pass Extractor: • Classify all pairs of candidate entities for some (undetermined) relation • Heuristically generate a relation name from the words between entities • Redundancy-Based Assessor: • Estimate probability that entities are related from co-occurrence statistics • Scalability • Extraction/Indexing • No tuning or domain knowledge during extraction, relation inclusion determined at query time • 0.04 CPU seconds pre sentence, 9M web page corpus in 68 CPU hours • Every document retrieved, processed (parsed, indexed, classified) in a single pass • Query-time • Distributed index for tuples by hashing on the relation name text • Related efforts: [Cucerzan and Agichtein 2005], [Pasca et al. 2006], [Sekine et al. 2006], [Rozenfeld and Feldman 2006], …

Event Extraction • Similar to Relation Extraction, but: • Events can be nested • Significantly more complex (e.g., more slots) than relations/template elements • Often requires coreference resolution, disambiguation, deduplication, and inference • Example: an integrated disease outbreak event [Hatunnen et al. 2002]

Event Extraction: Integration Challenges • Information spans multiple documents • Missing or incorrect values • Combining simple tuples into complex events • No single key to order or cluster likely duplicates while separating them from similar but different entities. • Ambiguity: distinct physical entities with same name (e.g., Kennedy) • Duplicate entities, relation tuples extracted • Large lists with multiple noisy mentions of the same entity/tuple • Need to depend on fuzzy and expensive string similarity functions • Cannot afford to compare each mention with every other. • See Part II of KDD 2006 Tutorial “Scalable Information Extraction and Integration” -- scaling up integration: http://www.scalability-tutorial.net/

Summary: Accuracy of Extraction Tasks • Errors cascade (errors in entity tag cause errors in relation extraction) • This estimate is optimistic: • Primarily for well-established (tuned) tasks • Many specialized or novel IE tasks (e.g. bio- and medical- domains) exhibit lower accuracy • Accuracy for all tasks is significantly lower for non-English [Feldman, ICML 2006 tutorial]

Multilingual Information Extraction • Closely tied to machine translation and cross-language information retrieval efforts. • Language-independent named entity tagging and related tasks at CoNLL: • 2006: multi-lingual dependency parsing (http://nextens.uvt.nl/~conll/) • 2002, 2003 shared tasks: language independent Named Entity Tagging (http://www.cnts.ua.ac.be/conll2003/ner/) • Global Autonomous Language Exploitation program (GALE): • http://www.darpa.mil/ipto/Programs/gale/concept.htm • Interlingual Annotation of Multilingual Text Corpora (IAMTC) • Tools and data for building MT and IE systems for six languages • http://aitc.aitcnet.org/nsf/iamtc/index.html • REFLEX project: NER for 50 languages • Exploit for training temporal correlations in weekly aligned corpora • http://l2r.cs.uiuc.edu/~cogcomp/wpt.php?pr_key=REFLEX • Cross-Language Information Retrieval (CLEF) • http://www.clef-campaign.org/

Outline • Overview of Information Extraction • Entity tagging • Relation extraction • Event Extraction • Scaling up Information Extraction • Focus on scaling up to large collections (where data mining and ML techniques shine) • Other dimensions of scalability

Scaling Information Extraction to the Web • Dimensions of Scalability • Corpus size: • Applying rules/patterns is expensive • Need efficient ways to select/filter relevant documents • Document accessibility: • Deep web: documents only accessible via a search interface • Dynamic sources: documents disappear from top page • Source heterogeneity: • Coding/learning patterns for each source is expensive • Requires many rules (expensive to apply) • Domain diversity: • Extracting information for any domain, entities, relationships • Some recent progress (e.g., see slide 17) • Not the focus of this talk

Scaling Up Information Extraction • Scan-based extraction • Classification/filtering to avoid processing documents • Sharing common tags/annotations • General keyword index-based techniques • QXtract, KnowItAll • Specialized indexes • BE/KnowItNow, Linguist’s Search Engine • Parallelization/distributed processing • IBM WebFountain, UIMA, Google’s Map/Reduce

Classifier Filter documents Efficient Scanning for Information Extraction Text Database Extraction System • 80/20 rule: use few simple rules to capture majority of the instances [Pantel et al. 2004] • Train a classifier to discard irrelevant documents without processing [Grishman et al. 2002] • (e.g., the Sports section of NYT is unlikely to describe disease outbreaks) • Share base annotations (entity tags) for multiple extraction tasks filtered Retrieve docs from database Process documents Extract output tuples

Exploiting Keyword and Phrase Indexes • Generate queries to retrieve only relevant documents • Data mining problem! • Some methods in literature: • Traversing Query Graphs [Agichtein et al. 2003] • Iteratively refine queries [Agichtein and Gravano 2003] • Iteratively partition document space [Etzioni et al., 2004] • Case studies: QXtract, KnowItAll

Simple Strategy: Iterative Set Expansion Text Database Extraction System Query Generation Execution time = |Retrieved Docs| * (R + P) + |Queries| * Q Process retrieved documents Extract tuplesfrom docs Augment seed tuples with new tuples Query database with seed tuples (e.g., <Malaria, Ethiopia>) (e.g., [Ebola AND Zaire]) Time for retrieving a document Time for processing a document Time for answering a query

Some IE tools Available • MALLET (UMass) • statistical natural language processing, • document classification, • clustering, • information extraction • other machine learning applications to text. • Sample Application:GeneTaggerCRF: a gene-entity tagger based on MALLET (MAchine Learning for LanguagE Toolkit). It uses conditional random fields to find genes in a text file.

MinorThird • http://minorthird.sourceforge.net/ • “a collection of Java classes for storing text, annotating text, and learning to extract entities and categorize text” • Stored documents can be annotated in independent files using TextLabels (denoting, say, part-of-speech and semantic information)

GATE • http://gate.ac.uk/ie/annie.html • leading toolkit for Text Mining • distributed with an Information Extraction component set called ANNIE (demo) • Used in many research projects • Long list can be found on its website • Under integration of IBM UIMA

Sunita Sarawagi's CRF package • http://crf.sourceforge.net/ • A Java implementation of conditional random fields for sequential labeling.

UIMA (IBM) • Unstructured Information Management Architecture. • A platform for unstructured information management solutions from combinations of semantic analysis (IE) and search components.

Some Interesting Website based on IE • ZoomInfo • CiteSeer.org(some of us using it everyday!) • Google Local, Google Scholar • and many more…

UIMA (IBM Research) • Unstructured Information Management Architecture (UIMA) • http://www.research.ibm.com/UIMA/ • Open component software architecture for development, composition, and deployment of text processing and analysis components. • Run-time framework allows to plug in components and applications and run them on different platforms. Supports distributed processing, failure recovery, … • Scales to millions of documents – incorporated into IBM OmniFind, grid computing-ready • The UIMA SDK (freely available) includes a run-time framework, APIs, and tools for composing and deploying UIMA components. • Framework source code also available on Sourceforge: • http://uima-framework.sourceforge.net/

UIMA – Quick Overview Architecture, Software Framework and Tooling

UIMA Analytics Bridge the Unstructured & Structured Worlds Text and Multi-Modal Analytics Unstructured Information Structured Information Text, Chat, Email, Audio, Video Indices Discover Relevant Semantics → Build into Structure • Docs, Emails, Phone Calls, Reports • Topics, Entities, Relationships • People, Places, Org, Times, Events • Customer Opinions, Products, Problems • Threats, Chemicals, Drugs, Drug Interactions.... DBs KBs • High-Value • Most Current • Fastest Growing • ...BUT ... • Buried in Huge Volumes (Noise) • Implicit Semantics • Inefficient Search • Explicit Semantics • Efficient Search • Focused Content • ...BUT... • Slow Growing • Narrow Coverage • Less Current/Relevant

Language, Speaker Identifiers Tokenizers Classifiers Part of Speech Detectors Document Structure Detectors Parsers, Translators Named-Entity Detectors Face Recognizers Relationship Detectors Modality Human Language Domain of Interest Source: Style and Format Input/Output Semantics Privacy/Security Precision/Recall Tradeoffs Performance/Precision Tradeoffs... The right analysis for the job will likely be a best-of-breed combination integrating capabilities across many dimensions. Analytics: The kinds of things they do • Different technologies & interfaces • Highly specialized & fine grained • Independently developed • From an increasing # of sources Capability Specializations Analysis Capabilities

UIMA CAS Representation now Aligned with XMI standard Common Analysis Structure (CAS) CeoOf Relationship Arg2:Org Arg1:Person Analysis Results (i.e., Artifact Metadata) Person Organization Named Entity PP NP VP Parser Fred Center is the CEO of Center Micros Artifact (e.g., Document) UIMA’s Basic Building Blocks are Annotators. They iterate over an artifact to discover new types based on existing ones and update the Common Analysis Structure(CAS) for upstream processing.

Analyzed by a collection of text analytics • Detected Semantic Entities and Relations Highlighted • Represented in UIMA Common Analysis Structure (CAS)

UIMA: Unstructured Information Management Architecture • Open Software Architecture and Emerging Standard • Platform independent standard for interoperable text and multi-modal analytics • Under Development: UIMA Standards Technical Committee Initiated under OASIS • http://www.oasis-open.org/committees/tc_home.php?wg_abbrev=uima • Software Framework Implementation • SDK Available on IBM Alphaworks • http://www.alphaworks.ibm.com/tech/uima • Tools, Utilities, Runtime, Extensive Documentation • Creation, Integration, Discovery, Deployment of analytics • Java, C++, Perl, Python (others possible) • Supports co-located and service-oriented deployments (eg., SOAP) • x-Language High-Performances APIs to common data structure (CAS) • Embeddable on Systems Middleware (e.g., ActiveMQ, WebSphere, DB2) • Apache UIMA open-source project • http://incubator.apache.org/uima/

Text, Chat, Email, Speech, Video Any UIMA-Compliant Readers, Segmenters Any UIMA-Compliant CAS Consumer(s) Any UIMA-Compliant Analysis Engine(s) Transcription Engine Video Object Detector Web Crawler Index Tokens & Annotations in IR Engine Entity & Relation Detector(s) Deep Parser File System Reader Index Entities & Relations in RDB or OWL KB Arabic-English Translator (Web Service) Streaming Speech Segmenter Relational Database Connect, Read & Segment Sources Analyze Content Assign Task-Relevant Semantics Index or Process Results OWL Knowledge- Base Text IR Engine Index CAS CAS Video Search Index UIMA: Pluggable Framework, User-defined Workflows CAS: Common UIMA Data Representation & Interchange Aligned with OMG & W3C standards (i.e., XMI, SOAP, RDF) Query Interface(s) Query Services End-User Application Interfaces Relevant Knowledge

Ontologies Indices Text, Chat, Email, Audio, Video DBs Collection Reader Knowledge Bases Key Framework Construction Developer Codes UIMA Component Architecture Collection Processing Engine (CPE) Aggregate Analysis Engine CAS Consumer Analysis Engine CAS Consumer Annotator CAS Consumer Analysis Engine CAS CAS CAS Annotator Flow Controller Flow Controller

Web scale Information Extraction

Web scale Information Extraction

Presentation Transcript

Towards Web-Scale Information Extraction

Information Extraction from Web Documents

Information Extraction

Information Extraction

Automatic Wrappers for Large Scale Web Extraction

Open Information Extraction from the Web

information extraction

Information Extraction

Information Extraction

Automatic Wrappers for Large Scale Web Extraction

Information Extraction on the Web

Toward Semantic Web Information Extraction

LaSIE: The Large Scale Information Extraction System

Information Extraction

Information extraction from web pages using extraction ontologies

Automatic Wrappers for Large Scale Web Extraction

Information extraction from web pages using extraction ontologies

Web-Scale Information Extraction in KNOWITALL (Preliminary Results)