810 likes | 1.09k Vues
Domain Adaptation in Natural Language Processing. Jing Jiang Department of Computer Science University of Illinois at Urbana-Champaign. Textual Data in the Information Age. Contains much useful information E.g. >85% corporate data stored as text Hard to handle
E N D
Domain Adaptation in Natural Language Processing Jing Jiang Department of Computer Science University of Illinois at Urbana-Champaign
Textual Data in the Information Age • Contains much useful information • E.g. >85% corporate data stored as text • Hard to handle • Large amount: e.g. by 2002, 2.5 billion documents on surface Web, +7.3 million / day • Diversity: emails, news, digital libraries, Web logs, etc. • Unstructured: vs. relation databases How to manage textual data?
Information retrieval: to rank documents based on relevance to keyword queries • Not always satisfactory • More sophisticated services desired
Beyond Information Retrieval • Automatic text summarization • Question answering • Information extraction • Sentiment analysis • Machine translation • Etc. All relies on Natural Language Processing (NLP) techniques to deeply understand and analyze text
Typical NLP Tasks “Larry Page was Google’s founding CEO” • Part-of-speech tagging Larry/noun Page/noun was/verb Google/noun ’s/possessive-end founding/adjective CEO/noun • Chunking [NP: Larry Page] [V: was] [NP: Google ’s founding CEO] • Named entity recognition [person:Larry Page] was [organization:Google] ’s founding CEO • Relation extraction Founder(Larry Page, Google) • Word sense disambiguation “Larry Page” vs. “Page 81” state-of-the-art solution: supervised machine learning
Supervised Learning for NLP representative corpus human annotation WSJ articles POS-tagged WSJ articles Larry/NNP Page/NNP was/VBD Google/NNP ’s/POS founding/ADJ CEO/NN Standard Supervised Learning Algorithm training part-of-speech tagging on news articles trained POS tagger
In Reality… X human annotation is expensive representative corpus human annotation MEDLINE articles POS-tagged MEDLINE articles POS-tagged WSJ articles We/PRP analyzed/VBD the/DT mutations/NNS of/IN the/DT H-ras/NN genes/NNS Standard Supervised Learning Algorithm training part-of-speech tagging on biomedical articles trained POS tagger
Many Other Examples • Named entity recognition • News articles personal blogs • Organism A organism B • Spam filtering • Public email collection personal inboxes • Sentiment analysis of product reviews (positive vs. negative) • Movies books • Cell phones digital cameras Problem with this non-standard setting with domain difference?
Domain Difference Performance Degradation ideal setting POS Tagger MEDLINE MEDLINE ~96% realistic setting POS Tagger MEDLINE WSJ ~86%
Another Example ideal setting gene name recognizer 54.1% realistic setting gene name recognizer 28.1%
Domain Adaptation source domain target domain Labeled Labeled Unlabeled to design learning algorithms that are aware of domain difference and exploit all available data to adapt to the target domain Domain Adaptive Learning Algorithm
With Domain Adaptation Techniques… standard learning gene name recognizer Yeast Fly + Mouse 63.3% domain adaptive learning gene name recognizer Yeast Fly + Mouse 75.9%
Roadmap • What is domain adaptation in NLP? • Our work • Overview • Instance weighting • Feature selection • Summary and future work
Overview Source Domain Target Domain
Ideal Goal Source Domain Target Domain
Standard Supervised Learning Source Domain Target Domain
Standard Semi-Supervised Learning Source Domain Target Domain
Idea 1: Generalization Source Domain Target Domain
Idea 2: Adaptation Source Domain Target Domain
Source Domain Target Domain How to formally formulate the ideas?
Instance Weighting instance space (each point represents an observed instance) Source Domain Target Domain to find appropriate weights for different instances
Feature Selection feature space (each point represents a useful feature) Source Domain Target Domain to separate generalizable features from domain-specific features
Roadmap • What is domain adaptation in NLP? • Our work • Overview • Instance weighting • Feature selection • Summary and future work
Observation source domain target domain
Observation source domain target domain
Analysis of Domain Difference x: observed instance y: class label (to be predicted) p(x, y) ps(y | x) ≠ pt(y | x) p(x)p(y | x) ps(x) ≠ pt(x) labeling difference instance difference ? labeling adaptation instance adaptation
Labeling Adaptation source domain target domain pt(y | x) ≠ ps(y | x) remove/demote instances
Labeling Adaptation source domain target domain pt(y | x) ≠ ps(y | x) remove/demote instances
Instance Adaptation (pt(x) < ps(x)) source domain target domain pt(x) < ps(x) remove/demote instances
Instance Adaptation (pt(x) < ps(x)) source domain target domain pt(x) < ps(x) remove/demote instances
Instance Adaptation (pt(x) > ps(x)) source domain target domain pt(x) > ps(x) promote instances
Instance Adaptation (pt(x) > ps(x)) source domain target domain pt(x) > ps(x) promote instances
Instance Adaptation (pt(x) > ps(x)) source domain target domain pt(x) > ps(x) • Target domain instances are useful
Empirical Risk Minimization with Three Sets of Instances Dt, l Dt, u Ds loss function optimal classification model use empirical loss to replace expected loss expected loss
Using Ds Dt, l Dt, u Ds XDs instance difference (hard for high-dimensional data) labeling difference (need labeled target data)
Using Dt,l Dt, l Dt, u Ds XDt,l small sample size estimation not accurate
Using Dt,u Dt, l Dt, u Ds XDt,u use predicted labels (bootstrapping)
Combined Framework a flexible setup covering both standard methods and new domain adaptive methods
Experiments • NLP tasks • POS tagging: WSJ (Penn TreeBank) Oncology (biomedical) text (Penn BioIE) • NE type classification: newswire conversational telephone speech (CTS) and web-log (WL) (ACE 2005) • Spam filtering: public email collection personal inboxes (u01, u02, u03) (ECML/PKDD 2006) • Three heuristics to partially explore the parameter settings
useful in most cases; failed in some case When is it guaranteed to work? (future work) Instance Pruningremoving “misleading” instances from Ds POS NE Type Spam
Dt,l with Larger Weights POS NE Type Dt,l is very useful promoting Dt,l is more useful Spam
Bootstrapping with Larger Weightsuntil Ds and Dt,u are balanced POS NE Type promoting target instances is useful, even with predicted labels Spam
Roadmap • What is domain adaptation in NLP? • Our work • Overview • Instance weighting • Feature selection • Summary and future work
Observation 1Domain-specific features wingless daughterless eyeless apexless …
Observation 1Domain-specific features wingless daughterless eyeless apexless … • describing phenotype in fly gene nomenclature • feature “-less” useful for this organism CD38 PABPC5 … feature still useful for other organisms? No!
…decapentaplegic and winglessare expressed in analogous patterns in each… …that CD38 is expressed by both neurons and glial cells…that PABPC5 is expressed in fetal brain and in a range of adult tissues. Observation 2Generalizable features
…decapentaplegic and winglessare expressed in analogous patterns in each… …that CD38 is expressed by both neurons and glial cells…that PABPC5 is expressed in fetal brain and in a range of adult tissues. Observation 2Generalizable features feature “X be expressed”