210 likes | 333 Vues
This project explores the use of advanced Hidden Markov Models (HMMs) to improve information extraction (IE) from text documents. HMM states correlate with semantic types like person names and background text. We address limitations of existing approaches by developing more flexible extraction techniques capable of handling multiple fields simultaneously while considering contextual information. Our research focuses on optimizing transitions, emissions, and the handling of unknown words through conditional training. The aim is to achieve higher accuracy in extracting structured information tailored to varying contexts.
E N D
Hidden Markov Modelsfor Information Extraction Recent Results and Current Projects Joseph Smarr & Huy Nguyen Advisor: Chris Manning
HMM Approach to IE • HMM states are associated with a semantic type • background-text, person-name, etc. • Constrained EM learns transitions and emissions • Viterbi alignment of a document marks tagged ranges of text with the same semantic type Extract range with highest probability 2 3 4 5 6 2 SpeakerisHuy Nguyenthis week
Existing Work • Leek (97 [UCSD MS thesis]) • Early results, fixed structures • Freitag & McCallum (99, 00) • Grow complex structures
Limitations of Existing Work • Only one field extracted at a time • Relative position of fields is ignored • e.g. authors usually come before titles in citations • Similar-looking fields aren’t competed for • e.g. acquired company vs. purchasing company • Simple model of unknown words • Use <UNK> for all words seen less than N times • No separation of content and context • e.g. can’t plug in generic date extractors, etc.
Current Research Goals • Flexibly train and combine extractors for multiple fields of information • Learn structures suited for individual fields • Can be recombined and reused with many HMMs • Learn intelligent context structures to link targets • Canonical ordering of fields • Common prefixes and suffixes • Construct merged HMM for actual extraction • Context/target split makes search problem tractable • Transitions between models are compiled out in merge
Current Research Goals • Richer models for handling unknown words • Estimate likelihood of novel words in each state • Featural decomposition for finer-grained probs • e.g. Nguyen UNK[Capitalized, No-numbers] • Character-level models for higher precision • e.g. phone numbers, room numbers, dates, etc. • Conditional training to focus on extraction task • Classical joint estimation often wastes states modeling patterns in English background text • Conditional training is slower, but only rewards structure that increases labeling accuracy
Learning Target Structures • Goal: Learn flexible structure tailored to composition of particular fields • Representation: Disjunction of multi-state chains • Learning method: • Collect and isolate all examples of the target field • Initialization: single state • Search operators (greedy search): • extend current chain(s) • Start a new chain • Stopping criteria: MDL score
Example Target HMM: dlramt mln billion U.S. Canadian dlrs dollars yen pesos 13.5 240 100 START END undisclosed withheld amount
Learning Context Structures • Goal: Learn structure to connect multiple target HMMs • Captures canonical ordering of fields • Identifies prefix and suffix patterns around targets • Initialization: • Background state connected to each target • Find minimum # words between each target type in corpus • Connect targets directly if distance is 0 • Add context state between targets if they’re close • Search operators (greedy search): • Add prefix/suffix between background and target • Lengthen an existing chain • Start a new chain (by splitting an existing one) • Stopping criteria: MDL score
Example of Context HMM The yesterday Reuters Background START END Purchaser Context Acquired purchased acquired bought
Merging Context and Targets • In context HMM, targets are collapsed into a single state that always emits “purchaser” etc. • Target HMMs have single START and END state • Glue target HMMs into place by “compiling out” start/end transitions and creating one big HMM • Challenge: create supportive structure without being overly restrictive • Too little structure hard to find regularities • Too much structure can’t generate all docs
Background START END START END Purchaser Context Acquired Background START END Context Acquired Example of Merging HMMs
Tricks and Optimizations • Mandatory end state • Allows explicit modeling of document end • Structural enhancements • Add transitions from start directly to targets • Add transitions from target/suffix directly to end • Allow “skip-ahead” transitions • Separation of core structure learning • Structure learning is performed on “skeleton” structure • Enhancements are added during parameter estimation • Keeps search tractable while exploiting rich transitions
Conditional Training • Observation: Joint HMMs waste states modeling patterns in background text • Improves document likelihood (like n-grams) • Doesn’t improve labeling accuracy (can hurt it!) • Ideally focus on prefixes, suffixes, etc. only • Idea: Maximize conditional probability of labels P(labels|words) instead of P(labels, words) • Should only reward modeling helpful patterns • Can’t use standard Baum-Welch training • Solution: use numerical optimization (CG)
b a|b|c a|o c|e o e T T Potential of Conditional Training • Don’t waste states modeling background patterns • Toy data model: ((abc)*(eTo))* [T is target] • e.g. abcabcabcabceToabcabceToabcabcabc • Modeling abc improves joint likelihood but provides no help for labeling targets Optimal Joint Model Optimal Labeling Model
Running Conditional Training • Gradient descent requires differentiable function • Value: • Deriv: • Likelihood and expectations are easily computed with existing HMM algorithms • Compute values with and without type constraints Forward algorithm Param expectations
Challenges for Cond. Training • Need additional constraint to keep numbers small • Can’t guarantee you’ll get a probability distribution • But it’s ok if you’re just summing and multiplying! • Solution: sum of all params must equal a constant • Need to fix parameter space ahead of time • Can’t add states, new words, etc. • Solution: start with large ergodic model in which all states emit entire vocabulary (use UNK tokens) • Need sensible initialization • Uniform structure has high variance • Fixed structure usually dictates training
Results on Toy Data Set • Results on (([ae][bt][co])*(eto))* • Contains spurious prefix/target/suffix-like symbols • Joint training always labels every t • Conditional training eventually gets it perfectly
Current and Future Work • Richer search operators for structure learning • Richer models of unknown words (char-level) • Reduce variance of conditional training • Build reusable repository of target HMMs • Integrate with larger IE framework(s) • Semantic Web / KAON • LTG • Applications • Semi-automatic ontology markup for web pages • Smart email processing