1 / 34

Hidden Markov Models Applied to Information Extraction

Hidden Markov Models Applied to Information Extraction. Part I: Concept HMM Tutorial Part II: Sample Application AutoBib: web information extraction. Larry Reeve INFO629: Artificial Intelligence Dr. Weber, Fall 2004. Part I: Concept HMM Motivation.

shateque
Télécharger la présentation

Hidden Markov Models Applied to Information Extraction

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Hidden Markov ModelsApplied to Information Extraction • Part I: Concept • HMM Tutorial • Part II: Sample Application • AutoBib: web information extraction Larry Reeve INFO629: Artificial IntelligenceDr. Weber, Fall 2004

  2. Part I: Concept HMM Motivation • Real-world has structures and processes which have (or produce) observable outputs • Usually sequential (process unfolds over time) • Cannot see the event producing the output • Example: speech signals • Problem: how to construct a model of the structure or process given only observations

  3. HMM Background • Basic theory developed and published in 1960s and 70s • No widespread understanding and application until late 80s • Why? • Theory published in mathematic journals which were not widely read by practicing engineers • Insufficient tutorial material for readers to understand and apply concepts

  4. HMM Uses • Uses • Speech recognition • Recognizing spoken words and phrases • Text processing • Parsing raw records into structured records • Bioinformatics • Protein sequence prediction • Financial • Stock market forecasts (price pattern prediction) • Comparison shopping services

  5. HMM Overview • Machine learning method • Makes use of state machines • Based on probabilistic models • Useful in problems having sequential steps • Can only observe output from states, not the states themselves • Example: speech recognition • Observe: acoustic signals • Hidden States: phonemes (distinctive sounds of a language) State machine:

  6. Observable Markov Model Example • Weather • Once each day weather is observed • State 1: rain • State 2: cloudy • State 3: sunny • What is the probability the weather for the next 7 days will be: • sun, sun, rain, rain, sun, cloudy, sun • Each state corresponds to a physical observable event State transition matrix

  7. Observable Markov Model

  8. Hidden Markov Model Example • Coin toss: • Heads, tails sequence with 2 coins • You are in a room, with a wall • Person behind wall flips coin, tells result • Coin selection and toss is hidden • Cannot observe events, only output (heads, tails) from events • Problem is then to build a model to explain observed sequence of heads and tails

  9. HMM Components • A set of states (x’s) • A set of possible output symbols (y’s) • A state transition matrix (a’s) • probability of making transition from one state to the next • Output emission matrix (b’s) • probability of a emitting/observing a symbol at a particular state • Initial probability vector • probability of starting at a particular state • Not shown, sometimes assumed to be 1

  10. HMM Components

  11. Common HMM Types • Ergodic (fully connected): • Every state of model can be reached in a single step from every other state of the model • Bakis (left-right): • As time increases, states proceed from left to right

  12. HMM Core Problems • Three problems must be solved for HMMs to be useful in real-world applications 1) Evaluation 2) Decoding 3) Learning

  13. HMM Evaluation Problem • Purpose: score how well a given model matches a given observation sequence • Example (Speech recognition): • Assume HMMs (models) have been built for words ‘home’ and ‘work’. • Given a speech signal, evaluation can determine the probability each model represents the utterance

  14. HMM Decoding Problem • Given a model and a set of observations, what are the hidden states most likely to have generated the observations? • Useful to learn about internal model structure, determine state statistics, and so forth

  15. HMM Learning Problem • Goal is to learn HMM parameters (training) • State transition probabilities • Observation probabilities at each state • Training is crucial: • it allows optimal adaptation of model parameters to observed training data using real-world phenomena • No known method for obtaining optimal parameters from data – only approximations • Can be a bottleneck in HMM usage

  16. HMM Concept Summary • Build models representing the hidden states of a process or structure using only observations • Use the models to evaluate probability that a model represents a particular observation sequence • Use the evaluation information in an application to: recognize speech, parse addresses, and many other applications

  17. Part II: Application AutoBib System • Provide a uniform view of several computer science bibliographic web data sources • An automated web information extraction system that requires little human input • Web pages designed differently from site-to-site • IE requires training samples • HMMs used to parse unstructured bibliographic records into a structured format: NLP

  18. Web Information Extraction Converting Raw Records

  19. Approach 1) Provide seed database of structured records 2) Extract raw records from relevant Web pages 3) Match structured records to raw records • To build training samples 4) Train HMM-based parser 5) Parse unmatched raw recs into structured recs 6) Merge new structured records into database

  20. AutoBib Architecture

  21. Step 1 - Seeding • Provide seed database of structured records • Take small collection of BibTeX format records and insert into database • Cleaning step normalizes record fields • Examples: • “Proc.”  “Proceedings” • “Jan”  “January” • Manual step, executed once only

  22. Step 2 – Extract Raw Records • Extract raw records from relevant Web pages • User specifies • Web pages to extract from • How to follow ‘next page’ links for multiple pages • Raw records are extracted • Uses record-boundary discovery techniques • Subtree of Interest = largest subtree of HTML tags • Record separators = frequent HTML tags

  23. Tokenized Records (Replace all HTML tags with ^)

  24. Step 3 - Matching • Match raw records R to structured records S • Apply 4 tests (heuristic-based) • Match at least author in R to an author in S • S.year must appear in R • If S.pages exists, R must contain it • S.title is ‘approximately contained’ in R Levenshtein edit distance – approximate string match

  25. Step 4 – Parser Training • Train HMM-based parser • For each pair of R and S that match, annotate tokens in raw record with field names • Annotated raw records are fed into HMM parser in order to learn: • State transition probabilities • Symbol probabilities at each state

  26. Parser Training, continued • Key consideration is HMM structure for navigating record fields (fields, delimiters) • Special states • start, end • Normal states • author, title, year, etc. • Best structure found: • Have multiple delimiter and tag states, • one for each normal state • Example: author-delimiter, author-tag

  27. Sample HMM (Method 3) Source: http://www.cs.duke.edu/~geng/autobib/web/hmm.jpg

  28. Step 5 - Conversion • Parse unmatched raw recs into structured recs using HMM parser • Matched raw records can be directly converted without parsing because they were annotated in matching step

  29. Step 6 - Merging • Merge new structured records into database • Initial seed database has now grown • New records will be used for improved matching on the next run

  30. Evaluation • Success rate: # of tokens labeled by HMM ------------------------------------- # of tokens labeled by person • DBLP: 98.9% • Computer Science Bibliography • CSWD: 93.4% • CompuScience WWW-Database

  31. HMM Advantages / Disadvantages • Advantages • Effective • Can handle variations in record structure • Optional fields • Varying field ordering • Disadvantages • Requires training using annotated data • Not completely automatic • May require manual markup • Size of training data may be an issue

  32. Other methods • Wrappers • Specification of areas of interest on Web page • Hand-crafted • Wrapper induction • Requires manual training • Not always accommodating to changing structure • Syntax-based; no semantic labeling

  33. Application to Other Domains • E-Commerce • Comparison shopping sites • Extract product/pricing information from many sites • Convert information into structured format and store • Provide interface to look up product information and then display pricing information gathered from many sites • Saves users time • Rather than navigating to and searching many sites, users can consult a single site

  34. References • Concept: • Rabiner, L. R. (1989). A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition. Proceedings of the IEEE, 77(2), 257-285. • Application: • Geng, J. and Yang, J. (2004). Automatic Extraction of Bibliographic Information on the Web. Proceedings of the 8th International Database Engineering and Applications Symposium (IDEAS’04), 193-204.

More Related