1 / 52

Information Extraction

Information Extraction. Yunyao Li EECS /SI 767 03/29/2006. The Problem. Date. Time: Start - End. Location. Speaker. Person. What is “Information Extraction”. As a task:. Filling slots in a database from sub-segments of text. October 14, 2002, 4:00 a.m. PT

grady
Télécharger la présentation

Information Extraction

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Information Extraction Yunyao Li EECS /SI 767 03/29/2006

  2. The Problem Date Time: Start - End Location Speaker Person

  3. What is “Information Extraction” As a task: Filling slots in a database from sub-segments of text. October 14, 2002, 4:00 a.m. PT For years, Microsoft Corporation CEO Bill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation. Today, Microsoft claims to "love" the open-source concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers. "We can be open source. We love the concept of shared source," said Bill Veghte, a Microsoft VP. "That's a super-important shift for us in terms of code access.“ Richard Stallman, founder of the Free Software Foundation, countered saying… NAME TITLE ORGANIZATION Courtesy of William W. Cohen

  4. What is “Information Extraction” As a task: Filling slots in a database from sub-segments of text. October 14, 2002, 4:00 a.m. PT For years, Microsoft CorporationCEOBill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation. Today, Microsoft claims to "love" the open-source concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers. "We can be open source. We love the concept of shared source," said Bill Veghte, a MicrosoftVP. "That's a super-important shift for us in terms of code access.“ Richard Stallman, founder of the Free Software Foundation, countered saying… IE NAME TITLE ORGANIZATION Bill GatesCEOMicrosoft Bill VeghteVPMicrosoft Richard StallmanfounderFree Soft.. Courtesy of William W. Cohen

  5. What is “Information Extraction” Information Extraction = segmentation + classification + association + clustering October 14, 2002, 4:00 a.m. PT For years, Microsoft CorporationCEOBill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation. Today, Microsoft claims to "love" the open-source concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers. "We can be open source. We love the concept of shared source," said Bill Veghte, a MicrosoftVP. "That's a super-important shift for us in terms of code access.“ Richard Stallman, founder of the Free Software Foundation, countered saying… Microsoft Corporation CEO Bill Gates Microsoft Gates Microsoft Bill Veghte Microsoft VP Richard Stallman founder Free Software Foundation aka “named entity extraction” Courtesy of William W. Cohen

  6. What is “Information Extraction” Information Extraction = segmentation + classification + association + clustering October 14, 2002, 4:00 a.m. PT For years, Microsoft CorporationCEOBill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation. Today, Microsoft claims to "love" the open-source concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers. "We can be open source. We love the concept of shared source," said Bill Veghte, a MicrosoftVP. "That's a super-important shift for us in terms of code access.“ Richard Stallman, founder of the Free Software Foundation, countered saying… Microsoft Corporation CEO Bill Gates Microsoft Gates Microsoft Bill Veghte Microsoft VP Richard Stallman founder Free Software Foundation Courtesy of William W. Cohen

  7. What is “Information Extraction” Information Extraction = segmentation + classification + association + clustering October 14, 2002, 4:00 a.m. PT For years, Microsoft CorporationCEOBill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation. Today, Microsoft claims to "love" the open-source concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers. "We can be open source. We love the concept of shared source," said Bill Veghte, a MicrosoftVP. "That's a super-important shift for us in terms of code access.“ Richard Stallman, founder of the Free Software Foundation, countered saying… Microsoft Corporation CEO Bill Gates Microsoft Gates Microsoft Bill Veghte Microsoft VP Richard Stallman founder Free Software Foundation Courtesy of William W. Cohen

  8. NAME TITLE ORGANIZATION Bill Gates CEO Microsoft Bill Veghte VP Microsoft Stallman founder Free Soft.. Richard What is “Information Extraction” Information Extraction = segmentation + classification + association+ clustering October 14, 2002, 4:00 a.m. PT For years, Microsoft CorporationCEOBill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation. Today, Microsoft claims to "love" the open-source concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers. "We can be open source. We love the concept of shared source," said Bill Veghte, a MicrosoftVP. "That's a super-important shift for us in terms of code access.“ Richard Stallman, founder of the Free Software Foundation, countered saying… Microsoft Corporation CEO Bill Gates Microsoft Gates Microsoft Bill Veghte Microsoft VP Richard Stallman founder Free Software Foundation * * * * Courtesy of William W. Cohen

  9. Live Example: Seminar

  10. Classify Pre-segmentedCandidates Sliding Window Abraham Lincoln was born in Kentucky. Abraham Lincoln was born in Kentucky. Classifier Classifier which class? which class? Try alternatewindow sizes: Context Free Grammars Boundary Models Finite State Machines Abraham Lincoln was born in Kentucky. Abraham Lincoln was born in Kentucky. Abraham Lincoln was born in Kentucky. BEGIN Most likely state sequence? NNP NNP V V P NP Most likely parse? Classifier PP which class? VP NP VP BEGIN END BEGIN END S Landscape of IE Techniques Lexicons Abraham Lincoln was born in Kentucky. member? Alabama Alaska … Wisconsin Wyoming Our Focus today! Courtesy of William W. Cohen

  11. Markov Property S1: rain S2: cloud S3: sun The state of a system at time t+1, qt+1, is conditionally independent of {qt-1, qt-2, …, q1, q0} given qt In another word, current state determines the probability distribution for the next state. 1/2 S2 1/3 1/2 S1 S2 2/3 1

  12. Markov Property S1: rain S2: cloud S3: sun 1/2 State-transition probabilities, A = S2 1/3 1/2 S1 S3 2/3 1 Q: given today is sunny (i.e., q1=3), what is the probability of “sun-cloud” with the model?

  13. state sequences O1 O3 O4 O5 O2 Hidden Markov Model S1: rain S2: cloud S3: sun 1/2 1/10 S2 9/10 1/2 1/3 S1 S3 2/3 observations 4/5 1 7/10 3/10 1/5

  14. IE with Hidden Markov Model Given a sequence of observations: SI/EECS 767 is held weekly at SIN2. and a trained HMM: course name location name background Find the most likely state sequence: (Viterbi) SI/EECS 767 is held weekly at SIN2 Any words said to be generated by the designated “course name” state extract as a course name: Course name: SI/EECS 767

  15. Person end-of-sentence start-of-sentence Org (Five other name classes) Other Name Entity Extraction [Bikel, et al 1998] Hidden states

  16. Name Entity Extraction Transitionprobabilities Observationprobabilities P(ot | st , st-1) P(st | st-1, ot-1) P(ot | st , ot-1) or (1) Generating first word of a name-class (2) Generating the rest of words in the name-class (3) Generating “+end+” in a name-class

  17. Training: Estimating Probabilities

  18. Back-Off “unknown words” and insufficient training data Transitionprobabilities Observationprobabilities P(st | st-1 ) P(ot | st ) P(st ) P(ot )

  19. HMM-Experimental Results Train on ~500k words of news wire text. Results:

  20. Learning HMM for IE [Seymore, 1999] Consider labeled, unlabeled, and distantly-labeled data

  21. Some Issues with HMM • Need to enumerate all possible observation sequences • Not practical to represent multiple interacting features or long-range dependencies of the observations • Very strict independence assumptions on the observations

  22. Maximum Entropy Markov Models [Lafferty, 2001] S S S identity of word ends in “-ski” is capitalized is part of a noun phrase is in a list of city names is under node X in WordNet is in bold font is indented is in hyperlink anchor … t - 1 t t+1 … is “Wisniewski” … part ofnoun phrase ends in “-ski” O O O t - t +1 t 1 Idea: replace generative model in HMM with a maxent model, where state depends on observations Courtesy of William W. Cohen

  23. MEMM S S identity of word ends in “-ski” is capitalized is part of a noun phrase is in a list of city names is under node X in WordNet is in bold font is indented is in hyperlink anchor … S t - 1 t+1 … t is “Wisniewski” … part ofnoun phrase ends in “-ski” O O O t - t +1 t 1 Idea: replace generative model in HMM with a maxent model, where state depends on observations and previous state history Courtesy of William W. Cohen

  24. HMM vs. MEMM St-1 St St+1 ... Ot-1 Ot Ot+1 St-1 St St+1 ... Ot-1 Ot Ot+1

  25. Label Bias Problem with MEMM Consider this MEMM Pr(12|ro) = Pr(2|1,ro)Pr(1,ro) = Pr(2| 1,o)Pr(1,r) Pr(12|ri) = Pr(2|1,ri)Pr(1,ri) = Pr(2| 1,i)Pr(1,r) Pr(2|1,o) = Pr(2|1,r) = 1 Pr(12|ro) = Pr(12|ri) But it should be Pr(12|ro) < Pr(12|ri)!

  26. Change the state-transition structure of the model Not always practical to change the set of states Start with a fully-connected model and let the training procedure figure out a good structure Prelude the use of prior, which is very valuable (e.g. in information extraction) Solve the Label Bias Problem

  27. Random Field Courtesy of Rongkun Shen

  28. Conditional Random Field Courtesy of Rongkun Shen

  29. If the graph G = (V, E) of Y is a tree, the conditional distribution over the label sequence Y = y, given X = x, by fundamental theorem of random fields is: x is a data sequence y is a label sequence v is a vertex from vertex set V = set of label random variables e is an edge from edge set E over V fk and gk are given and fixed. gk is a Boolean vertex feature; fk is a Boolean edge feature k is the number of features are parameters to be estimated y|e is the set of components of y defined by edge e y|v is the set of components of y defined by vertex v Conditional Distribution

  30. Conditional Distribution • CRFs use the observation-dependent normalization Z(x) for the conditional distributions: Z(x) is a normalization over the data sequence x

  31. HMM like CRF Single feature for each state-state pair (y’,y) and state-observation pair in the data to train CRF if yu = y’ and yv = y = Yt-1 Yt Yt+1 ... otherwise Xt-1 Xt Xt+1 if yv = y and xv = x = otherwise y’,y and µy,x are equivalent to the logarithm of the HMM transition probability Pr(y’|y) and observation probability Pr(x|y)

  32. HMM like CRF For a chain structure, the conditional probability of a label sequence can be expressed in matrix form. For each position i in the observed sequence x, define matrix Where ei is the edge with label (yi-1, yi) and vi is the vertex with label yi

  33. HMM like CRF The normalization function is the (start, stop) entry of the product of these matrices The conditional probability of label sequence y is: where, y0 = start and yn+1 = stop

  34. Parameter Estimation The problem: determine the parameters From training data with empirical distribution The goal: maximize the log-likelihood objective function

  35. Parameter Estimation – Iterative Scaling Algorithms Update the weights as and for Appropriately chosen for edge feature fkis the solution of T(x, y) is a global property of (x,y) and efficiently computing the Right-hand sides of the above equation is a problem

  36. Algorithm S Define slack feature: For each index i = 0, …, n+1 we define forward vectors And backward vectors

  37. Algorithm S = = =

  38. The rate of convergence is governed by step size which is Inversely proportional to constant S, but S is generally quite large, resulting in slow convergence. Algorithm S

  39. Algorithm T Keeps track of partial T total. It accumulates feature expectations into counters indexed by T(x) Use forward-back ward recurrences to compute the expectation ak,t of feature fk and bk,t of feature gk given that T(x) = t

  40. Experiments • Modeling label bias problem • 2000 training and 500 test samples generated by HMM • CRF error is 4.6% • MEMM error is 42% CRF solves label bias problem

  41. Experiments • Modeling mixed order sources • CRF converge in 500 iterations • MEMM converge in 100 iterations

  42. MEMM vs. HMM The HMM outperforms the MEMM

  43. CRF vs. MEMM CRF usually outperforms the MEMM

  44. CRF vs. HMM Each open square represents a data set with  < ½, and a sold square indicates a data set with a   ½. When the data is mostly second order   ½, the discriminatively trained CRF usually outperforms the MEMM

  45. POS Tagging Experiments • First-order HMM, MEMM and CRF model • Data set: Penn Tree bank • 50-50% test-train split • Uses MEMM parameter vector as a starting point for training the corresponding CRF to accelerate convergence speed.

  46. Interactive IE using CRF Interactive parser updates IE results according to user’s changes. Color coding used to alert the ambiguity of IE.

  47. Some IE tools Available • MALLET (UMass) • statistical natural language processing, • document classification, • clustering, • information extraction • other machine learning applications to text. • Sample Application: GeneTaggerCRF:a gene-entity tagger based on MALLET (MAchine Learning for LanguagE Toolkit). It uses conditional random fields to find genes in a text file.

  48. MinorThird • http://minorthird.sourceforge.net/ • “a collection of Java classes for storing text, annotating text, and learning to extract entities and categorize text” • Stored documents can be annotated in independent files using TextLabels (denoting, say, part-of-speech and semantic information)

  49. GATE • http://gate.ac.uk/ie/annie.html • leading toolkit for Text Mining • distributed with an Information Extraction component set called ANNIE (demo) • Used in many research projects • Long list can be found on its website • Under integration of IBM UIMA

  50. Sunita Sarawagi's CRF package • http://crf.sourceforge.net/ • A Java implementation of conditional random fields for sequential labeling.

More Related