Information Extraction

Chris Brew The Ohio State University Information Extraction

Algorithms have weaknesses • Decision trees embody a strong bias, searching only for models which can be expressed in the tree form. • This bias means that they need less training data than more general models to choose the one that they prefer. • But if the data does not fit comfortably into the class of models which DTs allow, their predictions become unstable and difficult to make sensible use of.

No Free Lunch • In order to model data well, we have to choose between strong assumptions and flexibility • If we are unbiased and flexible , we're bound to need a lot of data to get the right model • If the assumptions are largely valid, we get to learn effectively from not too much data. So, in a certain sense bias is a good thing to have. • But it is always possible to construct data that will make a biased algorithm look bad. So one can never unconditionally say “This is the best learning algorithm”

Using classifiers in information extraction • Focus on the Merge part of information extraction • Create corpus, including accurate annotations of co-reference. • For each pair of two phrases, generate an instance, labelled with coref or not-coref • At training time, learn a classifier to predict coref or not • At test time, create an instance every time a decision needs to be made about a pair of phrases.

Details matter • Two representative systems • Aone and Bennett's MLR • McCarthy and Lehnert's RESOLVE • Both work with MUC-5 Joint Venture data, Aone and Bennett on the Japanese corpus. McCarthy and Lehnert on the English • Both formally evaluated, but in different ways

MLR • Uses output of Sentence Analysis component from real IE system. Considers anaphors involving entities that the system thought were organizations. • Evaluated on 250 texts for which correct answers were created.

Features used by MLR • Lexical features (e.g. Whether one phrase contains a character sequence of the other) • The grammatical roles of the phrases • Semantic class information • Information about position • Total of 66 features

RESOLVE • Used hand-created noise-free evaluation corpus (in principle the features used are available from the IE system, but pattern of expected error not known) • Features were domain-specific • Evaluated on JV-ENTITIES, things that were organizations and that had been identified as a party in a joint venture by the extraction component.

Features used by RESOLVE • Whether each phrase contains a proper name from a known list (2 features) • Which of the phrases are names of joint ventures (3 features) • Whether the two phrases are related by an entry in a known list of aliases (1 feature) • Whether phrases have same base noun phrase (1 feature) • Whether phrases are in same sentence (1 feature)

Performance • Baseline: assume non-coreferential ~ 74% for resolve testset • RESOLVE: Precision ~80% Recall ~87% on noise-free data • MLR: Precision ~67% Recall ~80%

Are these comparable? • Data sets are not identical. • Feature repertoires are not identical. • Exact choice of which pairs the data are evaluated on not identical. • Inter-annotator agreement is not that high: definite descriptions and bare nominals causing the most difficulty

More evaluation • Aone and Bennett (working on Japanese) find that the character subsequence feature is all that is needed when the phrases are proper names of companies • Ideally one should evaluate these for their effects on the performance of the whole IE task, not just for the sub-task of getting the co-reference decision right. • But if you do that, it becomes even harder to compare work across systems, because there is so much going on.

Advantages of classifiers • Decision trees and other learning may give insight into importance of features. Decision trees are especially easy to interpret. • Once instance set has been created, running multiple variations wth different feature sets is easy.

Disadvantages of classifiers • Because the classifiers answer independently, some patterns of answers make no sense (could say A = B, B=C, but A !=C). • Unlike language models, classifiers don't give a global measure of the quality of the whole interpretation.

Classifiers for global structure • It is possible, and sometimes useful, to string together classifiers in ways which do better justice to global structure. • Documents have non-trivial structure. • One of the challenges is to find ways of incorporating knowledge about such structure, while still leaving the system able to learn.

Background: HMMs • Standard HMMs have two stochastic components • First is a stochastic finite state network, given by P(sn|sn-1). In a part-of-speech tagger, the state labels would be parts of speech. • Second is a stochastic emitter at each state, given by P(observation|sn). In a part-of-speech tagger the observation would (often) be a word. • HMM training is a scheme that adjusts the parameters from data.

When can you use HMMs? • If the output alphabet is of manageable size and there are relatively few states, you can hope to see enough training data to reliably estimate the emission matrices. • But if the outputs are very complex, then there will be many numbers to estimate.

Discriminative and generative • HMMs are generative probabilistic models for the state sequences and their outputs. • In application, we can see the sequence of observations, and only need to infer the best sequence of states. • Modelling the competition between the observations seems like wasted effort. • The plus is that we have good and efficient training algorithms.

Twisting classifiers • Classifiers are not generative, but discriminative. • They model P(class|observations), not P(observations|class) • They represent observations as sets of attributes • No effort is expended on modelling the observations themselves. • But they don't capture sequence.

FAQ classification • McCallum, Freitag, Pereira (citeseer.nj.nec.com/mccallum00maximum.html) • Task: given Usenet FAQ documents, identify the question answer structure. • Intuition: formatting will give cues to this • Details: label FAQ lines with (head|question|answer|tail) • Input: one labelled document • Test data: n-1 documents from the same FAQ (which presumably has the same unknown formatting conventions).

Relevant features begins-with-number begins-with-ordinal begins-with-punctuation begins-with-question-word begins-with-Subject blank contains-alphanumcontains-bracketed-number contains-http contains-non-space contains-numbercontains-pipe contains-query contains-question-word ends-with-query first-alpha-capitalized indented indented-1-4 intended-5-10 more-than-one-third-space punctuation-only previous-line-is-blank previous-begins-with-ordinal shorter-than-thirty-chars • Idea is to provide features which will allow modelling of way in which different FAQs use formatting, not to hand-build the models for each set of conventions that exist

Competing approaches • Token HMM: Tokenize each line, treat each token as an output from the relevant state. • Feature HMM: Reduce each line to set of features, treat each feature as an output of the relevant state. • MEMM: Learn a maximum entropy classifier for P(sn|obsn-1,sn-1). This is a discriminative approach. Performs well on the FAQ task. In principle any classifier will do: they used maximum entropy because it deals well with overlapping features. TCMM

Similar tasks • NP Chunking: state labels are things like BEGIN-NP, IN-NP, END-NP, NOT-IN. (Dan Roth's group has explored this approach) Emissions are sets of features of context (perhaps also including parts-of-speech). • Sentence boundary detection. (see me) • Lexical analysis for Thai (Chotimongol and Black)

Constraints • In both cases there are constraints on the sequence of labels (brackets must balance). One idea is to do as McCallum et al do, making models conditional on previous state. • Another is to collect probabilities for each state at every position, then hunt for the best sequence that respects the constraints. • Both approaches can work well, it depends on the task which is better.

Label bias • In an HMM, different states compete to account for the observations that we see. • In a TCMM, we just use observations to choose between available next states • If the next state is known or overwhelmingly probable, the observation has no possibility of influencing the outcome. • If there are two competing situations like this, one with a likely observation, one with a rare one, no useful competition possible.

Conditional Random Fields • Lafferty, McCallum Pereira: Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data (On citeseer) • Demonstrates that method can handle synthetic data with label bias, and that MEMM does not. • Shows small performance improvement over HMMs and MEMM on POS tagging data. • Method harder to understand than MEMMs, and more costly than HMMs, but more flexible.

Summary • Using classifiers for text tasks • Using classifiers for text tasks where there is a significant sequential component • More experience with learning from data • Next time: some text tasks that don't use classifiers, but special techniques. • Reference clustering (McCallum, Nigam,Ungar) • Inductive learning of extraction rules (Thompson, Califf, Mooney)

Non-classifier learning Mary-Elaine Califf and Ray Mooney (2002) Bottom-Up Relational Learning of Pattern Matching Rules for Information Extraction (submitted to JMLR, available from Mooney's web page) Also, if time permits, something on reference extraction and clustering, from the CMU group around McCallum.

Sample job posting • Subject: US-TN-SOFTWARE PROGRAMMERDate: 17 Nov 1996 17:37:29 GMTOrganization: Reference.Com Posting ServiceMessage-ID: 56nigp$mrs@bilbo.reference.comSOFTWARE PROGRAMMERPosition available for Software Programmer experienced in generating software for PC-Based Voice Mail systems. Experienced in C Programming. Must be familiar with communicating with and controlling voice cards; preferable Dialogic, however, experience with others such as Rhetorix and Natural Microsystems is okay. Prefer 5 years or more experience with PC Based Voice Mail, but will consider as little as 2 years. Need to find a Senior level person who can come on board and pick up code with very little training. Present Operating System is DOS. May go to OS-2 or UNIX in future.Please reply to:Kim AndersonAdNET(901) 458-2888 faxkimander@memphisonline.com

The filled template computer_science_job id: 56nigp$mrs@bilbo.reference.com title: SOFTWARE PROGRAMMER salary: Not specified company: Not specified recruiter: Not specified state: TN city: Not specified country: US language: C platform: PC \ DOS \ OS-2 \ UNIX Multiple fillers application area: Voice Mail req_years_experience: 2 desired_years_experience: 5 post_date: 17 Nov 1996

How shallow(!|?) • Feature-based techniques require design of a reasonable set of features. • Might still exclude useful contextual information. • Not obvious what features one wants in the job posting domain. • Califf and Mooney's RAPIER learns a rich relational representation from structured examples. Related to Inductive Logic Programming.

RAPIER's rules • Strongly resemble the patterns used by ELIZA to extract slot fillers. Pre-Filler Pattern: Filler Pattern: Post-Filler Pattern: 1) syntactic: {nn,nnp} 1) word: undisclosed 1) semantic: price 2) list: length 2 syntactic: jj • This one is highly specific to contexts like • “sold to the bank for an undisclosed amount" • “paid Honeywell an undisclosed price"

Using rules • Given a set of rules, indexed by template and slot RAPIER applies all of them. If the ruleset proposes multiple fillers for a slot after duplicates are eliminated, all are returned. • There are lots of possible rules, some highly specific, some highly general. How should the algorithm structure the search for rules at the right level of abstraction? • Each slot is learnt separately, no attempt to build generalizations across different slots.

Knowledge sources • Words • Parts of speech • Semantic classes from Wordnet • (Could probably get such resources Spanish/Italian/German/Japanese, but not Pushtun/Urdu/Tamil)

Rule learning strategies • Rule learning systems use two basic strategies • Compression – begin with a set of highly specific rules (usually one per example) that cover all examples. Look for more general rules that can replace several specific rules. Learning ends when no further compression possible. • Covering – start with a set containing all the examples. Find a rule to cover some of the examples. Repeat until all examples covered.

Rule generation • Whether doing covering or compression, need choices about how to generate rule candidates. • Could go top-down, starting with very general rules and specializing them as needed. • Could go bottom-up, using sets of existing rules to gradually generalize away from the data. • Could do a combination of top-down and bottom-up.

Trade-offs • Covering systems are efficient, because they can ignore examples already covered. • But compression systems do a more thorough search at each iteration, so may find better rules in the end. • Bottom-up rule generation may not generalize well enough to cover unseen data. • Top down generation might inefficiently thrash around generating irrelevant rules.

RAPIER's choices • Specific-to-general search (bottom-up) • Because: wanted to start with words actually represented in the examples, not search at random. • Because: specific rules tend to be high-precision at the possible expense of recall, and the designers favoured precision over recall. • Compression-based outer loop • Because: its efforts to generalize counteract the tendency to over-specificity.

The RAPIER algorithm for each slot, S in the template being learned SlotRules = most specific rules for S from example documents while compression has failed fewer than CompressLim times initialize RuleList to be an empty priority queue of length k randomly select M pairs of rules from SlotRules find the set L of generalizations of the fillers of each rule pair for each pattern P in L create a rule NewRule with filler P and empty pre- and post-fillers evaluate NewRule and add NewRule to RuleList let n = 0 loop increment n for each rule, CurRule, in RuleList NewRuleList = SpecializePreFiller (CurRule, n) evaluate each rule in NewRuleList and add it to RuleList for each rule, CurRule, in RuleList NewRuleList = SpecializePostFiller (CurRule, n) evaluate each rule in NewRuleList and add it to RuleList until best rule in RuleList produces only valid fillers or the value of the best rule in RuleList has failed to improve over the last LimNoImprovements iterations if best rule in RuleList covers no more than an allowable percentage of spurious fillers add it to SlotRules and remove empirically subsumed rules For each slot, S in the template being learned SlotRules = most specific rules for S from example documents while compression has failed fewer than CompressLim times initialize RuleList to be an empty priority queue of length k randomly select M pairs of rules from SlotRules find the set L of generalizations of the fillers of each rule pair for each pattern P in L create a rule NewRule with filler P and empty pre- and post-fillers evaluate NewRule and add NewRule to RuleList let n = 0 loop increment n for each rule, CurRule, in RuleList NewRuleList = SpecializePreFiller (CurRule, n) evaluate each rule in NewRuleList and add it to RuleList for each rule, CurRule, in RuleList NewRuleList = SpecializePostFiller (CurRule, n) evaluate each rule in NewRuleList and add it to RuleList until best rule in RuleList produces only valid fillers or the value of the best rule in RuleList has failed to improve over the last LimNoImprovements iterations if best rule in RuleList covers no more than an allowable percentage of spurious fillers add it to SlotRules and remove empirically subsumed rules

Rule evaluation • Rules are evaluated by their • Ability to separate positive from negative examples • By their size (smaller the better) • It allows rules that produce some spurious slot fillers, primarily because it doesn't want to be too sensitive to human error in the annotation.

Initial rulebase • Filler: word and part-of-speech. No semantic constraint (these are only introduced via generalization). Must be a word actually extracted for one of the templates. • Pre-filler: words and parts-of-speech going right back to beginning of document. • Post-filler: words and parts of speech going on to end of document.

Filler Generalisation Elements to be generalised Element A Element B word: man word: woman syntactic: nnp syntactic: nn semantic: semantic: Resulting Generalisations word: word: {man, woman} syntactic: {nn, nnp} syntactic: {nn, nnp} semantic: person semantic: person word: word: {man, woman} syntactic: syntactic: semantic: person semantic: person

How semantic constraints arise • If both items being generalized have semantic constraints, search for a common superclass • If both items being generalized have different words, see whether both belong in same synset • Don't guess semantic constraints for single words, since there are not enough clues to make this useful.

Generalizing contexts • If we know which elements of the pre/post fillers go with which, we can generalize in the same way that we did with fillers.

Patterns of equal length

Patterns of unequal length • Risk of combinatorial explosion • Assume every element of shorter must align with something in longer. Still many possibilities. • Align pairs of elements of length two or more assuming they must be correct • Several special cases

Example of unequal length Patterns to be Generalized Pattern A Pattern B 1) word: bank 1) list: length 3 syntactic: nn word: 2) word: vault syntactic: nnp syntactic: nn Resulting Generalizations 1) list: length 3 1) list: length 3 word: word: syntactic: {nn,nnp} syntactic:

Specialization • We also need a mechanism for getting the rules learnt down to manageable length • Basic strategy is to calculate all correspondences in the pre/post parts of the parent rules, then work outwards, generating rules that use successively more and more context. • In cases like “6 years experience required” and “4 years experience is required”, we want to be able to work outwards different amounts in the two original rules.

Example Pre-filler Pattern: Filler Pattern: Post-filler Pattern: 1) word: located 1) word: atlanta 1) word: , tag: vbn tag: nnp tag: , 2) word: in 2) word: georgia tag: in tag: nnp 3) word: . tag: . Pre-filler Pattern: Filler Pattern: Post-filler Pattern: 1) word: offices 1) word: kansas 1) word: , tag: nns tag: nnp tag: , 2) word: in 2) word: city 2) word: missouri tag: in tag: nnp tag: nnp 3) word: . Tag: . Pre-filler Pattern: Filler Pattern: Post-filler Pattern: 1) word: located 1) word: atlanta 1) word: , tag: vbn tag: nnp tag: , 2) word: in 2) word: georgia tag: in tag: nnp 3) word: . tag: . Pre-filler Pattern: Filler Pattern: Post-filler Pattern: 1) word: offices 1) word: kansas 1) word: , tag: nns tag: nnp tag: , 2) word: in 2) word: city 2) word: missouri tag: in tag: nnp tag: nnp 3) word: . Tag: .

Output of generalizations Pre-filler Pattern: Filler Pattern: Post-filler Pattern: 1) word: in 1)list: max length: 2 1) word: , tag: in tag: nnp tag: , 2) tag: nnp semantic: state Pre-filler Pattern: Filler Pattern: Post-filler Pattern: 1) word: in 1) list: max length: 2 1) word: , tag: in tag: nnp tag: , 2) tag: nnp semantic: state

Information Extraction