110 likes | 233 Vues
This presentation by Tim Chartrand discusses a paper by Mary Elaine Califf and Raymond J. Mooney on Information Extraction (IE) using RAPIER (Robust Automated Production of Information Extraction Rules). The RAPIER method automates rule learning from a corpus of documents and filled templates, producing rules without prior parsing or heavy syntactic processing. It specializes in extracting relevant information from natural language text, offering comparable performance to existing systems while simplifying the rule creation process. The implications for various domains and results from job postings are also explored.
E N D
Relational Learning of Pattern-Match Rules for Information Extraction Presentation by Tim Chartrand of A paper by Mary Elaine Califf and Raymond J. Mooney
Introduction • Information Extraction (IE) is the task of locating specific pieces of information in NL text • IE is an important subpart of text understanding • IE systems are difficult and time consuming to build and they don’t port well to different domains • Researchers are combining learning methods with NLP methods to automate IE
Overview of RAPIER • RAPIER – Robust Automated Production of Information Extraction Rules • Learn IE rules automatically • Use a corpus of documents paired with filled templates • Resulting rules do not require prior parsing or subsequent processing • Uses limited syntactic information from a POS tagger • Induced patterns incorporate semantic classes • Rules characterize slot-fillers and their context
RAPIER Rules • Consist of three parts: • Pre-filler pattern – matches text immediately preceding the extracted information • Filler pattern – matches the exact text to be extracted • Post-filler pattern – matches text after information • Each pattern is a sequence of pattern items or pattern lists • Pattern item specifies constraints for one word or symbol • Pattern list specifies constraints for 0..n words or symbols • Constraints include: • List of words, one of which must match the item • POS tag • Semantic class
Learning Algorithm located in Atlanta, Georgia. offices in Kansas City, Missouri. For each slot, S in the template being learned SlotRules = most specific rules from document S while compression has failed fewer than lim times randomly select r pairs of rules from SlotRules find the set L of generalizations of the fillers of the rule pairs create rules from L, evaluate, and initialize RulesList let n = 0 while best rule in RuleList produces spurious fillers and weighted information value of best rule is improving increment n specialize each rule in RuleList with generalizations of the last n items of the pre-filler patterns of the rule pair and add specializations to RuleList specialize each rule in RuleList with generalizations of the last n items of the post-filler patterns of the rule pair and add specializations to RuleList if best rule in RuleList produces only valid fillers Add it to SlotRules Remove empirically subsumed rules
Experimental Results • The task: Extract information from coputer-related job postings • 17 slots used, including employer, salary, etc. • Results do not employ semantic categories • 100 document dataset with filled templates with 10-fold cross validation • Measured precision, recall, and F-measure
Experimental Results – continued • Performance: • Is comparable to Crystal on a medical domain • Is better than AutoSlog and AutoSlog-TS on MUC-4 terrorism task • Is hard to compare because of the different domains tested • Is good because precision is most important
Related Work • Resolve • Uses decision trees • Uses annotated coreference examples • Crystal • Uses a clustering algorithm to build a dictionary of extraction patterns • Requires patterns identified by an expert • Requires prior syntax analysis to identify syntactic elements and their relationships • AutoSlog • Specializes a set of general syntatic patterns • An expert must examine the patterns it produces • Requires prior syntax analysis • Liep • Requires prior syntax analysis • Makes no real use of semantic information • Has not been applied to complex domains
Related Work – BYU DEG • RAPIER rules correspond closely to DEG data frames. • Data frames are finer-grained, based on character patterns, whereas rules are based on word patterns • Pre-filler and Post-filler patterns correspond closely to data frame contexts and key words • Semantic categories correspond closely with lexicons • Not mentioned how RAPIER handles multiple record documents • Rapier data structure is given by the template (slots) defined in the input data • RAPIER is very similar in purpose to what Joe is trying to do – learn extraction rules based on a filled in form
Conclusions • Extracting desired pieces of information from NL text is important • Manually constructing IE systems too hard • RAPIER uses relational learning to build a set of pattern-match rules given a database of texts and filled templates • Learned patterns employ syntactic and semantic information to match slot fillers and context • Fairly accurate results can be obtained for a real-world problem with relatively small datasets • RAPIER compares favorably with other IE learning systems