Information Extraction: Analyzing and Extracting Information from Text

CSA4050: Advanced Topics in NLP Information Extraction I What is Information Extraction? CSA4050: Information Extraction I

Sources • R. Gaizauskas and Y. Wilks, Information Extraction: Beyond Document Retrieval. Technical Report CS-97-10, Department of Computer Science, University of Sheffield, 1997. CSA4050: Information Extraction I

What is Information Extraction? • IE: the analysis of unrestricted text in order to extract information about pre-specified types of entity, relationship and event. • Typically, text is newspaper text or newswire feed. • Typically, prespecified structure is a class-like object with different data fields. CSA4050: Information Extraction I

19 March – A bomb went off near a power tower in San Salvador leaving a large part of city, without energy; but no casualties has been reported. According to unofficial sources, the bomb- allegedly detonated by urban Guerilla commandos- blew up a power tower in northwestern part of San Salvador Template Structure: IncidentType : bombingDate : March 19Location : San Salvador Perpetrator :Urban Guerilla Commandos Target : power tower A Example ofInformation Extraction CSA4050: Information Extraction I

Different levels of structure can be envisaged. • Named Entities • Relationships • Events • Scenarios CSA4050: Information Extraction I

Examples of Named Entities • People • John Smith, J. Smith, Smith, John, Mr. Smith • Locations • EU, The Hague, SLT, Piazza Tuta • Organisations • IBM, The Mizzi Group, University of Malta • Numerical Quantities • Lm 10, forty per cent, 40%, $10 CSA4050: Information Extraction I

Examples of Relationships between Named Entities George Bush1 is [President2 of the United States3 ] 4 • nation(3) • president(1,3) • coref(1,4) CSA4050: Information Extraction I

Examples of Events • Financial Events • Takeover bids • Changes of management • Socio/Political Events • Terrorist attacks • Traffic accidents • Geographical Events • Natural Disasters CSA4050: Information Extraction I

IE extracts relevantinformation from documents. IE has emerged from research into rule based systems in CL. IE typically based on some kind of linguistic analysis of source text. Information Retrieval (IR) retrieves relevantdocuments in a collection IR mostly influenced from theory of information, probability, and statistics. IR typically uses bag of words model of source text. Some Differencesbetween IE and IR CSA4050: Information Extraction I

Why Linguistic Analysisis Necessary • Active/Passive distinction • BNC Holdings namedMs G. Torretta to succeed Mr. N. Andrews as new chairperson • Nicholas Andrews was named byGina Torretta as chair-person of BNC Holdings • Use of different phrases to mean the same thing • Ms. Gina Torretta took the helm at BNC Holdings. She succeeds Nick Andrews • G Torrettasucceeds N Andrews as chairperson at BNC Holdings • Establishing coreferences CSA4050: Information Extraction I

Brief History • 1960-80 N Sager Linguistic String project: automatically induced information formats for radiology reports • 1970s R. Schank: Scripts • 1982 G. DeJong FRUMP: “Sketchy Scripts” used to process UPI newswire stores in domains (e.g. earthquakes; labour strikes); systematic evaluation. • 1983 J-P Zarri – analysis of historical texts by translating text into a semantic metalanguage • 1986 ATRANS (S. Lytinen et al) – script based system for analysis of money transfer messages between banks • 1992 Carnegie Group: JASPER - skims company press releases to fill in templates concerning earnings and dividends. CSA4050: Information Extraction I

Message Understanding Conferences • Conferences aimed at comparing the performance of a number of systems working on IE from naval messages. • Sponsored by DARPA and organised by the US Naval Command centre, San Diego. • Progressively more difficult tasks. • Progressively more refined evaluation measures. CSA4050: Information Extraction I

MUC Tasks • MUC1: tactical naval operations reports on ship sightings and engagements. No task definition; no evaluation criteria • MUC3: newswire stories about terrorist attacks. 18 slot templates to be filled. Formal evaluation criteria supplied. • MUC6: specific subtasks including named entity recognition; coreference identification; scenario template extraction. CSA4050: Information Extraction I

IE Subtasks • Named Entity recognition (NE) • Finds and classifies names, places etc. • Coreference Resolution (CO) • Identifies identity relations between entities in texts. • Template Element construction (TE) • Adds descriptive information to NE results (using CO). • Template Relation construction (TR) • Finds relations between TE entities. • Scenario Template production (ST) • Fits TE and TR results into specified event scenarios. CSA4050: Information Extraction I

Evaluation: the IR Starting Point false pos true pos false neg target selected CSA4050: Information Extraction I

Evaluation Metrics • Starting points are those used for IR, namely recall and precision. CSA4050: Information Extraction I

IR Measures:Precision and Recall • Precision: fraction of retrieved docs that are relevant = P(relevant|retrieved) Precision P = tp/(tp + fp) • Recall: fraction of relevant docs that are retrieved = P(retrieved|relevant) Recall R = tp/(tp + fn) CSA4050: Information Extraction I

F-Measure • Whatever method is chosen to establish P and R there is a trade-off between them. • For this reason researchers often use a measure which combines the two. • F = 1/ (α/P + (1- α)/R) is commonly used where α is a factor which determines the weighting between P and R • When α = 0.5 the formula reduces to the harmonic mean = 2PR/(P+R) • Clearly F is weighed towards P as α approaches 1. CSA4050: Information Extraction I

Harmonic Mean arithmetic mean CSA4050: Information Extraction I

Evaluation Metrics for IE • For IE, these measures need to be related to the activity of slot-filling: • Slot fills can be correct, partially correct or incorrect, missing, spurious. • These differences permit the introduction of finer grained measures of correctness that include overgeneration, undergeneration, and substitution. CSA4050: Information Extraction I

Recall • Recall is a measure of how much relevant information a system has extracted from text. • It is the ratio of how much information is actually extracted against how much information there is to be extracted, ie count of facts extractedcount of possible facts CSA4050: Information Extraction I

Precision • Precision is a measure of how accurate a system is in extracting information. • It is the ratio of how much correct information is actually extracted against how much information is extracted, i.e. count of correct facts extractedcount of facts extracted CSA4050: Information Extraction I

Bare Bones Architecture( from Appelt and Israel 1999) Tokenisation Word segmentation POS Tagging Morphological & Lexical Processing Word Sense Tagging Preparsing Syntactic Analysis Parsing Discourse Analysis Coreference CSA4050: Information Extraction I

Generic IE System(Hobbs 1993) Text Zoner Preprocessor Filter Preparser Semantic Interpreter Fragment Combiner Parser Coreference Resolution Template Generator Lexical Disambiguator CSA4050: Information Extraction I

Large Scale IELaSIE • General-purpose IE research system geared towards MUC-6 tasks. • Pipelined system with three principle processing tasks: • Lexical preprocessing • Parsing and semantic interpretation • Discourse interpretation CSA4050: Information Extraction I

LaSIE: Processing Stages • Lexical preprocessing: reads, tokenises, and tags raw input text. • Parsing and semantic interpretation: chart parser; best-parse selection; construction of predicate/argument structure • Discourse interpretation: adds information from predicate-argument representation to a world model in the form of a hierarchically structured semantic net CSA4050: Information Extraction I

LaSIE Parse Forest • It is rare that analysis contains a unique, spanning parse • selection of best parse is carried out by choosing that sequence of non-overlapping, semantically interpretable categories that covers the most words and consists of the fewest constituents. CSA4050: Information Extraction I

LaSIE Discourse Model CSA4050: Information Extraction I

Example Applications of IE • Finance • Medicine • Law • Police • Academic Research CSA4050: Information Extraction I

Future Trends • Better performance: higher precision & recall • User (not expert) defined IE: minimisation of role of expert • Integration with other technologies (e.g. IR) • Multilingual IE CSA4050: Information Extraction I

Information Extraction: Analyzing and Extracting Information from Text

Information Extraction: Analyzing and Extracting Information from Text

Presentation Transcript

Advanced Topics in Congestion Control

Basic and advanced Praat Scripting

Python: Overview and Advanced Topics

DATA MINING Introductory and Advanced Topics Part I

CIS750 – Seminar in Advanced Topics in Computer Science Advanced topics in databases – Multimedia Databases

High Level Architecture Module 2 Advanced Topics

DCM: Advanced topics

Model Railroading Operations 101: Part 4 – Advanced Topics Seminar

600.647 Advanced Topics in Wireless Networks

Advanced Topics in ChBE

Advanced Administrative Topics

SDMX Advanced Topics on Technical Standards

Advanced Topics

HDF5 Advanced Topics

Advanced Topics in Congestion Control

Advanced Topics in FOL

Advanced Math Topics

Chapter 13: Advanced GUI Applications

Advanced Topics in Computer Systems (ACS, R01)

Part 5 Advanced topics in CGI/Perl

Advanced Math Topics

Advanced Math Topics