Information Retrieval

Information Retrieval March 3, 2003 Handout #5

Course Information • Instructor: Dragomir R. Radev (radev@si.umich.edu) • Office: 3080, West Hall Connector • Phone: (734) 615-5225 • Office hours: M&F 11-12 • Course page: http://tangra.si.umich.edu/~radev/650/ • Class meets on Mondays, 1-4 PM in 409 West Hall

The Weka package

Weka • A general environment for machine learning (e.g. for classification and clustering) • Book by Witten and Frank • www.cs.waikato.ac.nz/ml/weka

K-means (continued)

Demos • http://www.cs.mcgill.ca/~bonnef/project.html • http://www.cs.washington.edu/research/imagedatabase/demo/kmcluster/ • http://www-2.cs.cmu.edu/~dellaert/software/ • java weka.clusterers.SimpleKMeans -t data/weather.arff

EM algorithm

EM algorithms • Needed: probabilistic model Θ • Given estimate Θ0 • Useful in the absence of certain data • Class of iterative algorithms for maximum likelihood estimation in problems with incomplete data. Given a model of data generation and data with some missing values, EM alternately uses the current model to estimate the missing values, and then uses the missing value estimates to improve the model. Using all the available data, EM will locally maximize the likelihood of the generative parameters giving estimates for the missing values. [Dempster et al. 77] [McCallum & Nigam 98]

E-M algorithms • Initialize probability model • Repeat • E-step: use the best available current classifier to classify some datapoints • M-step: modify the classifier based on the classes produced by the E-step. • Until convergence

Demos • java weka.clusterers.EM -t data/iris.arff • http://www.neurosci.aist.go.jp/~akaho/MixtureEM.html • http://www.cs.uic.edu/~liub/S-EM/S-EM-download.html

Question Answering

Question answering Q: When did Nelson Mandela become president of South Africa? A: 10 May 1994 Q: How tall is the Matterhorn? A: The institute revised the Matterhorn 's height to 14,776 feet 9 inches Q: How tall is the replica of the Matterhorn at Disneyland? A: In fact he has climbed the 147-foot Matterhorn at Disneyland every week end for the last 3 1/2 years Q: If Iraq attacks a neighboring country, what should the US do? A: ??

The TREC evaluation • Document retrieval • Eight years • Information retrieval? • Corpus: texts and questions

Prager et al. 2000 (SIGIR)Radev et al. 2000 (ANLP/NAACL)

Features (1) • Number: position of the span among all spans returned. Example: “Lou Vasquez” was the first span returned by GuruQA on the sample question. • Rspanno: position of the span among all spans returned within the current passage. • Count: number of spans of any span class retrieved within the current passage. • Notinq: the number of words in the span that do not appear in the query. Example: Notinq (“Woodbridge high school”) = 1, because both “high” and “school” appear in the query while “Woodbridge” does not. It is set to –100 when the actual value is 0.

Features (2) • Type: the position of the span type in the list of potential span types. Example: Type (“Lou Vasquez”) = 1, because the span type of “Lou Vasquez”, namely “PERSON” appears first in the SYN-set, “PERSON ORG NAME ROLE”. • Avgdst: the average distance in words between the beginning of the span and the words in the query that also appear in the passage. Example: given the passage “Tim O'Donohue, Woodbridge High School's varsity baseball coach, resigned Monday and will be replaced by assistant Johnny Ceballos, Athletic Director Dave Cowen said.” and the span “Tim O’Donohue”, the value of avgdst is equal to 8. • Sscore: passage relevance as computed by GuruQA.

Combining evidence • TOTAL (span) = – 0.3 * number – 0.5 * rspanno + 3.0 * count + 2.0 * notinq – 15.0 * types – 1.0 * avgdst + 1.5 * sscore

Extracted text

Results 50 bytes 250 bytes

Information Extraction

Types of Information Extraction • Template filling • Language reuse • Biographical information • Question answering

INCIDENT: DATE 30 OCT 89 INCIDENT: LOCATION EL SALVADOR INCIDENT: TYPE ATTACK INCIDENT: STAGE OF EXECUTION ACCOMPLISHED INCIDENT: INSTRUMENT ID INCIDENT: INSTRUMENT TYPEPERP: INCIDENT CATEGORY TERRORIST ACT PERP: INDIVIDUAL ID "TERRORIST" PERP: ORGANIZATION ID "THE FMLN" PERP: ORG. CONFIDENCE REPORTED: "THE FMLN" PHYS TGT: ID PHYS TGT: TYPEPHYS TGT: NUMBERPHYS TGT: FOREIGN NATIONPHYS TGT: EFFECT OF INCIDENTPHYS TGT: TOTAL NUMBERHUM TGT: NAMEHUM TGT: DESCRIPTION "1 CIVILIAN"HUM TGT: TYPE CIVILIAN: "1 CIVILIAN"HUM TGT: NUMBER 1: "1 CIVILIAN"HUM TGT: FOREIGN NATIONHUM TGT: EFFECT OF INCIDENT DEATH: "1 CIVILIAN"HUM TGT: TOTAL NUMBER On October 30, 1989, one civilian was killed in a reported FMLN attack in El Salvador. MUC-4 Example

Language reuse NP Yugoslav President Slobodan Milosevic [description] [entity] Phrase to be reused

Example NP NP Punc NP Andrija Hebrang , The Croatian Defense Minister [entity] [description]

Issues involved • Text generation depends on lexical resources • Lexical choice • Corpus processing vs. manual compilation • Deliberate decisions by writers • Difficult to encode by hand • Dynamically updated (Scott O’Grady) • No full semantic representation

Named entities Richard Butler met Tareq Aziz Monday after rejecting Iraqi attempts to set deadlines for finishing his work. Yitzhak Mordechai will meet Mahmoud Abbas at 7 p.m. (1600 GMT) in Tel Aviv after a 16-month-long impasse in peacemaking. Sinn Fein deferred a vote on Northern Ireland's peace deal Sunday. Hundreds of troops patrolled Dili on Friday during the anniversary of Indonesia's 1976 annexation of the territory.

Entities + Descriptions Chief U.N. arms inspector Richard Butler met Iraq’s Deputy Prime MinisterTareq Aziz Monday after rejecting Iraqi attempts to set deadlines for finishing his work. Israel's Defense MinisterYitzhak Mordechai will meet senior Palestinian negotiatorMahmoud Abbas at 7 p.m. (1600 GMT) in Tel Aviv after a 16-month-long impasse in peacemaking. Sinn Fein, the political wing of the Irish Republican Army, deferred a vote on Northern Ireland's peace deal Sunday. Hundreds of troops patrolled Dili, the Timorese capital, on Friday during the anniversary of Indonesia's 1976 annexation of the territory.

Building a database of descriptions • Size of database: 59,333 entities and 193,228 descriptions as of 08/01/98 • Text processed: 494 MB (ClariNet, Reuters, UPI) • Length: 1-15 lexical items • Accuracy: (precision 94%, recall 55%)

Multiple descriptions per entity Ung Huot A senior member Cambodia’s Cambodian foreign minister Co-premier First prime minister Foreign minister His excellency Mr. New co-premier New first prime minister Newly-appointed first prime minister Premier Profile for Ung Huot

CONCEPTS + CONSTRAINTS = CONSTRUCTS Language reuse and regeneration Corpus analysis: determining constraints Text generation: applying constraints

Language reuse and regeneration • Understanding: full parsing is expensive • Generation: expensive to use full parses • Bypassing certain stages (e.g., syntax) • Not(!) template-based: still required extraction, analysis, context identification, modification, and generation • Factual sentences, sentence fragments • Reusability of a phrase

Redefining the relation:DescriptionOf (E,C) = {Di,c, Di,cis a description ofE in context C} If named entity E appears in text and the context is C: Insert DescriptionOf (E,C) in text. Context-dependent solution

Multiple descriptions per entity Bill Clinton U.S. President President An Arkansas native Democratic presidential candidate Profile for Bill Clinton

Choosing the right description Bill ClintonCONTEXT U.S. President …………………………..foreign relations President ………………………………… national affairs An Arkansas native ……………....false bomb alert in AR Democratic presidential candidate …………….. elections Pragmatic and semantic constraints on lexical choice.

Semantic information from WordNet • All words contribute to the semantic representation • First sense is used only • What is a synset?

WordNet synset hierarchy {00001740} entity, something {00002086} life form, organism, being, living thing {00004123} person, individual, someone, somebody, human {06950891} leader {07311393} head, chief, top dog {07063507} administrator, decision maker {07063762} director, manager, managing director

Lexico-semantic matrix Profile for Ung Huot

Choosing the right description • Topic approximation by context: words that appear near the entity in the text (bag) • Name of the entity (set) • Length of article (continuous) • Profile: set of all descriptions for that entity (bag) - parent synset offsets for all words wi. • Semantic information: WordNet synset offsets (bag)

Choosing the right description Ripper feature vector [Cohen 1996] (Context, Entity, Description, Length, Profile, Parent) Classes

Example (training)

Sample rules Total number of rules: 4085 for 100,000 inputs

Evaluation • 35,206 tuples; 11,504 distinct entities; 3.06 DDPE • Training: 90% of corpus (10,353 entities) • Test: 10% of corpus (1,151 entities)

Evaluation • Rule format (each matching rule adds constraints): X [A] (evidence of A) Y [B] (evidence of B) X Y [A] [B] (evidence of A and B) • Classes are in 2W (powerset of WN nodes) • P&R on the constraints selected by system

Model System P R [B] [D] [A] [B] [C] 33.3 % 50.0 % [A] [B] [C] [A] [B] [D] 66.7 % 66.7 % Definition of precision and recall

Precision and recall

Information Retrieval