Relation Extraction for Academic Collaboration 10-709 Project Presentation

Relation Extraction for Academic Collaboration10-709 Project Presentation Justin Betteridge, Matthew Bilotti, Simon Fung, Sophie Wang February 16, 2006

Academic Collaboration • When two academic researchers work together... • on a proposal • by co-authoring a paper • by co-chairing a committee • in the same project or research group • This is evidence of Academic Collaboration • Binary, symmetric relation • Arguments are of type <person>

Motivation • Why might we be interested in extracting Academic Collaboration relations? • Social Networking • Explore the transitivity of the relation • Proof-of-concept for extending relation extraction machinery to other types of relations

Architectural Overview Query Formulator Query Formulator Pattern Bank Relation Bank IR Pattern Extractor Relation Extractor

Co-training Algorithm • Do until termination condition is reached • For each pattern in the pattern bank • Generate an IR query and send it to the IR engine getting back a set of documents • For each document in the set • extract relations • Score all relations (new and old) • Remove relations below threshold

Co-training Algorithm II • For each relation in the relation bank • Generate an IR query and send it to the IR engine getting back a set of documents • For each document in the set, • extract context strings for patterns • Score all patterns (new and old) • Remove patterns below threshold • Loop

Extraction Pattern Formalism • From the proposal: • “left” <x> “between” <y> “right” • Arguments extracted with respect to context • Current status quo: • “context string” <y> • <x> argument extracted from page title • Extracts the relation • CollaboratesWith( <x>, <y> )

Detecting Argument Types • CollaboratesWith( <x>, <y> ) • <x> and <y> must be of type <person> • Essential to weed out low quality relations produced by noisy patterns such as “in collaboration with” • Heuristics currently encoded as regular expressions

Measuring Confidence with Coverage • Confidence for an Extraction Pattern • Intuitively, relations “vote” for patterns • Query each relation, try to extract the pattern • score = proportion of successful relations • Confidence for a Relation • Query each pattern, try to extract the relation • Score = proportion of successful patterns

Issues with Coverage as Confidence • Seed relations and pattern must co-occur • Very little tolerance for “new” information • It is difficult for a new pattern that broadens the scope of the relations extracted to gain enough confidence to surpass the threshold • Scores tend to zero as pools grow • However, ad-hoc methods of confidence method combination from one iteration to the next introduces a new problem: there is no way to oust bad relations or patterns once extracted

Example Seed Data for Co-Training • Extraction Patterns • <x> “in collaboration with” <y> • <x> “my advisor is” <y> • Relations • CollaboratesWith( Tom, Roni ) • CollaboratesWith( William, Ken )

Extraction Pattern Examples Query: “my advisor is” site:cs.cmu.edu

Extracted Relations "Miroslav Dudik""Rob Schapire" 0.3333333333333333 "Personal""Prof. Sanjeev" 0.3333333333333333 "Research""Professors Jonathan" 0.3333333333333333 "Sharon Whiteman""Mary Vernon" 0.3333333333333333 "Sudhakar""Prof. Edward" 0.3333333333333333 "Ting""Professor Andrew" 0.3333333333333333 "Adriana Karagiozova""Moses Charikar" 0.6666666666666666 "Akash Lal""Tom Reps" 0.6666666666666666 "Amy Karlson""Benjamin B. Bederson" 0.6666666666666666 "Aravind Kalaiah""Dr. Amitabh" 0.6666666666666666 "Chi Zhang""Randolph Y. Wang" 0.6666666666666666 "Gaurav Shah""Matt Blaze" 0.6666666666666666 "Jennifer Beckmann""Jeff Naughton" 0.6666666666666666 "Lucja Kot""Dexter Kozen" 0.6666666666666666 "Mark Sandler""Jon Kleinberg" 0.6666666666666666 "Nina""Prof. Avrim" 0.6666666666666666 "Patrick Ng""Uri Keich" 0.6666666666666666 "Pavlos Papageorgiou""Prof. Michael" 0.6666666666666666 "Pratyusa Manadhata""Jeannette M. Wing" 0.6666666666666666 "Pavlos Papageorgiou""Prof. Michael" 0.6666666666666666 "Pratyusa Manadhata""Jeannette M. Wing" 0.6666666666666666 "Sudipta""Marc Pollefeys" 0.6666666666666666 "Sven Koenig""Reid Simmons" 0.6666666666666666 "Yan Liu""Jaime Carbonell" 0.6666666666666666

Learned Patterns • “My advisor is” <y> 0.6 • Near misses (hard to assess confidence): • “I work with” 0.4 • “Together with” 0.0667 • “Languages Research under” 0.0333 • “Computer Science advisor” 0.0333 • “Languages under Prof” 0.0 • “Study under Prof” 0.0 • “currently working with” 0.0 • “user studies with” 0.0

Bad Patterns • From citations: • “Amit Agarwal and”, etc. (other authors) • “L1 Norm with” (part of a title) • From professional titles: • “Professor”, “Professor of Mathematics”, etc. • From course web pages: • “courses cs686 2003sp” • Other: • “be addressed to”

Software and Datasets Used • Indri retrieval engine • Locally crawled collection of pages from CS departments of universities • Using a local collection greatly improved the development experience by shortening the debugging cycle, and relieved us from the Google API query quota • No features of Indri that Google does not support were used so that Google could be substituted for Indri in the future

Future Work • Different methods of combining confidence scores • including weighting of votes during scoring • Different confidence metrics, e.g., PMI • Additional useful sources of information: • bibliographies, anchor text and link structure: advisor-advisee cross-refs, department or lab organization • Better argument type checking • Tuning of the threshold • Termination condition • Integration with citations group • Integrate with Google • Make code run faster

Relation Extraction for Academic Collaboration 10-709 Project Presentation