70 likes | 216 Vues
Summary Report, Spring 2010. Research: Google News Project (with Bo, Yintao, Xide and Wei) Ongoing work Cross-document entity annotation and resolution Future work Sentimental analysis on entities Entity relation prediction/propagation Event tracking on news INARC: Quarterly Report
E N D
Summary Report, Spring 2010 • Research: • Google News Project (with Bo, Yintao, Xide and Wei) • Ongoing work • Cross-document entity annotation and resolution • Future work • Sentimental analysis on entities • Entity relation prediction/propagation • Event tracking on news • INARC: • Quarterly Report • ARM onsite visit • Courses: • CS512 Data Mining: Principles and Algorithms • CS410 Introduction to Text Information System • CS591 Data Mining Seminar
Effective Information Extraction • Goal: To extract and identify same entities across documents Unstructured Text Corpus Dimension 1: Person Obama McCain Entity Extraction Dimension 2: Loc Illinois Washington Entity Resolution Dimension 3: Org CNN NY Times
Why Entity Resolution? • Same entity may be referred to by different names, same name may refer to different entities. Sen. Barack Obama Dr. Martin Luther King Jr. Mr. Obama Barack Obama Dr. King Senator Obama Dr. Martin King [Barack Obama, Senator Obama, Mr. Obama, Sen. Barack Obama] Martin Obama Mrs. Obama Martin Luther King Jr. Michelle Obama Cross-Document Entity Resolution • Challenge: To identify underlying entities from different mentions. Michelle [Michelle, Mrs. Obama, Michelle Obama]
Features for Resolution • String similarity features • Character-based similarity • e.g., edit distance (“John Smith” & “Jhn Smith”) • Token-based similarity • e.g., TFIDF (“John Smith” & “Smith, John”) • Hybrid scheme (SoftTFIDF) • e.g., “Computer Science Dept.” & “Dep. Of CompterScence” • Contextual features • Local context (evidence) around the entities • Related named-entities (mentioned inside the same sentence/document) • Many other features …
Information Network Construction D1: Barack Obama and former President Bill Clinton will do lunch Thursday ... Obama or Hillary Clinton … John McCain … in New York City … D2: Barack Obama's "lipstick on a pig" comment regarding Republican nominee John McCain's proposals and George W. Bush's policies has taken on a life of its own. … McCain and Obama … • Inferring links between entities • Document-level • -- Sentence-level or window-level • Heterogeneous Information Network
Before Entity Resolution Top 40 nodes
After Entity Resolution Top 20 nodes