1 / 7

Summary Report, Spring 2010

Summary Report, Spring 2010. Research: Google News Project (with Bo, Yintao, Xide and Wei) Ongoing work Cross-document entity annotation and resolution Future work Sentimental analysis on entities Entity relation prediction/propagation Event tracking on news INARC: Quarterly Report

duer
Télécharger la présentation

Summary Report, Spring 2010

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Summary Report, Spring 2010 • Research: • Google News Project (with Bo, Yintao, Xide and Wei) • Ongoing work • Cross-document entity annotation and resolution • Future work • Sentimental analysis on entities • Entity relation prediction/propagation • Event tracking on news • INARC: • Quarterly Report • ARM onsite visit • Courses: • CS512 Data Mining: Principles and Algorithms • CS410 Introduction to Text Information System • CS591 Data Mining Seminar

  2. Effective Information Extraction • Goal: To extract and identify same entities across documents Unstructured Text Corpus Dimension 1: Person Obama McCain Entity Extraction Dimension 2: Loc Illinois Washington Entity Resolution Dimension 3: Org CNN NY Times

  3. Why Entity Resolution? • Same entity may be referred to by different names, same name may refer to different entities. Sen. Barack Obama Dr. Martin Luther King Jr. Mr. Obama Barack Obama Dr. King Senator Obama Dr. Martin King [Barack Obama, Senator Obama, Mr. Obama, Sen. Barack Obama] Martin Obama Mrs. Obama Martin Luther King Jr. Michelle Obama Cross-Document Entity Resolution • Challenge: To identify underlying entities from different mentions. Michelle [Michelle, Mrs. Obama, Michelle Obama]

  4. Features for Resolution • String similarity features • Character-based similarity • e.g., edit distance (“John Smith” & “Jhn Smith”) • Token-based similarity • e.g., TFIDF (“John Smith” & “Smith, John”) • Hybrid scheme (SoftTFIDF) • e.g., “Computer Science Dept.” & “Dep. Of CompterScence” • Contextual features • Local context (evidence) around the entities • Related named-entities (mentioned inside the same sentence/document) • Many other features …

  5. Information Network Construction D1: Barack Obama and former President Bill Clinton will do lunch Thursday ... Obama or Hillary Clinton … John McCain … in New York City … D2: Barack Obama's "lipstick on a pig" comment regarding Republican nominee John McCain's proposals and George W. Bush's policies has taken on a life of its own. … McCain and Obama … • Inferring links between entities • Document-level • -- Sentence-level or window-level • Heterogeneous Information Network

  6. Before Entity Resolution Top 40 nodes

  7. After Entity Resolution Top 20 nodes

More Related