150 likes | 246 Vues
Explore the past, present, and future of entity extraction and mapping on the Web with a focus on information networks and schema discovery. Learn about tools like Google Sets and visual list extraction. Understand the importance of finding lists and how it relates to data extraction and recommendation systems. Dive into mapping pages to records and inferring schemas from web pages. Discover the potential of information networks for ranking, clustering, and semantic search. WinaCS aims to be an innovative information network-based web search engine.
E N D
WinaCS ProjectWeb Entity Extraction and Mapping Discovering and Propagating Context Tim Weninger Department of Computer Science University of Illinois Urbana-Champaign, Urbana, IL
Past, Present, Future Past – Entity search and retrieval is one of the dreams of the Web – TBL Present – Ranking and Retrieval bi-directional approach 1) Information Networks 2) Web mining and Information Extraction a) List Finding b) Entity-page Discovery c) Entity-page Mapping Future – InfoBase Project Information extraction via Schema Discovery
Finding lists on the Web is Hard! (KDD Explorations Dec. 2010) 1. Google Sets 2. WebTables 3. Mining Data Records (MDR) 4. World Wide Tables (WWT) 5. Tag Path Clustering 6. RoadRunner 6. SEAL 7. Visual List Extraction 8. VIsual-based Page Segmentation (VIPS) 9. Visualized Element Nodes Table extraction (VENTex)
Why is finding lists important? • CharuAggarwal • DeepayanChakrabarti • Ed Chang • Kevin Chang • Olivier Chapelle • Chris Clifton • Jiawei Han • … • Jiawei Han • ChengXiangZhai • Kevin Chang • Dan Roth • Marianne Winslett • Jiawei Han • ChengXiangZhai • Kevin Chang • Dan Roth • Marianne Winslett • SaritaAdve • TarekAdelzaher • VikramAdve • GulAgha • … Correction Inference Disambiguation Recommendation etc
Mapping Pages to Records (CIKM’10) Example Ap1={People, Faculty, Dan Roth, Personal Site} Ap2={Research, Data Mining, Dan Roth, Personal Site} Bag of Anchors: {Research:1, People:1, Faculty:1, Data Mining:1, Dan Roth:2, Personal Site:2} Sorted Bag of Anchors: Au;v1={Dan Roth:2/2=1, Research:1/2=0.5, Data Mining:1/2 =0.5, Personal Site:2/5=0.4, People:1/3=0.33, Faculty:1/3=0.33}
CSMap Locations of top 25 computer science departments. Automatically generated by extracting and ranking 5 digit numbers from Entity Web pages.
Next Steps: The hard part! Infer categories/schemas from a set of WebPages Example: Name Address ZipCode Publications Collaborators Organizations How can we infer this schema? Wikipedia? How can we populate it? What does these entities have in common?
Next Steps: The hardest part! Inferred Given This can be modeled as a heterogeneous information network. Thus, Ranking and Clustering is possible So is semantic search, keyword search and typal search Cube operations are possible