1 / 23

Collectively Representing Semi-Structured Data from the Web

Collectively Representing Semi-Structured Data from the Web. Bhavana Dalvi , William W. Cohen and Jamie Callan Language Technologies Institute Carnegie Mellon University Paper ID : 02 . This work is supported by Google and the Intelligence Advanced Research Projects Activity

camdyn
Télécharger la présentation

Collectively Representing Semi-Structured Data from the Web

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Collectively Representing Semi-Structured Data from the Web BhavanaDalvi, William W. Cohen and Jamie Callan Language Technologies Institute Carnegie Mellon University Paper ID : 02 This work is supported by Google and the Intelligence Advanced Research Projects Activity (IARPA) via Air Force Research Laboratory (AFRL) contract number FA8650-10-C-7058.

  2. Motivation • Entities on the Web can be present in multiple datasets. E.g. HTML tables, text documents etc. • Traditional systems : Entities as sparse vector of document Ids in which it occurs. • We propose a low-dimensional representation for such entities. • Helps to efficiently perform different tasks with a small number of primitive operations : • Semi-supervised Learning (SSL) • Set Expansion (SE) • Automatic Class Instance Acquisition (ASIA)

  3. Entities in HTML tables Entity-ColumnBi-partite Graph Table-column Entity TC-1 USA TC-2 India TC-3 TC-2 Hockey TC-3 Cricket TC-4 Tennis

  4. Entities in unstructured text “Such as”Bi-partite Graph Countriessuch as India are developing rapidly in terms of infrastructure. Suchas Entity Country USA India Location Hockey Outdoor sportsincludeTennis andCricket. Cricket Sports Tennis

  5. Resultant Tri-partite Graph “Such as”Bi-partite Graph Entity-ColumnBi-partite Graph Table-column Suchas Entity TC-1 Country USA TC-2 India Location Hockey TC-3 Cricket Sports TC-4 Tennis

  6. Encoding the graph “Entity-Column”Bi-partite Graph Low-dimensional embedding using bipartite Power Iteration Clustering (Lin & Cohen, ICML 2010/ECAI 2010) Table-column Entity TC-1 USA TC-2 India Hockey TC-3 Cricket Entities with similar X1/X2 values should be ontologically similar - values summarize tabular co-occurrence TC-4 Tennis

  7. Encoding the graph “Such as”Bi-partite Graph Low-dimensional embedding using bipartite Power Iteration Clustering (Lin & Cohen, ICML 2010/ECAI 2010) Suchas Entity Country USA India Location Hockey Cricket Entities with similar Y1/Y2 values should be ontologically similar - values summarize “such as pattern” co-occurrence Sports Tennis

  8. Low-dimensional PIC3 embedding n * m PIC embeddingm << t n * t entity-tableColumn Bipartite graph n * 2m PIC3 embedding PIC Concatenate n * m PIC embeddingm << s n * s entity-suchas Bipartite graph PIC

  9. Using PIC3 Representation • Semi-Supervised Learning : Given few seed examples for each class, predict class-labels for unlabeled data-points. • Set Expansion : Given a set of seed entities, find more entities similar to seed entities. • Automatic Set Instance Acquisition (ASIA) : Given a concept name automatically find instances of that concept.

  10. Quantitative Evaluation: Datasets Link to dataset: http://rtw.ml.cmu.edu/wk/WebSets/wsdm_2012_online

  11. SSL using PIC3 Input : Few seed examples for each class label Output : Class-labels for unlabeled data-points PIC clusters similar entities together  better SVM classifier on unlabeled data (use of background data)

  12. SSL Task - I # dimensions : 2504  10

  13. SSL Task - II # dimensions : 2574  10

  14. Set Expansion using PIC3 Input : Few seed entities e.g. Football, Hockey, Tennis Output : More entities of same type as seeds e.g. Baseball, Badminton, Cricket, Golf …. K-NN operation is extremely efficient using KD-trees.

  15. Query Times • PIC3 preprocessing : 0.02 sec • # SE queries = 881 • Precision Recall Curve : K-NN+PIC3 consistently beats K-NN-Baseline. Modified Adsorption method is better on 2/5 query classes at the expense of larger query time. Modified Adsorption : Graph based label propagation algorithm

  16. Automatic Set Instance Acquisition(ASIA) : using PIC3 Input : Class label e.g. Country Output : Entities belonging to the given class label e.g. India, China, USA, Canada, Japan ….. Previously described Set Expansion algorithm is used as a subroutine here.

  17. Query Times • PIC3 preprocessing : 0.02 sec • # ASIA queries = 25 • Precision Recall Curve : K-NN+PIC3 consistently beats K-NN-Baseline. Modified Adsorption method is better on 2/4 query classes at the expense of much larger query time.

  18. Conclusions & Future Work • Presented a novel low-dimensional PIC3 representation for entities on the Web using Power Iteration Clustering (PIC). • Simple primitive operations on PIC3 to perform following tasks : • Semi-Supervised Learning • Set Expansion • Automatic Set Instance Acquisition • Future work : Use PIC3 representation for • Named entity disambiguation and • Unsupervised class-instance pair acquisition

  19. Thank You !! Please visit our poster ID : 02 This work is supported by Google and the Intelligence Advanced Research Projects Activity (IARPA) via Air Force Research Laboratory (AFRL) contract number FA8650-10-C-7058.

  20. Examples : Set Expansion

  21. Examples : ASIA

  22. Set Expansion

  23. ASIA Task

More Related