230 likes | 399 Vues
Collectively Representing Semi-Structured Data from the Web. Bhavana Dalvi , William W. Cohen and Jamie Callan Language Technologies Institute Carnegie Mellon University Paper ID : 02 . This work is supported by Google and the Intelligence Advanced Research Projects Activity
E N D
Collectively Representing Semi-Structured Data from the Web BhavanaDalvi, William W. Cohen and Jamie Callan Language Technologies Institute Carnegie Mellon University Paper ID : 02 This work is supported by Google and the Intelligence Advanced Research Projects Activity (IARPA) via Air Force Research Laboratory (AFRL) contract number FA8650-10-C-7058.
Motivation • Entities on the Web can be present in multiple datasets. E.g. HTML tables, text documents etc. • Traditional systems : Entities as sparse vector of document Ids in which it occurs. • We propose a low-dimensional representation for such entities. • Helps to efficiently perform different tasks with a small number of primitive operations : • Semi-supervised Learning (SSL) • Set Expansion (SE) • Automatic Class Instance Acquisition (ASIA)
Entities in HTML tables Entity-ColumnBi-partite Graph Table-column Entity TC-1 USA TC-2 India TC-3 TC-2 Hockey TC-3 Cricket TC-4 Tennis
Entities in unstructured text “Such as”Bi-partite Graph Countriessuch as India are developing rapidly in terms of infrastructure. Suchas Entity Country USA India Location Hockey Outdoor sportsincludeTennis andCricket. Cricket Sports Tennis
Resultant Tri-partite Graph “Such as”Bi-partite Graph Entity-ColumnBi-partite Graph Table-column Suchas Entity TC-1 Country USA TC-2 India Location Hockey TC-3 Cricket Sports TC-4 Tennis
Encoding the graph “Entity-Column”Bi-partite Graph Low-dimensional embedding using bipartite Power Iteration Clustering (Lin & Cohen, ICML 2010/ECAI 2010) Table-column Entity TC-1 USA TC-2 India Hockey TC-3 Cricket Entities with similar X1/X2 values should be ontologically similar - values summarize tabular co-occurrence TC-4 Tennis
Encoding the graph “Such as”Bi-partite Graph Low-dimensional embedding using bipartite Power Iteration Clustering (Lin & Cohen, ICML 2010/ECAI 2010) Suchas Entity Country USA India Location Hockey Cricket Entities with similar Y1/Y2 values should be ontologically similar - values summarize “such as pattern” co-occurrence Sports Tennis
Low-dimensional PIC3 embedding n * m PIC embeddingm << t n * t entity-tableColumn Bipartite graph n * 2m PIC3 embedding PIC Concatenate n * m PIC embeddingm << s n * s entity-suchas Bipartite graph PIC
Using PIC3 Representation • Semi-Supervised Learning : Given few seed examples for each class, predict class-labels for unlabeled data-points. • Set Expansion : Given a set of seed entities, find more entities similar to seed entities. • Automatic Set Instance Acquisition (ASIA) : Given a concept name automatically find instances of that concept.
Quantitative Evaluation: Datasets Link to dataset: http://rtw.ml.cmu.edu/wk/WebSets/wsdm_2012_online
SSL using PIC3 Input : Few seed examples for each class label Output : Class-labels for unlabeled data-points PIC clusters similar entities together better SVM classifier on unlabeled data (use of background data)
SSL Task - I # dimensions : 2504 10
SSL Task - II # dimensions : 2574 10
Set Expansion using PIC3 Input : Few seed entities e.g. Football, Hockey, Tennis Output : More entities of same type as seeds e.g. Baseball, Badminton, Cricket, Golf …. K-NN operation is extremely efficient using KD-trees.
Query Times • PIC3 preprocessing : 0.02 sec • # SE queries = 881 • Precision Recall Curve : K-NN+PIC3 consistently beats K-NN-Baseline. Modified Adsorption method is better on 2/5 query classes at the expense of larger query time. Modified Adsorption : Graph based label propagation algorithm
Automatic Set Instance Acquisition(ASIA) : using PIC3 Input : Class label e.g. Country Output : Entities belonging to the given class label e.g. India, China, USA, Canada, Japan ….. Previously described Set Expansion algorithm is used as a subroutine here.
Query Times • PIC3 preprocessing : 0.02 sec • # ASIA queries = 25 • Precision Recall Curve : K-NN+PIC3 consistently beats K-NN-Baseline. Modified Adsorption method is better on 2/4 query classes at the expense of much larger query time.
Conclusions & Future Work • Presented a novel low-dimensional PIC3 representation for entities on the Web using Power Iteration Clustering (PIC). • Simple primitive operations on PIC3 to perform following tasks : • Semi-Supervised Learning • Set Expansion • Automatic Set Instance Acquisition • Future work : Use PIC3 representation for • Named entity disambiguation and • Unsupervised class-instance pair acquisition
Thank You !! Please visit our poster ID : 02 This work is supported by Google and the Intelligence Advanced Research Projects Activity (IARPA) via Air Force Research Laboratory (AFRL) contract number FA8650-10-C-7058.