Wei Zhang 1 , Jian Su 2 , Bin Chen 2 ,WentingWang 2 ,

I2R-NUS-MSRA at TAC 2011: Entity Linking Wei Zhang1, Jian Su2, Bin Chen2,WentingWang2, Zhiqiang Toh2, Yanchuan Sim2, Yunbo Cao3, Chin Yew Lin3 and Chew Lim Tan1 1 National University of Singapore 2 Institute for Infocomm Research 3Microsoft Research Asia Text Analysis Conference, November 14-15, 2011

I2R-NUS-MSRA at TAC 2011: Entity Linking Outline • I2R-NUS team at TAC • incorporate the new technologies proposed in our recent papers (IJCAI 2011, IJCNLP 2011) • Acronym Expansion • Semantic Features • Instance Selection • Investigate three algorithms for NIL query clustering • Spectral Graph Partitioning (SGP) • Hierarchical Agglomerative Clustering (HAC) • Latent Dirichlet allocation (LDA) • Combination system • Offline Combination with the system of MSRA team at KB linking step Text Analysis Conference, November 14-15, 2011

I2R-NUS-MSRA at TAC 2011: Entity Linking Outline • I2R-NUS team at TAC • incorporate the new technologies proposed in our recent papers (IJCAI 2011, IJCNLP 2011) • Acronym Expansion • Semantic Features • Instance Selection • Investigate three algorithms for NIL query clustering • Spectral Graph Partitioning (SGP) • Hierarchical Agglomerative Clustering (HAC) • Latent Dirichlet allocation (LDA) • Combination system • Combine with the system of MSRA team at KB linking step Text Analysis Conference, November 14-15, 2011

I2R-NUS-MSRA at TAC 2011: Entity Linking Acronym Expansion - Motivation • Expanding an acronym from its context to reduce the ambiguities of a name • E.g.TSE in Wikipedia refers to 33 entries Vs. Tokyo Stock Exchange is unambiguous. Text Analysis Conference, November 14-15, 2011

I2R-NUS-MSRA at TAC 2011: Entity Linking Step 1 – Find Expansion Candidates • Identifying Candidate Expansions (e.g. for ACM) Text Analysis Conference, November 14-15, 2011

I2R-NUS-MSRA at TAC 2011: Entity Linking Step 2 – Candidate Expansions Ranking • Using SVM classifier to rank the candidates • Our SVM based acronym expansion • can handle link acronyms and full strings in the different sentences in the articles • Number of common characters between acronym and leading character of the expansion. • can handle acronym with swapped letters. • E.g. Communist Party of China Vs. CCP • Sentence distance between acronym and expansion Text Analysis Conference, November 14-15, 2011

A Wikipedia-LDA Model for Entity Linking with Batch Size Changing Instance Selection Related Work on Context Similarity • Zhang et al., 2010; Zheng et al., 2010; Dredze et al., 2010 • Term Matching • However, 1) Michael Jordan is a leading researcher in machine learning and artificial intelligence. 2) Michael Jordan is currently a full professor at the University of California, Berkeley. 3) Michael Jordan (born February, 1963) is a former American professional basketball player. 4) Michael Jordan wins NBA MVP of 91-92 season. No Term Match The 5th International Joint Conference on Natural Language Processing, November 8-13, 2011, Chiang Mai, Thailand

A Wikipedia-LDA Model for Entity Linking with Batch Size Changing Instance Selection Our System - A Wikipedia-LDA model • 1) Michael Jordan is a leading researcher in machine learning and artificial intelligence. • 2) Michael Jordan is currently a full professor at the University of California, Berkeley. • 3) Michael Jordan (born February, 1963) is a former American professional basketballplayer. • 4) Michael Jordan wins NBA MVP of 91-92 season. Topic: Science Topic: Basketball The 5th International Joint Conference on Natural Language Processing, November 8-13, 2011, Chiang Mai, Thailand

A Wikipedia-LDA Model for Entity Linking with Batch Size Changing Instance Selection Wikipedia – LDA Model P( word i| category j) P( category i| document j) Document Document … … The 5th International Joint Conference on Natural Language Processing, November 8-13, 2011, Chiang Mai, Thailand

A Wikipedia-LDA Model for Entity Linking with Batch Size Changing Instance Selection Wikipedia – LDA Model • 1) Michael Jordan is a leading researcherin machine learning and artificial intelligence. • 2) Michael Jordan is currently a full professor at the University of California, Berkeley. • 3) Michael Jordan (born February, 1963) is a former American professional basketball player. • 4) Michael Jordan winsNBA MVP of 91-92 season. The 5th International Joint Conference on Natural Language Processing, November 8-13, 2011, Chiang Mai, Thailand

A Wikipedia-LDA Model for Entity Linking with Batch Size Changing Instance Selection Related Work • Vector Space Model • Difficult to combine bag of words (BOW) with other features. • Performance needs to be improved • Supervised Approaches • Using manual annotated training instances • Dredze et al., 2010; Zheng et al., 2010 • Using automatically generated training instances • Zhang et al. 2010 The 5th International Joint Conference on Natural Language Processing, November 8-13, 2011, Chiang Mai, Thailand

A Wikipedia-LDA Model for Entity Linking with Batch Size Changing Instance Selection Related Work • (News Article) Obama Campaign Drops The George W. Bush Talking Point … • Auto-generate training instance (Zhang et al., 2010) The 5th International Joint Conference on Natural Language Processing, November 8-13, 2011, Chiang Mai, Thailand

A Wikipedia-LDA Model for Entity Linking with Batch Size Changing Instance Selection Related Work • From “George W. Bush” articles • No positive instances for “George H. W. Bush” “George P. Bush” and “George Washington Bush” generated • No negative instances for “George W. Bush” generated • Such positive negative training instance distributions may not be the same with the original ambiguous cases in the raw text collection • The distribution of the unambiguous mentions may not be the same in test data The 5th International Joint Conference on Natural Language Processing, November 8-13, 2011, Chiang Mai, Thailand

The Approach in Our System A Wikipedia-LDA Model for Entity Linking with Batch Size Changing Instance Selection • An instance selection approach • Select an informative,representative, and diversesubset from the auto-generated data set. • Reduce the effect of the distribution differences The 5th International Joint Conference on Natural Language Processing, November 8-13, 2011, Chiang Mai, Thailand 16

A Wikipedia-LDA Model for Entity Linking with Batch Size Changing Instance Selection Instance Selection auto-generated data set Test on training SVM Classifier Small Initial data set Add these selected instances to Initial data set 2-D data set Illustration Select Informative, representative and diverse Instances The 5th International Joint Conference on Natural Language Processing, November 8-13, 2011, Chiang Mai, Thailand SVM hyperplane

Spectral Clustering • Advantages over other clustering techniques • Globally optimized results • Efficient in time and space • Generally, produce a better result • Success in many areas • Image segmentation • Gene expression clustering

Spectral Clustering • Eigen Decomposition on Graph Laplacian • Dimensionality Reduction • (Luxburg, 2006) George W. Bush A = QɅQ-1 George H.W. Bush

I2R-NUS-MSRA at TAC 2011: Entity Linking Hierarchical Agglomerative Clustering • Convert a doc into a feature vector: Wikipedia concepts, bag-of-words and named entities. • Estimate the weight of each feature using Query Relevance Weighting Model (Long and Shi, 2010): • this model shows good performance in Web People Search • In our work, original query name, its Wikipedia redirected names and its coreference chain mentions are all considered as appearances of the query name in the text. • Similarity scores : cosine similarity and overlap similarity. Text Analysis Conference, November 14-15, 2011

I2R-NUS-MSRA at TAC 2011: Entity Linking Hierarchical Agglomerative Clustering • Docs referred to the same entity are clustered according to doc pair-wise similarity scores. • Start with singleton: each doc is a cluster • If there are two docs D and D' in clusters Ci and Cj respectively: Two clusters Ci and Cj are merged to form a new cluster Cij if Sim(D,D' ) > γ γ = 0.25 Calculate the similarity between the new cluster Cij and all remaining clusters Text Analysis Conference, November 14-15, 2011

I2R-NUS-MSRA at TAC 2011: Entity Linking Latent Dirichlet Allocation (LDA) • LDA has been applied to many NLP tasks such as: summarization and text classification • In our approach, the learned topics can represent the underlying entities of the ambiguous names • Generative story: Text Analysis Conference, November 14-15, 2011

I2R-NUS-MSRA at TAC 2011: Entity Linking Three Clustering Systems Combination • Three classes SVM classifier to decide which system to be trusted • Features: scores given by the three systems Combine with the system of MSRA team at KB linking step • Binary SVM classifier to decide which system to be trusted • Features: scores given by the two systems Text Analysis Conference, November 14-15, 2011

I2R-NUS-MSRA at TAC 2011: Entity Linking Experiment for Three Clustering Algorithms Text Analysis Conference, November 14-15, 2011

I2R-NUS-MSRA at TAC 2011: Entity Linking Submissions Text Analysis Conference, November 14-15, 2011

I2R-NUS-MSRA at TAC 2011: Entity Linking Conclusion • Incorporate the new technologies proposed in our recent papers (IJCAI 2011, IJCNLP 2011) • Acronym Expansion • Semantic Features • Instance Selection • Investigate three algorithms for NIL query clustering • Spectral Graph Partitioning (SGP) • Hierarchical Agglomerative Clustering (HAC) • Latent Dirichlet allocation (LDA) Text Analysis Conference, November 14-15, 2011

Wei Zhang 1 , Jian Su 2 , Bin Chen 2 ,WentingWang 2 ,

Wei Zhang 1 , Jian Su 2 , Bin Chen 2 ,WentingWang 2 ,

Presentation Transcript

1 1. 2. 3. 2 1. 2. ,,

,,: 1-1 :, 1-2 :, 2-1 :,, 2-2 :, 2-3 :,

2--1- 2- 2- -2-

: 1 : 2 : 1 : 2 : : 1 : 2 :

Gu Zhang Mao Jian

Wei Zhang, Xianghua Xu , Qinchao Zhang , Jian Wan, Naixue Xiong

2+2 = 4 2x2 = 2+2 1+2 1/3

Siyan Ma 1 , Jiquan Chen 1 , Malcolm North 2 , James Innes 2

Yanchang Zhao 1 , Huaifeng Zhang 2 , Shanshan Wu 1 , Jian Pei 3 ,

Wentao Zhang 1 , Bozhan Su 2 , Wenling Wu 1 Dengguo Feng 2 , Chuankun Wu 1

Xi Zhang 1,2 Jie Zhang 2 Yonggang Ji 2

Huanhuan Cao 1 , Daxin Jiang 2 , Jian Pei 3 , Enhong Chen 1 and Hang Li 2

Jian Zhang

 V 1 2 / 2 + p 1 /  + gz 1 =  V 2 2 /2 + p 2 /  + gz 2 + h lT

Jing Gao 1 , Wei Fan 2 , Deepak Turaga 2 , Olivier Verscheure 2 , Xiaoqiao Meng 2 ,

1 2 1 2

Kai Lei 1 , Meng Qin 1 , Bo Bai 2,* , Gong Zhang 2

Junjun Pan 1 , Xiaosong Yang 1 , Xin Xie 1 , Philip Willis 2 , Jian J Zhang 1

SU( 2 )