100 likes | 213 Vues
Cross-Lingual Linking of News Stories using ESA. Nitish Aggarwal, Kartik Asooja, Paul Biutelaar, Tamara Polajanar, Jorge Gracia DERI , NUI Galway, Ireland OEG, UPM, Madrid, Spain. Tuesday, 18 Dec, 2012 CL!NSS, FIRE-2012 . Overview. P roblem Space Approach Search Space Reduction
E N D
Cross-Lingual Linking of NewsStories using ESA Nitish Aggarwal, Kartik Asooja, Paul Biutelaar, Tamara Polajanar, Jorge Gracia DERI, NUI Galway, Ireland OEG, UPM, Madrid, Spain Tuesday, 18 Dec, 2012 CL!NSS, FIRE-2012
Overview • Problem Space • Approach • Search Space Reduction • Semantic Ranking • Cross-Lingual Explicit Semantic Analysis (CL-ESA) • Evaluations • Conclusion & Future Work
Problem Space • Cross-lingual news story linking • identify the same news articles in different languages • Cross-Lingual Plagiarism detection • Data set • 50 English News Stories • 50K Hindi News Stories • Challenge • Not directly Translated • Similar keywords in different stories • Different keywords in similar stories
Approach • Search Space Reduction • News publication dates • by taking K days window • Vocabulary overlap • Translating English news stories using Google Translate • SemanticRanking • Rank the news stories with their semantic relatedness • CL-ESA semantic relatedness score
Semantic Ranking/Relatedness • Corpus-based Relatedness • Semantic meaning as a distributional vector • Words that occur in similar contexts tend to have similar/ related meanings i.e. meaning of a word can be defined in terms of its context. (Distributional Hypothesis (Harris, 1954)) • Latent Semantic Analysis (LSA) • Latent or implicit semantics (unsupervised) • Explicit Semantic Analysis (ESA) • Explicit semantics from explicitly derived concepts (supervised)
Cross lingual ESA (CL-ESA) W1*URI1+w2*URI2…. wn*URIn EN Word1 • Multilingual Wikipedia Index • EN, DE, ES, PT, FR, NL, HI • Easily extendable for other languages • Performed better than CL-latent models W1*URI1+w2*URI2…. wn*URIn Wordn W1*URI1+w2*URI2…. wn*URIn Word1 HI W1*URI1+w2*URI2…. wn*URIn Wordn W1*URI1+w2*URI2…. wn*URIn Word1 ES W1*URI1+w2*URI2…. wn*URIn Wordn Inverted Index Term@en Vector Cosine W11*URI1+w12*URI2…. w1n*URIn Semantic Relatedness Term@hi W11*URI1+w12*URI2…. w1n*URIn
Experiments • Run1 • window of 4 days (2 days before and 2 days after) • Rank all news stories using CL-ESA • Run2 • window of 14 days (7 days before and 7 days after) • Rank all news stories using Modified CL-ESA • Run3 • English stories were translated into Hindi using Google translator • Took top 1000 Hindi news using vocabulary overlap • Re-rank all news stories using CL-ESA
Evaluation: Results • CL!NSS challenge
Conclusion • Initial approach for cross lingual linking of news stories • Bigger window with modified CL-ESA works best • Translated vocabulary overlap did not work well • Use other ranking scores • LSA, LDA • Evaluate separate effect of components • Bigger window size Vs Ranking function
Thank You Questions?