210 likes | 335 Vues
This paper explores a Latent Semantic Indexing (LSI)-based method for clustering multilingual documents. We detail the LSI technique, which employs singular value decomposition to reduce dimensionality while preserving meaningful relationships between terms and documents. This study includes an empirical evaluation of the technique’s effectiveness in clustering multi-language documents, analyzing factors like synonymy, polysemy, and vocabulary differences across languages. The results demonstrate the efficacy of LSI in creating a multilingual semantic space, ultimately enhancing decision support systems.
E N D
A Latent Semantic Indexing-based approach to multilingual document clastering Chih-Ping Wei, Christopher C. Yang, Chia-Min Lin Decision Support Systems 45 (2008) 606-620 Reporter : Yi Ru, Lee
Outline Introduction Latent Semantic Indexing(LSI) LSI-based multilingual document clustering technique Empirical evaluation Conclusion
Introduction • Translation-based • Synonymy • Polysemy • vocabulary • Multilingual space • Latent Semantic Indexing(LSI) • Lexical matching • Reduce the dimensions
Latent Semantic Indexing(con.) Singular Value Decomposition (SVD)
LSI-based multilingual document clustering technique(con.) Multilingual semantic space analysis
LSI-based multilingual document clustering technique(con.) Document folding-in
LSI-based multilingual document clustering technique(con.) Dj denote the LSI dimension j Wji is the weight of document i in Dj Dimension Selection
LSI-based multilingual document clustering technique(con.) • Clustering • Hierarchical clustering algorithm
Empirical evaluation(con.) TA is the set of associations in the true categories. GA is the set of associations in the clusters generated by the document clustering technique. CA is the set of correct associations that exists in both the clusters and the true categories.
Empirical evaluation(con.) TA={(e1−e2),(c1−c2), (e1−c1), (e1−c2), (e2−c1), (e2−c2), (e3−e4),(c3−c4), (c3−c5), (c4−c5), (e3−c3), (e3−c4), (e3−c5), (e4−c3), (e4−c4), (e4−c5)} GA={(e1−e2), (c1−c3), (e1−c1), (e1−c3), (e2−c1), (e2−c3), (e3−e4), (e3−c2), (e4−c2), (c4−c5)} CA={(e1−e2), (e1−c1), (e2−c1), (e3−e4), (c4−c5)} Examples
Empirical evaluation(con.) PRT curves of the LSI-based MLDC technique
Empirical evaluation(con.) Comparisons of different representation schemes
Empirical evaluation(con.) Effect of dimension selection (h=5 for MLDC with dimension selection; k=5 for MLDC without dimension selection)
Empirical evaluation(con.) Effect of dimension selection (h=20 for MLDC with dimension selection; k=20 for MLDC without dimension selection)
Empirical evaluation(con.) Best scenario versus best scenario comparison
Empirical evaluation(con.) PRT curves of overall, monolingual, and cross-lingual performance
Conclusion monolingual PRT curve > overall PRT curve > cross-lingual PRT curve Specific domain