Extracting Parallel Texts from Massive Web Documents
340 likes | 517 Vues
Extracting Parallel Texts from Massive Web Documents. Chikayama Taura lab. M2 Dai Saito. Construct Parallel Corpora from the Web. --it was the black kitten's fault entirely. One thing was certain, that the WHITE kitten had had nothing to do with it. ―― もうなにもかも、 黒い子ネコのせいだったのです。.
Extracting Parallel Texts from Massive Web Documents
E N D
Presentation Transcript
Extracting Parallel Texts from Massive Web Documents Chikayama Taura lab. M2 Dai Saito
Construct Parallel Corpora from the Web --it was the black kitten's fault entirely. One thing was certain, that the WHITE kitten had had nothing to do with it. ――もうなにもかも、 黒い子ネコのせいだったのです。 一つ確実なのは、 白い子ネコはなんの関係も なかったということ。 Purpose • Parallel corpus : a set of parallel texts • Parallel texts : translated pairs of texts 日本語 English
Parallel Texts • Useful resource for • Statistical machine translation • Dictionary construction • But… existing corpora are not enough • Amount • Small • Large human resource • Genre • Public Documents • Software Manuals • Language • Limited • English-French
Parallel Texts from the Web • Extracting Parallel Texts from Massive Web Documents • Very large amount of texts • Varied languages • Small human resource
Problems • How to detect parallel texts automatically • How to reduce calculation cost • To construct parallel corpus • Extract candidate pairs • Judge whether they really are parallel texts Web
Agenda • Introduction • Related work • Proposal • Detect parallel texts • Extract candidate pairs • Experiment • Conclusion
STRAND [Resnik et. al. 03] • URL Matching • Remove language-specific substrings[LSSs](Japanese : ja, jp, jpn, euc, sjis,…) • Match LSSs-removed URLs • Make a detail comparison http://www.hostname.com/index.html.en http://www.hostname.com/index.html.en http://www.hostname.com/index.html.ja http://www.hostname.com/index.html.ja
link link DOM Tree Alignment [Lei et. al. 06] • HTML→DOM Tree • Searching linked pages • “alt” tag • link name • Parallel link: a pair of the same hyperlinks in parallel texts “English version” “In English” etc…
Agenda • Introduction • Related work • Proposal • Detect parallel texts • Extract candidate pairs • Experiment • Conclusion
… … … … Outline Web Crawler Extract candidate pairs Detect parallel texts
Detecting parallel texts • Low comparison cost • without HTML Information • word (noun) • semantic ID • comparison [Fukushima et.al. 06]
Semantic ID Conversion • Constructing a graph from dictionaries • Treating Japaneseand English texts in the same level • # of Semantic IDs:about 10,000 1 Sense 感覚 意味 Movie 2 映画 Film Hobby 趣味 3 Taste 味
Texts to Vector テキスト 955 … 辞書 1704 … 数列 3173 辞書を使ってテキストを数列に変える。 1704 955 3173 sort (955, 1704, 3173) +position information
Comparison • tscore (translation score) T1:(106, 335, 455, 567, 1704, 3173, 7421) T2:(335, 567, 567, 1704, 4014, 5449, 7421) score= 0 1 2 3 4 tscore = 4/(7+7)
tscore threshold • Fry Corpus[05 Fry] 400 pair • F-measure • Speed 200,000 pairs/sec • tscore threshold 0.102
Agenda • Introduction • Related work • Proposal • Detect parallel texts • Extract candidate pairs • Experiment • Conclusion
Extract candidate pairs • Calculation cost of each comparison • Calculation cost of extracting parallel texts • A number of comparison: n^2 • URL matching is too strict • Japanese and English • 90,000,000URL → 4,000 URL pairs → 1,000 real pairs
Calculation Cost Reduction Sample →Reducing the number of comparison • distance score : tscore • Compare only texts close to each other Distance of each parallel texts and a sample text should be equal English 日本語
Calculation Cost Reduction • Flow • Select sample texts (<<n) • Calculate distance score with sample texts • Classify top m score • Compare only for texts in the same group
Sampling • Number of sample • Calculation cost • Accuracy (low risk of miss labeling) • Methods to select sample • Random • k-means
k-means k=2 • Select k samples • Classify all texts • Calculate centers • Re-classify
Calculation of tscore in k-means Text1:(106, 335, 455, 567, 1704, 3173, 7421) Text2:(335, 567, 567, 1704, 4014, 5449, 7421) tscore = 4/(7+7) normal k-means Text1:(106, 335, 455, 567, 1704, 3173, 7421) Average1:((567, 0.2), (4014, 0.14), (7421, 0.5), …) tscore = (0.2+0.5)
Converting HTML on the Web • Guess language • English, SJIS, EUC-JP, UTF-8 • Convert character code • Remove HTML Tag • Morphological Analysis→pickup noun
Agenda • Introduction • Related work • Proposal • Detect parallel texts • Extract candidate pairs • Experiment • Conclusion
Experiment • Calculation Cost • Accuracy v.s. Calculation time • Clustering • k-means
Environment • Dataset:Fry Corpus [Fry 05] • Corpus of Japanese-English news pages • Convert HTML to Semantic ID in advance • Machine • CPU : Xeon 2.4GHz Dual • Memory : 2GB • OS : Linux (Debian)
Calculation Cost • Fry Corpus • 200 - 6400 pairs Normal All-to-All Random sampling (Top3) • # of texts grows, gap becomes wider • Low cost with n^2 samples
Accuracy v.s. Calculation time • Fry Corpus • 400 pairs • Random sampling • # of sample grows, • Miss classification ratio → high • Execution time → low • Trade off with Miss classification ratio and Execution time
Sample selection with k-means • Accuracy and Execution time with k-means • Flow • Random sampling • number of samples : √n • Calculating the center and re-sampling • Measuring Miss-classification ratio and Execution time
Evaluation of k-means • Low miss-classification ratio→High biased
Agenda • Introduction • Related work • Proposal • Detect parallel texts • Extract candidate pairs • Experiment • Conclusion
Conclusion and Future work • Parallel texts from the Web • Detecting parallel texts • Extracting candidate pairs • Random sampling • k-means
Future work • Better clustering methods • Hierarchical • Dimension reduction • About 10,000 dimension is too high • Processing real HTML texts from the Web