310 likes | 415 Vues
Extracting parallel texts from the web for translation and dictionary construction using automatic detection methods to reduce human involvement and improve language resources.
E N D
Large Scale Crawling the Web for Parallel Texts Chikayama Taura lab. M1 Dai Saito
One thing was certain, that the WHITE kitten had had nothing to do with it. --it was the black kitten's fault entirely. 一つ確実なのは、 白い子ネコはなんの関係も なかったということ。 ――もうなにもかも、 黒い子ネコのせいだったのです。 Parallel Texts • Parallel texts : • Parallel corpus : a set of parallel texts Translated pair of multilingual texts 日本語 English
Parallel Texts • Useful resource for • Statistical machine translation • Dictionary construction • But… existing corpora are small • Number • Not enough • Need human resource • Language • English-French • Genre • Public Document • Software Manual
Parallel Texts from the Web • Crawling parallel texts from the Web • Very large number of texts exist • Varied languages are used • Low human resource Problems - How to detect parallel texts automatically - Calculation cost :
① Not parallel ① ② ② Not parallel Parallel Texts from the Web Maybe parallel Web Parallel Texts Not parallel Parallel texts
Agenda • Introduction • Related work • Proposal • Detecting parallel texts • Large scale crawling • Experiment • Conclusion
STRAND [Resnik et al. 03] • URL Matching • Removing language-specific substrings[LSSs](Japanese : ja, jp, jpn, euc, sjis,…) • Matching LSSs-removed URLs • Making a detailed comparison http://www.hostname.com/index.html.en http://www.hostname.com/index.html.en http://www.hostname.com/index.html.ja http://www.hostname.com/index.html.ja
URL Matching Experiment • URL Matching for URLs of crawled pages • 90,000,000URLs • English⇔Japanese • Seeing only URL • 90,000,000 →4,000 • Too strict? • Useless pages are included japanese.php english.php index.html.ja index.html.en
DOM Tree Alignment [Lei et al. 06] link • Searching linked pages • “alt” tag • link name • HTML→DOM Tree • Parallel link: a pair of the same hyperlinks in parallel texts “English version” “In English” etc… link
Pros and Cons • URL Matching High speed and Easy to implement Small number of pages • DOM Tree High accuracy and Small storage Execution speed is slow ○ × ○ ×
Agenda • Introduction • Related work • Proposal • Detecting parallel texts • Large scale crawling • Experiment • Conclusion
Detecting Parallel Texts • [Fukushima 06] • Reducing comparison cost • without HTML Information • word(noun)→semantic ID→comparison
Semantic ID Conversion • Constructing a graph from dictionaries • Treating Japaneseand English texts on same level • # of Semantic ID:about 10,000 1 Sense 感覚 意味 Movie 2 映画 Film Hobby 趣味 3 Taste 味
Texts to Vector テキスト 955 … 辞書 1704 … 数列 3173 辞書を使ってテキストを数列に変える。 1704 955 3173 sort (955, 1704, 3173) +position information
Comparison • tscore (translation score) T1:(106, 335, 455, 567, 1704, 3173, 7421) T2:(335, 567, 567, 1704, 4014, 5449, 7421) score= 0 1 2
tscore threshold • Fry Corpus[05 Fry] • F-measure • tscore threshold 0.102 • Speed 250,000 pairs/sec
Agenda • Introduction • Related work • Proposal • Detecting parallel texts • Large scale crawling • Experiment • Conclusion
Large Scale Crawling • Calculation cost of each comparison • Calculation cost of entire crawling • Number of comparisons: • URL matching is too strict • Alt tag or link name are not applied for all parallel pages
HTML on the Web to Natural Language • Guess language • English, SJIS, EUC-JP, UTF-8 • Convert character code • Remove HTML Tag • For crawling, <a> or <link> tag are used • <title>, <Hn> tag may be useful
Calculation Cost Reduction • Distance score of vectors • Compare only near vectors • distance score : tscore • Set a label of the nearest sample text for all texts Distance score of two texts is far, then,they are not parallel texts.
Calculation Cost Reduction • Flow • Select sample texts (<<n) • When crawling, calculate distance score with sample texts • Classify top m score • Compare only for texts in the same group
Sampling • Number of sample • Accuracy (risk of miss labeling) • Calculation cost • Size of the group • should be equal • Large group are divided into small recursively
Crawling link pages Same links from parallel texts will be parallel texts • Evaluation of same links • DOM Tree [Lei et al. 06] • Evaluate function • Position of <A> tag • Pages in same host • Diff of URLs • hoge.html.en -> fuga.html.en : hoge - fuga • hoge.html.ja -> fuga.html.ja : hoge – fuga
Agenda • Introduction • Related work • Proposal • Detecting parallel texts • Large scale crawling • Experiment • Conclusion
Evaluation of tscore • Fry Corpus [Fry 05] • 200(japanese) x 200(english) • Flow • Convert all texts to vector • Calculate distance score for all pairs(40000) • Check scores of real parallel texts are high • Score of parallel texts should be top
Evaluation of tscore (1,1,1,2,4,4,…) • NOT XOR (3,1,0,2,…) • Other distance score • AND sparse (3,1,0,2,0) (3,1,0,2,0) 3 2 • EUCLID • COS (3,0,0,1,2) (3,0,0,1,2) • AND - XOR (3,1,0,2,0) 0 (3,0,0,1,2)
Evaluation of tscore • Number of miss score ([200+200]texts)
Calculation Time • Fry Corpus • 200, 400, 800,1600, 3200 • NORMALtscore(Top3) • # of samples : √(# of All) • Miss labeling : 11 (in 200 pairs)
Agenda • Introduction • Related work • Proposal • Detecting parallel texts • Large scale crawling • Experiment • Conclusion
Conclusion and Future work • Parallel texts from the Web • Detecting parallel texts • Large scale crawling • Future work • Crawling many texts from the Web • Crawling with parallel link structure • Detecting parallel in real HTML texts • Proper sampling