Large-Scale Crawling for Parallel Texts

Large Scale Crawling the Web for Parallel Texts Chikayama Taura lab. M1 Dai Saito

One thing was certain, that the WHITE kitten had had nothing to do with it. --it was the black kitten's fault entirely. 一つ確実なのは、白い子ネコはなんの関係もなかったということ。 ――もうなにもかも、黒い子ネコのせいだったのです。 Parallel Texts • Parallel texts : • Parallel corpus : a set of parallel texts Translated pair of multilingual texts 日本語 English

Parallel Texts • Useful resource for • Statistical machine translation • Dictionary construction • But… existing corpora are small • Number • Not enough • Need human resource • Language • English-French • Genre • Public Document • Software Manual

Parallel Texts from the Web • Crawling parallel texts from the Web • Very large number of texts exist • Varied languages are used • Low human resource Problems - How to detect parallel texts automatically - Calculation cost :

① Not parallel ① ② ② Not parallel Parallel Texts from the Web Maybe parallel Web Parallel Texts Not parallel Parallel texts

Agenda • Introduction • Related work • Proposal • Detecting parallel texts • Large scale crawling • Experiment • Conclusion

STRAND [Resnik et al. 03] • URL Matching • Removing language-specific substrings[LSSs](Japanese : ja, jp, jpn, euc, sjis,…) • Matching LSSs-removed URLs • Making a detailed comparison http://www.hostname.com/index.html.en http://www.hostname.com/index.html.en http://www.hostname.com/index.html.ja http://www.hostname.com/index.html.ja

URL Matching Experiment • URL Matching for URLs of crawled pages • 90,000,000URLs • English⇔Japanese • Seeing only URL • 90,000,000 →4,000 • Too strict? • Useless pages are included japanese.php english.php index.html.ja index.html.en

DOM Tree Alignment [Lei et al. 06] link • Searching linked pages • “alt” tag • link name • HTML→DOM Tree • Parallel link: a pair of the same hyperlinks in parallel texts “English version” “In English” etc… link

Pros and Cons • URL Matching High speed and Easy to implement Small number of pages • DOM Tree High accuracy and Small storage Execution speed is slow ○ × ○ ×

Detecting Parallel Texts • [Fukushima 06] • Reducing comparison cost • without HTML Information • word(noun)→semantic ID→comparison

Semantic ID Conversion • Constructing a graph from dictionaries • Treating Japaneseand English texts on same level • # of Semantic ID:about 10,000 １ Sense 感覚意味 Movie ２映画 Film Hobby 趣味３ Taste 味

Texts to Vector テキスト 955 … 辞書 1704 … 数列 3173 辞書を使ってテキストを数列に変える。 1704 955 3173 sort (955, 1704, 3173) +position information

Comparison • tscore (translation score) T1:(106, 335, 455, 567, 1704, 3173, 7421) T2:(335, 567, 567, 1704, 4014, 5449, 7421) score= 0 1 2

tscore threshold • Fry Corpus[05 Fry] • F-measure • tscore threshold 0.102 • Speed 250,000 pairs/sec

Large Scale Crawling • Calculation cost of each comparison • Calculation cost of entire crawling • Number of comparisons: • URL matching is too strict • Alt tag or link name are not applied for all parallel pages

HTML on the Web to Natural Language • Guess language • English, SJIS, EUC-JP, UTF-8 • Convert character code • Remove HTML Tag • For crawling, <a> or <link> tag are used • <title>, <Hn> tag may be useful

Calculation Cost Reduction • Distance score of vectors • Compare only near vectors • distance score : tscore • Set a label of the nearest sample text for all texts Distance score of two texts is far, then,they are not parallel texts.

Calculation Cost Reduction • Flow • Select sample texts (<<n) • When crawling, calculate distance score with sample texts • Classify top m score • Compare only for texts in the same group

Sampling • Number of sample • Accuracy (risk of miss labeling) • Calculation cost • Size of the group • should be equal • Large group are divided into small recursively

Crawling link pages Same links from parallel texts will be parallel texts • Evaluation of same links • DOM Tree [Lei et al. 06] • Evaluate function • Position of <A> tag • Pages in same host • Diff of URLs • hoge.html.en -> fuga.html.en ： hoge - fuga • hoge.html.ja -> fuga.html.ja ： hoge – fuga

Evaluation of tscore • Fry Corpus [Fry 05] • 200(japanese) x 200(english) • Flow • Convert all texts to vector • Calculate distance score for all pairs(40000) • Check scores of real parallel texts are high • Score of parallel texts should be top

Evaluation of tscore (1,1,1,2,4,4,…) • NOT XOR (3,1,0,2,…) • Other distance score • AND sparse (3,1,0,2,0) (3,1,0,2,0) 3 2 • EUCLID • COS (3,0,0,1,2) (3,0,0,1,2) • AND - XOR (3,1,0,2,0) 0 (3,0,0,1,2)

Evaluation of tscore • Number of miss score ([200+200]texts)

Calculation Time • Fry Corpus • 200, 400, 800,1600, 3200 • NORMALtscore(Top3) • # of samples : √(# of All) • Miss labeling : 11 (in 200 pairs)

Conclusion and Future work • Parallel texts from the Web • Detecting parallel texts • Large scale crawling • Future work • Crawling many texts from the Web • Crawling with parallel link structure • Detecting parallel in real HTML texts • Proper sampling

Thank you for your attention!

Large-Scale Crawling for Parallel Texts

Large-Scale Crawling for Parallel Texts

Presentation Transcript

CRAWLING THE HIDDEN WEB

Crawling the Hidden Web

Efficient Parallel Software for Large-Scale Semidefinite Programs

Web Crawling

Web Crawling

Automatic Wrappers for Large Scale Web Extraction

Web Crawling

Crawling the Hidden Web

Web Crawling

CRAWLING THE WEB

Large-scale Hybrid Parallel SAT Solving

Automatic Wrappers for Large Scale Web Extraction

Crawling the Hidden Web

Crawling the Hidden Web

Large Scale Parallel Print Service

Crawling the Hidden Web

Parallel Visualization of Large-Scale Datasets for the Earth Simulator

Exploiting Large Scale Web Semantics

Automatic Wrappers for Large Scale Web Extraction

Large Scale Parallel Print Service