The Web as a Parallel Corpus

Slide 1:The Web as a Parallel Corpus A paper by Philip Resnik and Noah A. Smith (2003, Computational Linguistics) My interpretation of their research. http://www.thebritishmuseum.ac.uk/compass/ixbin/goto?id=OBJ67

Slide 2:Contents: Introduction to parallel corpora The STRAND Web-mining architecture (estb.1999) Content-Based Matching Exploiting the Internet Archive Conclusions and Further Work

Slide 3:Introduction to parallel corpora The Rosetta Stone dates back from around 190 BC. The three texts on the RS are of the same content in hieroglyphs, demotic and Greek. (3 different languages) Canadian Hansard and Hong Kong Hansard are two other famous parallel corpora, especially because they are available electronically and are of high standards. Motivation:Bitexts provide indispensable training data for statistical translation models. The Web can be mined for suitable bilingual and multilingual texts.

Slide 4:STRAND: Web-Mining Architecture(1) Structural Translation Recognition Acquired Natural Data (STRAND) is the authors� software to search for pairs of Web pages that are translations of each other. Using more parallel texts is always to the advantage of machine translation research and implementation. How STRAND works? 1)Location of pages that might have parallel translations: Looking for �parent pages� and �sibling pages�. The web page writer most probably has a language link such as �Chinese�or �Arabic� imbedded in the page. 2)Generation of candidate pairs that might be translations: Seeing if the pairs have the same HTML structure. 3)Structural filtering out the non-translation candidate pairs: Searching the content of the pairs.

Slide 5:STRAND: Web-Mining Architecture(2) 1) Locating pairs: Candidate pairs are typically from one Web-site. STRAND looks for �sibling� pairs. These pages are often linked to each other by links which offers the user: Francais, espanol, or other options. 2)Generating pairs:For many web sites The URLs are compared: http://www.ottawa.ca/index_en.html ��...http://www.ottawa.ca/index_fr.html. 3)Structural filtering: First look at the HTML structure. Web-page writers often use the same or very similar template. Next, we use a markup analyzer using three(3) tokens to produce a linear reproduction of each of the two �candidate web-pages��..over

Slide 6:STRAND: Web-Mining Architecture(3) Candidate pairs: <HTML> <HTML> <TITLE>City Hall</TITLE> <TITLE>Hotel de Ville</TITLE> <BODY> <BODY> <H1>Regional Government<H1> Les affaires��. The business�� Candidate pairs: Now formed into 2 linear alignments [START:HTML] [START:HTML] [START:TITLE] [START:TITLE] [Chunk: 8] [Chunk: 12] [END: TITLE] [END: TITLE] [START: H1] [START: BODY] [Chunk:18] [Chunk: 138] ��.over

Slide 7:Using these 2 linear alignments We use four scaler values to characterize the quality of the alignment: dp(difference percentage)= mismatches of alignments (that is, tokens that don�t match) n= number of aligned non-markup text chunks. r= correlation of lengths of the aligned non-markup chunks p= level of significance of the correlation r. Next the analysts can manually set the thresholds of these parameters and check the results. 100% precision and 68.6% recall has been obtained using STRAND to find English-French Web pages.

Slide 8:Optimizing Parameters Using Machine Learning A ninefold cross-validation experiment using decision tree induction was used to predict the class assigned by the human judges. The learned classifiers were substantially different from the manually-set (heuristic) thresholds. Manually-set: 31% of good document pairs were discarded ML-set: 16% of good pairs are discarded.(4%false positive) Other Related Work Some analysts use Parallel Text Miner (PT Miner) using already existing search engines to locate pages that are likely to be in the other language of interest.

Slide 9:Other Related Work /Other Linguistic Researchers Some analysts use Parallel Text Miner (PT Miner) using already existing search engines to locate pages that are likely to be in the other language of interest. Then a final filtering stage is undertaken to clean the corpus. Bilingual Internet Text Search(BITS) is used by other researchers and utilizes different matching techniques. STRAND, PTMiner, and BITS are all largely independent of linguistic knowledge about particular languages, and therefore very easily ported to new language pairs. Reskin has looked into:English-Arabic, English-Chinese(big5), and English-Basque.

Slide 10:Mining the Web Researchers can and do mine the internet every day. An American physicist (Barabasi) has had his team look at the size, shape and structure of the internet as well as hit-frequencies of numerous Web pages. Spiders or crawlers are used in research. The Internet Archive(www.archive.org/web/researcher/ ) is also instrumental in obtaining useful information.

Slide 11:The internet Archive The internet archive is a nonprofit organization attempting to archive the entirely publicly available Web, preserving the content and providing free access to researchers, historians, scholars, and the general public. (120terabytes of information in 2002) Over 10 billion Web pages. Properties of the Archive: 1)The archive is a temporal database, but it is not stored in temporal order. 2)Extracting a document is an expensive operation.(text extraction.) 3)Computational complexity must be keep low for mining this database. 4)Data relevant for linguistic purposes are clearly available. 5)A suite of tools exist for linguistic processing of the archive.

Slide 12:Building an English-Arabic Corpus Step 1:Search for English-Arabic pairs. Look at 24 top-level national domains for countries where Arabis is spoken: Egypt(.eg), Saudi Arabia(.sa), Kuwait(.kw). Also other .com domains believed to be useful to Arabic-speaking people. Step 2:Resnik et al. mined two crawls of the internet archive comprising 8TB and 12TB. Relevant domains numbered 19,917,923 pages. Step 3: Only 8,294 pairs of English-Arabic bitexts were found. EVALUATION:

Slide 13:Conclusions and Further Work Initial web searches for parallel texts were undertaken in 1998. Resnick�s report is from 2002. The author laments the lack of different languages available on the internet as well as the lack of data made available by some countries. The growth of both the internet and the internet archive will considerably add to the expansion of parallel corpora. Chen and Nie(2000), for example have found around 15,000 English-Chinese document pairs. One of the early STRAND projects for English-Chinese parallel texts found over 70,000 pairs. Because STRAND expects pages to be very similar in structure terms, the resulting document collections are particularly amenable to sentence- or segment-level alignment.

The Web as a Parallel Corpus

The Web as a Parallel Corpus

Presentation Transcript

The Web as a Parallel Corpus

The learner as corpus designer

The Web as a Parallel Corpus

Automatic Acquisition of Synonyms Using the Web as a Corpus

An Introduction to the Web as Corpus

Web Corpus Construction

What's on the Web? The Web as a Linguistic Corpus

A Web Application for Customized Corpus Delivery

Internet as a Corpus

Unsupervised Extraction of False Friends from Parallel Bi-Texts Using the Web as a Corpus

The Web as a Parallel Corpus

The Web as a Parallel Corpus

The Web as a graph

Improved Word Alignments Using the Web as a Corpus

The Grid as a Parallel Computer

Webinar: The Web as a Workhorse

Web as a graph

CorpEus, a ‘web as corpus’ tool designed for the agglutinative nature of Basque

How Useful Is the Web as a Linguistic Corpus?

The Sawa Corpus A Parallel Corpus English - Swahili

Automatic Acquisition of Synonyms Using the Web as a Corpus

The Grid as a Parallel Computer