1 / 13

The Web as a Parallel Corpus

The internet archive is a nonprofit organization attempting to archive the ... The growth of both the internet and the internet archive will considerably add to the ...

flora
Télécharger la présentation

The Web as a Parallel Corpus

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


    Slide 1:The Web as a Parallel Corpus A paper by Philip Resnik and Noah A. Smith (2003, Computational Linguistics) My interpretation of their research. http://www.thebritishmuseum.ac.uk/compass/ixbin/goto?id=OBJ67

    Slide 2:Contents: Introduction to parallel corpora The STRAND Web-mining architecture (estb.1999) Content-Based Matching Exploiting the Internet Archive Conclusions and Further Work

    Slide 3:Introduction to parallel corpora The Rosetta Stone dates back from around 190 BC. The three texts on the RS are of the same content in hieroglyphs, demotic and Greek. (3 different languages) Canadian Hansard and Hong Kong Hansard are two other famous parallel corpora, especially because they are available electronically and are of high standards. Motivation:Bitexts provide indispensable training data for statistical translation models. The Web can be mined for suitable bilingual and multilingual texts.

    Slide 4:STRAND: Web-Mining Architecture(1) Structural Translation Recognition Acquired Natural Data (STRAND) is the authors software to search for pairs of Web pages that are translations of each other. Using more parallel texts is always to the advantage of machine translation research and implementation. How STRAND works? 1)Location of pages that might have parallel translations: Looking for parent pages and sibling pages. The web page writer most probably has a language link such as Chineseor Arabic imbedded in the page. 2)Generation of candidate pairs that might be translations: Seeing if the pairs have the same HTML structure. 3)Structural filtering out the non-translation candidate pairs: Searching the content of the pairs.

    Slide 5:STRAND: Web-Mining Architecture(2) 1) Locating pairs: Candidate pairs are typically from one Web-site. STRAND looks for sibling pairs. These pages are often linked to each other by links which offers the user: Francais, espanol, or other options. 2)Generating pairs:For many web sites The URLs are compared: http://www.ottawa.ca/index_en.html ...http://www.ottawa.ca/index_fr.html. 3)Structural filtering: First look at the HTML structure. Web-page writers often use the same or very similar template. Next, we use a markup analyzer using three(3) tokens to produce a linear reproduction of each of the two candidate web-pages..over

    Slide 6:STRAND: Web-Mining Architecture(3) Candidate pairs: <HTML> <HTML> <TITLE>City Hall</TITLE> <TITLE>Hotel de Ville</TITLE> <BODY> <BODY> <H1>Regional Government<H1> Les affaires. The business Candidate pairs: Now formed into 2 linear alignments [START:HTML] [START:HTML] [START:TITLE] [START:TITLE] [Chunk: 8] [Chunk: 12] [END: TITLE] [END: TITLE] [START: H1] [START: BODY] [Chunk:18] [Chunk: 138] .over

    Slide 7:Using these 2 linear alignments We use four scaler values to characterize the quality of the alignment: dp(difference percentage)= mismatches of alignments (that is, tokens that dont match) n= number of aligned non-markup text chunks. r= correlation of lengths of the aligned non-markup chunks p= level of significance of the correlation r. Next the analysts can manually set the thresholds of these parameters and check the results. 100% precision and 68.6% recall has been obtained using STRAND to find English-French Web pages.

    Slide 8:Optimizing Parameters Using Machine Learning A ninefold cross-validation experiment using decision tree induction was used to predict the class assigned by the human judges. The learned classifiers were substantially different from the manually-set (heuristic) thresholds. Manually-set: 31% of good document pairs were discarded ML-set: 16% of good pairs are discarded.(4%false positive) Other Related Work Some analysts use Parallel Text Miner (PT Miner) using already existing search engines to locate pages that are likely to be in the other language of interest.

    Slide 9:Other Related Work /Other Linguistic Researchers Some analysts use Parallel Text Miner (PT Miner) using already existing search engines to locate pages that are likely to be in the other language of interest. Then a final filtering stage is undertaken to clean the corpus. Bilingual Internet Text Search(BITS) is used by other researchers and utilizes different matching techniques. STRAND, PTMiner, and BITS are all largely independent of linguistic knowledge about particular languages, and therefore very easily ported to new language pairs. Reskin has looked into:English-Arabic, English-Chinese(big5), and English-Basque.

    Slide 10:Mining the Web Researchers can and do mine the internet every day. An American physicist (Barabasi) has had his team look at the size, shape and structure of the internet as well as hit-frequencies of numerous Web pages. Spiders or crawlers are used in research. The Internet Archive(www.archive.org/web/researcher/ ) is also instrumental in obtaining useful information.

    Slide 11:The internet Archive The internet archive is a nonprofit organization attempting to archive the entirely publicly available Web, preserving the content and providing free access to researchers, historians, scholars, and the general public. (120terabytes of information in 2002) Over 10 billion Web pages. Properties of the Archive: 1)The archive is a temporal database, but it is not stored in temporal order. 2)Extracting a document is an expensive operation.(text extraction.) 3)Computational complexity must be keep low for mining this database. 4)Data relevant for linguistic purposes are clearly available. 5)A suite of tools exist for linguistic processing of the archive.

    Slide 12:Building an English-Arabic Corpus Step 1:Search for English-Arabic pairs. Look at 24 top-level national domains for countries where Arabis is spoken: Egypt(.eg), Saudi Arabia(.sa), Kuwait(.kw). Also other .com domains believed to be useful to Arabic-speaking people. Step 2:Resnik et al. mined two crawls of the internet archive comprising 8TB and 12TB. Relevant domains numbered 19,917,923 pages. Step 3: Only 8,294 pairs of English-Arabic bitexts were found. EVALUATION:

    Slide 13:Conclusions and Further Work Initial web searches for parallel texts were undertaken in 1998. Resnicks report is from 2002. The author laments the lack of different languages available on the internet as well as the lack of data made available by some countries. The growth of both the internet and the internet archive will considerably add to the expansion of parallel corpora. Chen and Nie(2000), for example have found around 15,000 English-Chinese document pairs. One of the early STRAND projects for English-Chinese parallel texts found over 70,000 pairs. Because STRAND expects pages to be very similar in structure terms, the resulting document collections are particularly amenable to sentence- or segment-level alignment.

More Related