1 / 11

BootCaT: Leveraging Web Corpora for Multilingual Term Extraction and Domain-Specific Translation

BootCaT is a powerful method developed by Marco Baroni and Silvia Bernardini for creating domain-specific corpora using web resources. It involves gathering search results by employing seed terms through search engines like Google, Yahoo, and Bing. This process enables translators to access relevant terminology without being domain experts. The method allows for bilingual term extraction and iteration, facilitating the creation of corpora in multiple languages. By integrating improved cleaning and duplicate removal techniques with tools like Sketch Engine, BootCaT enhances the usability of extracted terms for translation and research.

tejana
Télécharger la présentation

BootCaT: Leveraging Web Corpora for Multilingual Term Extraction and Domain-Specific Translation

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Comparable Corpora BootCat(CCBC) Adam Kilgarriff, Avinesh PVS Lexical Computing Ltd

  2. BootCaT • Bootstrapping Corpora and Terms • Translators • Know the language • Not domain experts • Can interpret domain terms but can’t guess them • Instant domain corpus from the web • Marco Baroni and Silvia Bernardini (2004)

  3. BootCaT method • Piggyback on a search engine • Google, Yahoo, Bing • Set of seed terms • Repeat • Take random 3 seeds • Send to search engine • Gather ‘search hits’ pages • Remove, duplicates, find terms • Can iterate

  4. WebBootCaT • Web interface • Improved cleaning, duplicate removal • Integrated with corpus tool (Sketch Engine)

  5. Going multilingual • Google-translate • English: volcanologyvolcanologist "volcanic eruption" seismographs Eyjafjallajokull geodic "deformation monitoring" tephra magma stratigraphictephrochronologygeochronological "volcanic ash" ablation rhyolitic • French:vulcanologuevolcanologie "éruptionvolcanique" sismographesEyjafjallajokull "surveillance de la déformation" géodiquestephra magma téphrochronologiestratigraphiquegéochronologiques "de cendresvolcaniques" ablation rhyolitiques • And do the same thing for French

  6. By July 2011 • All steps integrated • Propose bilingual terminology

More Related