110 likes | 259 Vues
BootCaT is a powerful method developed by Marco Baroni and Silvia Bernardini for creating domain-specific corpora using web resources. It involves gathering search results by employing seed terms through search engines like Google, Yahoo, and Bing. This process enables translators to access relevant terminology without being domain experts. The method allows for bilingual term extraction and iteration, facilitating the creation of corpora in multiple languages. By integrating improved cleaning and duplicate removal techniques with tools like Sketch Engine, BootCaT enhances the usability of extracted terms for translation and research.
E N D
Comparable Corpora BootCat(CCBC) Adam Kilgarriff, Avinesh PVS Lexical Computing Ltd
BootCaT • Bootstrapping Corpora and Terms • Translators • Know the language • Not domain experts • Can interpret domain terms but can’t guess them • Instant domain corpus from the web • Marco Baroni and Silvia Bernardini (2004)
BootCaT method • Piggyback on a search engine • Google, Yahoo, Bing • Set of seed terms • Repeat • Take random 3 seeds • Send to search engine • Gather ‘search hits’ pages • Remove, duplicates, find terms • Can iterate
WebBootCaT • Web interface • Improved cleaning, duplicate removal • Integrated with corpus tool (Sketch Engine)
Going multilingual • Google-translate • English: volcanologyvolcanologist "volcanic eruption" seismographs Eyjafjallajokull geodic "deformation monitoring" tephra magma stratigraphictephrochronologygeochronological "volcanic ash" ablation rhyolitic • French:vulcanologuevolcanologie "éruptionvolcanique" sismographesEyjafjallajokull "surveillance de la déformation" géodiquestephra magma téphrochronologiestratigraphiquegéochronologiques "de cendresvolcaniques" ablation rhyolitiques • And do the same thing for French
By July 2011 • All steps integrated • Propose bilingual terminology