1 / 12

Mining Gazetteer Data from Digital Library Collections

Mining Gazetteer Data from Digital Library Collections. David Smith Perseus Project Tufts University. Corpus Preview. Preview: 1400-1600. What DLs can do for gazetteers. Directly manage gazetteers Raw materials for gazetteers Reference works Monolingual and parallel corpora

ereynolds
Télécharger la présentation

Mining Gazetteer Data from Digital Library Collections

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Mining Gazetteer Data from Digital Library Collections David Smith Perseus Project Tufts University

  2. Corpus Preview Perseus Project, JCDL 2002

  3. Preview: 1400-1600 Perseus Project, JCDL 2002

  4. What DLs can do for gazetteers • Directly manage gazetteers • Raw materials for gazetteers • Reference works • Monolingual and parallel corpora • Testbeds for improving these technologies • E.g. alignment helps name tagging, and name tagging helps alignment Perseus Project, JCDL 2002

  5. Lexicographical parallels • Original “slipping” process • First, get a madman ... • Creation of Brown and other corpora • Kucera and Lewis • Cobuild dictionary and friends • But names “get no respect” in lexicography (McDonald, 1996) Perseus Project, JCDL 2002

  6. Cultural dependencies Perseus Project, JCDL 2002

  7. Toponym Results Perseus Project, JCDL 2002

  8. Projection principles • Exploits asymmetry in human language technologies (Yarowsky, HLT 2001) • English, French, Chinese, Czech (!) have • POS taggers, morphological analyzers • Named entity identifiers • Parsers and bracketers • Parallel corpus alignment allows projection of these resources Perseus Project, JCDL 2002

  9. Projection principles Perseus Project, JCDL 2002

  10. Projection on the cheap • Align texts at coarse structural level • Geocode source text (English) • Optionally winnow target text (e.g. non-capitalized words where applicable) • Calculate mutual information (Church & Hanks, 1990) • Transliteration may be too ad hoc Perseus Project, JCDL 2002

  11. Preliminary results • Greek/English testbed • 98% precision • 70.8% recall (Why?) • Ethnic designations present interesting problems • “Stephanus of Byzantium” • Morphology outside of English Perseus Project, JCDL 2002

  12. Proposals • Preservation of gazetteer source materials • DLs as home for gazetteer “slips” • Parallel texts as key resource • (also cf. Berkeley TIDES work) • Persistent documents as training sets for automatic methods • http://www.perseus.tufts.edu Perseus Project, JCDL 2002

More Related