1 / 50

Metadata as Infrastructure for Information Retrieval and Text Mining

Metadata as Infrastructure for Information Retrieval and Text Mining. Prof. Ray R. Larson University of California, Berkeley School of Information. Overview. Metadata as Infrastructure What, Where, When and Who? What are Entry Vocabulary Indexes? Notion of an EVI How are EVIs Built

yitro
Télécharger la présentation

Metadata as Infrastructure for Information Retrieval and Text Mining

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Metadata as Infrastructure for Information Retrieval and Text Mining Prof. Ray R. Larson University of California, BerkeleySchool of Information NaCTeM – Ray R. Larson

  2. Overview • Metadata as Infrastructure • What, Where, When and Who? • What are Entry Vocabulary Indexes? • Notion of an EVI • How are EVIs Built • Time Period Directories • Mining Metadata for new metadata NaCTeM – Ray R. Larson

  3. Metadata as Infrastructure • The difference between memorization and understanding lies in knowing the context and relationships of whatever is of interest. When setting out to learn about a new topic, a well-tested practice is to follow the traditional “5Ws and the H”: Who?, What?, When?, Where?, Why?, and How? NaCTeM – Ray R. Larson

  4. Metadata as Infrastructure • The reference collections of paper-based libraries provide a structured environment for resources, with encyclopedias and subject catalogs, gazetteers, chronologies, and biographical dictionaries, offering direct support for at least What, Where, When, and Who. • The digital environment does not yet provide an effective, and easily exploited, infrastructure comparable to the traditional reference library. NaCTeM – Ray R. Larson

  5. What? Searching texts by topic, e.g. Dewey, LCSH, any subject index, or category scheme applied to documents. Two kinds of mapping in every search: • Documents are assigned to topic categories, e.g. Dewey • Queries have to map to topic categories, e.g. Dewey’s Relativ Index from ordinary words/phrases to Decimal Classification numbers. Also mapping between topic systems, e.g. US Patent classification and International Patent Classification. NaCTeM – Ray R. Larson

  6. ‘What’ searches involve mapping to controlled vocabularies Thesaurus/ Ontology Texts NaCTeM – Ray R. Larson

  7. Start with a collection of documents. NaCTeM – Ray R. Larson

  8. Index Classify and index with controlled vocabulary Or use a pre-indexed collection. NaCTeM – Ray R. Larson

  9. Problem:Controlled Vocabularies can be difficult for people to use. For: “Wirtschaftspolitik” In Library of Congress subj Index Use: “Economic Policy” “pass mtr veh spark ign eng” NaCTeM – Ray R. Larson

  10. pass mtr veh spark ign eng” = “Automobile” Solution:Entry Level Vocabulary Indexes. Index EVI NaCTeM – Ray R. Larson

  11. “What” and Entry Vocabulary Indexes • EVIs are a means of mapping from user’s vocabulary to the controlled vocabulary of a collection of documents… NaCTeM – Ray R. Larson

  12. Domains to select from: Engineering, Medicine, Biology, Social science, etc. Has an Entry Vocabulary Module been built? User selects a subject domain of interest. Use an existing EVI. YES User has question but is unfamiliar with the domain he wants to search. NO Extract terms (words and noun phrases) from titles and abstracts. Build associations between extracted terms & controlled vocabularies. Download a set of training data. Map user’s query to ranked list of controlled vocabulary terms For noun phrases User selects search terms from the ranked list of terms returned by the EVI. Part of speech tagging Internet DB indexed with a controlled vocabulary. Building an Entry Vocabulary Module (EVI) Searching Building and Searching EVIs NaCTeM – Ray R. Larson

  13. Extract terms (words and noun phrases) from titles and abstracts. Build associations between extracted terms & controlled vocabularies. Download a set of training data. Part of speech tagging Technical Details For noun phrases Internet DB indexed with a controlled vocabulary. Building an Entry Vocabulary Module (EVI) NaCTeM – Ray R. Larson

  14. Association Measure C ¬C t a b ¬t c d Where t is the occurrence of a term and C is the occurrence of a class in the training set NaCTeM – Ray R. Larson

  15. W(C,t) = 2[logL(p1,a,a+b) + logL(p2,c,c+d) - logL(p,a,a+b) – logL(p,c,c+d)] where logL(p,n,k) = klog(p) + (n – k)log(1- p) and p1= p2= p= a a+b c c+d a+c a+b+c+d Vis. Dunning Association Measure • Maximum Likelihood ratio NaCTeM – Ray R. Larson

  16. Alternatively • Because the “evidence” terms in EVIs can be considered a document, you can also use IR techniques and use the top-ranked classes for classification or query expansion NaCTeM – Ray R. Larson

  17. In Arabic Chinese Greek Japanese Korean Russian Tamil Find Plutonium Digital library resources Statistical association NaCTeM – Ray R. Larson

  18. EVI example Index term:“pass mtr veh spark ign eng” EVI 1 User Query “Automobile” Index term:“automobiles” OR “internal combustible engines” EVI 2 NaCTeM – Ray R. Larson

  19. But why stop there? Index EVI NaCTeM – Ray R. Larson

  20. “Which EVI do I use?” Index EVI Index EVI Index EVI Index NaCTeM – Ray R. Larson

  21. EVI to EVIs Index EVI EVI2 Index EVI Index EVI Index NaCTeM – Ray R. Larson

  22. In Arabic Chinese Greek Japanese Korean Russian Tamil Find Plutonium Why not treat language the same way? NaCTeM – Ray R. Larson

  23. It is also difficult to move between different media forms Thesaurus/ Ontology Texts EVI Numeric datasets NaCTeM – Ray R. Larson

  24. Searching across data types • Different media can be linked indirectly via metadata, but often (e.g. for socio-economic numeric data series) you also need to specify WHERE to get correct results NaCTeM – Ray R. Larson

  25. But texts associated with numeric data can be mapped as well… Thesaurus/ Ontology Texts EVI EVI captions Numeric datasets NaCTeM – Ray R. Larson

  26. 1 2 3 4 search interface 1 online catalog EVI LCSH 10 9 5 numeric table search results captions 11 8 7 6 search interface 2 numeric database new query marc EVI to Numeric Data example NaCTeM – Ray R. Larson

  27. But there are also geographic dependencies… Thesaurus/ Ontology Texts EVI EVI Maps/ Geo Data captions Numeric datasets NaCTeM – Ray R. Larson

  28. WHERE: Place names are problematic… • Variant forms: St. Petersburg, Санкт Петербург, Saint-Pétersbourg, . . . • Multiple names: Cluj, in Romania / Roumania / Rumania, is also called Klausenburg and Kolozsvar. • Names changes: Bombay  Mumbai. • Homographs:Vienna, VA, and Vienna, Austria; • 50 Springfields. • Anachronisms: No Germany before 1870 • Vague, e.g. Midwest, Silicon Valley • Unstable boundaries: 19th century Poland; Balkans; USSR • Use a gazetteer! NaCTeM – Ray R. Larson

  29. WHERE. Geo-temporal search interface. Place names found in documents. Gazetteer provided lat. & long. Places displayed on map. Timebar NaCTeM – Ray R. Larson

  30. Zoom on map. Click on place for a list of records. Click on record to display text. NaCTeM – Ray R. Larson

  31. Catalogs and gazetteers should talk to each other! Catalog search Gazetteer search Geographic sort / display of catalog search result. NaCTeM – Ray R. Larson

  32. So geographic search becomes part of the infrastructure Thesaurus/ Ontology Texts EVI Maps/ Geo Data Gazetteers captions Numeric datasets NaCTeM – Ray R. Larson

  33. WHEN: Search by time is also weakly supported… • Calendars are the standard for time • But people use the names of events to refer to time periods • Named time periods resemble place names in being: • Unstable: European War, Great War, First World War • Multiple: Second World War, Great Patriotic War • Ambiguous: “Civil war” in different centuries in England, USA, Spain, etc. • Places have temporal aspects & periods have geographical aspects: When the Stone Age was, varies by region NaCTeM – Ray R. Larson

  34. Similarity between place names and period names • Suggests a similar solution: A gazetteer-like Time Period Directory. • Gazetteer: • Place name – Type – Spatial markers (Lat & long) -- When • Time Period Directory: • Period name – Type – Time markers (Calendar) – Where • Note the symmetry in the connections between Where and When. NaCTeM – Ray R. Larson

  35. Solution - Time Period Directories • Initial development involved mining the Library of Congress Subject Authority file for named time periods… NaCTeM – Ray R. Larson

  36. LC MARC Authorities Records <USMARC> <Fld001>sh 00000613 </Fld001> <Fld151><a>Magdeburg (Germany)</a><x>History</x><y>Siege, 1550-1551</y></Fld151> <Fld550><w>g</w><a>Sieges</a><z>Germany</z></Fld550> <Fld670><a>Work cat.: 45053442: Besselmeier, S. Warhafftige history vnd beschreibung des Magdeburgischen Kriegs, 1552.</a></Fld670> <Fld670><a>Cath. encyc.</a><b>(Magdeburg: besieged (1550-51) by the Margrave Maurice of Saxony)</b></Fld670> <Fld670><a>Ox. encyc. reformation</a><b>(Magdeburg: ... during the 1550-1551 siege of Magdeburg ...)</b></Fld670> </USMARC> NaCTeM – Ray R. Larson

  37. NaCTeM – Ray R. Larson

  38. NaCTeM – Ray R. Larson

  39. Time periods by named location NaCTeM – Ray R. Larson

  40. Catalog Search Result NaCTeM – Ray R. Larson

  41. Web Interface - Access by map NaCTeM – Ray R. Larson

  42. Zoomable interface gives access to geographically focused info… NaCTeM – Ray R. Larson

  43. Web Interface - Access by timeline Link initiates search of the Library of Congress catalog for all records relating to this time period. NaCTeM – Ray R. Larson

  44. WHEN and WHAT • These named time periods are derived from Library of Congress catalog subject headings and so can be used for catalog searching which finds books on topics important for that time period NaCTeM – Ray R. Larson

  45. Time period directories link via the place (or time) Thesaurus/ Ontology Texts EVI Maps/ Geo Data Gazetteers captions Numeric datasets Time Period Directory Time lines, Chronologies NaCTeM – Ray R. Larson

  46. WHEN, WHERE and WHO • Catalog records found from a time period search commonly include names of persons important at that time. Their names can be forwarded to, e.g., biographies in the Wikipedia encyclopedia. NaCTeM – Ray R. Larson

  47. Place and time are broadly important across numerous tools and genres including, e.g. Language atlases, Library catalogs, Biographical dictionaries, Bibliographies, Archival finding aids, Museum records, etc., etc. Biographical dictionaries are heavy on place and time: Emanuel Goldberg, Born Moscow 1881. PhD under Wilhelm Ostwald, Univ. of Leipzig, 1906. Director, Zeiss Ikon, Dresden, 1926-33. Moved to Palestine 1937. Died Tel Aviv, 1970. Life as a series of episodes involving Activity (WHAT), WHERE, WHEN, and WHO else. NaCTeM – Ray R. Larson

  48. A new form of biographical dictionary would link to all Biographical Dictionary Thesaurus/ Ontology Texts EVI Maps/ Geo Data Gazetteers captions Numeric datasets Time Period Directory Time lines, Chronologies NaCTeM – Ray R. Larson

  49. RESOURCES CATALOGS Audio Images Numeric Data Objects Texts Virtual Reality Webpages Achives Historical Societies Libraries Museums Public Television Publishers Booksellers Learners Dossiers Facet Authority Control Special Display Tools INTERMEDIA INFRASTRUCTURE WHAT Thesaurus Syndetic Structure WHERE Gazetteer Maps WHEN Time Period Directory Timelines WHO Biographical Dictionary Text and Images A Metadata Infrastructure NaCTeM – Ray R. Larson

  50. Acknowledgements • Electronic Cultural Atlas Initiative project • This work was partially supported by the Institute of Museum and Library Services through a National Leadership Grant for Libraries, award number LG-02-04-0041-04, Oct 2004 - Sept 2006 entitled “Supporting the Learner: What, Where, When and Who” – See: http://ecai.org/imls2004 • Michael Buckland, Fred Gey, Vivien Petras, Matt Meiske, Kim Carl • Contact: ray@sims.berkeley.edu NaCTeM – Ray R. Larson

More Related