1 / 23

The automatic encoding of lexical knowledge in RDF topicmaps

The automatic encoding of lexical knowledge in RDF topicmaps. Carol Jean Godby OCLC Online Computer Library Center March 6, 2001. Topicmaps of Web resources. For navigating complex Web sites For managing bookmark files For creating views of the Web that are organized by subject.

Télécharger la présentation

The automatic encoding of lexical knowledge in RDF topicmaps

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. The automatic encoding of lexical knowledge in RDF topicmaps Carol Jean Godby OCLC Online Computer Library Center March 6, 2001

  2. Topicmaps of Web resources • For navigating complex Web sites • For managing bookmark files • For creating views of the Web that are organized by subject

  3. Terminology identification • ...is an essential first step in the analysis of a document's content. • ...is one of the most mature research subjects in natural language processing.

  4. Lexical phrases • Are the names of persistent concepts. • Act like words. • Are commonly used to name new concepts in rapidly evolving technical subject domains.

  5. Not a lexical phrase:“Recurrent problem”

  6. A lexical phrase:“Recurrent erosion”

  7. Identifying lexical phrases Tokenized text:...Planetary scientists think the convex shape came about as lava welled up beneath the crater's solid floor…. Ngrams: planetary scientists think, convex shape, welled up, coincided with, five times greater than, easiest way, Milky Way, absolute magnitudes brighter than, added material, advanced study, African American Index filter: planetary scientists, convex shape, easiest way, Milky Way, absolute magnitudes, added material, advancedstudy, African American Topic filter: planetary scientists, Milky Way

  8. Terminology identification: process flow Tokenized text 9.8m Ngrams 1.9 M Index filter 59k, 2331 phrases 35k, 1632 phrases Topic filter

  9. Strategies in the topic filter • Word/phrase frequency and strength of association • “Knowledge-poor” text analysis • More sophisticated but computable text analysis

  10. Word and phrase frequencies • Word/phrase frequency high: dublin core, metadata, element, electronic resources low: availability period, background, applicable terminologies • Weighted frequency 1.core element, date element, metadata element 2. author name, entity name, corporate name 3. HTML tag, end tag, meta tag

  11. Knowledge-poor techniques 1: • Some noun phrase heads usually appear in text only with adjective or noun modifiers. Example: holes--black holes, grey holes, central holes • Others usually appear without modifiers. Example:galaxy--cartwheel galaxy, spiral galaxy a galaxy, our galaxy, this galaxy

  12. Consequences • We can identify topical single terms: galaxy, star, sun, moon government, abortion, communism metadata, html, Internet, information • We can create subject taxonomies: galaxy (-ies) *hole(s) cartwheel galaxy black holes elliptical galaxy drill holes spiral galaxy grey holes

  13. Knowledge-poor techniques 2: subject probes • Goal: to get high-quality subject terms • Lookfor markers of a subject that is talked about, written about or studied: topics in, study of, analysis of, (on the) subject of, major in… • Probes differ in specificity. topics insciences, arts, humanities, library science, astronomy, physics, business, data visualization, computer science, mathematics, computer and network security, mathematics, number theory, medicine analysis ofmetabolic regulation, numerical analysis, saline water phenomena, coals, iron ore, cereal grains, income dynamics among men, working hours, inflation, mass belief systems, aerial photography

  14. Some results

  15. The identification ofterm relationships Singular/Plural: Library, libraries Acronyms Standard Generalized MarkupLanguage--SGML Library of Congress Subject Headings--LCSH Coordination library and information science--library science, information science information storage and retrieval--information storage, information retrieval cataloging and interlibrary loan--cataloging, interlibrary loan Ellipsis abbreviated key title--abbreviated title authority file records--authority records

  16. A more abstract relationship: hypernym/hyponym • “…electronic formats, such as text/HTML, ASCII, orPostScript….” • Other examples from our data: Controlled Vocabularies: Medical Subject Headings, Art and Architecture Thesaurus metadata element set: Dublin Core protocol server applications: NFS server, FTP server, Web server moving images:films, videos, simulations

  17. A graph representation of relationships Dewey Subject Headings Dewey call numbers B/N Broad/Narrow Ellipsis Library of Congress Subject Headings Dewey Decimal Dewey numbers B/N Dewey decimal classification Acronym B/N DDC and LCSH Acronym numbers B/N cutter numbers DDC Coordination

  18. “Dewey Numbers” name Dewey numbers narrow broad isDefinedIn RDF Topic Representation “numbers” name Numbers isDefinedIn http://r1 http://r2 http://r3

  19. System flow 1: processing steps 1. Harvest Web text. 2. Extract terminology and relationships. 3. Organize terminology into an RDF graph. 4. Import the RDF graph into the Extended Open RDF Toolkit.

  20. System flow 2: User interaction The Web RDF Concept graph User RDF search engine

  21. A screen shot

  22. Future plans • Develop a user interface that fully exploits the richness of the RDF graph structure. • Merge terminology extracted from source documents with other sources of infermation. • Improve processes for automatically extracting terminology.

  23. References • The Extended Open RDF Toolkit Accessible at: http://eor.dublincore.org/ • “Automatically generated topic maps of World Wide Web resources.” Accessible at: http://www.oclc.org/oclc/research/publications/review99/godby/topicmaps.htm • “The WordSmith indexing system” Accessible at: http://www.oclc.org/oclc/research/publications/review98/godby_reighart/wordsmith.htm

More Related