1 / 31

Terminology identification from full text: OCLC’s WordSmith experience

Terminology identification from full text: OCLC’s WordSmith experience. Jean Godby Senior Research Scientist OCLC Online Computer Library Center, Inc. SOASIST Full-Day Workshop on Aboutness June 21, 2001. Outline of this talk. The need for terminology Sources of terminology

jena
Télécharger la présentation

Terminology identification from full text: OCLC’s WordSmith experience

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Terminology identificationfrom full text: OCLC’s WordSmith experience Jean Godby Senior Research Scientist OCLC Online Computer Library Center, Inc. SOASIST Full-Day Workshop on Aboutness June 21, 2001

  2. Outline of this talk • The need for terminology • Sources of terminology • Extracting terminology from free text • Organizing it • Mapping it to library classification schemes

  3. Increasing subject access to document collections More human effortLess human effort More abstract view of the data Less abstract Cataloging Tokenizing Classification Indexing Scorpion Classification Research WordSmith

  4. Subject terminology fromlibrary classification schemes • Strengths • Derived from scholarship in subject analysis and classification theory • Permits interoperability between Web resources and traditional published materials • Weaknesses • Literary warrant is based on traditional published materials. • Human effort is required to keep them current. • They must be modified for use in automated systems. • They aren’t free.

  5. Subject terminology from full text • Strengths • Literary warrant is based on current text. • Coverage is not restricted to traditionally published material. • The style is closer to the user’s vocabulary. • Weaknesses • The data is noisy and difficult to organize.

  6. Terminology identification • ...is an essential first step in the analysis of a document's content. • ...is one of the most mature research subjects in natural language processing.

  7. Lexical phrases • Are the names of persistent concepts. • Act like words. • Are commonly used to name new concepts in rapidly evolving technical subject domains.

  8. A lexical phrase:“Recurrent erosion”

  9. Not a lexical phrase:“Recurrent problem”

  10. Identifying lexical phrases Tokenized text:...Planetary scientists think the convex shape came about as lava welled up beneath the crater's solid floor…. Ngrams: planetary scientists think, convex shape, welled up, coincided with, five times greater than, easiest way, Milky Way, absolute magnitudes brighter than, added material, advanced study, African American Index filter: planetary scientists, convex shape, easiest way, Milky Way, absolute magnitudes, added material, advancedstudy, African American Topic filter: planetary scientists, Milky Way

  11. Strategies in the topic filter • Word/phrase frequency and strength of association • “Knowledge-poor” text analysis • More sophisticated but computable text analysis

  12. Word and phrase frequencies • Word/phrase frequency high: dublin core, metadata, element, electronic resources low: availability period, background, applicable terminologies • Weighted frequency 1.core element, date element, metadata element 2. author name, entity name, corporate name 3. HTML tag, end tag, meta tag

  13. Knowledge-poor techniques 1:parts of speech in local context • Some noun phrase heads usually appear in text only with adjective or noun modifiers. holes--black holes, grey holes, central holes • Others usually appear without modifiers. galaxy--cartwheel galaxies, spiral galaxy a galaxy, if galaxies; ...however, galaxy formation

  14. Consequences • We can identify topical single terms: galaxy, star, sun, moon government, abortion, communism metadata, html, Internet, information • We can create subject taxonomies: galaxy (-ies) *hole(s) cartwheel galaxy black holes elliptical galaxy drill holes spiral galaxy grey holes

  15. Knowledge-poor techniques 2: subject probes • Goal: to get high-quality subject terms • Lookfor indications that something is talked about, written about, or studied: topics in, study of, analysis of, (on the) subject of, major in, is called, is known as • Probes differ in specificity. topics insciences, arts, humanities, library science, astronomy, physics, business, data visualization, computer science, mathematics, computer and network security, mathematics, number theory, medicine analysis ofmetabolic regulation, numerical analysis, saline water phenomena, coals, iron ore, cereal grains, income dynamics among men, working hours, inflation, mass belief systems, aerial photography

  16. More clues can be identified with “knowledge-rich” processing You can sum up the big difference between beans on the one hand and Java applets and applications on the other in one word (okay, two words) : component model. Chapter 2 contains a nice, thorough discussion of component models (which is a pretty important concept, so I devoted an entire chapter to the subject). Java Beans for Dummies. Emily Vander Veer. Chicago, IL: IDG Books Worldwide. 1997, p. 14.

  17. Some results

  18. have havei havel haven havens havera haverty havey havice havill havilland health care health care coverage health insurance housing housing policy ……. world trade world trade accord world trade agreement world trade center world trade center bombing Terminology lists: tokenizing vs. indexing

  19. Terminology extraction works best with: • Full text • Collections of text, not isolated documents • Text from a single subject domain • Algorithms that are tuned to the style of the text

  20. An application: browse displays

  21. Organizing terminology Dewey Subject Headings Dewey call numbers B/N Broad/Narrow Ellipsis Library of Congress Subject Headings Dewey Decimal Dewey numbers B/N Dewey decimal classification Acronym B/N DDC and LCSH Acronym numbers B/N cutter numbers DDC Coordination

  22. An application: a topic map for a collection of Web resources

  23. Another application: a terminology server

  24. Mapping vocabulary to library classification schemes • Explicit • For each document in a collection, extract terminology using WordSmith. • Assign Dewey Decimal Classification (DDC) numbers using Scorpion. • Identify the highest associations between extracted terms and DDC numbers. • Implicit • Make both sources of subject information available in a user interface.

  25. Terminology mapping works best when: • The upstream processes for extracting terminology are clean. • It operates on a large collection of domain-specific text. • The classification scheme is simplified.

  26. The Desire database of Web documents about engineering

  27. Science aspects

  28. Social science aspects

  29. Links to documents about other types of pollution

  30. In sum • We can automatically extract useful terminology from full text. • The terminology can be embedded in applications of varying complexity. • There is a tradeoff between accuracy and technical sophistication.

  31. For more information Godby, Jean and Reighart, Ray. 1998. “The WordSmith indexing system..” Accessible at: http://www.oclc.org/oclc/research/publications/review98/godby_reighart/wordsmith.htm Godby, Jean; Miller, Eric; and Reighart, Ray . 2000. “Automatically generated topic maps of World Wide Web resources.” Accessible at: http://www.oclc.org/oclc/research/publications/review99/godby/topicmaps.htm Godby, Jean and Reighart, Ray. 2001. “Terminology identification in a collection of Web resources. In: K. Calhoun and J. Riemer, eds. CORC: New tools and possibilities forelectronic resource description. New York: The Hayworth Press, Inc., 49-66.

More Related