CS 430 / INFO 430 Information Retrieval

CS 430 / INFO 430 Information Retrieval Lecture 22 Metadata 4

Course Administration

Automated Creation of Metadata Records Sometimes it is possible to generate metadata automatically from the content of a digital object. The effectiveness varies from field to field. Examples • Images -- characteristics of color, texture, shape, etc. (crude) • Music -- optical recognition of score (good) • Bird song -- spectral analysis of sounds (good) • Fingerprints (good)

Automated Information Retrieval Using Feature Extraction Example: features extracted from images • Spectral features: color or tone, gradient, spectral parameter etc. • Geometric features: edge, shape, size, etc. • Textural features: pattern, spatial frequency, homogeneity, etc. Features can be recorded in a feature vector space (as in a term vector space). A query can be expressed in terms of the same features. Machine learning methods, such as a support vector machine, can be used with training data to create a similarity metric between image and query Example: Searching satellite photographs for dams in California

Example: Blobworld

Effective Information Discovery With Homogeneous Digital Information Comprehensive metadata with Boolean retrieval Can be excellent for well-understood categories of material, but requires standardized metadata and relatively homogeneous content (e.g., MARC catalog). Full text indexing with ranked retrieval Can be excellent, but methods developed and validated for relatively homogeneous textual material (e.g., TREC ad hoc track).

Mixed Content Examples: NSDL-funded collections at Cornell Atlas. Data sets of earthquakes, volcanoes, etc. Reuleaux. Digitized kinematics models from the nineteenth century Laboratory of Ornithology. Sound recording, images, videos of birds and other animals. Nuprl.Logic-based tools to support programming and to implement formal computational mathematics.

Mixed Metadata: the Chimera of Standardization Technical reasons Characteristics of formats and genres Differing user needs Social and cultural reasons Economic factors Installed base

Information Discovery in a Messy World Building blocks Brute force computation The expertise of users -- human in the loop Methods (a) Better understanding of how and why users seek for information (b) Relationships and context information (c) Multi-modal information discovery (d) User interfaces for exploring information

Understanding How and Why Users Seek for Information Homogeneous content All documents are assumed equal Criterion is relevance (binary measure) Goal is to find all relevant documents (high recall) Hits ranked in order of similarity to query Mixed content Some documents are more important than other Goal is to find most useful documents on a topic and then browse Hits ranked in order that combines importance and similarity to query

Automatic Creation of Surrogates for Non-textual Materials Discovery of non-textual materials usually requires surrogates • How far can these surrogates be created automatically? • Automatically created surrogates are much less expensive than manually created, but have high error rates. • If surrogates have high rates of error, is it possible to have effective information discovery?

Example: Informedia Digital Video Library Collections: Segments of video programs, e.g., TV and radio news and documentary broadcasts. Cable Network News, British Open University, WQED television. Segmentation: Automatically broken into short segments of video, such as the individual items in a news broadcast. Size: More than 4,000 hours, 2 terabyte. Objective:Research into automatic methods for organizing and retrieving information from video. Funding: NSF, DARPA, NASA and others. Principal investigator: Howard Wactlar (Carnegie Mellon University).

Informedia Digital Video Library History • Carnegie Mellon has broad research programs in speech recognition, image recognition, natural language processing. • 1994. Basic mock-up demonstrated the general concept of a system using speech recognition to build an index from a sound track matched against spoken queries. (DARPA funded.) • 1994-1998. Informedia developed the concept of multi-modal information discovery with a series of users interface experiments. (NSF/DARPA/NASA Digital Libraries Initiative.) • 1998 - . Continued research particularly in human computer interaction. Commercial spin-off failed.

The Challenge A video sequence is awkward for information discovery: • Textual methods of information retrieval cannot be applied • Browsing requires the user to view the sequence. Fast skimming is difficult. • Computing requirements are demanding (MPEG-1 requires 1.2 Mbits/sec). Surrogates are required

Multi-Modal Information Discovery • The multi-modal approach to information retrieval • Computer programs to analyze video materials for clues • e.g., changes of scene • methods from artificial intelligence, e.g., speech recognition, natural language processing, image recognition. • analysis of video track, sound track, closed captioning if present, any other information. • Each mode gives imperfect information. Therefore use • many approaches and combine the evidence.

Multi-Modal Information Discovery With mixed content and mixed metadata, the amount of information about the various resources varies greatly butclues from many difference sources can be combined. "The fundamental premise of the research was that the integration of these technologies, all of which are imperfect and incomplete, would overcome the limitations of each, and improve the overall performance in the information retrieval task." [Wactlar, 2000]

Informedia Library Creation Speech recognition Image extraction Natural language interpretation Segments with derived metadata Text Audio Video Segmentation

Text Extraction Source Sound track: Automatic speech recognition using Sphinx II and III recognition systems. (Unrestricted vocabulary, speaker independent, multi-lingual, background sounds). Error rates 25% up. Closed captions: Digitally encoded text. (Not on all video. Often inaccurate.) Text on screen: Can be extracted by image recognition and optical character recognition. (Matches speaker with name.) Query Spoken query: Automatic speech recognition using the same system as is used to index the sound track. Typed by user

Multimodal Metadata Extraction

Informedia: Information Discovery Segments with derived metadata User Querying via natural language Browsing via multimedia surrogates Requested segments and metadata

Limits to Scalability Informedia has demonstrated effective information discovery with moderately large collections Problems with increased scale: • Technical -- storage, bandwidth, etc. • Diversity of content -- difficult to tune heuristics • User interfaces -- complexity of browsing grows with scale

Lessons Learned • Searching and browsing must be considered integrated parts of a single information discovery process. • Data (content and metadata), computing systems (e.g., search engines), and user interfaces must be designed together. • Multi-modal methods compensate for incomplete or error-prone data.

Interoperability The Problem Conventional approaches require partners to support agreements (technical, content, and business) But a Web based digital library program needs thousands of very different partners ... most of whom are not directly part of the program The challenge is to create incentives for independent digital libraries to adopt agreements

Approaches to interoperability The conventional approach  Wise people develop standards: protocols, formats, etc.  Everybody implements the standards.  This creates an integrated, distributed system. Unfortunately ...  Standards are expensive to adopt.  Concepts are continually changing.  Systems are continually changing.  Different people have different ideas.

Interoperability is about agreements Technical agreements cover formats, protocols, security systems so that messages can be exchanged, etc. Content agreements cover the data and metadata, and include semantic agreements on the interpretation of the messages. Organizational agreements cover the ground rules for access, for changing collections and services, payment, authentication, etc. The challenge is to create incentives for independent digital libraries to adopt agreements

Function versus cost of acceptance Cost of acceptance Few adopters Many adopters Function

Example: security Cost of acceptance Public key infrastructure Login ID and password IP address Function

Example: metadata standards Cost of acceptance MARC Dublin Core Function Free text

NSDL: The Spectrum of Interoperability Level Agreements Example Federation Strict use of standards AACR, MARC (syntax, semantic, Z 39.50 and business) Harvesting Digital libraries expose Open Archives metadata; simple metadata harvesting protocol and registry Gathering Digital libraries do not Web crawlers cooperate; services must and search engines seek out information

CS 430 / INFO 430 Information Retrieval