multimatch Gareth J. F. Jones Dublin City University, Ireland

The MultiMatch ProjectSearching Multilingual Multimedia Cultural Heritage Collections www.multimatch.org Gareth J. F. Jones Dublin City University, Ireland

Background • Problem: There is a wealth of fragmented CH information available from multiple sources, but users are left to discover, interpret and aggregate this themselves using general search tools. • MultiMatch Goal: To provide enhanced multilingual access to a multimedia collection of cultural heritage objects.

Objectives • Develop a search engine that provides targeted, enriched access to heterogeneous CH objects: • across media types and language boundaries • supporting various user classes • with aggregate views on complex task scenarios • Assist CH institutions to raise visibility and disseminate content.

The Consortium Academia • Istituto di Scienza e Tecnologie dell’Informazione (ISTI-CNR) • University of Sheffield (USFD) • Dublin City University (DCU) • University of Amsterdam (UvA) • University of Geneva (UniGE) • Universidad Nacional de Educación a Distancia (UNED) Industry • OCLC PICA (FDI) • WIND Telecomunicazioni S.p.A. (WIND) Cultural Heritage • Fratelli Alinari Istituto Edizioni Artistiche SpA (Alinari) • Netherlands Institute for Sound and Vision (BandG) • University of Alicante – Biblioteca Virtual Miguel de Cervantes (UA-BVMC)

Van Gogh Museum (NL) Web Resources Museums Libraries Archives Newspapers News agencies Personal Pages Blogs Museums Databases MULTI MATCH National Gallery (UK) Musée d’Orsay (FR) Crawling Acquisition

Overview • System architecture • Indexing • Multilingual Search • Four languages for prototype 1 • English, Dutch, Italian, Spanish • For prototype 2 + German, Polish • Multimedia Search • Image, video, audio

System Architecture Web Services MultiMatch Interface (GWT Based) MultiMatch CORE MILOS + GIFT + Lucene Client-Side Interface Server-Side Interface (via Java API) Java API (to make calls to Web Services) • Interface Services • UI Logic (server side) • Connection with MM core using web services API • Entry Point • Widgets • UI Logic (client side) • RPCs

Indexing: Text • Text indexing • Web pages • Metadata describing videos and images • Speech recognition transcripts • Translation of documents • Transformation of metadata • Converts native formats (Alinari, BandG, BVMC, TEL) to MultiMatch metadata format using XSLT • Pre-processing • Stemming and Stopword removal

Indexing: Speech • Indexed audio corpus • Generate Automatic Speech Recongition (ASR) transcriptions using test Nuance Dragon Naturally Speaking 9 SDK Server Edition • Prototype 1: 20 hours of Podcasts in all 4 languages • Transcription format • Flexible with respect to granularity • Links in to MultiMatch metadata format • Challenge: create appropriate units for retrieval

Indexing: Video • Separate modalities while maintaining alignment • Visual: Keyframe extraction • Audio: Speech recognition transcription • Challenges similar to speech indexing • Determining appropriate document units: • Documents vs articles vs shots • Combining information sources

Indexing: Image Indexes • Still images: Alinari images (5,000) • Video keyframes from BandG (ca. 14,000) GIFT (Gnu Image Finding Tool) • Content-based (visual only) image retrieval service • Embedded into metadata repository • Low-level MPEG-7 image features

Metadata Search • MILOS • Metadata search, e.g. content source, dates, authors • Visual search: GIFT plug-in • Textual search: Lucene plug-in • Metadata search can be used in conjunction with visual and textual search to constrain result set.

Text and Speech Search • Text and speech retrieval using Lucene • Speech and video transcripts are handled as text • Prototype 1: Default Lucene information retrieval model • Prototype 2: Investigated use ofOkapiBM25 model

Visual Search GIFTsearch engine integrated as a service provider (plug-in) into the larger MILOS-based architecture. • Open source content-based image retrieval software • Low-level image MPEG-7 features • colour and texture • local and global scales

Query ExpansionSummary • Thesaurus expansion • Terms from EuroWordnet • Relevance feedback • Terms added from the user selected relevant documents • Blind feedback • Terms added from the top system ranked assumed relevant documents

Italian-to-Dutch Query (in Italian) Lucene Query Translation NL IT EN ES Query (in Dutch) Ranked Documents (in Italian) Document Translation Ranked Documents (in Dutch) Separate index for each language Cross-lingual Search

Multilingual Search • Searching documents in a specific language • English, Italian, Spanish, Dutch, German, Polish • Query translation • Separate index and search for each language • Searching documents in all six languages • Translate all non-English documents to English • Store all in a single English index • Translate incoming queries to English

Machine Translation • WorldLingo commercial machine translation system used under licence • Supports all 20 language pairs for the five languages for which MT was available • Easy to use and integrate into prototype • Well-documented API

Machine Translation • MT is able to provide reasonable translations for general terms • Not sufficient for domain-specific terms (in particular, multiple-word phrases) • Personal names • Organization names • Location names • Titles of art works

Hybrid Translation • MultiMatch improves translation accuracy of phrases previously untranslatedor inappropriately translated by a standard MT system • Objective: improve the CLIR effectiveness and facilitate MLIA • Augmented MT combining domain-specific dictionaries mined from the web

Dictionary-based Query Translation Example Original English Query Italian Translation Processed English Query

Domain-specific DictionaryConstruction • Multilingual wikipedia A wikipedia page written in one language can contain hyperlinks to its counterparts in other languages: titles and basenamesare translation pairs. • For example …

Hyperlink Feature of Wikipedia

Dictionary Construction Process A 3-stage automatic process: • Crawling the English wikipedia, Category:Culture (pages and subcategories) • Extracting hyperlinks to query languages (Italian and Spanish) • Generatingtranslation pairs using hyperlink basenames Multiple-word phrases wereadded into a phrase dictionary for each language

137897 EN‒ES 110568 ES‒EN 133570 EN‒IT 114237 IT‒EN 151147 EN‒NL 131652 NL‒EN 67470 ES‒IT 67498 IT‒ES 66625 ES‒NL 66005 NL‒ES 82006 IT‒NL 81503 NL‒IT Lexicon Coverage The number of distinct entries in each of the language pairs:

Hybrid Translation Process • Dictionary-based phrase translation • Lexical rule-based phrase identification • Phrase translation • WorldLingo machine translation • For both the query and the phrases detected • Phrase translation validation • For each of the recognized phrases, replaceits WorldLingo translation by the translation(s)from our domain-specific dictionary, if they are not identical.

Hybrid Query Translation Example

Some Translation Examples

Overview User Interface • Common entry point for (mainly) general users • Starting point for specialised interfaces • Results aggregated across sources and media types • Users can add/remove results windows • Web, Archives, Video, Image, Audio, Creators, RSS • Use of “infinite” scrolling bar • Links to metadata, term clouds, snippets and related terms • Users can filter language of results • Functionality for cross-language support • Use of combined MT and dictionary service • Ability to select/de-select alternative translations • Search all languages (English interlingua) • Localisation (Dutch, Spanish and German)

Example Overview UI

Image UI • Simple functionality to access visual material • Assume that users begin with verbalised query • Results include • Image (thumbnail), title and URN • Users can also view • Metadata, term cloud, snippet, related terms, similar images • Find similar images (“more like this”) • Images with similar visual content • Images with similar visual content and semantic content matching the query (uses fusion service)

Audio UI • Generic playback for any web browser • JavaScript-based wrapper provided playback controls independent of audio file type Segments in transcribed audio which match query terms Two approaches for presenting transcripts

Term cloud items Transcript Presentation • Transcript browse • Clickable audio segments • Transcript viewable • Term cloud browse • Clickable audio segments • Term cloud items can be arranged alphabetically or by time

Video UI • Similar look-and-feel to audio UI • Users can view metadata, term clouds, snippets and related terms • Initial search queries video metadata • Also provided a search within results based on transcripts (requested by video experts) • For selected video users can • Playback video from various starting points • View keyframes and associated transcript text • Search for (and playback) specific keyframe segments containing keywords

Creators UI • Utilisation of MultiMatch metadata schema and annotation • Provides biographical information and cross-referencing • Provides users with access to biographical information related to particular artist • Birth/death places/dates, nationality, alternative names, description • Query suggestions (names of artists) • List of creations by selected artist • Link to page for selected artist displaying • Related creations (images and videos) • Related web pages • “More like this” for images (derived from image UI)

Non-Integrated Extensions

Collection Browser: Motivation • A query places the user at some point of the information space • Feedback may be used to explore further • However relevance feedback makes it difficult to control the exploration and converge to a certain point • Similarity-based browsing effectively completes the query mechanism

Implementation (http://viper.unige.ch/explore) • A basic search interface (text-based) • Locates initial relevant documents • A document viewer • Selects a seed for browsing and shows details • A collection browser • Allows similarity-based browsing

Sampling of the collection content for fast navigation Clicking on any non-central image will bring it at the center Clicking on the central image returns to the search view Possible dimensions for display Browser Interface

Semantic Browsing • Design of prototype faceted browsing system • Material crawled from Tate Online • Derived from Tate Collection (30,000 artworks, 3,000 artists) • Linked together Tate Collection and ULAN (100,000 artists) • Content can be accessed using one of four views • Artist view (search artists and titles of artworks) • Artwork view (more information on artworks and artwork thumbnails) • Timeline view of artist’s birth/death dates • Map view of artist’s birth places

Dynamic Summarisation (DCU) • Focused browsing provides query-biased summary (term clouds) of hyperlinked pages

Conclusions • MultiMatch successfully developed a system for multilingual multimedia search for the CH domain • Standard text, speech and video indexing tools are used as the underlying platform • Domain specific methods were developed to support improved multilingual studies • An integrated demonstration system was deployed • Extensive scope for extending the existing MultiMatch demonstration system and developing new indexing, search and interaction tools

Questions?

multimatch Gareth J. F. Jones Dublin City University, Ireland

multimatch Gareth J. F. Jones Dublin City University, Ireland

Presentation Transcript

Margaret Farren Dublin City University

Dublin Tourism Dublin City Council Temple Bar Traders Tourism Ireland

Gareth Jones

D Gareth Jones

Dublin, Ireland

Dublin, Ireland

Leadership Through Change Dublin City University

SenseCam Work at Dublin City University

DUBLIN , IRELAND

Urban Institute Ireland/University College Dublin

Dublin, Ireland

The Gareth Jones Diaries

The Gareth Jones Diaries

Stephanie J. Rickard School of Law and Government Dublin City University

Jobs in Dublin Ireland

Wedding Rings Dublin, Ireland

Dublin Airport Transfer Ireland

Gareth Jones

Gareth Jones

Gareth Jones