1 / 47

multimatch Gareth J. F. Jones Dublin City University, Ireland

The MultiMatch P roject S earching M ultilingual M ultimedia C ultural H eritage C ollections. www.multimatch.org Gareth J. F. Jones Dublin City University, Ireland. Background. Problem:

ricky
Télécharger la présentation

multimatch Gareth J. F. Jones Dublin City University, Ireland

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. The MultiMatch ProjectSearching Multilingual Multimedia Cultural Heritage Collections www.multimatch.org Gareth J. F. Jones Dublin City University, Ireland

  2. Background • Problem: There is a wealth of fragmented CH information available from multiple sources, but users are left to discover, interpret and aggregate this themselves using general search tools. • MultiMatch Goal: To provide enhanced multilingual access to a multimedia collection of cultural heritage objects.

  3. Objectives • Develop a search engine that provides targeted, enriched access to heterogeneous CH objects: • across media types and language boundaries • supporting various user classes • with aggregate views on complex task scenarios • Assist CH institutions to raise visibility and disseminate content.

  4. The Consortium Academia • Istituto di Scienza e Tecnologie dell’Informazione (ISTI-CNR) • University of Sheffield (USFD) • Dublin City University (DCU) • University of Amsterdam (UvA) • University of Geneva (UniGE) • Universidad Nacional de Educación a Distancia (UNED) Industry • OCLC PICA (FDI) • WIND Telecomunicazioni S.p.A. (WIND) Cultural Heritage • Fratelli Alinari Istituto Edizioni Artistiche SpA (Alinari) • Netherlands Institute for Sound and Vision (BandG) • University of Alicante – Biblioteca Virtual Miguel de Cervantes (UA-BVMC)

  5. Van Gogh Museum (NL) Web Resources Museums Libraries Archives Newspapers News agencies Personal Pages Blogs Museums Databases MULTI MATCH National Gallery (UK) Musée d’Orsay (FR) Crawling Acquisition

  6. Overview • System architecture • Indexing • Multilingual Search • Four languages for prototype 1 • English, Dutch, Italian, Spanish • For prototype 2 + German, Polish • Multimedia Search • Image, video, audio

  7. System Architecture Web Services MultiMatch Interface (GWT Based) MultiMatch CORE MILOS + GIFT + Lucene Client-Side Interface Server-Side Interface (via Java API) Java API (to make calls to Web Services) • Interface Services • UI Logic (server side) • Connection with MM core using web services API • Entry Point • Widgets • UI Logic (client side) • RPCs

  8. Indexing: Text • Text indexing • Web pages • Metadata describing videos and images • Speech recognition transcripts • Translation of documents • Transformation of metadata • Converts native formats (Alinari, BandG, BVMC, TEL) to MultiMatch metadata format using XSLT • Pre-processing • Stemming and Stopword removal

  9. Indexing: Speech • Indexed audio corpus • Generate Automatic Speech Recongition (ASR) transcriptions using test Nuance Dragon Naturally Speaking 9 SDK Server Edition • Prototype 1: 20 hours of Podcasts in all 4 languages • Transcription format • Flexible with respect to granularity • Links in to MultiMatch metadata format • Challenge: create appropriate units for retrieval

  10. Indexing: Video • Separate modalities while maintaining alignment • Visual: Keyframe extraction • Audio: Speech recognition transcription • Challenges similar to speech indexing • Determining appropriate document units: • Documents vs articles vs shots • Combining information sources

  11. Indexing: Image Indexes • Still images: Alinari images (5,000) • Video keyframes from BandG (ca. 14,000) GIFT (Gnu Image Finding Tool) • Content-based (visual only) image retrieval service • Embedded into metadata repository • Low-level MPEG-7 image features

  12. Metadata Search • MILOS • Metadata search, e.g. content source, dates, authors • Visual search: GIFT plug-in • Textual search: Lucene plug-in • Metadata search can be used in conjunction with visual and textual search to constrain result set.

  13. Text and Speech Search • Text and speech retrieval using Lucene • Speech and video transcripts are handled as text • Prototype 1: Default Lucene information retrieval model • Prototype 2: Investigated use ofOkapiBM25 model

  14. Visual Search GIFTsearch engine integrated as a service provider (plug-in) into the larger MILOS-based architecture. • Open source content-based image retrieval software • Low-level image MPEG-7 features • colour and texture • local and global scales

  15. Query ExpansionSummary • Thesaurus expansion • Terms from EuroWordnet • Relevance feedback • Terms added from the user selected relevant documents • Blind feedback • Terms added from the top system ranked assumed relevant documents

  16. Italian-to-Dutch Query (in Italian) Lucene Query Translation NL IT EN ES Query (in Dutch) Ranked Documents (in Italian) Document Translation Ranked Documents (in Dutch) Separate index for each language Cross-lingual Search

  17. Multilingual Search • Searching documents in a specific language • English, Italian, Spanish, Dutch, German, Polish • Query translation • Separate index and search for each language • Searching documents in all six languages • Translate all non-English documents to English • Store all in a single English index • Translate incoming queries to English

  18. Machine Translation • WorldLingo commercial machine translation system used under licence • Supports all 20 language pairs for the five languages for which MT was available • Easy to use and integrate into prototype • Well-documented API

  19. Machine Translation • MT is able to provide reasonable translations for general terms • Not sufficient for domain-specific terms (in particular, multiple-word phrases) • Personal names • Organization names • Location names • Titles of art works

  20. Hybrid Translation • MultiMatch improves translation accuracy of phrases previously untranslatedor inappropriately translated by a standard MT system • Objective: improve the CLIR effectiveness and facilitate MLIA • Augmented MT combining domain-specific dictionaries mined from the web

  21. Dictionary-based Query Translation Example Original English Query Italian Translation Processed English Query

  22. Domain-specific DictionaryConstruction • Multilingual wikipedia A wikipedia page written in one language can contain hyperlinks to its counterparts in other languages: titles and basenamesare translation pairs. • For example …

  23. Hyperlink Feature of Wikipedia

  24. Dictionary Construction Process A 3-stage automatic process: • Crawling the English wikipedia, Category:Culture (pages and subcategories) • Extracting hyperlinks to query languages (Italian and Spanish) • Generatingtranslation pairs using hyperlink basenames Multiple-word phrases wereadded into a phrase dictionary for each language

  25. 137897 EN‒ES 110568 ES‒EN 133570 EN‒IT 114237 IT‒EN 151147 EN‒NL 131652 NL‒EN 67470 ES‒IT 67498 IT‒ES 66625 ES‒NL 66005 NL‒ES 82006 IT‒NL 81503 NL‒IT Lexicon Coverage The number of distinct entries in each of the language pairs:

  26. Hybrid Translation Process • Dictionary-based phrase translation • Lexical rule-based phrase identification • Phrase translation • WorldLingo machine translation • For both the query and the phrases detected • Phrase translation validation • For each of the recognized phrases, replaceits WorldLingo translation by the translation(s)from our domain-specific dictionary, if they are not identical.

  27. Hybrid Query Translation Example

  28. Some Translation Examples

  29. Overview User Interface • Common entry point for (mainly) general users • Starting point for specialised interfaces • Results aggregated across sources and media types • Users can add/remove results windows • Web, Archives, Video, Image, Audio, Creators, RSS • Use of “infinite” scrolling bar • Links to metadata, term clouds, snippets and related terms • Users can filter language of results • Functionality for cross-language support • Use of combined MT and dictionary service • Ability to select/de-select alternative translations • Search all languages (English interlingua) • Localisation (Dutch, Spanish and German)

  30. Example Overview UI

  31. Image UI • Simple functionality to access visual material • Assume that users begin with verbalised query • Results include • Image (thumbnail), title and URN • Users can also view • Metadata, term cloud, snippet, related terms, similar images • Find similar images (“more like this”) • Images with similar visual content • Images with similar visual content and semantic content matching the query (uses fusion service)

  32. Audio UI • Generic playback for any web browser • JavaScript-based wrapper provided playback controls independent of audio file type Segments in transcribed audio which match query terms Two approaches for presenting transcripts

  33. Term cloud items Transcript Presentation • Transcript browse • Clickable audio segments • Transcript viewable • Term cloud browse • Clickable audio segments • Term cloud items can be arranged alphabetically or by time

  34. Video UI • Similar look-and-feel to audio UI • Users can view metadata, term clouds, snippets and related terms • Initial search queries video metadata • Also provided a search within results based on transcripts (requested by video experts) • For selected video users can • Playback video from various starting points • View keyframes and associated transcript text • Search for (and playback) specific keyframe segments containing keywords

  35. Creators UI • Utilisation of MultiMatch metadata schema and annotation • Provides biographical information and cross-referencing • Provides users with access to biographical information related to particular artist • Birth/death places/dates, nationality, alternative names, description • Query suggestions (names of artists) • List of creations by selected artist • Link to page for selected artist displaying • Related creations (images and videos) • Related web pages • “More like this” for images (derived from image UI)

  36. Non-Integrated Extensions

  37. Collection Browser: Motivation • A query places the user at some point of the information space • Feedback may be used to explore further • However relevance feedback makes it difficult to control the exploration and converge to a certain point • Similarity-based browsing effectively completes the query mechanism

  38. Implementation (http://viper.unige.ch/explore) • A basic search interface (text-based) • Locates initial relevant documents • A document viewer • Selects a seed for browsing and shows details • A collection browser • Allows similarity-based browsing

  39. Sampling of the collection content for fast navigation Clicking on any non-central image will bring it at the center Clicking on the central image returns to the search view Possible dimensions for display Browser Interface

  40. Semantic Browsing • Design of prototype faceted browsing system • Material crawled from Tate Online • Derived from Tate Collection (30,000 artworks, 3,000 artists) • Linked together Tate Collection and ULAN (100,000 artists) • Content can be accessed using one of four views • Artist view (search artists and titles of artworks) • Artwork view (more information on artworks and artwork thumbnails) • Timeline view of artist’s birth/death dates • Map view of artist’s birth places

  41. Dynamic Summarisation (DCU) • Focused browsing provides query-biased summary (term clouds) of hyperlinked pages

  42. Conclusions • MultiMatch successfully developed a system for multilingual multimedia search for the CH domain • Standard text, speech and video indexing tools are used as the underlying platform • Domain specific methods were developed to support improved multilingual studies • An integrated demonstration system was deployed • Extensive scope for extending the existing MultiMatch demonstration system and developing new indexing, search and interaction tools

  43. Questions?

More Related