1 / 32

Gerhard Rigoll Munich University of Technology Institute for Human-Machine Communication

The ALERT System: Audiovisual Broadcast Speech Transcription for Selective Dissemination of Multimedia Information. Gerhard Rigoll Munich University of Technology Institute for Human-Machine Communication Munich, Germany rigoll@ei.tum.de. General Project dates. ALERT system for selective

cortez
Télécharger la présentation

Gerhard Rigoll Munich University of Technology Institute for Human-Machine Communication

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. The ALERT System: Audiovisual Broadcast Speech Transcription for Selective Dissemination of Multimedia Information Gerhard Rigoll Munich University of Technology Institute for Human-Machine Communication Munich, Germany rigoll@ei.tum.de

  2. General Project dates ALERT system for selective dissemination of multimedia information • Official start: 01/2000, start of work: 03/2000, duration: 30 months • Man power effort: ~30 MY ---> Budget: ~1.6 Mio Euro EC funding • Web Site: http://alert.uni-duisburg.de

  3. Media information flooding NEWS Internet supervision by information brokers

  4. NEWS Internet Media monitoring in the alert project information (sound, video, text) topic detection transcription today‘s headlines .... TAXES ALERT MESSAGE

  5. General project Objectives • To develop a demo system capable of identifying specific information in multimedia data, consisting of • text, • audio and • video streams • using • advanced speech recognition • video processing techniques • automatic topic detection algorithms • demonstrator shall • alert a user about the existence of requested information • send detailed information (on client's further request) • extracted text • annotated audio/video data and video clips • provide functionality in French, German and Portuguese • demo system will be evaluated mainly by industrial partners

  6. THe alert Consortium integration technologies users Consortium

  7. deliverable milestone today WP structure (WP0-WP4)

  8. deliverable today milestone WP structure (WP5-WP7)

  9. Collection of pilot corpus • First step to setup similar resources • Purpose: testbed for assessing methods for data collection, annotation and distribution • Collection guidelines: • Minimum amount: 5 hours • Type of data: video, audio and annotation • Video format: MPEG1 • Audio format: PCM linear, 16KHz sampling rate, 16 bits/sample, mono, collected from antenna • Annotation based on LDC guidelines • Thematic orientation: news and interview shows

  10. Collection of final databases • Experimental results • recommendations for final corpus • quality  mp3, 32 kbps, 16kHz, mono • Minimum amount: • speech recognition: 50 hours (training), 3 hours (development), 3 hours (evaluation) word-labelled • topic detection: 300 hours, topic annotated • text corpus: 100 million words • Full data set: • 1300 hours word or topic annotated • > 10k topic annotated summaries in German • text corpus: > 1 billion words

  11. comparison of coding schemes for broadcast speech databases

  12. Multimedia datA-labeling and alert-generation multimedia document video/image processing segmentation if video contained video-based speech processing transcription segmentation if audio alert specific users best hypo- wordgraph contained automatic topic detection topic if text keywords contained match topics found against user profiles multimedia document database label database

  13. Basic principle of video-segmentation Stochastic Video-Model (based on HMMs):

  14. Result of video-based segmentation

  15. Combined video-audio-segmentation

  16. topic segmentation Results: video based detection of topic boundaries is feasible precision rate = 1 - insertion rate = 88.2 % recall rate = 1 - deletion rate = 82.2 %

  17. French BN speech recognizer • continuous density HMM system • 33 phones + 3 non-speech (silence, filler words, breath) • ~20% WER (on news) • 65k dictionary • automatic pronunciation with manual verification • 58 hours acoustic training data, 350 Mio words text • RT decoding: 5700 states, 92k Gaussians • 10xRT decoding: 11000 states, 350k Gaussians • 4-gram language model 15M bi-, 15M tri-, 13M four-grams

  18. Portuguese BN speech recognizer • Based on the AUDIMUS LVCSR system • Hybrid system based on MLP/HMM techniques • Combination of different acoustic models (product of posterior probabilities) • 38 phones + silence, 57k dictionary • 4 gram LM: 5M bi-, 12M tri-, 13M fourgrams • Trained on 13 h of BN data • Results: • 15xRT: F0: ~20%, All F: ~40 %

  19. German Baseline Speech Recognition System

  20. German BN speech recognizer • continuous density HMM system • 50 phones + 17 non speech (silence, filler words, breath, rustle, ...) • ~20 % WER (initial DuDeutsch: >70 % WER) • 100 k dictionary • initial pronunciation from CELEX, compound word construction • 10xRT: 30-90k Gaussians • 3-gram (cached) language model, 8M bi-, 16M trigrams

  21. Evolution of the german system system phone models #mixtures WER baseline German triphones 31 780 ~30% system, 100k, spontaneous speech baseline, not triphones 31 780 79,7% trained on broad- cast data baseline with triphones 31 780 72,3% broadcast language model acoustic models monophones 1 722 54,3% trained on broadcast data acoustic models triphones 96 417 22,8% optimized on broadcast data

  22. Examples for German transcription results

  23. Automatic topic detection • Objectives: • to divide automatically audio/video streams into topic-specific homogeneous segments • automatic assignment of requested topics to distinct segments Test set: • 22 topics in 2956 training and 1284 test texts • deletion of 150 stop words • no stemming performed

  24. New approach to topic detection This is a text containing important topics. [00.....0100....0] p(w1) p(w2) p(w3) . . . MMI Neural Net VQ label

  25. Results for Clean text Comparison of feature quantization with k-means clustering and MMI neural net Comparison of new approach and standard system

  26. Partially Corrupted text Results with partially corrupted texts: • some words are fragmented • similar to speech recognition output • 22 topics in 3037 training and 1319 test texts • no stop words • no stemming

  27. Results for Corrupted text 22 topics 173 topics

  28. Demonstrator specification (details)

  29. Publications • ICASSP 2001 (7/2001) • LIMSI: Automatic transcription of compressed broadcast audio • GMUD: New approaches to audio- visual segmentation of TV news for automatic topic retrieval. • TREC-9 (11/2000) • LIMSI: The LIMSI SDR system for TREC-9 • argus press (11/2000) • Observer: Observer Argus Media beteiligt sich am EU-Forschungsprojekt ALERT • ICSLP 2000 (10/2000) • GMUD: Compound splitting and lexical unit recombination for improved performance of a speech recognition system for German parlianmentary speeches • INESC: The Use of Syllable Segmentation Information in Continuous Speech Recognition Hybrid Systems Applied to the Portuguese Language • INESC: Combination of Acoustic Models in Continuous Speech Recognition Hybrid Systems

  30. Publications (II) • ICSLP 2000 (10/2000) • LIMSI: Fast decoding for indexation of broadcast data • LIMSI: Investigating text normalization and pronunciation variants for German broadcast transcription • EDCL 2000 4th European Conference on Research and Advanced Technology for Digital Libraries (9/2000) • INESC: Topic Detection in Read Documents • ASR 2000 (9/2000) • INESC: A Decoder for Finite-State Structured Search Spaces • ICASSP 2000 (6/2000) • GMUD: A Novel Error Measure for the Evaluation of Video Indexing Systems

  31. Presentations • Schaufenster der Wissenschaft (3/2001) • GMUD: Informationen aus Radio, Fernsehen und Internet: Automatische Themenerkennung in Multimedia-Daten • Euromap Informationstag (12/2000) • GMUD: Das Projekt ALERT - Alert system for selective dissemination of multimedia information • IV Jornadas de Arquivo e Documentação (10/2000) • INESC: Speech recognition and topic detection applied to alert systems for broadcast news • ASR 2000 (9/2000) • GMUD: ALERT System for Selective Dissemination of Multimedia Information • Homme Technologie et Systèmes Complexes (6/2000) • VECSYS: Parlez Naturellement, la Machine Vous Comprend • RIAO'2000 Content-based Multimedia Information Access (4/2000) • VECSYS, LIMSI: An Audio Transcriber for Broadcast Document Indexation

  32. outlook • use of additional data • cross-talker situations • enlarged number of topics • improving rejection mechanisms of unknown topics (confidence for topics) • detection of new topics • summarization • scalable summarization • topic-dependent summarization

More Related