Outline

Multimedia Segmentation and SummarizationDr. Jia-Ching WangHonorary Fellow, ECE Department, UW-Madison

Outline • Introduction • Speaker Segmentation • Video Summarization • Conclusion

What is Multimedia? • Image • Video • Speech • Audio • Text

Multimedia Everywhere • Fax machines: transmission of binary images • Digital cameras: still images • iPod / iPhone & MP3 • Digital camcorders: video sequences with audio • Digital television broadcasting • Compact disk (CD), Digital video disk (DVD) • Personal video recorder (PVR, TiVo) • Images on the World Wide Web • Video streaming, video conferencing • Video on cell phones, PDAs • High-definition televisions (HDTV) • Medical imaging: X-ray, MRI, ultrasound • Military imaging: multi-spectral, satellite, microwave

What is Multimedia Content? • Multimedia content: the syntactic and semantic information inherent in a digital material. • Example: text document • Syntactic content: chapter, paragraph • Semantic content: key words, subject, types of text document, etc. • Example: videodocument • Syntactic content: scene cuts, shots • Semantic content: motion, summary, index, caption, etc.

Why We Need to Know Multimedia Content? • Why we need to know multimedia content? • Information processing, in terms of archiving, indexing, delivering, accessing and other processing, require in-depth knowledge of content to optimize the performance.

Howto Know Multimedia Content? • How to Know Multimedia Content? • Multimedia content analysis • The computerized understanding of the semantic/syntactic of a multimedia document • Multimedia content analysis usually involves • Segmentation • Segmenting the multimedia document into units • Classification • Classifying each unit into a predefined type • Annotation • Annotatingthe multimedia document • Summarization • Summarizing the multimedia document

Multimedia Segmentation and Summarization • Multimedia segmentation • Syntactic content • Multimedia summarization • Semantic/syntactic content • The result of the temporal segmentation can benefit the video summarization

Multimedia Segmentation • Image segmentation • Video segmentation • Scene change, shot change • Audio segmentation • Audio class change • Speech segmentation • Speaker change detection • Text Segmentation • word segmentation, sentence segmentation, topic change detection

Multimedia Summarization • Image summarization • Region of interest • Video summarization • Storyboard, highlight • Audio summarization • Main theme in music, Corus in song, event sound in environmental sound stream • Speech summarization • Speech abstract • Text summarization • Abstract

What is Speaker Segmentation? • It can also be called speaker change detection (SCD) • Assumption: there is no overlapping between any of the two speaker streams speaker3 speaker2 speaker1

Supervised v.s. Unsupervised SCD • Supervised manner: acoustic data are made up of distinct speakers who are known a priori • Recognition based solution • Unsupervised manner: no prior knowledge about the number and identities of speakers • Metric-based criterion • Model selection-based criterion

x is a d-dimensional random vector. , i=1,…,M is the mixture weight. ,the mean vector. ,the covariance matrix. Supervised Speaker Segmentation-- Gaussian Mixture Model • Gaussian mixture modeling (GMM) • Incoming audio stream is classified into one of D classes in a maximum likelihood manner at time t

Supervised Speaker Segmentation-- Hidden Markov Model

Unsupervised Speaker Segmentation-- Sliding Window Strategy & Detection Criterion • Metric-based criterion (The dissimilarities between the acoustic feature vectors are measured) • Kullback-Leibler distance • Mahalanobis distance • Bhattacharyya distance • Model selection-based criterion • Bayesian information criterion (BIC)

Bayesian Information Criterion • Model selection • Choose one among a set of candidate models Mi , i=1,2,...,m and corresponding model parameters to represent a given data set D = (D1, D2, …, DN). • Model Posterior Probability • Bayesian information criterion • Maximized log data likelihood for the given model with model complexity penalty • Bayesian information criterion of model Mi where di is the number of independent parameters in the mode parameter set

Unsupervised Segmentation Using Bayesian Information Criterion • First model • Second model • Bayesian information criterion

Disadvantages of Conventional Unsupervised Speaker Change Detection Disadvantage: • For metric based methods, it’s not easy to decide a suitable threshold • For BIC, it’s not easy to detect speaker segment less than 2 seconds

Proposed Method -- Misclassification Error Rate • Sliding window pairs • Feature vector distribution Same speaker Different speakers

Mathematical Analysis

Discussion • Generative and discriminant classifiers are both applicable • Key Point: Discriminant classifiers have the benefit that smaller data are required • We can have smaller scanning window size • The ability to detect short speaker change segment increases

Speaker Segmentation Using Misclassification Error Rate • Steps • Preprocessing • Framing, Feature extraction • Hypothesized speaker change point selection • Forcing 2-class labels • Training a discriminat hyperplane • Inside data recognition & calculating misclassification error rate • Accept/reject the hypothesized speaker change point • Significance • The unsupervised speaker segmentation problem is solved by supervised classification

Experimental Results EXPERIMENTAL RESULTS

Video Summarization • Dynamic v.s. Static Video Summarization • Dynamic video summarization • Sport highlight, movie trailer • Static video summarization • Storyboard • Visual-based approach • Incorporation of the semantic Information

Static Video Summarization-- Visual Based Approach • Example • Problem • Is the summarization ratio adjustable? • How to generate effective storyboard under a given summarization ratio?

How to Generate Effective Storyboard • Question: Assume there are n frames and the summarization ratio is r/n. How do we select the best r frames ? • Complexity: • There are C(n,r) different choices

How to Generate Effective Storyboard • In visual viewpoint • Most visually distinct frames should be extracted • Dissimality between two frames is measured by low level visual features • How to select best r frames from n frames • Solution: maximize the overall pairwise dissimilities • Complexity: C(n,r) x C(r,2) • Unfeasible: C(n,r) is usually huge • Fact • Human beings usually browse a storyboard in a sequential way • Optimal solution in a sequential sense • Maximize the sum of dissimilities from sequential adjacent images in a storyboard

How to Maximize the Dissimality Sum of the Extracted Images • Lattice-based representative frame extraction approach • Extract key component from temporal sequence • Dynamic programming can be applied • Example: how to select the best 4 images from an 8-image sequence

How to Maximize the Adjacent Dissimality Sum of the Extracted Images • Original images: O(1), O(2), O(3), O(4), O(5), O(6), O(7), O(8) • Extracted images: E(1), E(2), E(3), E(4) • E(1) ← O(i); E(1) ← O(j); E(1) ← O(k); E(1) ← O(l); where i < j < k < l • Each legal left-to-right path represents a way to extract images • Each transition results in an adjacent dissimality • In this example, the adjacent dissimality sum of the extracted images are D[ O(1),O(3) ] + D[ O(3),O(4) ] + D[ O(4),O(7) ]

How to Maximize the Adjacent Dissimality Sum of the Extracted Images

Complexity Comparison • Select 4 images from an 8-image sequence • Lattice-based approach • 45 dissimality comparison • Optimal approach • 420 dissimality comparison

Segment-Based Solution

Experimental Results

Incorporation of the Semantic Information • Conventional • The static summarized images are extracted in accordance with low level visual features • Disadvantage • It’s difficult to catch the main story without the support of semantic significant information • We present a semantic based static video summarization • Each extracted image has an annotation • Related images are connected by edge • Using ‘who’ ‘what’ ‘where’ ‘when’ to list all extracted images

The Proposed Architecture • Shot annotation: mapping visual content to text • Concept expansion: It provides an alterative view and dependency information while measuring the relation of two annotations. • Relational graph construction

Concept Tree Construction • The concept tree denotes the dependent structure of the expanded words • Meronym • ‘Wheel' is a meronym of 'automobile'. • Holonym • ‘Tree' is a holonym of 'bark', of 'trunk' and of 'limb' • Pencil used for Draw • Salesperson location of Store • Motorist capable of Drive • Eat breakfast Effect of Full stomach

Concept Tree Reorganization • Who: names of people, subset of "person" in WordNet • Where: "social group," "building," and "location " in WordNet • What: " All the other words which do not belong to "who" and "where" • When: searching for time-period phrase

Relational Graph Construction -- Relation of Two Concept Trees • The relation of the two concept trees • The relation of the two roots • The relation of the two children

Relational Graph Construction -- Remove Unimportant Vertices and Edges • Remove edges with smaller weighting, i.e. lower relation • Remove vertices with smaller term frequency – inverse document frequency (TF-IDF)

The Final Relational Graph • Comparison with conventional storyboard

Conclusion • A novel speaker segmentation criterion is proposed • Misclassification error rate • The unsupervised speaker segmentation problem is solved by supervised classificationwith label-forcing • Discriminat classifier makes the proposed approach be able to have smaller scanning window size • The ability to detect short speaker change segment increases • Two new static video summarization approaches are proposed • Lattice-based representative frame extraction • Merely using low level visual features • The summarization ratio is adjustable • Under a given summarization ratio, the dissimality sum from sequential adjacent images is minimized • Concept-organizedrepresentative frame extraction • Incorporating semantic information • Mining the four kinds of concept entities: who, what, where, and when • People can efficiently grasp the comprehensive structure of the story and understand the main points of the contents

Future Work • Multimedia segmentation • Speech segmentation • Audio segmentation • Video segmentation • Multimedia summarization • Video summarization • Static, dynamic • Speech summarization • Audio summarization

Thank all of you for your attendance!

Outline

Outline

Presentation Transcript

Outline

Outline

Outline

Outline

Outline

Outline

Outline

outline

outline

OUTLINE

Outline

Outline

Outline

Outline

Outline

Outline

Outline

Outline

Outline:

Outline

Outline

OUTLINE: