1 / 1

S emantic S imilarity for M usic R etrieval

A coustic M odels. +. +. +. +. +. +. +. +. +. +. +. +. +. +. +. +. +. +. +. +. +. +. +. +. +. +. +. +. +. +. +. +. +. +. +. +. +. +. +. +. +. +. +. +. +. +. +. +. +. +. +. +. +. +. +. +. +. +. +. +. +. +. +. +. +. +. +. +. +. +.

hana
Télécharger la présentation

S emantic S imilarity for M usic R etrieval

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Acoustic Models + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + Each song, s, is represented as a probability distribution, P(a|s), over the audio feature space, approximated as a Gaussian mixture model (GMM): A bag-of-features extracted from the song’s audio content and the expectation maximization (EM) algorithm are used to train song-level GMMs. + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + bag-of-features EM Sounds →Semantics Semantic Multinomials It’s All Semantics… Using learned word-level GMMs P(a|wi), compute the posterior probability of word wi, givensong Assume xm and xnare conditionally independent, given wi: Estimate the song prior , by summing over all words: Normalizing posteriors of all words, we represent songs as semantic multinomial distributions over the vocabulary: Semantic understanding of audio signals enables retrieval of songs that, while acoustically different, are semantically similar to a query song. Given a query with a high-pitched, wailing electric guitar solo, a system based on acoustics might retrieve songs with screechy violins or a screaming female singer. Our system retrieves songs with semantically similar content: acoustic, classical or distorted electric guitars. Semantic Models Query Each word, w, is represented as a probability distribution, P(a|w), over the same audio feature space. The training data for word-level GMM is the set of all song-level GMMs fromsongs labeled with word w. Song-level GMMs are combined to train word-level GMMs using the mixture-hierarchies EM algorithm. The semantic model - a set of word-level GMMs - is used as the basis for song similarity. “Romantic” song-level GMMs “Romantic” word-level GMM Similar Mixture-Hierarchies EM Semantic Similarity p(a|s2) p(a|s1) We represent every song as a semantic distribution: a point in a semantic space. A natural similarity measure in this space is the Kullback-Leibler (KL) divergence; Given a query song, we retrieve the database songs that minimize the KL divergence with the query. p(a|s3) p(a|s4) p(a|“romantic”) http://cosmal.ucsd.edu/cal/ p(a|s5) p(a|s6) References Carneiro & Vasconcelos (2005). Formulating semantic image annotation as a supervised learning problem. IEEE CVPR. Rasiwasia, Vasconcelos & Moreno (2006). Query by Semantic Example. ACM ICIVR. Barrington, Chan, Turnbull & Lanckriet (2007). Audio Information Retrieval using Semantic Similarity. IEEE ICASSP Turnbull, Barrington, Torres & Lanckriet (2007). Towards Musical Query-by-Semantic-Description using the CAL500 Data Set. ACM SIGIR Not Similar Luke Barrington, Doug Turnbull, David Torres & Gert Lanckriet Electrical & Computer Engineering University of California, San Diego lbarrington@ucsd.edu Semantic Similarity for Music Retrieval Audio & Text Features Our models are trained on the CAL500 dataset, a heterogeneous data set of song / caption pairs: 500 popular western songs, 146-word vocabulary Each track has been annotated by at least 3 humans Audio content is represented as a bag of feature vectors: MFCC features plus 1st and 2nd time deltas 10,000 feature vectors per minute of audio Annotations are represented as a bag of words: Binary document vector of length 146

More Related