410 likes | 429 Vues
Query by Singing (CBMR: Content-based Music Retrieval). MATLAB Conf. 1999. J.-S. Roger Jang ( 張智星 ) CS Dept, Tsing-Hua Univ, Taiwan http://www.cs.nthu.edu.tw/~jang. Brought to you by Roger Jang. Outline. Part 1 : Introduction Part 2 : Related Work Part 3 : Proposed Methods
E N D
Query by Singing (CBMR: Content-based Music Retrieval) MATLAB Conf. 1999 J.-S. Roger Jang (張智星) CS Dept, Tsing-Hua Univ, Taiwan http://www.cs.nthu.edu.tw/~jang Brought to you by Roger Jang
Outline • Part 1 : Introduction • Part 2 : Related Work • Part 3 : Proposed Methods • Part 4 : Experimental Results and Demos • Part 5 : Conclusions and Future Work
About Me • Experiences • 1993-1995: The MathWorks, Inc., U.S.A. • 1995-now: Associate Prof. At CS Dept., Tsing Hua Univ., Taiwan • Special achievements: Have survived • 1989 S.F. earthquake (7.1) • 1999 Taiwan earthquake (7.6) • 2009 ?
Part 1: Introduction to CBMR • CBMR: Content-based music retrieval • Goal: Music retrieval by singing/humming • Traditional database query • Text-based search, SQL-based queries • Features used by CBMR systems • melody • rhythm • chord
Part 2: Related Work • Query by humming by Ghias,Loga and Chamberlin in 1995 - Modified autocorrelation - 183 songs in database • MELDEX systemby New Zealand Digital Library Project in 1996 - Gold/Rabiner Algorithm (800 songs) - Sing ‘la’ or ‘ta’ when transposition
Pitch Determination Methods • Time-domain analysis • Autocorrelation(1976) • AMDF(Average magnitude difference function) • Gold-Rabiner Algorithm(1969) • Frequency-domain analysis • Cepstrum (Noll 1964) • Harmonic product spectrum (Schroeder 1968) • Chen’s heuristic method (Chen 1998) • Others • Maximum likelihood • Simple inverse filter tracking (SIFT) • Neural network approaches
PART 3: Proposed Method On-line processing Flow Chart : Microphone Signal input Sampling Short-term Autocorrelation Center Clipping Note Segmentation 11KHz Mid-level Representation Similarity Comparison Query Results (Ranking List) Music Note Extraction Midi Song Database Off-line processing
Microphone Signal Input • Sampling & low-pass filtering • Wave file: (Happy Birthday) Note starts Note ends Note ends Note starts
Autocorrelation in speech signal Speech wave form : Zoom in Overlap Frame
Short-term Autocorrelation Autocorrelation of N points ending at M : Frame size 256 points, shift 128 points Using rectangular window
Short-term Autocorrelation 1 128 s(n): s(n-h): h=30 x(30) = dot prod. of overlapped = sum(s(31:128).*s(1:99) Autocorrelation g(h): Pitch period 30
Center Clipping Clipping limits are set to r% of the absolute maximum of the autocorrelation data output output output r% 0 input 0 input 0 input (a) (b) (c)
Computing Fundamental Frequency • Fundamental frequency: • Removal of unreasonable pitch: 1. Outside the fundamental frequency range 2. Sharp transition from both sides
Pitch Tracking Pitch tracking via autocorrelationfor茉莉花 (jasmine)
Pitch Contour Yellow line : Correct pitch contour
Note Segmentation • Segmentation based on energy - Necessary to have intensity contrast - Hard to define each note boundary • Segmentation based on pitch - No constraints when singing - Reasonable in CBMR system
Note Segmentation by Pitch Proposed approach : Sliding window method for I=1:seg_num if seg_length <= note_min find_note(min(seg(I))); // Find mean value else cut_note(min(seg(I))); // Sliding window end
Sliding Window Method • Window size : 10 • Max standard deviation: 8 • for each window • if std(window) < 8 • find_note(window) • else • goto next winodw • end • end
Transform pitches into notes • After segmentation by pitch Identified pitch frequencies: 329 392 440 523 440 392 440 392 Pitch contour
Mid-level Representation(I) Numeric Contour • Find each note’s distance from A440 (La) Ex: So Mi Mi Fa Re Re Do Re Mi Fa So So So => So Mi Fa Re Do Re Mi Fa So (Removal of repeated notes) => 67 64 65 62 60 62 64 65 67 (Midi representation) => -2 -5 -4 -7 -9 -7 -5 -4 -2 (Distance from 69, or “la”) => -3 1 -3 -2 2 2 1 2 (Difference between neighbors)
Mid-level Representation (II) Ternary Contour • Find each note’s distance from A440 (La) Ex:So Mi Mi Fa Re Re Do Re Mi Fa So So So => So Mi Fa Re Do Re Mi Fa So (Removal of repeated notes) => 67 64 65 62 60 62 64 65 67 (Midi representation) => -2 -5 -4 -7 -9 -7 -5 -4 -2 (Distance from 69, or “la”) => -1 1 -1 -1 1 1 1 1 (use 1, 0 -1 as contour)
Mid-level Representation (III) Chord Contour C: do, mi, sol Dm: re, fa, la Em: mi, sol, si F: fa, la, do G: sol, si, mi Am: la, do, mi E1: mi E2: mi, sol E2 D2
Wave Transformation Frequency (Hz) : 293 293 329 440 392 261 => -7 -7 -5 0 -2 -9 (Semitone offset) => -7 -5 0 -2 -9 (Removal of repeated notes) => 2 5 -2 -7 (Difference between neighbors) Frequency to semitone offset : freq : Note frequency(Hz) Offset : Semitone from A440 (La)
1-D String Matching • : 1 1 2 0 -1 0 1 2 0 : -3 1 1 2 4 -1 1 2 5 • LCS: Longest common subsequence • lcs(, ) = 6 • LCCS: Longest common consecutive subsequence • lccs(, ) = 3 • : 1 1 2 0 -1 0 1 2 0 • : -3 1 1 2 4 -1 1 2 5
Similarity Evaluation N : Number of sequence ,. : Standard Deviation of , LCS distance Euclidean distance Cosine distance
Modified LCS/LCCS Algorithm Two 1-D stringγ,λ: Initial values :
Our Simulink Model Pre-recorded wave files Mid-level representation Simularity measures Microphone input
Demo : Query Results Scores Song name
Part 4: Experiment Results(I) • Songs database : 212 Chinese/English/Taiwan songs • Experiment 1:
Experiment Results(I) • Experiment 2:
Experiment Results(II) Ternary contour Cosine Distance Ranking Titles
Experiment Results(III) • Methods comparisons 1.Use 60 wave files as our acoustic inputs 2.For each file, find the output ranking number 3.Find the total rank number of the 60 wave files.
Computation Time • Total time = Sound recording time + Retrieval time • Sound recording time: 5 sec (default setting) • Retrieval time = Note segmentation + similarity comparison
Demos • Pitch determination demo • Standard octave • On-line display of autocorrelation • CBMR System Demo • DTW (dynamic time warping) demo • Original • Input • Warping path
PART 5: Conclusions • MATLAB, Simulink, and DSP Blockset are ideal tools for real-time audio signal processing. • Different similarity measures lead to different results. • The most time-consuming part is to key in single-channel midi files.
Current Limitations • Matching only starts at the beginning of a song. • Retrieval time is proportional to the number of songs in the database. • Lyrics are neither identified nor used for similarity comparison.
Future Work • Use DTW to allow matching at any place of a song • Use tree search techniques to shorten matching time • Recognize major channels in midi files • Construct a MIDI search engine on the Web • Try other types of content-based audio retrieval • Automatic music score generation from wave input (wave to midi converter) • Chip implementation