620 likes | 1k Vues
Musical Source Separation from Monaural Audio. Mert Bay ECE Dept. University of Illinois at Urbana-Champaign mertbay@ illinois.edu. Outline. Introduction Problem Description Related Research Overview the method Harmonic Modeling of Musical Sources STFT
E N D
Musical Source Separation from Monaural Audio Mert Bay ECE Dept. University of Illinois at Urbana-Champaign mertbay@illinois.edu
Outline • Introduction • Problem Description • Related Research • Overview the method • Harmonic Modeling of Musical Sources • STFT • Analysis / Synthesis (Overlap-Add) • Harmonic Modeling • Sinusoidal Analysis based on f0s • Spectral Peak Picking based on f0 • DFT Frequency Refinement using Signal Derivatives • Amplitude Refinement • f0 Refinement • Sinusoidal Synthesis (oscillators) • Initial Separation • Results
Outline • Methods for Resolving Overlapping Harmonics • Nonnegative Matrix Factorization using a Spectral Library • Adaptive Overlap Estimation and Repair • Synthesizing Repaired Harmonics • Spectral Estimation using Least-Squares • Common Amplitude Modulation • Evaluation and Results • Evaluation Metrics • Results • Data • Results • Conclusions and Future Work
Problem Description • The goal is to recover individual sources from a single-channel monaural recording • This is an ill-posed problem, not possible to solve without some information about the sources (Wang and Brown 2006). • This thesis is confined to the subset of musical sources that have harmonic partials, integer multiples of a common fundamental frequency f0.
Problem Description (cont’d) • Monaural musical source separation can be more challenging than speech. • Music is written in harmonies. Sources overlap in both time and frequency. • Harmonics are usually overlapped. Resolving overlapped partials is an important problem. References
Harmonic Overlap (Collision) Problem:Looking at a single STFT frame Bassoon f0=220Hz
Harmonic Overlap (Collision) Problem:Looking at a single STFT frame Bassoon f0=220Hz Clarinet f0=330Hz
Harmonic Overlap (Collision) Problem:Looking at a single STFT frame Bassoon f0=220Hz Clarinet f0=330Hz Oboe f0=440Hz
Harmonic Overlap (Collision) Problem:Looking at a single STFT frame Bassoon f0=220Hz Clarinet f0=330Hz Oboe f0=440Hz Mixture Overlaps: 440 660 880 1320 1760 1980
Example Input Mixture • We work on continuous audio as opposed to individual notes. • Each instrument has unique timbre. • Each instrument has similar levels.
Related Work on Monaural Source Separation • Based on sinusoidal modeling: • Davy 2003, Cemgil 2005, Vincent & Plumley 2006, used Bayesian models for sinusoidal parameter estimation. Virtanen 2006 used LS estimation. • Maher 1990, Every and Szymanski 2006, Li and Wang performed f0 based sinusoidal modeling. Every used linear interpolation, Li used CAM to fix the overlaps. • These methods try to estimate the parameters of the sinusoidal model (amplitudes and frequencies of harmonics) • Statistical Techniques: • Casey and Westner 2000, Dubnow2002, Fitzgerald 2002, Brown and Smaragdis 2004, Schmidt 2005 used independent subspace analysis. • Smaragdis and Brown 2003, Helen 2005, Virtanen 2007, Smaragdis and Mysore 2009 used NMF and PLCA. • Abdallah 2002, Blumensath and Davies 2006 used sparse coding. • They try to recover components whose combination approximate the input mixture. • Computational Auditory Scene Analysis based approach: • Bregman 1990, Mellinger 1991, Kashino 1993, Brown and Cook 1994, Godsmark and Brown 1999, Virtanen 2000, Abe and Ando 2002 • Used perceptual cues to group sinusoids to to different sources. Cues include nclude spectral proximity, common onset, common offset, common amplitude modulation, and common frequency modulation
Overview of the Principal Method of this Thesis • Harmonic Sinusoidal modeling based on f0s • Estimation of harmonics spectra using the spectral library with Supervised NMF • Estimation and recovery of overlapped harmonics using the reproduced library spectra • Synthesis of repaired spectra
Overview of Other Methods • Least-Square estimation of harmonic amplitudes using a time domain model based on the estimated harmonic frequencies. • Common amplitude modulation. Sinusoidal modeling using amplitude vs. harmonic track of a non-overlapped harmonic to estimate the overlapped harmonic tracks
Outline • Introduction • Problem Description • Related Research • Overview the method • Harmonic Modeling of Musical Sources • STFT • Analysis / Synthesis (Overlap-Add) • Harmonic Modeling • Sinusoidal Analysis based on f0s • Spectral Peak Picking based on f0 • DFT Frequency Refinement using Signal Derivatives • Amplitude Refinement • f0 Refinement • Sinusoidal Synthesis (oscillators) • Initial Separation • Results
Harmonic Modeling of Musical Sources:STFT • Start with STFT analysis on the mixture: • 46ms Hamming window for analysis • 75% overlap • Overlap-add Synthesis of modified spectra (will be used later) • Using an synthesis window which is a triangular window divided by the analysis window.
Harmonic Modeling of Musical Sources:An example magnitude spectrum
Harmonic Modeling of Musical Sources:Harmonic peak picking on Magnitude Spectra • For each instrument, a quarter tone range is searched around each of its harmonics for peaks on the corresponding bins of DFT frame • All bins k in the above range are tested for peaks. A bin is detected as peak if its magnitude is greater than the two adjacent bins and also greater than the half of the neighboring bins two steps away on both sides.
Harmonic Modeling of Musical Sources:Harmonic peak picking on Magnitude Spectrum
Harmonic Modeling of Musical Sources:Frequency Refinement • DFT frequencies are rough estimates sampled at fs/N hz increments. Peak frequencies have to be refined. • We use signal derivative approach for frequency refinement (Desainte-Catherine & Marchand 2000). The idea is that the FT of the harmonic peak • and its derivative • are both going to be delta functions on the same frequency where the derivative has a gain of • The ratio of the FT magnitude of the derivative of the signal to FT magnitude of the signal itself at peak frequency is the peak’s exact frequency. • This might not seem of interest since we don’t know . However the discrete version is very useful.
Harmonic Modeling of Musical Sources:Frequency Refinement • In the discrete version, the peak’s frequency is approximate (khifs/N). Note that the effect of the analysis window is the same on both the DFT of the signal and its derivative and it’s cancelled in the division. • The digital derivative can be approximated by • This is a filtering operation. However practical gain is different than the theoretical gain of • This is compensated by the scaling the magnitude spectra by
Harmonic Modeling of Musical Sources:Amplitude Refinement • The amplitude is refined by taking the inner product of the signal at the refined digital frequency. where k is the value corresponding to the refined frequency. • Refined amplitude can be calculated as • Refined phase can be calculated as
Harmonic Modeling of Musical Sources: • At this point, we estimated the harmonic frequencies and amplitudes of each source based on the given f0-vs.-time tracks. • However some of those parameters are corrupt due to overlap problem. You are here!
f0 Refinement • The initial f0-vs. time tracks might come from midi notes which are quantized to semitones. • The actual performers might be a little bit off. • The f0-vs. time tracks can be refined based on the non overlapping harmonics. • To simulate the effect, ground-truth f0 tracks are quantized to semitones.
Initial Separation using Binary Mask • At this point we can perform an initial separation on the mixture using the estimated parameters. • One way to do is with unit-amplitude filters that are centered around each peak on the DFT. • kl and klare the first minima on mixture magnitude spectra around the refined harmonic’s bin (+- 2 bins) • Each instrument can be separated from mixture magnitude spectra for each frame by multiplying the mixture magnitude spectra with these filters. • Time domain signal can be recovered by IFFT as described earlier. • If any peak is modified, the magnitudes resulting bins are calculated by sampling a 64 times oversampled window function convolved with the modified peak on DFT bins. Synthesis using IFFT and Overlap-add
Synthesis using Oscillators • Alternatively, each source can be synthesized using one sinusoidal oscillator for each harmonic using the sinusoidal model equation. • The estimated parameters (amplitudes and frequencies) of the sinusoids need to be interpolated. • While this method works well, the resulting waveform will not have the correct absolute phase values. Results of Initial Separation Mixture Binary Mask Original Oscillators
Outline • Methods for Resolving Overlapping Harmonics • Supervised Nonnegative Matrix Factorization using a Spectral Library • Adaptive Overlap Estimation and Repair • Synthesizing Repaired Harmonics • Spectral Estimation using Least-Squares • Common Amplitude Modulation • Evaluation and Results • Evaluation Metrics • Results • Data • Results • Conclusions and Future Work
Repairing Overlapping Harmonics using a Spectral Library • If we know some information about the sources, this information can be used to reconstruct the instruments timbre. • This information comes from a spectral library that is learned in advance. • Spectral library should represent different pitches and dynamics of each instrument. (e.g., RWC musical instrument database (Goto 2003)) • Goal: using the library repair / replace the overlapped harmonics.
Repairing Overlapping Harmonics using a Spectral Library: Some History Pitfalls: • Used K-means centroid clusters to represent each pitch of each instrument from RWC database. (k=10) • Perform sinusoidal analysis on the mixture. Determine the non overlapped harmonics of each source based on the proximity of each instrument’s harmonic frequencies from the other instruments harmonic frequencies. • Do a 2 nearest neighbor classification to the spectral library templates using the corresponding non overlapped harmonics. • Choose to template from the library and perform a least squares match to the non overlapped harmonics of the input. • Replace the harmonics using the LS solution on the 2 chosen library template. • We found setting a hard threshold on frequency proximity for overlap estimation leads to insufficient data to match a library spectra. (e.g., all the first strong harmonics are overlapped) • Quantizing the library (K-means centroid clusters) leads to sudden switching of the libraries from frame to next frame and jittering in the upper partials. • Find a more intelligent way to declare overlaps. • Use all the information available simultaneously to make a decision from the library. Goal:
New Method: Supervised Nonnegative Matrix Factorization using a Spectral Library • Each frame from the input mixture is decomposed onto the library spectra that is organized by each instruments spectral template belonging to a particular f0 that is active on the frame. • Using NMF, only the weights of the spectral templates are estimated. (Hence supervised). • Each sources’ harmonic amplitudes are then estimated by the combining the templates from the corresponding instrument with the estimated weights. • We choose to represent both the input mixture and the library spectra by their harmonic amplitude vectors as opposed to the DFT magnitude spectra, since the library will likely be slightly off pitch compared to the input. • Using the harmonic amplitude spectra, we can model each instrument’s note independent of the precise underlying f0.
Supervised Nonnegative Matrix Factorization using a Spectral Library: Method • A harmonic amplitude vector is created from the sinusoidal analysis of the sources by combining the amplitudes of the unique harmonics. • We also keep a corresponding frequency vector. • and is the corresponding amplitude vector for the input mixture. • For the spectral library, we only choose the templates with same f0 for each source. • We create a same size vector with where the harmonics exist in the dimensions corresponding to their frequencies. The other dimensions are zero.
Supervised Nonnegative Matrix Factorization using a Spectral Library: Method • The input mixture can be approximated as • The weights wn are unknown. Once the weights are estimated, each source harmonic amplitudes can be approximated using the corresponding part of the library. • In matrix notation
Supervised Nonnegative Matrix Factorization using a Spectral Library: Method First instrument’s spectral Amplitude templates for f0im Mixture spectral amplitude vector ith Ins Ith Ins
Supervised Nonnegative Matrix Factorization using a Spectral Library: Method First instrument’s spectral Amplitude templates for f0im Mixture spectral amplitude vector ith Ins Ith Ins
Supervised Nonnegative Matrix Factorization using a Spectral Library: Method • The weights wn can also be solved with least-squares (LS) or other optimization methods. However LS would assign negative weights to optimize the result which would render the library useless. • We choose to NMF to solve the weights. NMF would only use estimate positive weights that would confine the spectra in the library in a similar timbre space. • The weighs can be solved by the multiplicative NMF update rule • Proof that the above algorithm converges and minimizes the least squares error can be found in Lee & Seung 2001.
Supervised Nonnegative Matrix Factorization using a Spectral Library:
Supervised Nonnegative Matrix Factorization using a Spectral Library:
Supervised Nonnegative Matrix Factorization using a Spectral Library:
Supervised Nonnegative Matrix Factorization using a Spectral Library:
Adaptive Overlap Estimation and Repair • Common collision estimation methods compares harmonics frequencies. If the harmonics are closer to each other than the main lobe of the window function, then collision is assumed (Every 2006, Li 2009) • This method do not take into account the relative amplitudes of the harmonics. Say if the harmonic is dominant, (e.g. strong first harmonic is colliding with a weak 20th harmonic of a lower instrument), there is no need to attempt to repair the dominant harmonic. • We compare the NMF estimated amplitudes of the overlapped harmonics declared by the above equation. If the harmonic is dominant, it is not replaced. The refined harmonic amplitude from the harmonic sinusoidal modeling stage is used. • Otherwise harmonic amplitude is replaced with the NMF estimated and harmonic frequency is replaced with the integer multiple of the refined f0
Repairing Harmonic Collisions without Library:Spectral Estimation using Least-Squares • Method 1: Least squares spectral estimation • Each instrument modeled in time domain as sum of cosine + sine at each harmonic frequency. • Each cosine-sine pair are combined in matrix. • System LS is solved for the amplitudes of each cosine-sine pair. • In matrix notation:
Repairing Harmonic Collisions without Library:Spectral Estimation using Least-Squares Single Harmonic Weights Time domain mixture
Repairing Harmonic Collisions without Library:Using Common Amplitude Modulation • Harmonic sinusoidal modeling is performed in the input mixture based on f0s • The non-overlapped harmonics are separated in a similar way. • The overlapped harmonics are synthesized using a strong non overlapped harmonic track. • The idea is that harmonic amplitude vs. time tracks are correlated. They follow a similar pattern. • Overlapped harmonic amplitude track can be estimated by scaling a strong non overlapped harmonic amplitude vs. time track within the overlapped period. • The scaling factors for each track are solved using Least-Squares on the overlapped portion of the STFT.
Repairing Harmonic Collisions without Library:Using Common Amplitude Modulation
Outline • Methods for Resolving Overlapping Harmonics • Supervised Nonnegative Matrix Factorization using a Spectral Library • Adaptive Overlap Estimation and Repair • Synthesizing Repaired Harmonics • Spectral Estimation using Least-Squares • Common Amplitude Modulation • Evaluation and Results • Evaluation Metrics • Results • Data • Results • Conclusions and Future Work