Data-Adaptive Source Separation for Audio Spatialization
This project presentation by Pradeep Gaddipati explores innovative methods for audio spatialization and source separation. Highlighting the inefficacy of traditional spatial audio over headphones, it discusses a data-adaptive time-frequency representation (TFR) approach to enhance the listener’s experience. Key topics include audio spatialization techniques, the problem statement regarding mixed audio environments, source separation and re-synthesis processes, and performance evaluation metrics. Future work aims to refine these methods to improve clarity and integration of audio sources in various playback settings.
Data-Adaptive Source Separation for Audio Spatialization
E N D
Presentation Transcript
Data-Adaptive Source Separation for Audio Spatialization M. Tech. project presentation by PradeepGaddipati 08307029 Supervisors: Prof. PreetiRao and Prof. V. Rajbabu
Outline • Problem statement • Audio spatialization • Source separation • Data-adaptive TFR • Concentration measure (sparsity) • Re-construction of signal from TFR • Performance evaluation • Data-adaptive TFR for sinusoid detection • Conclusions and future work
Problem statement • Spatial audio – surround sound • commonly used in movies, gaming, etc. • suspended disbelief • applicable when the playback device is located at a considerable distance from the listener • Mobile phones • headphones – for playback • spatial audio – ineffective over headphones • lacks body reflection cues – in-the-head localization • can‘t re-record – so need for audio spatialization
Audio spatialization • Audio spatialization – a spatial rendering technique for conversion of the available audio into desired listening configuration • Analysis – separating individual sources • Re-synthesis – re-creating the desired listener-end configuration
Source separation Source 1 Mixtures (stereo) Source 2 Source 3 • Source separation – obtaining the estimates of the underlying sources, from a set of observations from the sensors • Time-frequency transform • Source analysis – estimation of mixing parameters • Source synthesis – estimation of sources • Inverse time-frequency representation
Mixing model • Anechoic mixing model • mixtures, xi • sources, sj • Under-determined (M < N) • M = number of mixtures • N = number of sources • Mixing parameters • attenuation parameters, aij • delay parameters, Figure: Anechoic mixing model – Audio is observed at the microphones with differing intensity and arrival times (because of propagation delays) but with no reverberations Source: P. O. Grady, B. Pearlmutter and S. Rickard, “Survey of sparse and non-sparse methods in source separation,” International Journal of Imaging Systems and Technology, 2005.
Source analysis (estimation of mixing parameters) • Time-frequency representation of mixtures • Requirement for source separation [1] • W-disjoint orthogonality
Source analysis (estimation of mixing parameters) • For every time-frequency bin • estimate the mixing parameters [1] • Create a 2-dimensional histogram • peaks indicate the mixing parameters
Source synthesis (estimation of sources) Source 1 Mixture Source 2 Source 3 Masks Sources
Source synthesis (estimation of sources) Source 1 Mixture Source 2 Source 3
Source synthesis (estimation of sources) • Source estimation techniques • degenerate unmixing technique (DUET) [1] • lq-basis pursuit (LQBP) [2] • delay and scale subtraction scoring (DASSS) [3]
Source synthesis (DUET) • Every time-frequency bin of the mixture is assigned to one of the source based on the distance measure
Source synthesis (LQBP) • Relaxes the assumption of WDO – assumes at most ‘M’ sources present at each T-F bin • M = no. of mixtures, N = no. of sources, (M < N) • lq measure decides which ‘M’ sources are present
Source synthesis (DASSS) • Identifies which bins have only one dominant source • uses DUET for that bins • assumes at most ‘M’ sources present in rest of the bins • error threshold decides which ‘M’ sources are present
Inverse time-frequency transform Mixtures (stereo) Orig. source 1 Est. source 1 Orig. source 2 Est. source 2 Orig. source 3 Est. source 3
Scope for improvement • Requirement for source separation • W-disjoint orthogonality (WDO) amongst the sources • Sparser the TFR of the mixtures [4] • the less will be the overlap amongst the sources (i.e. higher WDO) • easier will be their separation
Data-adaptive TFR • For music/speech signals • different components (harmonic/transients/modulations) at different time-instants • best window differs for different components • this suggests use of data-dependent time-varying window function to achieve a high sparsity [6] • To obtain sparser TFR of mixture • use different analysis window lengths for different time-instants, the one which gives maximum sparsity
Data-adaptive TFR Data-adaptive time-frequency representation of singing voice, window function = hamming window sizes = 30, 60 and 90 ms, hop size = 10 ms, conc. measure = kurtosis
Sparsity measure(concentration measure) • What is sparsity ? • small number of coefficients contain a large proportion of the energy • Common sparsity measures [5] • Kurtosis • Gini Index • Which sparsity measure to use for adaptation ? • the one which shows the same trend as WDO as a function of analysis window size
WDO and sparsity (some formulae) • W-disjoint orthogonality [4] • Kurtosis • Gini Index
Dataset description • Dataset : BSS oracle • Sampling frequency : 22050 Hz • 10 sets each of music and speech signals • One set : 3 signals • Duration : 11 seconds
WDO and sparsity • WDO vs. window size • obtain TFR of the sources in a set • obtain source-masks based on the magnitude of the TFRs in each of the T-F bins • using the source-masks and the TFR of the sources obtain the WDO measure • NOTE: In case of data-adaptive TFR, obtain the TFR of sources using the window sequence obtained from the adaptation of the mixture • Sparsity vs. window size • obtain the TFR of one of the channel of the source • calculate the frame-wise sparsity of the TFR of the mixture
WDO and sparsity (observations) • Highest sparsity (kurtosis/Gini Index) is obtained when data-adaptive TFR is used • Highest WDO is obtained by using data-adaptive TFR (with kurtosis as the adaptation) • Kurtosis is observed to have similar trend as that of WDO
Inverse data-adaptive TFR • Constraint (introduced by source separation) • TFR should be invertible • Solution • Select analysis windows such that they satisfy constant over-lap add (COLA) criterion [7] • Techniques • transition window • modified (extended) window
Problems with re-construction • Transition window technique • adaptation carried out only on alternate frames • WDO obtained amongst the underlying sources is less • Modified window technique • the extended window as compared to a normal hamming window has larger side-lobes • spreading the signal energy into neighboring bins • WDO measure decreases
Dataset description • Dataset – BSS oracle • Mixtures per set (72 = 24 x 3) • attenuation parameters (24 = 4P3) • {100, 300, 600, 800} • Delay parameters • {(0,0,0), (0, 1, 2), (0 2 1)} • A total of 720 (72 x 10) mixtures (test cases) for each of music and speech groups
Performance (source estimation) • Evaluate the source-masks using one of the source estimation techniques (DUET or LQBP) • Using the set of estimated source-masks and the TFRs of the original sources calculate the WDO measure of each of the source-masks • WDO measure indicates how well the mask • preserves the source of interest • suppresses the interfering sources
Data-adaptive TFR (for sinusoid detection) Data-adaptive time-frequency representation of a singing voice window function = hamming; window sizes = 20, 40 and 60 ms; hop size = 10 ms, concentration measure = kurtosis; frequency range = 1000 to 3000 Hz
Conclusions • Mixing model – anechoic • Kurtosis can be used as the adaptation criterion for data-adaptive TFR • Data-adaptive TFR provides higher WDO measure amongst the underlying sources as compared to fixed-window STFT • Better estimates of the mixing parameters and the sources are obtained using data-adaptive TFR • Performance of DUET is better than LQBP
Future work • Testing of the DASSS source estimation technique • Re-construction of the signal from TFR • Need to consider a more realistic mixing model to account for reverberation effects, like echoic mixing model
Acknowledgments I would like to thank Nokia, India for providing financial support and technical inputs for the work reported here
References • A. Jourjine, S. Rickard and O. Yilmaz, “Blind separation of disjoint orthogonal signals: demixing n sources from 2 mixtures,” IEEE Conference on Acoustics, Speech and Signal Processing, 2000 • R. Saab, O. Yilmaz, M. J. Mckeown and R. Abugharbieh, “Underdetermined anechoic blind source separation via lq basis pursuit with q<1,” IEEE Transactions on Signal Processing, 2007 • A. S. Master, “Bayesian two source modelling for separation of N sources from stereo signal,” IEEE Conference on Acoustics, Speech and Signal Processing (ICASSP), vol. 4, pp. 281-284, 2004
References • S. Rickard, “Sparse sources are separated sources,” European Signal Processing Conference, 2006 • N. Hurley and S. Rickard, “Comparing measures of sparsity,” IEEE Transactions on Information Theory, 2009 • D. L. Jones and T. Parks, “A high resolution data-adaptive time-frequency representation,” IEEE Transactions on Acoustics, Speech and Signal Processing, 1990 • P. Basu, P. J. Wolfe, D. Rudoy, T. F. Quatieri and B. Dunn, “Adaptive short-time analysis-synthesis for speech enhancement,” IEEE Conference on Acoustics, Speech and Signal Processing, 2008
Thank you Questions ?