1 / 115

Audio Segregation

Audio Segregation. 2010. 4. 26. Hyung-Min Park. Contents. Independent component analysis (ICA) Conventional methods for acoustic mixtures Filter bank approach to ICA Degenerate unmixing and estimation technique (DUET) Target speech enhancement Zero-crossing-based binaural processing

erma
Télécharger la présentation

Audio Segregation

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Audio Segregation 2010. 4. 26. Hyung-Min Park

  2. Contents • Independent component analysis(ICA) • Conventional methods for acoustic mixtures • Filter bank approach to ICA • Degenerate unmixing and estimation technique (DUET) • Target speech enhancement • Zero-crossing-based binaural processing • Inter-aural time difference (ITD) • Zero crossings vs. cross-correlation • Continuously-variable mask vs. binary mask

  3. Cocktail Party Problem

  4. Independent Component Analysis

  5. Blind Source Separation: A Demo ? mixing environment initial system sources and the mixing environment

  6. Independent Component Analysis • Blind source separation • Sensor signals • Recover the original source signals without knowing how they are mixed • ICA • Assume sources are independent • Estimate the unmixing system W from mixtures x(t) u W x A s

  7. Acoustic Mixtures • Instantaneous mixtures • Acoustic mixing environments • Time delay • Reverberation • Convolutive mixing sources s1 s2 x1 x2 sensors Wall

  8. u1(n) x1(n) W11 W21 W12 x2(n) u2(n) W22 Time Domain Approach to ICA • Feedback architecture • Adaptation rules (Torkkola, 1996) • Intensive computations and slow convergence

  9. Frequency Domain Approach to ICA (1) • In the frequency domain • Complex score function • Adaptation rule (Smaragdis, 1998) W1 Short- Time Fourier Transform Inverse Short- Time Fourier Transform x1(n) u1(n) W2 xN(n) uN(n) WK

  10. kth block kth block s1(n) s2(n-d1) = + x1(n) d2 s1(n-d2) s2(n) d1 = + x2(n) Frequency Domain Approach to ICA (2) • Performance limitation • Contradiction between long reverberation covering and insufficient learning data • Long reverberation long frame size • Small number of frames insufficient input data • Mixtures combined from different time ranges of sources • Delayed mixtures

  11. Design of a Filter Bank • Filter bank design • Frequency response of analysis filters Uniform sixteen-channel filter bank Decimation factor: 10 Filter length: 220 taps

  12. H1(z) M M F1(z) ICA network W1(z) H2(z) M M F2(z) x1(n) u1(n) ICA network W2(z) HK(z) M M FK(z) H1(z) M M F1(z) H2(z) M M F2(z) x2(n) u2(n) ICA network WK(z) HK(z) M M FK(z) Filter Bank Approach to ICA (1) • 2x2 network for the filter bank approach to ICA

  13. Filter Bank Approach to ICA (2) • Adaptation rules • Total number of multiplications • Time domain approach • Filter bank approach • The number of required filter coefficients • Uniform K-channel oversampled filter bank is the number of adaptive filter coefficients

  14. Experimental Setup (1) • Measure for blind source separation • SIR for a 2x2 mixing/unmixing system • Sources • Two streams of speech • 5 second length • 16 kHz sampling rate

  15. Experimental Setup (2) • Mixing system Virtual room to simulate impulse responses

  16. Experimental Results • Learning curves of the three different approaches

  17. Experiment on Real-Recorded Data (1) • Mixing environment • Filter bank approach • Using the sixteen-channel filter bank • Each adaptive filter: 103 taps 40cm Microphones 60cm Speakers

  18. Experiment on Real-Recorded Data (2) • Blind source separation of real recorded mixtures Mixture 1 Mixture 2 Result 1 Result 2 stop

  19. Motivation of a Nonuniform Filter Bank Approach • Time-averaged magnitude responses of signals • The energy exponentially decreases as the frequency increases. Speech Car noise Music • Subband division • Result of trade-off between mitigation of undesired properties of the uniform filter bank approach and that of large adaptive filter length

  20. Relationship between Performances and Filter Length • Convergence of gradient-based algorithms • Controlled by condition number • Bordering theorem • Condition number • Monotonically nondecreasing function of filter length • The longer filter length • The slower convergence speed and

  21. Bark-Scale Filter Banks • Subband division • Result of trade-off between mitigation of undesired properties of the uniform filter bank approach and that of large adaptive filter length • Bark frequency warping function • Bark-scale filter banks • Resemble that of the mammalian cochlea • Somewhat narrow subbands in low frequency region • Wide subbands in high frequency region

  22. H1(z) M1 M1 F1(z) ICA network W1(z) H2(z) M2 M2 F2(z) x1(n) u1(n) ICA network W2(z) HK(z) MK MK FK(z) H1(z) M1 M1 F1(z) H2(z) M2 M2 F2(z) x2(n) u2(n) ICA network WK(z) HK(z) MK MK FK(z) Nonuniform Filter Bank Approach to BSS • 2x2 network for the nonuniform oversampled filter bank approach to BSS

  23. Design of a Bark-Scale Filter Bank • Filter design of a Bark-scale oversampled filter bank • 16-channel, , OSR=167% Bark-scale filter bank Uniform filter bank

  24. Experimental Results • Results on blind source separation in the oversampled filter bank SIR PESQ score

  25. female speech noises noise references microphones mic. signals outputs male speech FPGA Implementation (1)

  26. FPGA Implementation (2) • 4 adaptive noise canceling (4 music signals) + 2 blind source separation (2 speech signals) MIC1 MIC2 OUT1 OUT2 stop

  27. output SNR=21.38dB Application to Hearing Aids front mic. SNR=3.20dB • BTE-type hearing aids 1m 1m rear mic. SNR=2.45dB noise speech front mic. rear mic. stop

  28. Discussion on ICA • Assume sources are independent • Time domain approach • Intensive computations and slow convergence • Frequency domain approach • Less computations but inferior performance • Filter bank approach • Moderate computations and good performance • Suitable for parallel processing • Bark-scale filter bank approach

  29. Degenerate Unmixing and Estimation Technique

  30. Introduction • Independent component analysis for blind source separation • Good performance • In general, the number of microphones should not be smaller than the number of sources. • Too many parameters • Heavy computational load and slow convergence • Problem with a source which is active in a short period

  31. Binaural Processing • Auditory scene analysis (ASA) • Cues: harmonics, pitch, on-set, etc • Spatial cues • Inter-aural time difference (ITD) • Inter-aural intensity difference (IID) target noise

  32. DUET Algorithm (1) • Mixing model • In the time-frequency domain

  33. DUET Algorithm (2) • W-disjoint orthogonality assumption • Parameter estimation

  34. DUET Algorithm (3) • 2D Histogram of amplitude-delay estimates from two mixtures of five sources ♦Amplitude parameters ( .98, 1.02, .93, 1.06, .93) ♦ Delay parameters ( .3, -.2, .8, -.7, -.2)

  35. DUET Algorithm (4) • If the j-th source is active, • Cost function • Parameter estimation • Stochastic gradient descent algorithm

  36. DUET Algorithm (5) 1 0 0 1 1 1 1 1 0 1 0 0 1 0 0 0 1 1 1 1 0 1 1 1 0 0 1 0 0 0 0 0 0 1 0 1 0 1 0 1 • Mask • Demixing s1 s2

  37. Target Speech Enhancement • In many practical applications, • Need to estimate a signal from a target source • The target source • Frequently, we can expect its approximated direction. • Strong utterance in a noisy environment

  38. Proposed Method (1) • Continuously variable mask • Real mask Continuously variable mask Real mask

  39. Proposed Method (2) • Determine a threshold. • Using a top ranking • Binary mask using a threshold Binary mask Real mask

  40. Proposed Method (3) Initial continuous mask Thresholding • Overall procedure • Overall procedure of the DUET algorithm Thresholding Comparing likelihoods STFT Attenuation -delay histogram Initializing attenuation -delay parameters Learning attenuation -delay parameters Continuous mask Binary mask ISTFT Attenuation -delay histogram Initializing attenuation -delay parameters Learning attenuation -delay parameters Binary mask STFT STFT ISTFT STFT

  41. Experimental Setup (1) • Number of sources : 2 (1 target source and 1 noise source) • Input SIR : 5 dB • Simulated mixing in an anechoic environment -10 ˚ 20 ˚ 50cm • Source signals • 10-second-long speech signals uttered by 4 males and 4 females in the TIMIT database • Microphones • Space: 2 cm • Angle differences between two sources • 30˚, 60˚, 90˚, 120˚, 150˚, and 180˚ -40 ˚ 80˚ 50 ˚ -70 ˚ Mic1 Mic2 -100 ˚

  42. Experimental Results (1) DUET Proposed method 22.03 21.96 22.03 22.11 21.09 20.63 13.82 13.77 13.79 13.61 10.99 9.17 98.35 98.34 98.26 97.24 96.26 95.38 96.26 89.87 89.79 89.53 89.55 89.55 89.51

  43. Experimental Setup (2) • Number of sources : 2 (1 target source and 1 noise source) • Input SIR : 5 dB • Real recorded mixtures in a normal office room 50cm • Source signals • 10-second-long speech signals uttered by 3 male and 3 female speakers in the TIMIT database • Microphones • Space: 2 cm • Angle differences between two sources • 30˚, 60˚, 90˚, 120˚, 150˚, and 180˚ 0 ˚ 90˚ -30 ˚ 30 ˚ -60 ˚ 60 ˚ Mic1 Mic2 -90 ˚

  44. Experimental Results (2) DUET Proposed method 15.93 15.92 15.99 14.88 12.70 12.57 12.46 11.49 10.99 10.02 6.29 5.51 96.93 97.14 97.03 85.81 84.67 85.14 83.90 83.37 81.77 80.02 76.67 72.27

  45. Discussion on DUET • DUET(Degenerate Unmixing and Estimation Technique) • Simple • We should know the number of sources in advance. • Estimate the attenuation and delay parameters for all sources. • Described target speech enhancement technique • Estimate the parameters for only one target source • Much faster convergence of all the required parameters • Not robust to reverberation

  46. Zero-Crossing-Based Binaural Processing

  47. Binaural Processing • Auditory scene analysis (ASA) • Spatial cues: ITD, IID • Others: harmonics, pitch, on-set, etc • Conventional methods • Inter-aural cross-correlation • Binary mask (all-or-none) • Developed method • Inter-aural zero-crossing difference • Continuously variable mask target noise

  48. Jeffress’ Model running interaural cross-correlation running integration multiplication right ear left ear

  49. Source Localization Based on Cross-Correlation • Signal model for the sensor outputs • ITD estimation based on generalized cross-correlation • Phase transform (PHAT)

  50. two microphones Finding Zero-Crossings ITD

More Related