1 / 62

Voice Conversion – Part II

Voice Conversion – Part II. By: Rafi Lemansky Noam Elron Under the supervision of: Dr. Yizhar Lavner. Contents. Project Aims. Voice Conversion Results (sample) . Conversion Scheme. Restatement of Results and Discussion. Suggestions for Future Work. . Project Aims. Output voice.

muriel
Télécharger la présentation

Voice Conversion – Part II

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Voice Conversion – Part II By: Rafi Lemansky Noam Elron Under the supervision of: Dr. Yizhar Lavner

  2. Contents • Project Aims. • Voice Conversion Results (sample). • Conversion Scheme. • Restatement of Results and Discussion. • Suggestions for Future Work.

  3. Project Aims Output voice Input voice Voice Conversion Converting speech into another speaker’s voice, based on offline training. The emphasis is on: • Good output voice quality • Robustness • Computational Complexity

  4. Voice Conversion Results (sample) Original Target Converted

  5. Conversion Scheme Voice Conversion is achieved in two stages: • Training – Creating a transformation function* between two speakers, using speech segments. • Conversion – Using the transformation in order to convert an input sentence. *the transformation is not symmetric.

  6. Conversion Scheme - Training הצגה פרמטרית (HNM, MFCC) חלוקה למאורעות דיבור בניית ספר קוד VQ אות לימוד - מקור יצירת העמדה ברזולוציה של מאורעות דיבור העמדה ברזולוציה של מסגרות אנליזה DTW הצגה פרמטרית חלוקה למאורעות דיבור בניית ספר קוד יצירת אישכול מושרה של מאורעות מטרה אות לימוד - מטרה Training Output: Quantized vector space for each speaker’s “speech events” with 1-to-1 transformation table In the Future: Histograms of personal characteristics (pitch, length of phonemes, etc.)

  7. Conversion Scheme - Conversion הצגה פרמטרית חלוקה למאורעות דיבור דובר המקור בחירת מאורעות מטרה ומאפיינים נוספים שרשור מאורעות מטרה ספר הקוד שינוי מאפיינים זמניים סינתזה - דובר מטרה

  8. Voice Conversion Results and Discussion

  9. הצגה פרמטרית חלוקה למאורעות דיבור בניית ספר קוד VQ אות לימוד - מקור יצירת העמדה ברזולוציה של מאורעות דיבור העמדה ברזולוציה של מסגרות אנליזה הצגה פרמטרית חלוקה למאורעות דיבור בניית ספר קוד אישכול מושרה אות לימוד - מטרה Isolation of Components The source and the target data come into contact only during alignment. The crux of the algorithm (in pink) can be tested using identical source and target data. Automatic phoneme segmentation can also be bypassed (manual segmentation).

  10. Results I Source Target ManSeg AutoSeg FullConv ManSeg: 90.68 secs; 1353 phones; 120 clusters. AutoSeg: 441.87 secs; 9378 phones; 150 clusters. FullConv: 499.02 secs; 8314 phones; 150 clusters.

  11. Results II Source Target ManSeg AutoSeg FullConv Things to notice: • Good vocal imitation but sentences are hard to decipher (even in ManSeg conversion) Selection of phonemes is not good. • A lot of added noise in the full conversion  Alignment needs improvement. • Miss-chosen half-phonemes sound like background speakers.

  12. Results IIIThe Effects of Over-Splitting and of Extended Phonemes When using phonemes that are over-split, and echo occurs. The use of extended phonemes (especially for larger codebooks) cancels this echo. Source Target ManSeg AutoSeg FullConv OL = 0 OL = 3

  13. Results IVAlternative Conversion Methods After listening to the results, several variations on the algorithm were partially tested in order to explore avenues for further experimentation: • Nearest Neighbour Selection – after the cluster is chosen, the phoneme selected is not the centroid. • Conversion of Voiced Sections Only – unvoiced sections are copied directly from the source sentence. • Use of K Nearest Clusters – half way to fuzzy clustering. • Improved MFCC distance – Euclidean distance with V/unV flag.

  14. Results VAlternative Conversion Methods Conversion of Voiced Segments Source Target Regular Special Improved Distance/Semi-Fuzzy Clustering Source Target Regular Special ManSeg FullConv Nearest Neigbours

  15. Discussion and Summary • Output Speech quality is not satisfactory. • Most building blocks produce good output. The exceptions are: phoneme separation, alignment. • The distance in phoneme space, in combination with VQ, does not perform well enough in hearing tests. • Feasibility of the paradigm for voice conversion has not been proved.

  16. Suggestions for Future Work Improving Phoneme Selection • Neural net classification. • Implementation of fuzzy clustering to extended phonemes. • Further investigation of the improved MFCC distance. • Examination of different distance measures.

  17. Suggestions for Future Work Improving Phoneme Separation and Alignment • Combined work on separation and alignment. • Use of auto-correlation to separate voiced phonemes (LPC for unvoiced phonemes). • Use of improved MFCC distance in alignment (reference to V/unV). • Develop robust two-stage DTW.

  18. The End

  19. The Benefits of a Parametric Representation of Speech Allows for relatively simple: • Comparison between speech events. • Manipulation of recorded speech events. The requirements from a parameterization scheme are: • Quality of synthesis. • Low computational needs. • Low BPS.

  20. Back Parametric Representation - Notes • Different parametric representations can be used for different tasks. • In this project we use two parametric representation: • HNM – Used mainly for speech synthesis and manipulation of recorded speech events. • MFCC – Used for comparison between speech events. Both parameterizations are pitch cycle based

  21. Back The Harmonic + Noise Model • During the i-th pitch interval • The noise ε(t) is modeled using LPC. • Advantage: utilizes the characteristics of human auditory perception. • What is the number of harmonies Ki?

  22. Back HNM Analysis Speech Signal V/unV decision Pitch estimation Time comb Kiestimation Noise parameters (LPC) Harmonic parameters HNM results Partial HNM

  23. HNM Analysis Pitch Estimation Gross Pitch Estimation Using Real Cepstrum Fine Pitch Estimation Using Cross-Correlation Input Signal This method eliminates pitch multiplication

  24. Pseudo-Harmonic Analysis – Determining Ki • For every harmony centered segment check if the local maximum is high, sharp and centered enough. • If it is mark a harmony. • The highest harmony is at

  25. HNM Analysis Pseudo-Harmonic Analysis – Determining Ki Take analysis frame. Concat m frames (highlight periodicity). Zero pad (increase spectral resolution).

  26. HNM Analysis Pseudo-Harmonic Analysis • Using Pseudo-inverse method, minimize:

  27. HNM Analysis Noise Analysis (LPC) For each pitch frame find A and w so that: where A(ti,z) is a normalized all-pole filter n(t) is normalized WGN w(ti) is the local energy of the noise. In an unvoiced analysis frame sH=0. is estimation of Spectral Probability Density

  28. HNM Results HNM Analysis Original Harmonic Noise Reconstructed NORMALIZED Fs = 16000Hz Bit Depth = 16

  29. Partial HNM Synthesis I Pseudo-harmonies are not well understood. Therefore, they are unyielding to manipulation. For the remainder of the project synthesis will use only real harmonics. Although the MSE for “harmonic only analysis” is smaller, speech quality is lower than with regular pseudo-harmonic analysis.

  30. HNM Analysis Partial HNM Synthesis II Original Full HNM synthesis Pseudo-Harmonic Analysis\ Harmonic Synthesis Harmonic Analysis\ Harmonic Synthesis Hypothesis: with pseudo-harmonic analysis, a coeffs give better approximation of local harmonies, because b coeffs represent changes in local harmonies and time-comb errors.

  31. Prosodic Modifications using HNM Time stretching requires only the addition (or removal) of synthesis frames, and the assignment of parameters to the new frames. Pitch shifting also requires resampling of the spectral envelope. For a Time Stretch Ratio Contour β(t), and a Pitch Shift Ratio Contour α(t), there is an integral equation for the creation of the new time-comb. α, β and P are piecewise constant (within each analysis frame) – the equation is solved iteratively

  32. Resampling the Spectral Envelope I - Theory • The harmonic part of speech is quasi-periodic  can be written as a function convoluted with an impulse train. • Therefore, its spectral representation is sampled, where the sampling frequency is dependant on the pitch. • CHANGING THE PITCH REQUIRES RESAMPLING THE SPECTRAL ENVELOPE. • We assume that the HNM harmonic coefficients are a good approximation of the spectral envelope.

  33. Resampling the Spectral Envelope II - Practice • We want to evaluate the Real-Cepstrum Coefficients so that the function will “follow the contour” of the spectral envelope as closely as possible. • We then resample function (*) in the new frequency locations.

  34. Resampling the Spectral Envelope III - Practice • “Following the contour” is defined as minimizing the error measure (pseudo-inv’): λ=0 λ=5·10-4

  35. Back to AutoSeg Resampling the Spectral Envelope IV – Bark Scale • The sensitivity of the human ear is logarithmic in both pitch and amplitude. • The described resampling method is logarithmic in amplitude but linear in pitch. • Therefore: Redistribute the harmonies that are to be resampled on a Quasi-logarithmic scale.

  36. Resampling the Spectral Envelope V - Demonstration Original O from “strong" After Pitch Shifting with α=0.805

  37. Time Stretch Results TS ratio 0.6 Original TS ratio 1.3 TS ratio 1.8

  38. Back Pitch Shift Results Original PS ratio 0.6 PS ratio 1.3

  39. Back Mel Frequency Cepstrum Coefficients Speech Signal Frame blocking Windowing FFT Mel Cepstrum Mel Frequency warping Cepstrum

  40. Automatic Phoneme Separation • A large training bank cannot be separated manually. • Option for online conversion. The problem of separating N phonemes is equivalent to the problem of placing N-1 “Split Points” in the signal. Split points are characterized by changes in: • Energy • Voiced/Unvoiced sound • SPECTRAL ENVELOPE

  41. Current Results with Automatic Phoneme Separation Ratio of Auto/Manual splits is 1.38

  42. The HNM Distance Measure • The signal is coded in two additive parts. The distance, therefore, is made up of two additive parts: where the weight is determined using the local energy of the two parts

  43. Harmonic Distance Measure Assumptions • We need to measure distances between spectral envelopes. • Distance measurement between orthogonal coordinates is usually Euclidean. • Harmonic coefficients a are orthogonal coordinates. The distance between them “should” be Euclidean. *** We want only “envelope shape” distance, disregarding energy. Hence the discussion is limited to normalized harmonic coefficients.

  44. Harmonic Distance Measure Problem • Due to pitch differences between frames a1 and a2, they are the coefficients of different members of the orthonormal family. 1 2 3 4 5 6 7 . . . Euclidean distance between them may result in large distance even for identical spectral envelopes !! 1 2 3 4 5 6 7 . . .

  45. Harmonic Distance Measure Solution • Harmonic coefficients a1 and a2 should undergo Pitch Shifting to a common pitch before distance calculation. • This involves converting harmonic coefficients ai into cepstral coefficients ci and back. • Parseval’s Theorem Since cepstral coefficients are orthonormal coordinates, Euclidean distance between them is equal to the distance between the harmonic coefficients.

  46. Harmonic Distance Measure Summary The harmonic distance is defined as the distance between the normalized cepstral coefficients, calculated using the bark scale. • The distance between two unvoiced frames is zero. • The distance between a voiced and an unvoiced frame is defined as the maximum in the calculated batch.

  47. Noise Distance Measure – Itakura Saito Distance

  48. Back Splitting Algorithm • Mark Silences (parameters are nil). • Mark Voiced/Unvoiced changes. • Mark Energy ratio larger than threshold. “Cleaning”. • Mark HNM distance peaks. “Cleaning”.  The peak is marked if it is high enough and narrow enough.

  49. J 4 3 2 1 0 1 2 3 4 I Dynamic Time Warping S Used for aligning “parallel” events in two sequences of data. The two series are arranged in a grid DTW determines the “least-cost” path through the grid. T The optimal path determines desired alignment through node pairs.

  50. J (m,n) נקודת סיום 4 התחום המותר לתנועה 3 2 1 (1,1) נקודת התחלה 0 1 2 3 4 I Dynamic Time Warping S The sequences used for aligning events in the source and target training set are the MFCC of the analysis frames. The cost of the path was calculated thus: • Node Cost – MFCC distance (Euclidean) between S and T frames. • Local Constraint – movement only as depicted in left figure*. • Global Constraint I – parallelogram constraint (right figure)*. • Global Constraint II – exponential cost for movement parallel to axes. * Realized through infinite cost on illegal paths T

More Related