110 likes | 226 Vues
Comparing Audio Signals. What makes it difficult?. Phase misalignment Deeper peaks and valleys Pitch misalignment Energy misalignment Embedded noise Length of vowels Phoneme variance. Review: Minimum Distance Algorithm.
E N D
Comparing Audio Signals What makes it difficult? • Phase misalignment • Deeper peaks and valleys • Pitch misalignment • Energy misalignment • Embedded noise • Length of vowels • Phoneme variance
Review: Minimum Distance Algorithm Array[i,j] = min{1+Array[i-1,j], cost(i,j)+Array[i-1,j-1],1+ Array[i,j-1)}
Pseudo Code (minDistance(target, source)) n = character in source m = characters in target Create array, distance, with dimensions n+1, m+1 FOR r=0 TO n distance[r,0] = r FOR c=0 TOm distance[0,c] = c FOR eachrow r FOReach column c IF source[r]=target[c] cost = 0 ELSE cost = 1 distance[r,c]=minimum of distance[r-1,c] + 1, //insertion distance[r, c-1] + 1, //deletion and distance[r-1,c-1] + cost) //substitution Result is in distance[n,m]
Is Minimum Distance Applicable? • Maybe? • The optimal distance from indices [a,b] is a function of the costs with smaller indices. • This suggests that a dynamic approach may work. • Problems • The cost function is more complex. A binary equal or not equal doesn’t work • Need to define a distance metric. But what should that metric be? Answer: It depends on which audio features we use. • Longer vowels may still represent the same speech. The classical solution is not to apply a cost when going from index [i-1,j] or [i,j-1] to [I,j]. Unfortunately, this assumption can lead to singularities, which result in incorrect comparisons
Complexity of Minimum Distance • The basic algorithm is O(m*n) where m is the length (samples) of one audio signal and m is the length of the other. If m=n, the algorithm is O(n2). Why?: count the number of cells that need to be filled in. • O(n2) may be too slow. Alternate solutions have been devised. • Don’t fill in all of the cells. • Use a multi-level approach • Question: Are the faster approaches needed for our purposes? Perhaps not!
Don’t Fill in all of the Cells Problem: May miss the optimal minimum distancepath
The Multilevel Approach Concept Down sample to coarsen the array Run the algorithm Refine the array (up sample) Adjust the solution Repeat steps 3-4 till the original sample rate is restored Notes The multilevel approach is a common technique for increasing many algorithms’ complexity from O(n2) to O(n lg n) Example is partitioning a graph to balance work loads among threads or processors
Singularities • Assumption • The minimum distance comparing two signals only depends on the previous adjacent entries • The cost function accounts for the varied length of a particular phoneme, which causes the cost in particular array indices to no longer be well-defined • Problem: The algorithm can compute incorrectly due to mismatched alignments • Possible solutions: • Compare based on the change of feature values between windows instead of the values themselves • Pre-process to eliminate the causes of the mismatches
Possible Preprocessing • Remove the phase from the audio: • Compute the Fourier transform • Perform discrete cosine transform on the amplitudes • Normalize the energy of voiced audio: • Compute the energy of both signals • Multiply the larger by the percentage difference • Remove the DC offset: Subtract the average amplitude from all samples • Brick Wall Normalize the peaks and valleys: • Find the average peak and valley value • Set values larger than the average equal to the average • Normalize the pitch: Use PSOLA to align the pitch of the two signals • Remove duplicate frames: Auto correlate frames at pitch points • Remove noise from the signal: implement a noise removal algorithm • Normalize the speed of the speech:
Which Audio Features? • Cepstrals: They are statistically independent and phase differences are removed • ΔCepstrals, or ΔΔCepstrals: Reflects how the signal is changing from one frame to the next • Energy: Distinguish the frames that are voiced verses those that are unvoiced • Normalized LPC Coefficients: Represents the shape of the vocal track normalized by vocal tract length for different speakers. These are the popular features used for speech recognition
Which Distance Metric? • General Formula: array[i,j] = distance(i,j) + min{array[i-1,j], array[i-1,j-1],array[i,j-1)} • Assumption : There is no cost assessed for duplicate or eliminated frames. • Distance Formula: • Euclidian: sum the square of one metric minus another squared • Linear: sum the absolute value of the distance between features • Weighting the features: Multiply each metric’s difference by a weighting factor to give greater/lesser emphasis to certain features • Example of a distance metric using linear distance ∑ wi |(fa[i] – fb[i])| where f[i] is a particular audio feature for signals a and b. w[i] is that feature’s weight