Speaker Recognition Controversy

Speaker Recognition Controversy Prof Dr P Chandra Sekharan

The CD said to have contained the recordings of the conversation between Shanti Bhushan and Mulayam singh had been examined forensically. The Central Forensic Science Laboratory seems to have opined that the CD is not doctored while the Hyderabad Truth Labs had said it is fabricated. It is also claimed that a lab in United States has also said that the CD is fabricated.

In as much as the recordings in a CD can only be a secondary recording which will always contain the inherent defects editing, volume management, suppression of noise etc the finding that the CD is doctored or fabricated has no forensic evidentiary value.

What is important in this case is the speaker verification [or speaker authentication] and identification. • Speaker recognition uses the acoustic features of speech that have been found to differ between individuals. These acoustic patterns reflect both anatomy , namely size and shape of the throat and mouth and learned behavioural patterns, namely voice pitch, speaking style. Speaker verification has earned speaker recognition its classification as a "behavioural biometric."

There are two major applications of speaker recognition technologies and methodologies. If the speaker claims to be of a certain identity and the voice is used to verify this claim, this is called verification or authentication. On the other hand, identification is the task of determining an unknown speaker's identity • In a sense speaker verification is a 1:1 match where one speaker's voice is matched to one "voice print" or "voice model" whereas speaker identification is a 1: n match where the voice is compared against a number of n voice print samples.

Each speaker recognition system has two phases: i) verification phase and ii) identification. The systems involve enrolment (specimen recording) and verification. During enrolment, the speaker's voice is recorded and typically a number of features are extracted to form a voice print, template, or model. In the verification phase [which is similar to our case in question] a speech sample or "utterance" is compared against an acquired voice print or a previously created voice print.

For identification systems, the utterance is compared against multiple voice prints in order to determine the best match(es) while verification systems compare an utterance against a single voice print. Because of the process involved, verification is faster than identification. • Speaker recognition systems fall into two categories: text-dependent and text-independent.

If the text must be the same for enrolment and verification this is called text-dependent recognition. In a text-dependent system, prompts can either be common across all speakers (e.g.: a common pass phrase) or unique. In addition, the use of shared-secrets (e.g.: passwords and PINs) or knowledge-based information can be employed in order to create a multi-factor authentication scenario.

Text-independent systems are most often used for speaker identification as they require very little if any cooperation by the speaker. In this case the text during enrolment and test is different. In fact, the enrolment may happen without the user's knowledge, as in the case for many forensic applications. As text-independent technologies do not compare what was said at enrolment and verification, verification applications tend to also employ speech recognition to determine what the user is saying at the point of authentication

CDs examined by both the labs • What types of CDs are normally submitted or distributed to media in cases of this nature? [The questioned CD said to have contained recordings of ShantiBhushan purportedly telling Mulayam Singh that his lawyer son Prashant could manage a judge for Rs.4 Crores]. Such CDs are only ‘read only’ CD-Rs and not rewritable CD-RWs.

Let us call such a CD as ‘CD-X’. If someone wants to splice different matter in these CD-Rs, one has to first copy the recordings from the CD-R to a CD-RW or alternatively to the hard disc of the computer and then only splice extraneous matter on to the original recordings. Then the recordings with splicing in the CD-RW can be burnt (copied) in a CD-R and distributed. Let such CD be named as ‘CD-Y’

Let us give credit to both the labs assuming that the reports submitted by both the labs are the outcome of their unbiased honest scientific examination. Then the controversy in their opinion is due to each of them examining different CDs; say CFSL examining “CD-X” while truth lab examining “CD-Y”. It then becomes just like the views expressed by the six blind after sensing different parts of the body of the elephant!

First to solve the dispute between the two labs, CD-X and CD-Y, rather to be on the safer side, true copies of the two CDs are to be exchanged between these two labs and let them get satisfied by examining the other CD. • Or true copies of both the CDs can be examined by a third lab preferably in Europe (France or Germany).

Then of course the voice spectrum from ‘CD-X’ can be analysed along with voice samples of suspected individuals to attribute the questioned voice in CD-X to a particular individual say in this case to ShantiBhushan. • Otherwise I would consider that vested interests are abusing or misusing ‘bits and pieces’ of scientific information with ulterior motives to hoodwink the public and judiciary

Voice analysis was first used in World War II for military intelligence purposes. • Its use in forensic investigation dates back to the 1960s. • The Method relies on the fact that each person's voice has a unique quality that can be recorded as a voiceprint, rather like a fingerprint, on an instrument called a ‘sound spectrograph’.

Suspects knowingly or unknowingly leave recordings of the voices on the telephone, voice mail, answering machines, or hidden tape recorders, and these samples can be used as evidence. • Forensic voice analysis has been used in a wide range of criminal cases ranging from murder, rape, drug dealing, bomb threats, to terrorism.

Each person's voice is different because the anatomy of the vocal cords, vocal cavity, and oral and nasal cavities is specific to the individual. • In addition, each person coordinates the muscles of the lips, tongue, soft palate, and jaw differently to produce words.

The teeth also have an impact in the way speech is formed. • The body's voice-producing apparatus is like an organ pipe producing notes, a tube in which sound waves vibrate, producing sounds which can readily be recorded.

The sound spectrograph records a voiceprint in terms of the frequencies and intensities of the sounds made by an individual while speaking. A good mimic may sound like the person they are imitating, but the voiceprint will be quite different. Of course, a person's voice changes with age, but the voiceprint remains distinctive.

Voiceprint samples may be obtained through covert police operations, such as by investigators wearing hidden microphones or putting surveillance equipment on a suspect's phone. • As with fingerprints and shoeprints, samples for comparison can be taken from a suspect, by court order if necessary. • The investigator will ask them to speak the same words as those that were recorded on the voice evidence that has been collected. • This may be a anonymous call from a murderer or a terrorist. There is always the possibility that the suspect will try to disguise his or her voice, but the voiceprint expert will probably be able to allow for this.

The investigator has two complementary ways of making identification through voice analysis. • 1) The Expert will listen to the evidence sample and the sample taken from the suspect, comparing accent, speech habits, breath patterns, and inflections. • Then a comparison of the corresponding voiceprints is made. • There is no international standard for the minimum number of points of identity needed in this comparison, but ten to twenty speech sounds that correspond are often taken as good proof of identification

It has been argued that voiceprints may not be as individual as fingerprints. Certainly the technology for analysis has been well developed during the last few years. • However, in one analysis of 2,000 cases by the Federal Bureau of Investigation, (FBI) the error rate in both false identification and false elimination of suspects was found to be very low

Ambient noise levels can impede both collections of the initial and subsequent voice samples. • Noise reduction algorithms can be employed to improve accuracy, but incorrect application can have the opposite effect. • Performance degradation can result from changes in behavioural attributes of the voice and from enrolment using one telephone and verification on another telephone ("cross channel"). Integration with ‘two-factor authentication’ products is expected to increase.

Voice changes due to ageing may impact system performance over time. Some systems adapt the speaker models after the successful verification to capture such long-term changes in the voice, though there is debate regarding the overall security impact imposed by automated adaptation. • Capture of the biometric is seen as non-invasive. The technology traditionally uses existing microphones and voice transmission technology allowing recognition over long distances via ordinary telephones (wired or wireless).

How speeches are recorded? • Speeches are recorded in cell phones or sound recorders. • Many newer cell phone models have recording capabilities. • They are also connectable to pc via cable or Bluetooth or infra red, some even Wlan. • Data transfer is done by software which came with the phone (or with the cable) or is freely downloaded from manufacturer.

WM Sound Recorder is an easy-to-use application which could auto record sound and phone calls in Windows Mobile Pocket PC. • And it also can play the record file so that you can check if the record fits you. • In addition, you may distribute your recorded files easily with it.

The technologies • There are various technologies used to process and store voice prints. They include i) frequency estimation, ii) hidden Markov models, iii) Gaussian mixture models, iv) pattern matching algorithms, v) neural networks, vi) matrix representation, vii) Vector Quantization and viii) decision trees. Some systems also use "anti-speaker" techniques, such as cohort models, and world models.

Frequency Estimation • The spoken signal has an unknown noise component, such as background noise and audio equipment noise. Frequency estimation methods estimate the noise component by using techniques such as solving for eigenvectors, a type of mathematics important in physics and engineering; subtracting the noise from the input to get an approximation to the signal of interest; and decomposing that signal as a sum of complex frequency components.

The most important fact about this method is that the noise-free voice of a given speaker is reduced to a more manageable representation: the voice's intensity on a few frequency components (that happen to be the most intense ones). • This method works well when background noise is a problem and when the words spoken when the system was trained may not be exactly the same words spoken when trying to authenticate the speaker.

Hidden Markov Models • A hidden Markov model always is in one of a set of states, but the current state is not visible to the observer. Such a model is constantly making transitions from the current state to the next at rates, and with probabilities, determined by the model's parameters. When making a transition, the model may emit an output with a known probability. The same output can be generated by a transition from multiple states, with different probabilities.

In the particular case of speaker recognition, a hidden Markov model emits outputs representing phonemes with probabilities that depend on the prior sequence of visited states. A speaker uttering a sequence of phonemes (i.e., talking) corresponds to the model visiting a sequence of states and emitting outputs corresponding to the same phonemes. This method works well to authenticate the speaker by having him utter a sequence of words forming complete sentences.

Pattern Recognition • This technique, among the most complex being used for speaker recognition, compares two voice streams: the one spoken by the authenticated speaker while training the system, and the one spoken by the unknown speaker who is attempting to gain access. The speaker utters the same words when training the system and, later, when trying to prove his identity. The computer aligns the training sound stream with the one just obtained (to account for small variations in rhythm and for delays in beginning to speak).

Then, the computer discretizes each of the two streams as a sequence of frames and computes the probability that each pair of frames was spoken by the same speaker by running them through a multilayer perceptron--a particular type of neural network trained for this task. This method works well in low-noise conditions, and when the speaker is uttering exactly the same words used to train the system.

Multi-Speech • Multi-Speech, Model 3700, software is a low-cost, Windows-based, speech analysis program, which uses standard multimedia hardware (e.g., Sound Blaster™ boards) to capture, analyze, and play speech samples. Multi-Speech software brings KayPENTAX quality to the Windows-based software product category.

Multi-Speech software is a comprehensive speech recording, analysis, feedback, and measurement software program. It includes the same analysis features as CSL software, the most widely used speech analysis system. Multi-Speech software also has numerous application-specific program options. It is only limited by the specifications, features, and S/N limitations typical of audio devices in the host computer.

GoldWave • GoldWave is a digital-audio editor for Windows. The program has real-time oscilloscopes, intelligent editing, and numerous effects (including echo, flange, distortion, mechanize, and reverse). An intuitive user interface makes GoldWave easy to learn and use. Version 4.0 adds an improved interface, time stretching, CD-audio extraction, and vertical zooming.GoldWave 5.06, the latest version is free software which can be downloaded from internet.

Voice identification played a key role in the investigation of the crimes of Peter Sutcliffe, the so called Yorkshire Ripper, who murdered several women in the North of England in the late 1970s. Tapes purporting to be from the Ripper were sent to the police team involved in the case, taunting them for their lack of success in catching him. Voice analysis was at first inconclusive, but it now looks as if the tapes were probably the work of a hoaxer.

OSAMA BIN LADEN’ s VOICE • Voice analysis has also been applied to the investigation of tapes said to be made by Osama bin Laden, the world's most-wanted terrorist. Since the terror attacks in New York and Washington on September 11, 2001, bin Laden has apparently issued a number of video and audiotapes.

Corresponding words on the tapes, like "America," can be compared, but the voiceprints do not match exactly because the same person will never say a word in exactly the same way each time. • If there is enough similarity, however, identification can be made even if it is tentative, especially if there is other evidence.

Of course, bin Laden speaks in Arabic, but there is software to handle this and other languages. • It may be significant that the most recent utterances by bin Laden have been by audio rather than video tape, raising the possibility that he has been dead for some time and the tape has been made by someone else hoping to raise the morale of al Qaeda. The tape is of poor quality and difficult for analysts to work with.

It is unlikely, however, that a mimic could fool a voice analysis expert, even under these conditions. Yet there is the possibility that the tape has been created from previous ones that feature bin Laden's real voice, with new information pasted in to update it. The final possibility is that the tape has been made by one of his sons; parents and children tend to sound similar and may give similar voiceprints.

The identification of bin Laden looks as if it will be an ongoing challenge to the forensic voice analysts.

Voice analysis for emotion detection and risk assessment • Voice analysis is also used for emotion detection and risk assessment. It is used in non-invasive investigation and security tools and fraud prevention solutions,thepatented technology known as “Layered Voice Analysis” (LVA™).

‘Nemesysco’ anIsrael based company, dedicated to the advancement of voice analysis technologies, addresses different needs of the security, corporate and financial markets, enabling organizations to enhance crime detection and prevention; expedite investigations; identify and fight fraud more effectively; improve veracity assessment during recruitment processes and provide better services to the public at need.

Speaker Recognition Controversy

Speaker Recognition Controversy

Presentation Transcript

Speaker Recognition

Speaker Recognition Research in Joensuu

Speaker Recognition

Language modeling for speaker recognition

Speaker Recognition

A Text-Independent Speaker Recognition System

Speaker recognition Phase 1: Detecting speech

SPEAKER RECOGNITION

Speaker Recognition

Speaker Recognition

Speaker Recognition

Robust speaker recognition over varying channels

An Intro to Speaker Recognition

Speaker Recognition Experiment

Automatic Speaker Recognition In Forensic Environment

Speaker Recognition

Speaker Recognition

IRISA 2003 SPEAKER RECOGNITION SYSTEM

Speaker Recognition

Robust Speaker Recognition

Using Speaker Recognition

Chapter 14 Speaker Recognition