speaker theory

This project encompasses the implementation of a speaker recognition program in Matlab. Speaker recognition systems can be characterised as text-independent or text-independent. The system we have developed is the latter, text-independent, meaning the system can identify the speaker regardless of what is being said.

The program will contain two functionalities: A training mode, a recognition mode. The training mode will allow the user to record voice and make a feature model of that voice. The recognition mode will use the information that the user has provided in the training mode and attempt to isolate and identify the speaker.

The System has been developed using more advanced speaker recognition principles and concepts. Concepts such as Mel-Frequency Cepstral Analysis and Vector Quantization have been integral to the development of this system.

2.1DESCRIPTION

Speaker recognition systems are classified based on the Functions and Methods.

2.1.1Classification based on Functions:

Based on Functions Speaker Recognition Systems are classified into two categories.They are

v Speaker Identification

v Speaker Verification

2.2.2Classification based on Methods:

Based on methods Speaker Recognition System is classified into two categories. They are

v Text-independent Recognition

v Text-dependent Recognition

2.2 Speaker Identification

Speaker identification is the process of determining which registered speaker provides a given utterance.

2.4Text-Independent Recognition

Recognition system does not know text spoken by person.

Examples:

· User selected phrase.

· Conversational speech – Used for applications with less control over user input.

More flexible system but also more difficult problem.

Speech recognition can provide knowledge of spoken text.

Techniques: In text-independent recognition two types of techniques are employed.

They are

· Vector quantization

· Advanced Gaussian mixture model

2.4.1 Vector quantization:

In this project, the VQ approach will be used, due to ease of implementation and high accuracy. VQ is a process of mapping vectors from a large vector space to a finite number of regions in that space. Each region is called a cluster and can be represented by its center called a codeword. The collection of all codeword’s is called a codebook. The Automatic Speaker Recognition System will compare the codebooks of the tested speaker with the codebooks of the trained speaker. The best matching result will be the desired speaker.

2.4.2 Gaussian mixture models:

Gaussian mixture models (GMM) are similar to code books in the regard that clusters in feature space are estimated as well. In addition to the mean vectors, the covariances of the clusters are computed, resulting in a more detailed speaker model if there is a sufficient amount of training speech.

2.5 Text-dependent Recognition

Examples:

· Fixed phrase.

· Prompted phrase – Used for applications with strong control over user input.

Knowledge of spoken text can improve system performance.

Techniques: In text dependent recognition two types of techniques are employed.

They are

· Dynamic time wraping model

· Hidden markov model

2.5.1 Dynamic time warping (DTW):

Dynamic time warping is an approach that was historically used for speech recognition but has now largely been displaced by the more successful HMM-based approach. Dynamic time warping is an algorithm for measuring similarity between two sequences which may vary in time or speed. For instance, similarities in walking patterns would be detected, even if in one video the person was walking slowly and if in another they were walking more quickly, or even if there were accelerations and decelerations during the course of one observation. DTW has been applied to video, audio, and graphics -- indeed, any data which can be turned into a linear representation can be analyzed with DTW.

A well known application has been automatic speech recognition, to cope with different speaking speeds. In general, it is a method that allows a computer to find an optimal match between two given sequences (e.g. time series) with certain restrictions, i.e. the sequences are "warped" non-linearly to match each other. This sequence alignment method is often used in the context of hidden Markov models.

2.5.2 Hidden Markov model (HMM):

Modern general-purpose speech recognition systems are generally based on HMMs. These are statistical models which output a sequence of symbols or quantities.

Reason to Use HMM:

v One possible reason why HMMs are used in speech recognition is that a speech signal could be viewed as a piecewise stationary signal or a short-time stationary signal.

v Another reason why HMMs are popular is because they can be trained automatically and are simple and computationally feasible to use.

In speech recognition, the hidden Markov model would output a sequence of n-dimensional real-valued vectors (with n being a small integer, such as 10), outputting one of these every 10 milliseconds. The vectors would consist of cepstral coefficients, which are obtained by taking a Fourier transform of a short time window of speech and decorrelating the spectrum using a cosine transform, then taking the first (most significant) coefficients.

Each word, or (for more general speech recognition systems), each phoneme, will have a different output distribution; a hidden Markov model for a sequence of words or phonemes is made by concatenating the individual trained hidden Markov models for the separate words and phonemes.

A typical large-vocabulary system would need context dependency for the phonemes (so phonemes with different left and right context have different realizations as HMM states); it would use cepstral normalization to normalize for different speaker and recording conditions; for further speaker normalization it might use vocal tract length normalization (VTLN) for male-female normalization and maximum likelihood linear regression (MLLR) for more general speaker adaptation.

Many systems use so-called discriminative training techniques which dispense with a purely statistical approach to HMM parameter estimation and instead optimize some classification-related measure of the training data. Examples are maximum mutual information (MMI), minimum classification error (MCE) and minimum phone error (MPE).

3.1 Introduction:

At the highest level, all speaker recognition systems contain two main modules.

v Feature Extraction

v Feature Matching

Feature Extraction:

Feature extraction is the process that extracts a small amount of data from

the voice signal that can later be used to represent each speaker.The purpose of this module is to convert the speech waveform, using digital signal processing (DSP) tools, to a set of features (at a considerably lower information rate) for further analysis. This is often referred as the signal-processing front end.

A wide range of possibilities exist for parametrically representing the spe-ech signal for the speaker recognition task, such as Linear Prediction Coding (LPC), Mel-Frequ-ency Cepstrum Coefficients (MFCC).LPC analyzes the speech signal by estimating the formants, removing their effects from the speech signal, and estimating the intensity and frequency of the remaining buzz. The process of removing the formants is called inverse filtering, and the remain-ing signal is called the residue .

Feature Matching:

Feature matching involves the actual procedure to identify the unknown speaker by comparing extracted features from his/her voice input with the ones from a set of known speakers. We will discuss each module in detail in later sections.

The problem of speaker recognition belongs to a much broader topic in scien-tific and engineering so called pattern recognition. The goal of pattern recognition is to classify objects of interest into one of a number of categories or classes. The objects of interest are gene-rically called patterns and in our case are sequences of acoustic vectors that are extracted from an input speech using the techniques described in the previous section. The classes here refer to ind-ividual speakers. Since the classification procedure in our case is applied on extracted features, it can be also referred to as feature matching.

4.1 Introduction:

The speech signal is a slowly timed varying signal (it is called quasist-ationary). An example of speech signal is shown in Figure 4.1(a). When examined over a suf-ficiently short period of time (between 5 and 100 msec), its characteristics are fairly station-ary. However, over long periods of time (on the order of 1/5 seconds or more) the signal characteri-stic change to reflect the different speech sounds being spoken. Therefore, short-time spectral analysis is the most common way to characterize the speech signal.

4.1(a) Example of speech signal

A wide range of possibilities exist for parametrically representing the speech signal for the speaker recognition task, such as Linear Prediction Coding (LPC), Mel-Frequency Cepstrum Coefficients (MFCC), and others. MFCC is perhaps the best known and most popular, and will be described in this paper.

MFCC’s are based on the known variation of the human ear’s critical bandwidths with frequ-ency, filters spaced linearly at low frequencies and logarithmically at high frequencies have been used to capture the phonetically important characteristics of speech. This is expressed in the mel-frequency scale, which is a linear frequency spacing below 1000 Hz and a logarithmic spac-ing above 1000 Hz. The process of computing MFCCs is described in more detail next.

4.2 Mel-frequency cepsrum coefficients processor

A block diagram of the structure of an MFCC processor is given in Figure 4.2(a). The spee-ch input is typically recorded at a sampling rate above 10000 Hz. This sampling frequency was chosen to minimize the effects of aliasing in the analog-to-digital conversion. These sampled si-gnals can capture all frequencies up to 5 kHz, which cover most energy of sounds that are gener-ated by humans. As been discussed previously, the main purpose of the MFCC processor is to mimic the behavior of the human ears. In addition, rather than the speech waveforms themselv-es, MFFC’s are shown to be less susceptible to mentioned variations.

4.2.2 Windowing:

The next step in the processing is to window each individual frame so as to minimize the signal discontinuities at the beginning and end of each frame. The concept here is to minimize the spectral distortion by using the window to taper the signal to zero at the begin-ning and end of each frame. If we define the window as w(n),0_< style="mso-bidi-font-style:normal">N is the num-ber of samples in each frame, then the result of windowing is the signal

5.1 Overview

The goal of pattern recognition is to classify objects of interest into one of a number of categories or classes. The objects of interest are generically called patterns and in our case are sequences of acoustic vectors that are extracted from an input speech using the techniques described in the previous section. The classes here refer to individual speakers.

Furthermore, if there exists some set of patterns that the individual classes of which are already known, then one has a problem in supervised pattern recognition. These patterns comprise the training set and are used to derive a classification algorithm. The remaining patterns are then used to test the classification algorithm; these patterns are collecti-vely referred to as the test set. If the correct classes of the individual patterns in the test set are also known, then one can evaluate the performance of the algorithm.

The state-of-the-art in feature matching techniques used in speaker recognition includes Dynamic Time Warping (DTW), Hidden Markov Modeling (HMM), and Vector Quantization (VQ). In this project, the VQ approach will be used, due to ease of imple-mentation and high accuracy. VQ is a process of mapping vectors from a large vector space to a finite number of regions in that space. Each region is called a cluster and can be represented by its center called a codeword. The collection of all codeword’s is called a codebook.

5.1.1Vector Quantization:

Figure 5 shows a conceptual diagram to illustrate this recognition process. In the figure, only two speakers and two dimensions of the acoustic space are shown. The circles refer to the acoustic vectors from the speaker 1 while the triangles are from the speaker 2. In the training phase, using the clustering algorithm described in Section 4.2, a speaker-specific VQ codebook is generated for each known speaker by clustering his/her training acoustic vectors. The result codeword’s (centroids) are shown in Figure 5 by black circles and black triangles for speaker 1 and 2, respectively. The distance from a vector to the closest codeword of a codebook is called a VQ-distortion. In the recognition phase, an input utterance of an unknown voice is “vector-quantized” using each trained codebook and the total VQ distortion is computed. The speaker corresponding to the VQ codebook with smallest total distortion is identified as the speaker of the input utterance.

5.2 Clustering the Training Vectors

After the enrolment session, the acoustic vectors extracted from input speech of each speaker provide a set of training vectors for that speaker. As described above, the next important step is to build a speaker-specific VQ codebook for each speaker using those training vectors. There is a well-know algorithm, namely LBG algorithm [Linde, Buzo and Gray, 1980], for clustering a set of L training vectors into a set of M codebook vectors. The algorithm is formally implemented by the following recursive procedure:

LBG algorithm:

1. Design a 1-vector codebook; this is the centroid of the entire set of training vectors (hence, no iteration is required here).

2. Double the size of the codebook by splitting each current codebook y_n according to the rule

Where n varies from 1 to the current size of the codebook, and is a splitting parameter (we choose =0.01).

3. Nearest-Neighbor Search: for each training vector, find the codeword in the current codebook that is closest (in terms of similarity measurement), and assign that vector to the corresponding cell (associated with the closest codeword).

4. Centroid Update: update the codeword in each cell using the centroid of the training vectors assigned to that cell.

5. Iteration 1: repeat steps 3 and 4 until the average distance falls below a preset threshold

6. Iteration 2: repeat steps 2, 3 and 4 until a codebook size of M is designed.

Intuitively, the LBG algorithm designs an M-vector codebook in stages. It starts first by designing a 1-vector codebook, then uses a splitting technique on the codeword’s to initialize the search for a 2-vector codebook, and continues the splitting process until the desired M-vector codebook is obtained.

Figure 6 shows, in a flow diagram, the detailed steps of the LBG algorithm. “Cluster vectors” is the nearest-neighbor search procedure which assigns each training vector to a cluster associated with the closest codeword. “Find centroids” is the centroid update procedure. “Compute D (distortion)” sums the distances of all training vectors in the nearest-neighbor search so as to determine whether the procedure has converged.

6.1.1 Process description of System

To implement the System we have used three auxiliary functions. The functions are:

* melfilterbank()

* elucid_dist()

* test()

6.2 Phases:

All speaker recognition systems have to serve two distinguished phases.

· The first one is referred to the enrolment or training phase

· The second one is referred to as the operational or testing phase.

In the training phase, each registered speaker has to provide samples of their speech so that the system can build or train a reference model for that speaker. In case of speaker verification systems, in addition, a speaker-specific threshold is also computed from the training samples. In the testing phase, the input speech is matched with stored reference model(s) and a recognition decision is made.

Speaker recognition is a difficult task. Automatic speaker recognition works based on the premise that a person’s speech exhibits characteristics that are unique to the speaker. However this task has been challenged by the highly variant of input speech signals. The principle source of variance is the speaker himself/herself. Speech signals in training and testing sessions can be greatly different due to many facts such as people voice change with time, health conditions (e.g. the speaker has a cold), speaking rates, and so on. There are also other factors, beyond speaker variability, that present a challenge to speaker recognition technology. Examples of these are acoustical noise and variations in recording environments (e.g. speaker uses different telephone handsets).