Speaker recognition is basically divided into two-classification: speaker recognition
and speaker identification and it is the method of automatically identify who is
speaking on the basis of individual information integrated in speech waves. Speaker
recognition is widely applicable in use of speaker’s voice to verify their identity and
control access to services such as banking by telephone, database access services,
voice dialling telephone shopping, information services, voice mail, security control
for secret information areas, and remote access to computer AT and T and TI with
Sprint have started field tests and actual application of speaker recognition
technology; many customers are already being used by Sprint’s Voice Phone Card.
Speaker recognition technology is the most potential technology to create new
services that will make our every day lives more secured. Another important
application of speaker recognition technology is for forensic purposes. Speaker
recognition has been seen an appealing research field for the last decades which still
yields a number of unsolved problems.
The main aim of this project is speaker identification, which consists of comparing a
speech signal from an unknown speaker to a database of known speaker. The system
can recognize the speaker, which has been trained with a number of speakers. Below
figure shows the fundamental formation of speaker identification and verification
systems. Where the speaker identification is the process of determining which
registered speaker provides a given speech. On the other hand, speaker verification is
the process of rejecting or accepting the identity claim of a speaker. In most of the
applications, voice is use as the key to confirm the identities of a speaker are
Adding the open set identification case in which a reference model for an unknown
speaker may not exist can also modify above formation of speaker identification and
verification system. This is usually the case in forensic application. In this
circumstances, an added decision alternative, the unknown does not match any of the
models, is required. Other threshold examination can be used in both verification and
identification process to decide if the match is close enough to acknowledge the
decision or if more speech data are needed.
Speaker recognition can also divide into two methods, text- dependent and text
independent methods. In text dependent method the speaker to say key words or
sentences having the same text for both training and recognition trials. Whereas in the
text independent does not rely on a specific text being speak. Formerly text dependent
methods were widely in application, but later text independent is in use. Both text
dependent and text independent methods share a problem however.
By playing back the recorded voice of registered speakers this system can be easily
deceived. There are different technique is used to cope up with such problems. Such
as a small set of words or digits are used as input and each user is provoked to
thorough a specified sequence of key words that is randomly selected every time the
system is used. Still this method is not completely reliable. This method can be
deceived with the highly developed electronics recording system that can repeat
secrete key words in a request order. Therefore T. Matsui and S. Furui have recently
proposed the text dependent speaker recognition method
Speech Feature Extraction:
In this project the most important thing is to extract the feature from the speech signal.
The speech feature extraction in a categorization problem is about reducing the
dimensionality of the input-vector while maintaining the discriminating power of the
signal. As we know from the above fundamental formation of speaker identification
and verification systems, that the number of training and test vector needed for the
classification problem grows exponential with the dimension of the given input
vector, so we need feature extraction.
But extracted feature should meet some criteria while dealing with the speech signal.
Such as:
Easy to measure extracted Speech features.
Distinguish between speakers while being lenient of intra speaker variability’s.
It should not be susceptible to mimicry.
It should show little fluctuation from one speaking environment to another.
It should be stable over time.
It should occur frequently and naturally in speech.
In this project we are using the Mel Frequency Cepstral Coefficients (MFCC)
technique to extract features from the speech signal and compare the unknown
speaker with the exist speaker in the database. Figure below shows the complete
pipeline of Mel Frequency Cepstral Coefficients.
Result:
For example, we are going to test speech wave file made by Brian, which called
‘test_brian.wav’. Assume we do not know the speaker is Brian at the beginning.
Therefore we need to apply the wav. file into our speaker recognition system to find
out who the speaker is. We run the program twice in order to get a more accurate
result. The Matlab codes are provided as following:
% First run
>> speakerID('test_brian')
Loading data...
Calculating mel-frequency cepstral coefficients for training set...
Harry
Carli
Brian
In___
Hojin
Performing K-means...
Calculating mel-frequency cepstral coefficients for test set...
Compute a distortion measure for each codebook...
Display the result...
The average of Euclidean distances between database and test wave file
Harry
7.0183
Carli
10.0679
Brian
5.9630
In___
8.4237
Hojin
7.6526
The test voice is most likely from
Brian
% Second run
>> speakerID('test_brian')
Loading data...
Calculating mel-frequency cepstral coefficients for training set...
Harry
Carli
Brian
In___
Hojin
Performing K-means...
Calculating mel-frequency cepstral coefficients for test set...
Compute a distortion measure for each codebook...
Display the result...
The average of Euclidean distances between database and test wave file
Harry
6.9995
Carli
9.9876
Brian
5.8339
In___
8.7075
Hojin
7.6390
The test voice is most likely from
Brian
From the above outputs we had in Matlab, we got 5 measurements for each run, whic
are the calculated Euclidean distances between the test wave file and codebooks from
the database. We can see that, compare to the codebooks in the database; both
calculated distortion distance of Brian have the smallest values, which are 5.9630 and
5.8339. Therefore, we can conclude that the speak person is Brian according to the
theory: “the most likely speaker’s voice should have the smallest Euclidean distance
compared to the codebooks in the database”.
Conclusion:
The goal of this project was to create a speaker recognition system, and apply it to a
speech of an unknown speaker. By investigating the extracted features of the
unknown speech and then compare them to the stored extracted features for each
different speaker in order to identify the unknown speaker.
The feature extraction is done by using MFCC (Mel Frequency Cepstral Coefficients).
The function ‘melcepst’ is used to calculate the mel cepstrum of a signal. The speaker
was modeled using Vector Quantization (VQ). A VQ codebook is generated by
clustering the training feature vectors of each speaker and then stored in the speaker
database. In this method, the K means algorithm is used to do the clustering. In the
recognition stage, a distortion measure which based on the minimizing the Euclidean
distance was used when matching an unknown speaker with the speaker database.
During this project, we have found out that the VQ based clustering approach
provides us with the faster speaker identification process.
1 comments:
sir please send this code to my email assad4237@gmail.com
Post a Comment