EEN 540 - Computer Project No.3 : Isolated Word Recognition

By Arsén Ivanóv, under direction of Dr. Michael Scordilis, University of Miami, Spring 2007.

Abstract

This project aims to implement an automatic speech recognition system using Dynamic Time Warping (DTW) algorithm. The recognition corpus is spoken digits one through nine, recorded by one speaker (the author) in one sitting. The method of implementation consists of these steps:

  1. creating a reference set for each word in the vocabulary (i.e., training the system)
  2. applying DTW to compare the test (unknown) utterance to each of the reference utterances, obtaining a measure of similarity (also know as the Total Accumulated Distance).

Table of Contents

  1. The DTW Algorithm
  2. Results
  3. Duscission and Conclusions
  4. Possible Improvements
  5. MATLAB code and Audio files
  6. References

The DTW Algorithm

The DTW can be quickly described as follows:

  1. several reference waveforms of the same word are summed together in time-domain and normalized to 1, to produce a single reference waveform;
  2. the obtained reference waveform is analyzed using a Short-Time Fourer Transform, i.e. it is windowed with a window of chosen length and overlap. Each windowed segment is transofrmed into frequency domain using a Fast Fourier Transform, producing a frame. At this point, all frequency coefficients are summed in each critical band, averaged, and normalized to the maximumum of the critical band energies of the given frame. All frames are arranged in order, producing a two-dimensional time-spectrum representation of sound called "reference template", similar to spectrogram;
  3. a test utterance is analized in the same manner, producing a time-spectrum representation called "test template";
  4. a Local Distance Matrix is computed, one per reference template, by comparing the test template with each given reference template;
  5. a Accumulated Distance Matrix is computed, one per reference template;
  6. Total Accumulated Distance is computed for each reference template. The Total Accumulated Distance is essentially a measure of how dissimilar the test template is to the given reference template. Small Total Accumulated Distance means that reference and the test templates are very similar, and therefore, probability of test utterance being the same as the reference utterance is high. Large Total Accumulated Distance means that the test template and the reference template are very dissimilar, therefore, probability of test utterance being the same word as reference utterance is low.
  7. among all Total Accumulated Distances, the smallest one is the most probable match. At this point, we compare if the match is correct or not, and what is the distance to the second smallest Total Accumulated Distance (i.e., with how much confidence was the match made).

Results: Trial 1

Window: 20 msec (160 samples), Hamming window, overlap=0.5

Normalized-Bark Energy Results
Test UtteranceReference TemplatesCorrect Match?Distance to Closest Neighbor (Confidence)
ZeroOneTwoThreeFourFiveSixSevenEightNine
Zero56.425680.1390108.243959.804385.167171.7628110.2530109.904765.813968.2818Yes3.3787
One82.625572.151375.9218107.7745126.1208134.8430147.8016145.094792.021878.6653Yes3.7705
Two60.040162.538647.3730100.889369.932977.055489.591682.968354.9619130.2274Yes7.5889
Three78.6856103.808180.692356.818973.330365.4852113.0276108.586172.6364102.1738Yes8.6663
Four83.8541139.298371.087584.6918111.8913161.8228142.3920135.315964.2175130.8920NoN/A
Five97.7762117.609484.851993.1297106.6733113.0875128.7775117.092793.4818114.2581NoN/A
Six172.8461168.6293121.1009157.1766190.6625223.0134155.8142160.9377159.5971173.9432NoN/A
Seven95.4183142.3294115.082595.2566148.3594162.7888119.0059123.540277.2117128.1571NoN/A
Eight56.691581.236466.925257.322758.916050.231878.449080.006941.231960.7897Yes8.9999
Nine55.391081.164063.688449.557380.421360.737097.866295.502758.654842.9859Yes6.5714

Results: Trial 2

Window: 30 msec (320 samples), Hamming window, overlap=0.5

Normalized-Bark Energy Results
Test UtteranceReference TemplatesCorrect Match?Distance to Closest Neighbor (Confidence)
ZeroOneTwoThreeFourFiveSixSevenEightNine
Zero36.206445.809476.608649.286156.169975.161585.173775.284143.169353.5918Yes6.9629
One61.681245.960552.432358.005979.423984.207097.230094.289869.073053.4150Yes6.4718
Two40.114737.397724.403670.113041.416652.493157.942570.549134.908540.0762Yes10.5049
Three41.435568.256146.575132.235155.605142.131180.643269.083046.473951.7426Yes9.8960
Four60.768491.281650.771149.683768.4320101.567294.650291.153448.474475.4980NoN/A
Five66.879179.431856.726861.117867.990284.435285.599388.936265.969876.7196NoN/A
Six112.2202113.286579.4277102.8140127.6908144.713596.6866102.4596103.8494116.5784NoN/A
Seven71.589993.753872.303766.018979.7796109.051066.548185.479456.613882.4248NoN/A
Eight38.436954.673246.079136.996339.329237.518853.906280.774624.940744.4719Yes12.0556
Nine43.480456.622170.687432.943549.886249.173672.323766.529630.533446.3745NoN/A

Results: Trial 3

Window: 40 msec (320 samples), Hamming window, overlap=0.5

Normalized-Bark Energy Results
Test UtteranceReference TemplatesCorrect Match?Distance to Closest Neighbor (Confidence)
ZeroOneTwoThreeFourFiveSixSevenEightNine
Zero28.259841.570561.334537.996644.151044.235767.684064.375533.989131.6468Yes3.387
One41.823029.001941.151941.295849.572764.965072.125367.240343.918939.2081Yes10.2062
Two28.617727.481620.499849.818628.233626.851940.346146.447727.630427.9546Yes6.3521
Three30.294550.324235.669923.968325.768630.934661.062060.436837.776240.4791Yes1.8003
Four48.790666.379960.319248.305043.307073.119465.149664.546261.457262.8617Yes4.998
Five49.008257.266343.966044.131048.406461.628865.079768.595555.419555.6039NoN/A
Six85.070384.109755.759777.674579.1677109.853069.228981.910184.360187.2103NoN/A
Seven40.479363.810755.856251.818367.706580.915346.466555.950944.385061.8762NoN/A
Eight23.735235.031931.996323.458136.297227.022856.989547.535418.163127.4224Yes5.295
Nine30.538741.543354.375825.759638.618836.094363.556564.347935.518735.5377NoN/A

Discussion and Conclusions

The overall percentage of correct matches is 60%. There is a correlation in the success and failure of correct matching between the sound of the word being matched. Words that are comprised of only plosives and vowels are generally matched correctly, while words with fricatives, suchs as "six","seven","four" and "five" are not matched correctly. Perhaps, it has to do with the spectral nature of the white noise-like fricatives, that it makes these words difficult to match. For instance, the word "six" produced very high accumulated distances, which means that algorithm had a very difficult time matching it to any template.

In general, words "four" and "five" both are very similar, and so are "six" and "seven." Both pairs of words start with the same sound. It would make sense if "four" would be confused with "five", but not with "two", as it is in results. It is also very interesting, that the incorrectly identified words are consistently matched to either "two" or "eight." Perhaps, it has to do with phonemes: both "five" and "six" have the "i" sound ("five" at the end of "ai", and six" as just plain "i"), but again it is counter-intutive to match these to "two".

The final remark will be about putting oneself into computer's perspective: if one would hear words "six" and "seven" , "four" and "five" merely four times in his/her life, that person would have difficulty distinguishing these words. Obviously, even for a human this task would be difficult with just four samples to train with. Clearly, much larger or more sophisticated method of training the reference templates is needed to make automatic speech recognition more reliable with such a small training corpus.

Possible Improvements

One potentially key improvement in my (unexperienced) opinion is phoneme-oriented automatic speech recognition. It does not make much sense to recognize words per-se. Humans do not recognize words as an undivisible whole, but rather, recognize sequences of distinct phonemes. Obviously, some kind of intelligent approach must be undertaken to do that, with a very large corpora of different words, separated into their component phonemes, so that the algorithm would have access to actual phonemes.

Most important improvement would be to use a larger corpora of training data.

MATLAB code and Audio files

The complete MATLAB code is available together with test and reference wave sound files, through a zip file. Right click to download.

Dr. Michael Scordilis. Dynamic Time Warping and Search. EEN 540 Class Handout. University of Miami, Spring 2007.