This project aims to implement an automatic speech recognition system using Dynamic Time Warping (DTW) algorithm. The recognition corpus is spoken digits one through nine, recorded by one speaker (the author) in one sitting. The method of implementation consists of these steps:
The DTW can be quickly described as follows:
| Test Utterance | Reference Templates | Correct Match? | Distance to Closest Neighbor (Confidence) | |||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Zero | One | Two | Three | Four | Five | Six | Seven | Eight | Nine | |||
| Zero | 56.4256 | 80.1390 | 108.2439 | 59.8043 | 85.1671 | 71.7628 | 110.2530 | 109.9047 | 65.8139 | 68.2818 | Yes | 3.3787 |
| One | 82.6255 | 72.1513 | 75.9218 | 107.7745 | 126.1208 | 134.8430 | 147.8016 | 145.0947 | 92.0218 | 78.6653 | Yes | 3.7705 |
| Two | 60.0401 | 62.5386 | 47.3730 | 100.8893 | 69.9329 | 77.0554 | 89.5916 | 82.9683 | 54.9619 | 130.2274 | Yes | 7.5889 |
| Three | 78.6856 | 103.8081 | 80.6923 | 56.8189 | 73.3303 | 65.4852 | 113.0276 | 108.5861 | 72.6364 | 102.1738 | Yes | 8.6663 |
| Four | 83.8541 | 139.2983 | 71.0875 | 84.6918 | 111.8913 | 161.8228 | 142.3920 | 135.3159 | 64.2175 | 130.8920 | No | N/A |
| Five | 97.7762 | 117.6094 | 84.8519 | 93.1297 | 106.6733 | 113.0875 | 128.7775 | 117.0927 | 93.4818 | 114.2581 | No | N/A |
| Six | 172.8461 | 168.6293 | 121.1009 | 157.1766 | 190.6625 | 223.0134 | 155.8142 | 160.9377 | 159.5971 | 173.9432 | No | N/A |
| Seven | 95.4183 | 142.3294 | 115.0825 | 95.2566 | 148.3594 | 162.7888 | 119.0059 | 123.5402 | 77.2117 | 128.1571 | No | N/A |
| Eight | 56.6915 | 81.2364 | 66.9252 | 57.3227 | 58.9160 | 50.2318 | 78.4490 | 80.0069 | 41.2319 | 60.7897 | Yes | 8.9999 |
| Nine | 55.3910 | 81.1640 | 63.6884 | 49.5573 | 80.4213 | 60.7370 | 97.8662 | 95.5027 | 58.6548 | 42.9859 | Yes | 6.5714 |
| Test Utterance | Reference Templates | Correct Match? | Distance to Closest Neighbor (Confidence) | |||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Zero | One | Two | Three | Four | Five | Six | Seven | Eight | Nine | |||
| Zero | 36.2064 | 45.8094 | 76.6086 | 49.2861 | 56.1699 | 75.1615 | 85.1737 | 75.2841 | 43.1693 | 53.5918 | Yes | 6.9629 |
| One | 61.6812 | 45.9605 | 52.4323 | 58.0059 | 79.4239 | 84.2070 | 97.2300 | 94.2898 | 69.0730 | 53.4150 | Yes | 6.4718 |
| Two | 40.1147 | 37.3977 | 24.4036 | 70.1130 | 41.4166 | 52.4931 | 57.9425 | 70.5491 | 34.9085 | 40.0762 | Yes | 10.5049 |
| Three | 41.4355 | 68.2561 | 46.5751 | 32.2351 | 55.6051 | 42.1311 | 80.6432 | 69.0830 | 46.4739 | 51.7426 | Yes | 9.8960 |
| Four | 60.7684 | 91.2816 | 50.7711 | 49.6837 | 68.4320 | 101.5672 | 94.6502 | 91.1534 | 48.4744 | 75.4980 | No | N/A |
| Five | 66.8791 | 79.4318 | 56.7268 | 61.1178 | 67.9902 | 84.4352 | 85.5993 | 88.9362 | 65.9698 | 76.7196 | No | N/A |
| Six | 112.2202 | 113.2865 | 79.4277 | 102.8140 | 127.6908 | 144.7135 | 96.6866 | 102.4596 | 103.8494 | 116.5784 | No | N/A |
| Seven | 71.5899 | 93.7538 | 72.3037 | 66.0189 | 79.7796 | 109.0510 | 66.5481 | 85.4794 | 56.6138 | 82.4248 | No | N/A |
| Eight | 38.4369 | 54.6732 | 46.0791 | 36.9963 | 39.3292 | 37.5188 | 53.9062 | 80.7746 | 24.9407 | 44.4719 | Yes | 12.0556 |
| Nine | 43.4804 | 56.6221 | 70.6874 | 32.9435 | 49.8862 | 49.1736 | 72.3237 | 66.5296 | 30.5334 | 46.3745 | No | N/A |
| Test Utterance | Reference Templates | Correct Match? | Distance to Closest Neighbor (Confidence) | |||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Zero | One | Two | Three | Four | Five | Six | Seven | Eight | Nine | |||
| Zero | 28.2598 | 41.5705 | 61.3345 | 37.9966 | 44.1510 | 44.2357 | 67.6840 | 64.3755 | 33.9891 | 31.6468 | Yes | 3.387 |
| One | 41.8230 | 29.0019 | 41.1519 | 41.2958 | 49.5727 | 64.9650 | 72.1253 | 67.2403 | 43.9189 | 39.2081 | Yes | 10.2062 |
| Two | 28.6177 | 27.4816 | 20.4998 | 49.8186 | 28.2336 | 26.8519 | 40.3461 | 46.4477 | 27.6304 | 27.9546 | Yes | 6.3521 |
| Three | 30.2945 | 50.3242 | 35.6699 | 23.9683 | 25.7686 | 30.9346 | 61.0620 | 60.4368 | 37.7762 | 40.4791 | Yes | 1.8003 |
| Four | 48.7906 | 66.3799 | 60.3192 | 48.3050 | 43.3070 | 73.1194 | 65.1496 | 64.5462 | 61.4572 | 62.8617 | Yes | 4.998 |
| Five | 49.0082 | 57.2663 | 43.9660 | 44.1310 | 48.4064 | 61.6288 | 65.0797 | 68.5955 | 55.4195 | 55.6039 | No | N/A |
| Six | 85.0703 | 84.1097 | 55.7597 | 77.6745 | 79.1677 | 109.8530 | 69.2289 | 81.9101 | 84.3601 | 87.2103 | No | N/A |
| Seven | 40.4793 | 63.8107 | 55.8562 | 51.8183 | 67.7065 | 80.9153 | 46.4665 | 55.9509 | 44.3850 | 61.8762 | No | N/A |
| Eight | 23.7352 | 35.0319 | 31.9963 | 23.4581 | 36.2972 | 27.0228 | 56.9895 | 47.5354 | 18.1631 | 27.4224 | Yes | 5.295 |
| Nine | 30.5387 | 41.5433 | 54.3758 | 25.7596 | 38.6188 | 36.0943 | 63.5565 | 64.3479 | 35.5187 | 35.5377 | No | N/A |
The overall percentage of correct matches is 60%. There is a correlation in the success and failure of correct matching between the sound of the word being matched. Words that are comprised of only plosives and vowels are generally matched correctly, while words with fricatives, suchs as "six","seven","four" and "five" are not matched correctly. Perhaps, it has to do with the spectral nature of the white noise-like fricatives, that it makes these words difficult to match. For instance, the word "six" produced very high accumulated distances, which means that algorithm had a very difficult time matching it to any template.
In general, words "four" and "five" both are very similar, and so are "six" and "seven." Both pairs of words start with the same sound. It would make sense if "four" would be confused with "five", but not with "two", as it is in results. It is also very interesting, that the incorrectly identified words are consistently matched to either "two" or "eight." Perhaps, it has to do with phonemes: both "five" and "six" have the "i" sound ("five" at the end of "ai", and six" as just plain "i"), but again it is counter-intutive to match these to "two".
The final remark will be about putting oneself into computer's perspective: if one would hear words "six" and "seven" , "four" and "five" merely four times in his/her life, that person would have difficulty distinguishing these words. Obviously, even for a human this task would be difficult with just four samples to train with. Clearly, much larger or more sophisticated method of training the reference templates is needed to make automatic speech recognition more reliable with such a small training corpus.
One potentially key improvement in my (unexperienced) opinion is phoneme-oriented automatic speech recognition. It does not make much sense to recognize words per-se. Humans do not recognize words as an undivisible whole, but rather, recognize sequences of distinct phonemes. Obviously, some kind of intelligent approach must be undertaken to do that, with a very large corpora of different words, separated into their component phonemes, so that the algorithm would have access to actual phonemes.
Most important improvement would be to use a larger corpora of training data.
The complete MATLAB code is available together with test and reference wave sound files, through a zip file. Right click to download.