CTU Prague,
Faculty of Electrical Engineering,
department K331
Speech Processing Group
You can also download a Postscript version of the paper distributed during the meeting and look at it or print it.
Closer look at the real speech signal often reveals that concepts used in the model structure design do not correspond very well with what is really present in the utterance. Great flexibility of HMM parameters allows us widely neglect this fact and believe that a massive training process will overcome the problem. However there still might be potential for improovement hidden in the structure design stage.
Our work aims to exploit possibilities to base the model structure not only on the experience of the designer, but rather on the real structure of the speech data itself.
We used HTK toolkit to train EHHMs. Using HTK for this purpose is a bit tricky but possible. HTK allows using of full transition matrix in a model being trained using HRest tool but there is no direct way to initialise it and no way to discover the most probable sequence of states inside a model. The first problem can be solved by using HInit to initialise a model with 1 emiting state and N mixtures and then converting this model to N-state 1 mixture model, thus gaining access to the quite suitable clustering algorithm hidden in HInit. The second problem can be solved by converting the trained N-state model into an equivalent network of N 1-state models, using the transition matrix as bigram probabilities. Resulting network can be used with HVite tool to code the speech as a sequence of states of EHMM.
Observation of the resulting sequences of states revealed that EHMMs naturaly tend to stay in one state for several subsequent frames, thus defining the speech units as parts of the signal for which the state does not change.
The following table sumarises the sizes of units obtained using EHMMs with different numbers of states. Sizes are compared with the average length of phoneme in the corpus.
| N | units/phoneme |
| 4 | 1.01 |
| 8 | 1.31 |
| 16 | 1.67 |
| 23 | 1.88 |
| 32 | 1.98 |
| 45 | 2.18 |
Sequences of units obtained for different realisations of the same utterance by the same speaker exhibit suprisingly high similarity and a simple recognision system can be constructed even by storing the reference sequences and comparing them with the unknown one using dynamic programming.
Methods of using the resulting units in continuous speech recognition are currently being studied. The preliminary experiments show phoneme recognition score of 75% on the test corpus.
Corpus used contains 331.881 words. Words were first transcripted using pronunciation rules. Results were used to create dictionary of Czech words and their pronunciation sorted by frequency. Then the dictionary was checked manually to find out errors in transcription and pronunciation rules were extended to work correctly for exceptional cases (usually foreign words). This process was repeated until all the frequent words were transcripted correctly. Then the original text was transcripted using the final rules and spaces between words were removed thus obtaining simulated transcription of continuous speech. This text was searched for chunks defined as strings beginning and ending with either vowel or oclusive. Chunks found were counted separately in four different groups (according to type of boundaries).
Results of this analysis shows that relatively small number of chunks occurs very often, giving good chance to build dictionary of chunks to be used for recognition. The following table shows numbers of different chunks required to cover certain percentage of all the chunks in the given class.
| type | 30% | 60% | 90% |
| oclusive to oclusive | 5 | 18 | 62 |
| oclusive to vowel | 9 | 31 | 104 |
| vowel to oclusive | 14 | 39 | 166 |
| vowel to vowel | 36 | 143 | 221 |
Table shows that big part of a signal could probably be recognised using recogniser capable of recognising the most frequent chunks. One possible recognition technique being evaluated is based on Time Delayed Neural Netvorks.
Our experiments conducted on the French corpus Meteo Marine shows that this initial phase can be avoided and the missing information can be infered from the data itself. We used modified initialisation and in the training process we inserted 'annealing' of models (tying/untying), which can help the HMMs to escape local minima reached due to the lack of a conventional bootstrap training.
The Corpus Meteo Marine contains many cca 5 minutes long weather forcasts for individual sea sectors around France. It was recorded from FM radio, sampled at 16 kHz 16bit. Only the whole sentences were labeled (the only time information used is where the sentence begins and ends) and a phonetic transcription was automatically genereted from written text only. Part of the database used is from one speaker and has the following characteristics:
We constructed word models with the nubmer of states equal to number of phonemes in the word, left-to-right no skips, 12 cepstrum coefficients, 1 mixture only, diagonal covariance matrix. The training process was conducted as follows:
In the recognition process we used simple network allowing transition between words with probabilities estimated on the training corpus. We made no provision for recognition of words not present in the training corpus, thus having theoretical limit of accuracy due to missing models 95.5% and another decrease of this limit due to missing word couples in the training corpus.
Using the training process with 9x HErest, states tied, 6x HErest, states untied, 15x HErest (training was finished due to limited resources, still far from the convergence) and then using the HVite tool with fixed per-model penalty (p = -40) to decrease insertions, we achieved HResult word recognition rate Corr=91.73%, Acc=86.20%. We can estimate that eliminating sentences containing unknown words from the test corpus would lead to recognition score Corr=96.05%, Acc=90.26%.
The recognition scores showes that bootstrap training can be avoided and a HMM recogniser can be build using training databases with manually labeled whole sentences only, using phoneme-like units infered by the training process itself.