Different Approaches to Definition of Elements Used in Speech Recognition

Presented by Vaclav Hanzl and Jan Uhlir at COST 249 meeting in Kosice, 1996

CTU Prague, Faculty of Electrical Engineering, department K331
Speech Processing Group


You can download a color Postscript version of slides presented at the meeting and look at it using ghostview on some UNIX machines.

You can also download a Postscript version of the paper distributed during the meeting and look at it or print it.


Our presentation in Nancy, 1995 is also available online.

Abstract

Traditional approach to speech recognition borrows concepts of speech units from both phonetics and linguistics. Phonemes and words are viable concepts for humans, however they might not be the ideal units for the machine. Three possibilities of using different elements are described in this paper - units infered by ergodic HMM, chunks of signal cut in the middle of certain phonemes and phoneme-like units found by HMM during partly unsupervised training.

Introduction

Continuous speech recognition techniques based on Hidden Marcov Models require the model structure to be a priori designed. This design is usually driven by the idea that speech is composed of sentences, sentences are composed of words and words are composed of phonemes. Various modifications of this basic idea exist, enhancing the capabilities of the model structure by using context dependent phonems, groupings of phonemes into biger subword units etc. Having designed the model structure, the model is trained on real data.

Closer look at the real speech signal often reveals that concepts used in the model structure design do not correspond very well with what is really present in the utterance. Great flexibility of HMM parameters allows us widely neglect this fact and believe that a massive training process will overcome the problem. However there still might be potential for improovement hidden in the structure design stage.

Our work aims to exploit possibilities to base the model structure not only on the experience of the designer, but rather on the real structure of the speech data itself.

Ergodic Hidden Markov Models

EHMMs is an HMM allowing potentialy any state transition, in contrast with the usual left-to-right model. EHMM can be trained on the speech data without any labeling and without any assumptions regarding the signal structure, thus being a natural candidate for the task of capturing the low level structure hidden in the real speech.

We used HTK toolkit to train EHHMs. Using HTK for this purpose is a bit tricky but possible. HTK allows using of full transition matrix in a model being trained using HRest tool but there is no direct way to initialise it and no way to discover the most probable sequence of states inside a model. The first problem can be solved by using HInit to initialise a model with 1 emiting state and N mixtures and then converting this model to N-state 1 mixture model, thus gaining access to the quite suitable clustering algorithm hidden in HInit. The second problem can be solved by converting the trained N-state model into an equivalent network of N 1-state models, using the transition matrix as bigram probabilities. Resulting network can be used with HVite tool to code the speech as a sequence of states of EHMM.

Observation of the resulting sequences of states revealed that EHMMs naturaly tend to stay in one state for several subsequent frames, thus defining the speech units as parts of the signal for which the state does not change.

The following table sumarises the sizes of units obtained using EHMMs with different numbers of states. Sizes are compared with the average length of phoneme in the corpus.

N units/phoneme
4 1.01
8 1.31
16 1.67
23 1.88
32 1.98
45 2.18

Sequences of units obtained for different realisations of the same utterance by the same speaker exhibit suprisingly high similarity and a simple recognision system can be constructed even by storing the reference sequences and comparing them with the unknown one using dynamic programming.

Methods of using the resulting units in continuous speech recognition are currently being studied. The preliminary experiments show phoneme recognition score of 75% on the test corpus.

Signal Chunks

Another approach we evaluated is to concentrate on points in signal which can be repeatedly found by signal analysis, e.g. local minima of energy and centers of steady parts, and use those points to split the signal into so called 'chunks'. Chunk boundaries defined like this may well correspond with the centers of plosiva and wovels. We carried out extensive statistical analysis of Czech language to estimate possibilities for chunk-based recognition approach described above.

Corpus used contains 331.881 words. Words were first transcripted using pronunciation rules. Results were used to create dictionary of Czech words and their pronunciation sorted by frequency. Then the dictionary was checked manually to find out errors in transcription and pronunciation rules were extended to work correctly for exceptional cases (usually foreign words). This process was repeated until all the frequent words were transcripted correctly. Then the original text was transcripted using the final rules and spaces between words were removed thus obtaining simulated transcription of continuous speech. This text was searched for chunks defined as strings beginning and ending with either vowel or oclusive. Chunks found were counted separately in four different groups (according to type of boundaries).

Results of this analysis shows that relatively small number of chunks occurs very often, giving good chance to build dictionary of chunks to be used for recognition. The following table shows numbers of different chunks required to cover certain percentage of all the chunks in the given class.

type 30% 60% 90%
oclusive to oclusive 5 18 62
oclusive to vowel 9 31 104
vowel to oclusive 14 39 166
vowel to vowel 36 143 221

Table shows that big part of a signal could probably be recognised using recogniser capable of recognising the most frequent chunks. One possible recognition technique being evaluated is based on Time Delayed Neural Netvorks.

Avoiding manual labeling of phonemes

Usual HMM training process requires part of the training set to manualy labeled. This bootstrap part of the database transfers humans idea of what the speech elements are to the initial models parameters and gives the training process a starting point from which it can converge during subsequent training.

Our experiments conducted on the French corpus Meteo Marine shows that this initial phase can be avoided and the missing information can be infered from the data itself. We used modified initialisation and in the training process we inserted 'annealing' of models (tying/untying), which can help the HMMs to escape local minima reached due to the lack of a conventional bootstrap training.

The Corpus Meteo Marine contains many cca 5 minutes long weather forcasts for individual sea sectors around France. It was recorded from FM radio, sampled at 16 kHz 16bit. Only the whole sentences were labeled (the only time information used is where the sentence begins and ends) and a phonetic transcription was automatically genereted from written text only. Part of the database used is from one speaker and has the following characteristics:

We constructed word models with the nubmer of states equal to number of phonemes in the word, left-to-right no skips, 12 cepstrum coefficients, 1 mixture only, diagonal covariance matrix. The training process was conducted as follows:

  1. Every state initialised with the average means & variances of the whole training corpus
  2. Spurious 'labels' generated by uniform segmentation according to phonetic transcription of sentences
  3. HInit used on words with at least 3 occurences
  4. HERest used several times
  5. All the states corresponding to the same 'phoneme' tied
  6. HERest used several times
  7. All the states untied
  8. HERest used several times

In the recognition process we used simple network allowing transition between words with probabilities estimated on the training corpus. We made no provision for recognition of words not present in the training corpus, thus having theoretical limit of accuracy due to missing models 95.5% and another decrease of this limit due to missing word couples in the training corpus.

Using the training process with 9x HErest, states tied, 6x HErest, states untied, 15x HErest (training was finished due to limited resources, still far from the convergence) and then using the HVite tool with fixed per-model penalty (p = -40) to decrease insertions, we achieved HResult word recognition rate Corr=91.73%, Acc=86.20%. We can estimate that eliminating sentences containing unknown words from the test corpus would lead to recognition score Corr=96.05%, Acc=90.26%.

The recognition scores showes that bootstrap training can be avoided and a HMM recogniser can be build using training databases with manually labeled whole sentences only, using phoneme-like units infered by the training process itself.