Automatic Speech Recognition: The Development of the SPHINX System

Front Cover
Springer Science & Business Media, Oct 31, 1988 - Technology & Engineering - 207 pages
Speech Recognition has a long history of being one of the difficult problems in Artificial Intelligence and Computer Science. As one goes from problem solving tasks such as puzzles and chess to perceptual tasks such as speech and vision, the problem characteristics change dramatically: knowledge poor to knowledge rich; low data rates to high data rates; slow response time (minutes to hours) to instantaneous response time. These characteristics taken together increase the computational complexity of the problem by several orders of magnitude. Further, speech provides a challenging task domain which embodies many of the requirements of intelligent behavior: operate in real time; exploit vast amounts of knowledge, tolerate errorful, unexpected unknown input; use symbols and abstractions; communicate in natural language and learn from the environment. Voice input to computers offers a number of advantages. It provides a natural, fast, hands free, eyes free, location free input medium. However, there are many as yet unsolved problems that prevent routine use of speech as an input device by non-experts. These include cost, real time response, speaker independence, robustness to variations such as noise, microphone, speech rate and loudness, and the ability to handle non-grammatical speech. Satisfactory solutions to each of these problems can be expected within the next decade. Recognition of unrestricted spontaneous continuous speech appears unsolvable at present. However, by the addition of simple constraints, such as clarification dialog to resolve ambiguity, we believe it will be possible to develop systems capable of accepting very large vocabulary continuous speechdictation.
 

What people are saying - Write a review

We haven't found any reviews in the usual places.

Contents

Introduction
11
Achievements and Limitations
11
111 Speaker Independence
11
112 Continuous Speech
11
113 Large Vocabulary
11
The SPHINX System
11
A Representation of Speech
11
122 Adding Human Knowledge
11
5143 Multiple Codebooks
69
52 VariableWidth Speech Parameters
72
522 Knowledgebased Parameters
75
531 InsertionDeletion Modeling
77
532 Multiple Pronunciations
79
533 Other DictionaryPhoneSet Improvements
81
5333 Tailoring HMM Topology
82
5334 Final Phone Set and Dictionary
83

123 Finding a Good Unit of Speech
12
124 Speaker Learning and Adaptation
14
13 Summary and Monograph Outline
15
Hidden Markov Modeling of Speech
17
22 Three HMM Problems
19
The Forward Algorithm
20
The Viterbi Algorithm
22
The ForwardBackward Algorithm
23
23 Implementational Issues
26
232 Null Transitions
27
234 Scaling or Log Compression
28
235 Multiple Independent Observations
30
24 Using HMMs for Speech Recognition
32
2412 HMM Representation of Speech Units
34
2413 HMM Representation of Other Knowledge Sources
36
2422 Recognition
37
243 Using HMM for Continuous Speech Tasks
38
2432 Recognition
39
Task and Databases
45
312 The Grammar
46
313 The TIRM Database
47
32 The TIMIT Database
48
The Baseline SPHINX System
51
42 Vector Quantization
52
422 A Hierarchical VQ Algorithm
53
43 The Phone Model
54
44 The Pronunciation Dictionary
55
45 HMM Training
56
46 HMM Recognition
59
47 Results and Discussion
60
48 Summary
62
Adding Knowledge
63
51 FixedWidth Speech Parameters
64
512 Differenced Cepstrum Coefficients
65
513 Power and Differenced Power
66
514 Integrating FrameBased Parameters
67
5142 Composite Distance Metric
68
54 Results and Discussion
84
55 Summary
88
Finding a Good Unit of Speech
91
612 Phones
92
613 MultiPhone Units
93
614 Explicit Transition Modeling
94
615 WordDependent Phones
95
617 Summary of Previous Units
97
63 FunctionWordDependent Phones
100
64 Generalized Triphones
103
65 Summary of SPHINX Training Procedure
106
66 Results and Discussion
107
67 Summary
111
Learning and Adaptation
115
71 Speaker Adaptation through Speaker Cluster Selection
116
711 Speaker Clustering
117
712 Speaker Cluster Identification
118
721 Different SpeakerAdaptive Estimates
119
722 Interpolated Reestimation
122
73 Results and Discussion
124
74 Summary
126
Summary of Results
129
82 Comparison with Other Systems
131
83 Error Analysis
133
Conclusion
137
92 Contributions
138
93 Future Work
141
94 Final Remarks
143
Evaluating Speech Recognizers
145
I2 Computing Error Rate
146
The Resource Management Task
149
II2 The Grammar
170
Examples of SPHINX Recognition
173
References
187
Index
205
Copyright

Other editions - View all

Common terms and phrases

Popular passages

Page 11 - ... 635 Mbytes per hour for uncompressed speech, without any noticeable loss of quality. Speech Recognition. Speech recognition has a long history of being one of the difficult problems in artificial intelligence (AI) and computer science. As one goes from problem solving tasks in AI to perceptual tasks, the problem characteristics change dramatically: knowledge poor to knowledge rich; low data rates to high data rates; slow response time (minutes to hours) to instantaneous response time. These characteristics...
Page 191 - The acoustic-modeling problem in automatic speech recognition. Ph.D. Thesis, Computer Science Department, Carnegie Mellon University.

About the author (1988)

Kai-Fu Lee was born on December 3, 1961 in Taipei, Taiwan. He earned a B.S. degree in computer science from Columbia University and a Ph.D in computer science from Carnegie Mellon University. In 1988, he completed his doctoral dissertation on Sphinx, the first large-vocabulary, speaker-independent, continuous speech recognition system. Lee has written two books on speech recognition and more than 60 papers in computer science. His doctoral dissertation was published in 1988 as a Kluwer monograph, Automatic Speech Recognition: The Development of the Sphinx Recognition System. Together with Alex Waibel, another Carnegie Mellon researcher, Lee edited Readings in Speech Recognition.

Bibliographic information