论文部分内容阅读
在语音识别中 ,如何经济地挑选语音训练语料 ,使其覆盖尽可能多的语音现象是一个非常重要的问题 .传统的语音训练语料采用手工挑选后再进行检验和补充的方法 ,此方法难以保证所选语料语音现象的覆盖率 .该文提出了一种自动地从大规模语料库中挑选语料的搜索算法 ,此算法不但能使所选语料覆盖几乎所有语音现象 ,而且能保证训练语料中三音子和类三音子有足够的样本个数 ,使训练数据不过于稀疏 ,为训练正确而可靠的语音模型打下了坚实的基础 .
In speech recognition, it is a very important issue how to choose the speech training corpus economically to cover as many speech phenomena as possible.Traditional speech training corpus is selected by hand and then tested and supplemented, which is difficult to guarantee The rate of coverage of the selected speech corpus phenomenon.This paper proposes a search algorithm that automatically selects corpus from large-scale corpus, which not only can make the selected corpus cover almost all speech phenomena, but also can ensure the training of the three tones There are enough samples in the sub-class and the tri-tone class to make the training data not too sparse, which lays a solid foundation for training a correct and reliable speech model.