论文部分内容阅读
构造了两个单流单音素的动态贝叶斯网络(DBN)模型,以实现基于音频和视频特征的连续语音识别,并在描述词和对应音素具体关系的基础上,实现对音素的时间切分。实验结果表明,在基于音频特征的识别率方面:在低信噪比(0~15dB)时,DBN模型的识别率比HMM模型平均高12.79%;而纯净语音下,基于DBN模型的音素时间切分结果和三音素HMM模型的切分结果很接近。对基于视频特征的语音识别,DBN模型的识别率比HMM识别率高2.47%。实验最后还分析了音视频数据音素时间切分的异步关系,为基于多流DBN模型的音视频连续语音识别和确定音频和视频的异步关系奠定了基础。
Two dynamic monophone dynamic Bayesian Networks (DBN) models are constructed to realize continuous speech recognition based on audio and video features. On the basis of the specific relationship between descriptors and corresponding phonemes, the time-cut of phonemes Minute. Experimental results show that in the aspect of audio feature-based recognition rate, the recognition rate of DBN model is 12.79% higher than that of HMM model at low signal-to-noise ratio (0 ~ 15dB); while pure speech, The sub-result is very close to the segmentation result of the triphone HMM model. For speech recognition based on video features, the recognition rate of DBN model is 2.47% higher than HMM recognition rate. Finally, the experiment also analyzes the asynchronous relationship of the time segmentation of the phoneme of audio and video data, which lays the foundation for the continuous speech and audio and video based on multi-stream DBN model and the determination of the asynchronous relationship between audio and video.