论文部分内容阅读
语音和唇部运动的异步性是多模态融合语音识别的关键问题,该文首先引入一个多流异步动态贝叶斯网络(MS-ADBN)模型,在词的级别上描述了音频流和视频流的异步性,音视频流都采用了词-音素的层次结构。而多流多状态异步DBN(MM-ADBN)模型是MS-ADBN模型的扩展,音视频流都采用了词-音素-状态的层次结构。本质上,MS-ADBN是一个整词模型,而MM-ADBN模型是一个音素模型,适用于大词汇量连续语音识别。实验结果表明:基于连续音视频数据库,在纯净语音环境下,MM-ADBN比MS-ADBN模型和多流HMM识别率分别提高35.91%和9.97%。
Asynchronism of speech and lip motion is the key issue of multimodal fusion speech recognition. This paper first introduces a multi-stream asynchronous dynamic Bayesian Network (MS-ADBN) model, which describes the audio stream and video at the word level Asynchronous flow, audio and video streams have adopted the word - phoneme hierarchy. However, the multi-stream multi-state asynchronous DBN (MM-ADBN) model is an extension of the MS-ADBN model, and the audio-video stream adopts the hierarchy of word-phoneme-states. In essence, MS-ADBN is a whole word model, while MM-ADBN model is a phoneme model, suitable for large vocabulary continuous speech recognition. The experimental results show that MM-ADBN improves the recognition rate of MS-ADBN model and multi-stream HMM by 35.91% and 9.97% respectively in the pure voice environment based on the continuous audio and video database.