|
|
DBN Based Multi-stream Multi-states Model for Continue Audio-Visual Speech Recognition |
Lü Guo-yun① Jiang Dong-mei① Zhang Yan-ning① Zhao Rong-chun① H Sahli② Ilse Ravyse② W Verhelst② |
①(Northwestern Polytechnical University, School of Computer Science, Xi’an 710072, China) ②(Vrije Universiteit Brussel, Department ETRO, Brussel B-1050, Belgium) |
|
|
Abstract Asynchrony of speech and lip motion is a key issue of multi-model fusion Audio-Visual Speech Recognition (AVSR). In this paper, a Multi-Stream Asynchrony Dynamic Bayesian Network (MS-ADBN) model is introduced, which looses the asynchrony of audio and visual streams to the word level, and both in audio stream and in visual stream, word-phone topology structure is used. However, Multi-stream Multi-states Asynchrony DBN (MM-ADBN) model is an augmentation of Multi-Stream DBN (MS-ADBN) model, is proposed for large vocabulary AVSR, which adopts word-phone-state topology structure in both audio stream and visual stream. In essential, MS-ADBN model is a word model, and while MM-ADBN model is a phone model whose recognition basic units are phones. The experiments are done on small vocabulary and large vocabulary audio-visual database, the results show that: for large vocabulary audio-visual database, comparing with MS-ADBN model and MSHMM, in clean speech environment, the improvements of 35.91 and 9.97% are obtained for MM-ADBN model respectively, which show the asynchrony description is important for AVSR systems.
|
Received: 11 June 2007
|
|
Corresponding Authors:
Lü Guo-yun
|
|
|
|
|
|
|