结合有监督联合一致性自编码器的跨音视频说话人标注

doi:10.11999/JEIT171011

摘要
图/表
参考文献(20)
相关文章 (15)

全文: PDF (1557 KB)
输出: BibTeX | EndNote (RIS)

摘要跨模态说话人标注旨在利用说话人的不同生物特征进行相互匹配和互标注，可广泛应用于各种人机交互场合。针对人脸和语音两种不同模态生物特征之间存在明显的“语义鸿沟”问题，该文提出一种结合有监督联合一致性自编码器的跨音视频说话人标注方法。首先分别利用卷积神经网络和深度信念网络分别对人脸图像和语音数据进行判别性特征提取，接着在联合自编码器模型的基础上，提出一种新的有监督跨模态神经网络模型，同时嵌入softmax回归模型以保证模态间和模态内样本的相似性，进而扩展为3种有监督一致性自编码器神经网络模型来挖掘音视频异构特征之间的潜在关系，从而有效实现人脸和语音的跨模态相互标注。实验结果表明，该文提出的网络模型能够有效的对说话人进行跨模态标注，效果显著，取得了对姿态变化和样本多样性的鲁棒性。

	服务

	把本文推荐给朋友
	加入我的书架
	加入引用管理器
	E-mail Alert
	RSS
	作者相关文章
	柳欣
	李鹤洋
	钟必能
	杜吉祥

关键词 ：跨模态说话人标注, 有监督联合自编码器, softmax回归模型, 有监督神经网络模型

Abstract：Cross-modal speaker tagging aims to learn the latent relationship between different biometrics for mutual annotation, which can potentially be utilized in various human-computer interactions. In order to solve the “semantic gap” between the face and audio modalities, this paper presents an efficient supervised joint correspondence auto-encoder to link the face and audio counterpart, where by the speaker can be crosswise tagged. First, Convolutional Neural Network (CNN) and Deep Belief Network (DBN) are used to extract the discriminative features of the face and the audio samples respectively. Then, a supervised neural network model associated with softmax regression is embedded into a joint auto-encoder model, which can discriminatively preserving the inter-modal and intra-modal similarities. Accordingly, three different kinds of supervised joint correspondence auto-encoder models are presented to correlate the semantic relationships between the face and the audio counterparts, and the speaker can be crosswise annotated efficiently. The experimental results show that the proposed supervised joint auto-encoder is able to perform cross-modal speaker tagging with outstanding performance, and demonstrate the robustness to facial posture variations and sample diversities.

Key words： Cross-modal speaker tagging Supervised joint correspondence auto-encoder Softmax regression Supervised neural network model

收稿日期: 2017-10-30 出版日期: 2018-05-11

PACS:

TP391.4

基金资助:国家自然科学基金(61673185, 61572205, 61673186)，福建省自然科学基金(2017J01112)，华侨大学中青年创新人才培育项目(ZQN-309)

通讯作者: 柳欣：男，1982年生，博士，副教授，研究方向为生物特征识别和机器学习. E-mail: xliu@hqu.edu.cn

作者简介: 柳欣：男，1982年生，博士，副教授，研究方向为生物特征识别和机器学习. 李鹤洋：男，1994年生，硕士生，研究方向为计算机视觉与模式识别. 钟必能：男，1981年生，博士，教授，研究方向为机器学习和模式识别. 杜吉祥：男，1977年生，博士，教授，研究方向为计算机视觉和机器学习.

引用本文:

柳欣,李鹤洋,钟必能, 杜吉祥. 结合有监督联合一致性自编码器的跨音视频说话人标注[J]. 电子与信息学报, 2018, 40(7): 1635-1642. LIU Xin, LI Heyang, ZHONG Bineng, DU Jixiang. Efficient Audio-visual Cross-modal Speaker Tagging via Supervised Joint Correspondence Auto-encoder. JEIT, 2018, 40(7): 1635-1642.

链接本文:

http://jeit.ie.ac.cn/CN/10.11999/JEIT171011 或 http://jeit.ie.ac.cn/CN/Y2018/V40/I7/1635

[1]	陈存宝, 赵力. 嵌入自联想神经网络的高斯混合模型说话人辨认[J]. 电子与信息学报, 2010, 32(3): 528-532. doi: 10.3724/ SP.J.1146.2008.00275.
	CHEN Cunbao and ZHAO Li. Speaker identification based on GMM with embedded AANN[J]. Journal of Electronics & Information Technology, 2010, 32(3): 528-532. doi: 10.3724/ SP.J.1146.2008.00275.
[2]	郭武, 戴礼荣, 王仁华. 采用因子分析和支持向量机的说话人确认系统[J]. 电子与信息学报, 2009, 31(2): 302-305. doi: 10.3724/SP.J.1146.2007.01289.
	GUO Wu, DAI Lirong, and WANG Renhua. Speaker verification based on factor analysis and SVM[J]. Journal of Electronics & Information Technology, 2009, 31(2): 302-305. doi: 10.3724/SP.J.1146.2007.01289.
[3]	RASIWASIA N, PEREIRA J C, COVIELLO E, et al. A new approach to cross-modal multimedia retrieval[C]. ACM International Conference on Multimedia, Firenze, Italy, 2010: 251-260.
[4]	ZHANG Liang, MA Bingpeng, LI Guorong, et al. Cross- modal retrieval using multiordered discriminative structured subspace learning[J]. IEEE Transactions on Multimedia, 2017, 19(6): 1220-1233. doi: 10.1109/TMM.2016.2646219.
[5]	ZOU Hui, DU Jixiang, ZHAI Chuanmin, et al. Deep learning and shared representation space learning based cross-modal multimedia retrieval[C]. International Conference on Intelligent Computing. Lanzhou, China, 2016: 322-331.
[6]	SUN Yi, WANG Xiaogang, and TANG Xiaoou. Hybrid deep learning for face verification[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2016, 38(10): 1997-2009. doi: 10.1109/TPAMI.2015.2505293.
[7]	SUN Yi, WANG Xiaogang, TANG Xiaoou, et al. Deep learning face representation from predicting 10,000 classes[C]. IEEE Conference on Computer Vision and Pattern Recognition, Columbus, USA, 2014: 1891-1898.
[8]	KARAABA M F, SURINTA O, SCHOMAKER L R B, et al. Robust face identification with small sample sizes using bag of words and histogram of oriented gradients[C]. International Joint Conference on Computer Vision Imaging and Computer Graphics Theory and Applications, Rome, Italy, 2016: 582-589.
[9]	TAIGMAN Y, YANG M, RANZATO M, et al. DeepFace: Closing the gap to human-level performance in face verification[C]. IEEE Conference on Computer Vision and Pattern Recognition, Columbus, USA, 2014: 1701-1708.
[10]	YUAN Xiaochen, PUN Chiman, and CHEN C L. Robust Mel-Frequency cepstral coefficients feature detection and dual-tree complex wavelet transform for digital audio watermarking[J]. Information Sciences, 2015, 29(8): 159-179. doi: 10.1016/j.ins.2014.11.040.
[11]	PATHAK M A and RAJ B. Privacy-preserving speaker verification and identification using Gaussian mixture models [J]. IEEE Transactions on Audio, Speech, and Language Processing, 2013, 21(2): 397-406. doi: 10.1109/ TASL.2012. 2215602.
[12]	HINTON G, LI Deng, DONG Yu, et al. Deep neural networks for acoustic modeling in speech recognition: the shared views of four research groups[J]. IEEE Signal Processing Magazine, 2012, 29(6): 82-97. doi: 10.1109/MSP.2012.2205597.
[13]	NGIAM J, KHOSLA A, KIM M, et al. Multimodal deep learning[C]. IEEE International Conference on Machine Learning, Bellevue, USA, 2011: 689-696.
[14]	HU Yongtao, REN J S, DAI Jingwen, et al. Deep multimodal speaker naming[C]. ACM International Conference on Multimedia, Brisbane, Australia, 2015: 1107-1110.
[15]	FENG Fangxiang, WANG Xi, LI Ruifan, et al. Correspondence autoencoders for cross-modal retrieval[J]. ACM Transactions on Multimedia Computing Communications & Applications, 2015, 12(1s): 1-22. doi: 10.1145/2808205.
[16]	MOHAMED A, DAHL G E, and HINTON G. Acoustic modeling using deep belief networks[J]. IEEE Transactions on Audio, Speech, and Language Processing, 2012, 20(1): 14-22. doi: 10.1109/TASL.2011.2109382.
[17]	WANG Kaiye, HE Ran, WANG Liang, et al. Joint feature selection and subspace learning for cross-modal retrieval[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2016, 38(10): 2010-2023. doi: 10.1109/TPAMI. 2015.2505311.
[18]	CASTREJÓN L, AYTAR Y, VONDRICK C, et al. Learning aligned cross-modal representations from weakly aligned data[C]. IEEE International Conference on Computer Vision and Pattern Recognition, Las Vegas, USA, 2016: 2940-2949.
[19]	KIM J, NAM J, and GUREVYCH I. Learning semantics with deep belief network for cross-language information retrieval[C]. International Conference on Computational Linguistics, Dublin, Ireland, 2013: 579-588.
[20]	TANG Jun, WANG Ke, and SHAO Ling. Supervised matrix factorization hashing for cross-modal retrieval[J]. IEEE Transactions on Image Processing, 2016, 25(7): 3157-3166. doi: 10.1109/TIP.2016.2564638.