Abstract:Cross-modal speaker tagging aims to learn the latent relationship between different biometrics for mutual annotation, which can potentially be utilized in various human-computer interactions. In order to solve the “semantic gap” between the face and audio modalities, this paper presents an efficient supervised joint correspondence auto-encoder to link the face and audio counterpart, where by the speaker can be crosswise tagged. First, Convolutional Neural Network (CNN) and Deep Belief Network (DBN) are used to extract the discriminative features of the face and the audio samples respectively. Then, a supervised neural network model associated with softmax regression is embedded into a joint auto-encoder model, which can discriminatively preserving the inter-modal and intra-modal similarities. Accordingly, three different kinds of supervised joint correspondence auto-encoder models are presented to correlate the semantic relationships between the face and the audio counterparts, and the speaker can be crosswise annotated efficiently. The experimental results show that the proposed supervised joint auto-encoder is able to perform cross-modal speaker tagging with outstanding performance, and demonstrate the robustness to facial posture variations and sample diversities.
CHEN Cunbao and ZHAO Li. Speaker identification based on GMM with embedded AANN[J]. Journal of Electronics & Information Technology, 2010, 32(3): 528-532. doi: 10.3724/ SP.J.1146.2008.00275.
GUO Wu, DAI Lirong, and WANG Renhua. Speaker verification based on factor analysis and SVM[J]. Journal of Electronics & Information Technology, 2009, 31(2): 302-305. doi: 10.3724/SP.J.1146.2007.01289.
[3]
RASIWASIA N, PEREIRA J C, COVIELLO E, et al. A new approach to cross-modal multimedia retrieval[C]. ACM International Conference on Multimedia, Firenze, Italy, 2010: 251-260.
[4]
ZHANG Liang, MA Bingpeng, LI Guorong, et al. Cross- modal retrieval using multiordered discriminative structured subspace learning[J]. IEEE Transactions on Multimedia, 2017, 19(6): 1220-1233. doi: 10.1109/TMM.2016.2646219.
[5]
ZOU Hui, DU Jixiang, ZHAI Chuanmin, et al. Deep learning and shared representation space learning based cross-modal multimedia retrieval[C]. International Conference on Intelligent Computing. Lanzhou, China, 2016: 322-331.
[6]
SUN Yi, WANG Xiaogang, and TANG Xiaoou. Hybrid deep learning for face verification[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2016, 38(10): 1997-2009. doi: 10.1109/TPAMI.2015.2505293.
[7]
SUN Yi, WANG Xiaogang, TANG Xiaoou, et al. Deep learning face representation from predicting 10,000 classes[C]. IEEE Conference on Computer Vision and Pattern Recognition, Columbus, USA, 2014: 1891-1898.
[8]
KARAABA M F, SURINTA O, SCHOMAKER L R B, et al. Robust face identification with small sample sizes using bag of words and histogram of oriented gradients[C]. International Joint Conference on Computer Vision Imaging and Computer Graphics Theory and Applications, Rome, Italy, 2016: 582-589.
[9]
TAIGMAN Y, YANG M, RANZATO M, et al. DeepFace: Closing the gap to human-level performance in face verification[C]. IEEE Conference on Computer Vision and Pattern Recognition, Columbus, USA, 2014: 1701-1708.
[10]
YUAN Xiaochen, PUN Chiman, and CHEN C L. Robust Mel-Frequency cepstral coefficients feature detection and dual-tree complex wavelet transform for digital audio watermarking[J]. Information Sciences, 2015, 29(8): 159-179. doi: 10.1016/j.ins.2014.11.040.
[11]
PATHAK M A and RAJ B. Privacy-preserving speaker verification and identification using Gaussian mixture models [J]. IEEE Transactions on Audio, Speech, and Language Processing, 2013, 21(2): 397-406. doi: 10.1109/ TASL.2012. 2215602.
[12]
HINTON G, LI Deng, DONG Yu, et al. Deep neural networks for acoustic modeling in speech recognition: the shared views of four research groups[J]. IEEE Signal Processing Magazine, 2012, 29(6): 82-97. doi: 10.1109/MSP.2012.2205597.
[13]
NGIAM J, KHOSLA A, KIM M, et al. Multimodal deep learning[C]. IEEE International Conference on Machine Learning, Bellevue, USA, 2011: 689-696.
[14]
HU Yongtao, REN J S, DAI Jingwen, et al. Deep multimodal speaker naming[C]. ACM International Conference on Multimedia, Brisbane, Australia, 2015: 1107-1110.
[15]
FENG Fangxiang, WANG Xi, LI Ruifan, et al. Correspondence autoencoders for cross-modal retrieval[J]. ACM Transactions on Multimedia Computing Communications & Applications, 2015, 12(1s): 1-22. doi: 10.1145/2808205.
[16]
MOHAMED A, DAHL G E, and HINTON G. Acoustic modeling using deep belief networks[J]. IEEE Transactions on Audio, Speech, and Language Processing, 2012, 20(1): 14-22. doi: 10.1109/TASL.2011.2109382.
[17]
WANG Kaiye, HE Ran, WANG Liang, et al. Joint feature selection and subspace learning for cross-modal retrieval[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2016, 38(10): 2010-2023. doi: 10.1109/TPAMI. 2015.2505311.
[18]
CASTREJÓN L, AYTAR Y, VONDRICK C, et al. Learning aligned cross-modal representations from weakly aligned data[C]. IEEE International Conference on Computer Vision and Pattern Recognition, Las Vegas, USA, 2016: 2940-2949.
[19]
KIM J, NAM J, and GUREVYCH I. Learning semantics with deep belief network for cross-language information retrieval[C]. International Conference on Computational Linguistics, Dublin, Ireland, 2013: 579-588.
[20]
TANG Jun, WANG Ke, and SHAO Ling. Supervised matrix factorization hashing for cross-modal retrieval[J]. IEEE Transactions on Image Processing, 2016, 25(7): 3157-3166. doi: 10.1109/TIP.2016.2564638.