<-Previous Article Next Article->

[1]KONG Yinghui,CUI Wenting,ZHANG Ke,et al.Two-stream network video expression recognition by fusing key region information[J].CAAI Transactions on Intelligent Systems,2025,20(3):658-669.[doi:10.11992/tis.202401031]

Copy

Two-stream network video expression recognition by fusing key region information

PDF Download HTML

CAAI Transactions on Intelligent Systems[ISSN 1673-4785/CN 23-1538/TP] Volume: 20 Number of periods: 2025 3 Page number: 658-669 Column: 学术论文—机器感知与模式识别 Public date: 2025-05-05

Title:: Two-stream network video expression recognition by fusing key region information

Author(s):: KONG Yinghui¹; 2; CUI Wenting¹; ZHANG Ke¹; 2; CHE Linlin¹; 2; 1. Department of Electronic and Communication Engineering, North China Electric Power University, Baoding 071003, China;
2. Hebei Key Laboratory of Power Internet of Things Technology, North China Electric Power University, Baoding 071003, China

Keywords:: video expression recognition; two-stream network; attention mechanism; optical flow; convolutional neural networks; mask; feature fusion; facial expression recognition

CLC:: TP39

DOI:: 10.11992/tis.202401031

Abstract:: Facial expression recognition is an important research topic in the field of computer vision, and facial expression recognition in video has practical value in many scenes. Video sequences contain rich intra-frame spatial information and inter-frame temporal information, and key facial regions also have an important impact on the expression recognition results. This paper proposes a two-stream network expression recognition method by fusing key region information. First, a spatial-temporal two-stream network is constructed. The spatial network branch combines the facial motion unit and the CSFA attention mechanism to focus on the key facial regions that affect the expression recognition results, so as to realize the effective extraction of spatial features. The temporal branch extracts the optical flow through Farneback to obtain the expression motion information between frames and uses the spatial key region mask selection to reduce the computational complexity of optical flow. Finally, the final video expression recognition results are obtained by decision fusion of the spatial-temporal two-stream network recognition results. The method is tested on the eNTERFACE’05 and CK+ datasets. The results show that the proposed method can effectively improve the recognition accuracy and operating efficiency.

References:: [1] 彭小江, 乔宇. 面部表情分析进展和挑战[J]. 中国图象图形学报, 2020, 25(11): 2337-2348.
PENG Xiaojiang, QIAO Yu. Advances and challenges in facial expression analysis[J]. Journal of image and graphics, 2020, 25(11): 2337-2348.
[2] SHAN Caifeng, GONG Shaogang, MCOWAN P W. Facial expression recognition based on local binary patterns: a comprehensive study[J]. Image and vision computing, 2009, 27(6): 803-816.
[3] ZHAO Guoying, PIETIK?INEN M. Dynamic texture recognition using local binary patterns with an application to facial expressions[J]. IEEE transactions on pattern analysis and machine intelligence, 2007, 29(6): 915-928.
[4] ZHI Ruicong, FLIERL M, RUAN Qiuqi, et al. Graph-preserving sparse nonnegative matrix factorization with application to facial expression recognition[J]. IEEE transactions on systems, man, and cybernetics Part B, Cybernetics, 2011, 41(1): 38-52.
[5] ZHONG Lin, LIU Qingshan, YANG Peng, et al. Learning active facial patches for expression analysis[C]//2012 IEEE Conference on Computer Vision and Pattern Recognition. Providence: IEEE, 2012: 2562-2569.
[6] 应自炉, 张有为, 李景文. 融合人脸局部区域的表情识别[J]. 信号处理, 2009, 25(6): 963-966.
YING Zilu, ZHANG Youwei, LI Jingwen. Facial expression recognition by fusing local facial regions[J]. Journal of signal processing, 2009, 25(6): 963-966.
[7] 何俊, 蔡建峰, 房灵芝. 基于LBP特征的融合脸部关键表情区域的表情识别方法[C]//第27届中国控制与决策会议. 青岛: 信息科技, 2015: 1209-1213.
HE Jun, CAI Jianfeng, FANG Lingzhi. Facial expression recognition method based on LBP feature fusion of key facial expression regions [C]// 27th China Control and Decision Conference. Qingdao: Information Technology, 2015: 1209-1213.
[8] JAIN S, HU Changbo, AGGARWAL J K. Facial expression recognition with temporal modeling of shapes[C]//2011 IEEE International Conference on Computer Vision Workshops. Barcelona: IEEE, 2011: 1642-1649.
[9] SIKKA K, SHARMA G, BARTLETT M. LOMo: latent ordinal model for facial analysis in videos[C]//2016 IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas: IEEE, 2016: 5580-5589.
[10] ZHANG Kaihao, HUANG Yongzhen, DU Yong, et al. Facial expression recognition based on deep evolutional spatial-temporal networks[J]. IEEE transactions on image processing, 2017, 26(9): 4193-4203.
[11] FENG Duo, REN Fuji. Dynamic facial expression recognition based on two-stream-CNN with LBP-TOP[C]//2018 5th IEEE International Conference on Cloud Computing and Intelligence Systems. Nanjing: IEEE, 2018: 355-359.
[12] CHEN Tuo, XING Shuai, YANG Wenwu, et al. Spatio-temporal features based human facial expression recognition[J]. Journal of image and graphics, 2022, 27(7): 2185-2198.
[13] HASANI B, MAHOOR M H. Facial expression recognition using enhanced deep 3D convolutional neural networks[C]//2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops. Honolulu: IEEE, 2017: 2278-2288.
[14] 张隽睿. 基于深度学习的静态和动态面部表情识别研究[D]. 成都: 电子科技大学, 2022.
ZHANG Junrui. Research on static and dynamic facial expression recognition based on deep learning[D]. Chengdu: University of Electronic Science and Technology of China, 2022.
[15] 刘菁菁, 吴晓峰. 基于长短时记忆网络的多模态情感识别和空间标注[J]. 复旦学报(自然科学版), 2020, 59(5): 565-574.
LIU Jingjing, WU Xiaofeng. Real-time multimodal emotion recognition and emotion space labeling using LSTM networks[J]. Journal of Fudan University (natural science), 2020, 59(5): 565-574.
[16] FARHOUDI Z, SETAYESHI S. Fusion of deep learning features with mixture of brain emotional learning for audio-visual emotion recognition[J]. Speech communication, 2021, 127: 92-103.
[17] FERNANDEZ P D M, PENA F A G, REN T I, et al. FERAtt: facial expression recognition with attention net[C]//2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops. Long Beach: IEEE, 2019: 837-846.
[18] RONNEBERGER O, FISCHER P, BROX T. U-Net: convolutional networks for biomedical image segmentation[M]//Medical Image Computing and Computer-Assisted Intervention. Cham: Springer International Publishing, 2015: 234-241.
[19] MENG Debin, PENG Xiaojiang, WANG Kai, et al. Frame attention networks for facial expression recognition in videos[C]//2019 IEEE International Conference on Image Processing. Taipei: IEEE, 2019: 3866-3870.
[20] 李同霞. 基于表征流嵌入网络的动态表情识别[D]. 南京: 南京邮电大学, 2022.
LI Tongxia. Dynamic expression recognition based on representation stream embedding network[D]. Nanjing: Nanjing University of Posts and Telecommunications, 2022.
[21] HE Kaiming, ZHANG Xiangyu, REN Shaoqing, et al. Deep residual learning for image recognition[C]//2016 IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas: IEEE, 2016: 770-778.
[22] WANG Qilong, WU Banggu, ZHU Pengfei, et al. ECA-net: efficient channel attention for deep convolutional neural networks[C]//2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Seattle: IEEE, 2020: 11531-11539.
[23] WANG Dong, GAO Feng, DONG Junyu, et al. Change detection in synthetic aperture radar images based on convolutional block attention module[C]//2019 10th International Workshop on the Analysis of Multitemporal Remote Sensing Images (MultiTemp). Shanghai: IEEE, 2019: 1-4.
[24] SIMONYAN K, ZISSERMAN A, SIMONYAN K, et al. Two-stream convolutional networks for action recognition in videos[C]//Proceedings of the 28th International Conference on Neural Information Processing Systems- Volume 1. [S.l.]: ACM, 2014: 568-576.
[25] FARNEB?CK G. Two-frame motion estimation based on polynomial expansion[C]//Image Analysis. Berlin: Springer Berlin Heidelberg, 2003: 363-370.
[26] LUCEY P, COHN J F, KANADE T, et al. The extended Cohn-kanade dataset (CK): a complete dataset for action unit and emotion-specified expression[C]//2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition-Workshops. San Francisco: IEEE, 2010: 94-101.
[27] OLIVIER M, IRENE K, BENOIT M, et al. The eNTERFACE’05 audio-visual emotion database[C]//Proceedings of the 22nd International Conference on Data Engineering Workshops. [S.l.]: IEEE, 2006: 8-15.
[28] SELVARAJU R R, COGSWELL M, DAS A, et al. Grad-CAM: visual explanations from deep networks via gradient-based localization[C]//2017 IEEE International Conference on Computer Vision. Venice: IEEE, 2017: 618-626.
[29] MA Fei, HUANG Shaolun, ZHANG Lin. An efficient approach for audio-visual emotion recognition with missing labels and missing modalities[C]//2021 IEEE International Conference on Multimedia and Expo. Shenzhen: IEEE, 2021: 1-6.
[30] ZHAO Jianfeng, MAO Xia, ZHANG Jian. Learning deep facial expression features from image and optical flow sequences using 3D CNN[J]. The visual computer, 2018, 34(10): 1461-1475.

Similar References:

Memo

Last Update: 1900-01-01

Two-stream network video expression recognition by fusing key region information PDF DownloadHTML

Memo

Two-stream network video expression recognition by fusing key region information

PDF Download HTML