[1]KONG Yinghui,CUI Wenting,ZHANG Ke,et al.Two-stream network video expression recognition by fusing key region information[J].CAAI Transactions on Intelligent Systems,2025,20(3):658-669.[doi:10.11992/tis.202401031]
Copy
CAAI Transactions on Intelligent Systems[ISSN 1673-4785/CN 23-1538/TP] Volume:
20
Number of periods:
2025 3
Page number:
658-669
Column:
学术论文—机器感知与模式识别
Public date:
2025-05-05
- Title:
-
Two-stream network video expression recognition by fusing key region information
- Author(s):
-
KONG Yinghui1; 2; CUI Wenting1; ZHANG Ke1; 2; CHE Linlin1; 2
-
1. Department of Electronic and Communication Engineering, North China Electric Power University, Baoding 071003, China;
2. Hebei Key Laboratory of Power Internet of Things Technology, North China Electric Power University, Baoding 071003, China
-
- Keywords:
-
video expression recognition; two-stream network; attention mechanism; optical flow; convolutional neural networks; mask; feature fusion; facial expression recognition
- CLC:
-
TP39
- DOI:
-
10.11992/tis.202401031
- Abstract:
-
Facial expression recognition is an important research topic in the field of computer vision, and facial expression recognition in video has practical value in many scenes. Video sequences contain rich intra-frame spatial information and inter-frame temporal information, and key facial regions also have an important impact on the expression recognition results. This paper proposes a two-stream network expression recognition method by fusing key region information. First, a spatial-temporal two-stream network is constructed. The spatial network branch combines the facial motion unit and the CSFA attention mechanism to focus on the key facial regions that affect the expression recognition results, so as to realize the effective extraction of spatial features. The temporal branch extracts the optical flow through Farneback to obtain the expression motion information between frames and uses the spatial key region mask selection to reduce the computational complexity of optical flow. Finally, the final video expression recognition results are obtained by decision fusion of the spatial-temporal two-stream network recognition results. The method is tested on the eNTERFACE’05 and CK+ datasets. The results show that the proposed method can effectively improve the recognition accuracy and operating efficiency.