[1]YAN He,LIU Lingkun,HUANG Junbin,et al.Video summarization model based on the multiscale attention mechanism and bidirectional gated recurrent network[J].CAAI Transactions on Intelligent Systems,2024,19(2):446-454.[doi:10.11992/tis.202209048]
Copy
CAAI Transactions on Intelligent Systems[ISSN 1673-4785/CN 23-1538/TP] Volume:
19
Number of periods:
2024 2
Page number:
446-454
Column:
学术论文—人工智能基础
Public date:
2024-03-05
- Title:
-
Video summarization model based on the multiscale attention mechanism and bidirectional gated recurrent network
- Author(s):
-
YAN He; LIU Lingkun; HUANG Junbin; ZHANG Ye; DUAN Siyu
-
Liangjiang College of Artificial Intelligence, Chongqing University of Technology, Chongqing 401135, China
-
- Keywords:
-
video summary; self-attention mechanism; importance score; long-range dependence; computer vision; BiGRU; nonmaximum suppression (NMS); kernel temporal segmentation (KTS)
- CLC:
-
TP391.41
- DOI:
-
10.11992/tis.202209048
- Abstract:
-
In the video summary task, the variance of global attention value distribution on long distance video sequences is large, the importance score of generating key frames is large, and the semantic coherence of fragments is poor due to the lack of long-range dependence on the boundary values of time series nodes. Herein, by improving the attention module, segmented local self-attention and global self-attention mechanisms are merged to acquire the key features of local and global video sequences and lower the variance of attention values. Concurrently, the bidirectional gated recurrent neural network (BiGRU) is introduced in parallel, the output is input into the enhanced classification regression module, and afterward, the results are additively fused. Lastly, nonmaximum suppression and kernel temporal segmentation methods are applied to filter fragments and segment them into high-quality representative shots. The final summary is created by the knapsack combinatorial optimization algorithm. The video summary model LG-RU, which integrates the multiscale attention mechanism and BiGRU, is developed and compared with TvSum and SumMe’s standard and enhanced data sets. It is demonstrated that the model has a higher F-score, which verifies that this model can complete the video summary robustly while preserving high accuracy.