<-上一篇/Previous Article 下一篇/Next Article->

[1]王忠美,敖文秀,刘建华,等.基于自适应梯度调制的音视频多模态平衡学习方法[J].智能系统学报,2025,20(5):1217-1226.[doi:10.11992/tis.202412009]
　WANG Zhongmei,AO Wenxiu,LIU Jianhua,et al.An audio-visual multimodal balanced learning method based on adaptive gradient modulation[J].CAAI Transactions on Intelligent Systems,2025,20(5):1217-1226.[doi:10.11992/tis.202412009]

点击复制

基于自适应梯度调制的音视频多模态平衡学习方法

PDF下载 HTML

《智能系统学报》[ISSN 1673-4785/CN 23-1538/TP] 卷: 20 期数: 2025年第5期页码: 1217-1226 栏目: 学术论文—自然语言处理与理解出版日期: 2025-09-05

Title:: An audio-visual multimodal balanced learning method based on adaptive gradient modulation

作者:: 王忠美¹, 敖文秀¹, 刘建华¹, 贾林¹, 张昌凡¹, 彭深奥¹, 刘金平²; 1. 湖南工业大学轨道交通学院, 湖南株洲 412007;
2. 湖南师范大学信息科学与工程学院, 湖南长沙 410081

Author(s):: WANG Zhongmei¹, AO Wenxiu¹, LIU Jianhua¹, JIA Lin¹, ZHANG Changfan¹, PENG Shen’ao¹, LIU Jinping²; 1. School of Railway Transportation, Hunan University of Technology, Zhuzhou 412007, China;
2. College of Information Science and Engineering, Hunan Normal University, Changsha 410081, China

关键词:: 平衡学习; 多模态学习; 梯度调制; 自适应学习; 梯度均衡化; 学习速率; 音视频模态; 协同决策

Keywords:: balanced learning; multimodal learning; gradient modulation; adaptive learning; multimodal gradient balancing; learning rate; audio-visual multimodal; collaborative decision-making

分类号:: TP391

DOI:: 10.11992/tis.202412009

摘要:: 针对音视频多模态学习中因异质学习速率导致单一模态主导模型学习过程，抑制其他模态学习，进而削弱多模态协同决策效果的问题，提出一种基于自适应梯度调制的多模态平衡学习方法(adaptive gradient modulation based compensation and regularization, AGM-CR)。首先，根据模态间的学习梯度差异引入调制系数来自适应调整各模态的学习速率；然后，通过梯度均衡化策略，将单个模态的梯度损失作为正则项融入总损失来约束模态间梯度差异，进一步平衡各模态的学习过程；最后，实验结果表明在CREMA-D和RAVDESS数据集上，AGM-CR将分类准确率分别提高了2.5和3.3百分点，并在多次迭代中减小模型的梯度波动，表现出更高的训练稳定性和收敛速度。与现有的平衡方法相比，AGM-CR可即插即用，更具灵活性和通用性。

Abstract:: To address the challenge in audio-visual multimodal learning, where differing learning rates across modalities cause one to dominate and suppress others, thereby weakening the multimodal collaborative decision-making process, a novel multimodal balanced learning method based on adaptive gradient modulation (AGM-CR) is proposed. This method employs modulation coefficients that dynamically adjust the learning rates of individual modalities according to their gradient variations. Additionally, it incorporates a gradient balancing strategy that integrates modality-specific gradient losses into the total loss as a regularization term. Together, these mechanisms reduce gradient disparities, fostering a more balanced and effective learning process. Experimental evaluation on the CREMA-D and RAVDESS datasets demonstrates that AGM-CR improves classification accuracy by 2.5 and 3.3 percentage points, respectively. Furthermore, AGM-CR stabilizes training by minimizing gradient fluctuations across iterations, which accelerates convergence. Importantly, AGM-CR functions as a plug-and-play approach, enhancing flexibility and generalizability compared with existing balancing approaches.

参考文献/References:: [1] 黄学坚, 马廷淮, 王根生. 基于样本内外协同表示和自适应融合的多模态学习方法[J]. 计算机研究与发展, 2024, 61(5): 1310-1324.
HUANG Xuejian, MA Tinghuai, WANG Gensheng. Multimodal learning method based on intra-and inter-sample cooperative representation and adaptive fusion[J]. Journal of computer research and development, 2024, 61(5): 1310-1324.
[2] 潘家辉, 何志鹏, 李自娜, 等. 多模态情绪识别研究综述[J]. 智能系统学报, 2020, 15(4): 633-645.
PAN Jiahui, HE Zhipeng, LI Zina, et al. A review of multimodal emotion recognition[J]. CAAI transactions on intelligent systems, 2020, 15(4): 633-645.
[3] CHANG Yicong, XUE Feng, SHENG Fei, et al. Fast road segmentation via uncertainty-aware symmetric network[C]//2022 International Conference on Robotics and Automation. Philadelphia: IEEE, 2022: 11124-11130.
[4] CUI Can, MA Yunsheng, CAO Xu, et al. A survey on multimodal large language models for autonomous driving [C]//Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. Waikoloa: IEEE, 2024: 958-979.
[5] WANG Weiyao, TRAN D, FEISZLI M. What makes training multi-modal classification networks hard? [C]//2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Seattle: IEEE, 2020: 12695-12705.
[6] HUANG Yu, LIN Junyang, ZHOU Chang, et al. Modality competition: what makes joint training of multi-modal network fail in deep learning?(provably)[C]//International Conference on Machine Learning. Baltimore: PMLR, 2022: 9226-9259.
[7] XU Ruize, FENG Ruoxuan, ZHANG Shixiong, et al. MMCosine: multi-modal cosine loss towards balanced audio-visual fine-grained learning[C]//2023 IEEE International Conference on Acoustics, Speech and Signal Processing. Rhodes Island: IEEE, 2023: 1-5.
[8] LI Hong, LI Xingyu, HU Pengbo, et al. Boosting multi-modal model performance with adaptive gradient modulation[C]//2023 IEEE/CVF International Conference on Computer Vision. Paris: IEEE, 2023: 22157-22167.
[9] WEI Yake, HU Di, DU Henghui, et al. On-the-fly modulation for balanced multimodal learning[J]. IEEE transactions on pattern analysis and machine intelligence, 2025, 47(1): 469-485.
[10] YANG Liu, WU Zhenjie, HONG Junkun, et al. MCL: a contrastive learning method for multimodal data fusion in violence detection[J]. IEEE signal processing letters, 2022, 30: 408-412.
[11] DU Chenzhuang, TENG Jiaye, LI Tingle, et al. On unimodal feature learning in supervised multimodal learn- ing[C]//International Conference on Machine Learning. Honolulu: PMLR, 2023: 8632-8656.
[12] LIU Shilei, LI Lin, SONG Jun, et al. Multimodal pre-training with self-distillation for product understanding in E-commerce[C]//Proceedings of the Sixteenth ACM International Conference on Web Search and Data Mining. Singapore: ACM, 2023: 1039-1047.
[13] PENG Xiaokang, WEI Yake, DENG Andong, et al. Balanced multimodal learning via on-the-fly gradient modulation[C]//Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. New Orleans: IEEE, 2022: 8238-8247.
[14] 刘成广, 王善敏, 刘青山. 类别平衡调制的人脸表情识别[J]. 计算机科学与探索, 2023, 17(12): 3029-3038.
LIU Chengguang, WANG Shanmin, LIU Qingshan. Class-balanced modulation for facial expression recognition[J]. Journal of frontiers of computer science and technology, 2023, 17(12): 3029-3038.
[15] FAN Yunfeng, XU Wenchao, WANG Haozhao, et al. PMR: prototypical modal rebalance for multimodal learning[C]//2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Vancouver: IEEE, 2023: 20029-20038.
[16] LIN Xun, WANG Shuai, CAI Rizhao, et al. Suppress and rebalance: towards generalized multi-modal face anti-spoofing[C]//2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Seattle: IEEE, 2024: 211-221.
[17] 刘佳, 宋泓, 陈大鹏, 等. 非语言信息增强和对比学习的多模态情感分析模型[J]. 电子与信息学报, 2024, 46(8): 3372-3381.
LIU Jia, SONG Hong, CHEN Dapeng, et al. A multimodal sentiment analysis model enhanced with non-verbal information and contrastive learning[J]. Journal of electronics & information technology, 2024, 46(8): 3372-3381.
[18] ZHOU Yipin, LIM S N. Joint audio-visual deepfake detection[C]//2021 IEEE/CVF International Conference on Computer Vision. Montreal: IEEE, 2021: 14780-14789.
[19] YU Wenmeng, XU Hua, YUAN Ziqi, et al. Learning modality-specific representations with self-supervised multi-task learning for multimodal sentiment analysis[J]. Proceedings of the AAAI conference on artificial intelligence, 2021, 35(12): 10790-10797.
[20] XIAO Yi, CODEVILLA F, GURRAM A, et al. Multimo- dal end-to-end autonomous driving[J]. IEEE Transactions on intelligent transportation systems, 2020, 23(1): 537-547.
[21] 刘慧, 朱积成, 王欣雨, 等. 面向医学图像融合的多尺度特征频域分解滤波[J]. 软件学报, 2024, 35(12): 5687-5709.
LIU Hui, ZHU Jicheng, WANG Xinyu, et al. Multi-scale feature frequency domain decomposition filtering for medical image fusion[J]. Journal of software, 2024, 35(12): 5687-5709.
[22] SUN Ya, MAI Sijie, HU Haifeng. Learning to balance the learning rates between various modalities via adaptive tracking factor[J]. IEEE signal processing letters, 2021, 28: 1650-1654.
[23] XIAO Fanyi, LEE Y J, GRAUMAN K, et al. Audiovisual slowfast networks for video recognition[EB/OL]. (2020-01-23)[2024-12-11]. https://arxiv.org/abs/2001.08740.
[24] 罗渊贻, 吴锐, 刘家锋, 等. 基于自适应权值融合的多模态情感分析方法[J]. 软件学报, 2024, 35(10): 4781-4793.
LUO Yuanyi, WU Rui, LIU Jiafeng, et al. Multimodal sentiment analysis method based on adaptive weight fusion[J]. Journal of software, 2024, 35(10): 4781-4793.
[25] WU Nan, JASTRZEBSKI S, Cho K, et al. Characteri- zing and overcoming the greedy nature of learning in multi-modal deep neural networks[C]//International Conference on Machine Learning. Baltimore: PMLR, 2022: 24043-24055.
[26] CAO Houwei, COOPER D G, KEUTMANN M K, et al. CREMA-D: crowd-sourced emotional multimodal actors dataset[J]. IEEE transactions on affective computing, 2014, 5(4): 377-390.
[27] LIVINGSTONE S R, RUSSO F. The ryerson audio-visual database of emotional speech and song (RAVDESS): a dynamic, multimodal set of facial and vocal expressions in North American English[J]. PloS one, 2018, 13(5): e0196391.
[28] HE Kaiming, ZHANG Xiangyu, REN Shaoqing, et al. Deep residual learning for image recognition[C]//2016 IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas: IEEE, 2016: 770-778.
[29] JIN Qin, LI Chengxin, CHEN Shizhe, et al. Speech emotion recognition with acoustic and lexical features[C]//2015 IEEE International Conference on Acoustics, Speech and Signal Processing. South Brisbane: IEEE, 2015: 4749-4753.
[30] TANG Guichen, XIE Yue, LI Ke, et al. Multimodal emotion recognition from facial expression and speech based on feature fusion[J]. Multimedia tools and applications, 2023, 82(11): 16359-16373.

相似文献/References:: [1]肖建力,黄星宇,姜飞.智慧教育中的大语言模型综述[J].智能系统学报,2025,20(5):1054.[doi:10.11992/tis.202406040]
　XIAO Jianli,HUANG Xingyu,JIANG Fei.A survey of large language models in smart education[J].CAAI Transactions on Intelligent Systems,2025,20():1054.[doi:10.11992/tis.202406040]

备注/Memo

收稿日期:2024-12-11。
基金项目:国家重点研发计划项目(2021YFF0501101)；国家自然科学基金项目(52272347)；国家自然科学基金青年基金项目（62106074）.
作者简介:王忠美，讲师，电气与电子工程师协会(IEEE)会员，主要研究方向为人工智能、计算机视觉和遥感信息处理。E-mail：wangzhongmei@hut.edu.cn。;敖文秀，硕士研究生，主要研究方向为模态融合、多模态平衡学习。E-mail：m23081100020@stu.hut.edu.cn。;刘建华，教授，博士生导师，主要研究方向为轨道交通电传动控制与智能运维。主持国家自然科学基金项目2项、国家重点研发计划课题1项。E-mail：jhliu@hut.edu.cn。
通讯作者:王忠美. E-mail：wangzhongmei@hut.edu.cn

更新日期/Last Update: 2025-09-05

基于自适应梯度调制的音视频多模态平衡学习方法 PDF下载HTML

备注/Memo

基于自适应梯度调制的音视频多模态平衡学习方法

PDF下载 HTML