<-Previous Article Next Article->

[1]WANG Zhongmei,AO Wenxiu,LIU Jianhua,et al.An audio-visual multimodal balanced learning method based on adaptive gradient modulation[J].CAAI Transactions on Intelligent Systems,2025,20(5):1217-1226.[doi:10.11992/tis.202412009]

Copy

An audio-visual multimodal balanced learning method based on adaptive gradient modulation

PDF Download HTML

CAAI Transactions on Intelligent Systems[ISSN 1673-4785/CN 23-1538/TP] Volume: 20 Number of periods: 2025 5 Page number: 1217-1226 Column: 学术论文—自然语言处理与理解 Public date: 2025-09-05

Title:: An audio-visual multimodal balanced learning method based on adaptive gradient modulation

Author(s):: WANG Zhongmei¹; AO Wenxiu¹; LIU Jianhua¹; JIA Lin¹; ZHANG Changfan¹; PENG Shen’ao¹; LIU Jinping²; 1. School of Railway Transportation, Hunan University of Technology, Zhuzhou 412007, China;
2. College of Information Science and Engineering, Hunan Normal University, Changsha 410081, China

Keywords:: balanced learning; multimodal learning; gradient modulation; adaptive learning; multimodal gradient balancing; learning rate; audio-visual multimodal; collaborative decision-making

CLC:: TP391

DOI:: 10.11992/tis.202412009

Abstract:: To address the challenge in audio-visual multimodal learning, where differing learning rates across modalities cause one to dominate and suppress others, thereby weakening the multimodal collaborative decision-making process, a novel multimodal balanced learning method based on adaptive gradient modulation (AGM-CR) is proposed. This method employs modulation coefficients that dynamically adjust the learning rates of individual modalities according to their gradient variations. Additionally, it incorporates a gradient balancing strategy that integrates modality-specific gradient losses into the total loss as a regularization term. Together, these mechanisms reduce gradient disparities, fostering a more balanced and effective learning process. Experimental evaluation on the CREMA-D and RAVDESS datasets demonstrates that AGM-CR improves classification accuracy by 2.5 and 3.3 percentage points, respectively. Furthermore, AGM-CR stabilizes training by minimizing gradient fluctuations across iterations, which accelerates convergence. Importantly, AGM-CR functions as a plug-and-play approach, enhancing flexibility and generalizability compared with existing balancing approaches.

References:: [1] 黄学坚, 马廷淮, 王根生. 基于样本内外协同表示和自适应融合的多模态学习方法[J]. 计算机研究与发展, 2024, 61(5): 1310-1324.
HUANG Xuejian, MA Tinghuai, WANG Gensheng. Multimodal learning method based on intra-and inter-sample cooperative representation and adaptive fusion[J]. Journal of computer research and development, 2024, 61(5): 1310-1324.
[2] 潘家辉, 何志鹏, 李自娜, 等. 多模态情绪识别研究综述[J]. 智能系统学报, 2020, 15(4): 633-645.
PAN Jiahui, HE Zhipeng, LI Zina, et al. A review of multimodal emotion recognition[J]. CAAI transactions on intelligent systems, 2020, 15(4): 633-645.
[3] CHANG Yicong, XUE Feng, SHENG Fei, et al. Fast road segmentation via uncertainty-aware symmetric network[C]//2022 International Conference on Robotics and Automation. Philadelphia: IEEE, 2022: 11124-11130.
[4] CUI Can, MA Yunsheng, CAO Xu, et al. A survey on multimodal large language models for autonomous driving [C]//Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. Waikoloa: IEEE, 2024: 958-979.
[5] WANG Weiyao, TRAN D, FEISZLI M. What makes training multi-modal classification networks hard? [C]//2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Seattle: IEEE, 2020: 12695-12705.
[6] HUANG Yu, LIN Junyang, ZHOU Chang, et al. Modality competition: what makes joint training of multi-modal network fail in deep learning?(provably)[C]//International Conference on Machine Learning. Baltimore: PMLR, 2022: 9226-9259.
[7] XU Ruize, FENG Ruoxuan, ZHANG Shixiong, et al. MMCosine: multi-modal cosine loss towards balanced audio-visual fine-grained learning[C]//2023 IEEE International Conference on Acoustics, Speech and Signal Processing. Rhodes Island: IEEE, 2023: 1-5.
[8] LI Hong, LI Xingyu, HU Pengbo, et al. Boosting multi-modal model performance with adaptive gradient modulation[C]//2023 IEEE/CVF International Conference on Computer Vision. Paris: IEEE, 2023: 22157-22167.
[9] WEI Yake, HU Di, DU Henghui, et al. On-the-fly modulation for balanced multimodal learning[J]. IEEE transactions on pattern analysis and machine intelligence, 2025, 47(1): 469-485.
[10] YANG Liu, WU Zhenjie, HONG Junkun, et al. MCL: a contrastive learning method for multimodal data fusion in violence detection[J]. IEEE signal processing letters, 2022, 30: 408-412.
[11] DU Chenzhuang, TENG Jiaye, LI Tingle, et al. On unimodal feature learning in supervised multimodal learn- ing[C]//International Conference on Machine Learning. Honolulu: PMLR, 2023: 8632-8656.
[12] LIU Shilei, LI Lin, SONG Jun, et al. Multimodal pre-training with self-distillation for product understanding in E-commerce[C]//Proceedings of the Sixteenth ACM International Conference on Web Search and Data Mining. Singapore: ACM, 2023: 1039-1047.
[13] PENG Xiaokang, WEI Yake, DENG Andong, et al. Balanced multimodal learning via on-the-fly gradient modulation[C]//Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. New Orleans: IEEE, 2022: 8238-8247.
[14] 刘成广, 王善敏, 刘青山. 类别平衡调制的人脸表情识别[J]. 计算机科学与探索, 2023, 17(12): 3029-3038.
LIU Chengguang, WANG Shanmin, LIU Qingshan. Class-balanced modulation for facial expression recognition[J]. Journal of frontiers of computer science and technology, 2023, 17(12): 3029-3038.
[15] FAN Yunfeng, XU Wenchao, WANG Haozhao, et al. PMR: prototypical modal rebalance for multimodal learning[C]//2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Vancouver: IEEE, 2023: 20029-20038.
[16] LIN Xun, WANG Shuai, CAI Rizhao, et al. Suppress and rebalance: towards generalized multi-modal face anti-spoofing[C]//2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Seattle: IEEE, 2024: 211-221.
[17] 刘佳, 宋泓, 陈大鹏, 等. 非语言信息增强和对比学习的多模态情感分析模型[J]. 电子与信息学报, 2024, 46(8): 3372-3381.
LIU Jia, SONG Hong, CHEN Dapeng, et al. A multimodal sentiment analysis model enhanced with non-verbal information and contrastive learning[J]. Journal of electronics & information technology, 2024, 46(8): 3372-3381.
[18] ZHOU Yipin, LIM S N. Joint audio-visual deepfake detection[C]//2021 IEEE/CVF International Conference on Computer Vision. Montreal: IEEE, 2021: 14780-14789.
[19] YU Wenmeng, XU Hua, YUAN Ziqi, et al. Learning modality-specific representations with self-supervised multi-task learning for multimodal sentiment analysis[J]. Proceedings of the AAAI conference on artificial intelligence, 2021, 35(12): 10790-10797.
[20] XIAO Yi, CODEVILLA F, GURRAM A, et al. Multimo- dal end-to-end autonomous driving[J]. IEEE Transactions on intelligent transportation systems, 2020, 23(1): 537-547.
[21] 刘慧, 朱积成, 王欣雨, 等. 面向医学图像融合的多尺度特征频域分解滤波[J]. 软件学报, 2024, 35(12): 5687-5709.
LIU Hui, ZHU Jicheng, WANG Xinyu, et al. Multi-scale feature frequency domain decomposition filtering for medical image fusion[J]. Journal of software, 2024, 35(12): 5687-5709.
[22] SUN Ya, MAI Sijie, HU Haifeng. Learning to balance the learning rates between various modalities via adaptive tracking factor[J]. IEEE signal processing letters, 2021, 28: 1650-1654.
[23] XIAO Fanyi, LEE Y J, GRAUMAN K, et al. Audiovisual slowfast networks for video recognition[EB/OL]. (2020-01-23)[2024-12-11]. https://arxiv.org/abs/2001.08740.
[24] 罗渊贻, 吴锐, 刘家锋, 等. 基于自适应权值融合的多模态情感分析方法[J]. 软件学报, 2024, 35(10): 4781-4793.
LUO Yuanyi, WU Rui, LIU Jiafeng, et al. Multimodal sentiment analysis method based on adaptive weight fusion[J]. Journal of software, 2024, 35(10): 4781-4793.
[25] WU Nan, JASTRZEBSKI S, Cho K, et al. Characteri- zing and overcoming the greedy nature of learning in multi-modal deep neural networks[C]//International Conference on Machine Learning. Baltimore: PMLR, 2022: 24043-24055.
[26] CAO Houwei, COOPER D G, KEUTMANN M K, et al. CREMA-D: crowd-sourced emotional multimodal actors dataset[J]. IEEE transactions on affective computing, 2014, 5(4): 377-390.
[27] LIVINGSTONE S R, RUSSO F. The ryerson audio-visual database of emotional speech and song (RAVDESS): a dynamic, multimodal set of facial and vocal expressions in North American English[J]. PloS one, 2018, 13(5): e0196391.
[28] HE Kaiming, ZHANG Xiangyu, REN Shaoqing, et al. Deep residual learning for image recognition[C]//2016 IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas: IEEE, 2016: 770-778.
[29] JIN Qin, LI Chengxin, CHEN Shizhe, et al. Speech emotion recognition with acoustic and lexical features[C]//2015 IEEE International Conference on Acoustics, Speech and Signal Processing. South Brisbane: IEEE, 2015: 4749-4753.
[30] TANG Guichen, XIE Yue, LI Ke, et al. Multimodal emotion recognition from facial expression and speech based on feature fusion[J]. Multimedia tools and applications, 2023, 82(11): 16359-16373.

Similar References:

Memo

Last Update: 2025-09-05

An audio-visual multimodal balanced learning method based on adaptive gradient modulation PDF DownloadHTML

Memo

An audio-visual multimodal balanced learning method based on adaptive gradient modulation

PDF Download HTML