[1]王忠美,敖文秀,刘建华,等.基于自适应梯度调制的音视频多模态平衡学习方法[J].智能系统学报,2025,20(5):1217-1226.[doi:10.11992/tis.202412009]
 WANG Zhongmei,AO Wenxiu,LIU Jianhua,et al.An audio-visual multimodal balanced learning method based on adaptive gradient modulation[J].CAAI Transactions on Intelligent Systems,2025,20(5):1217-1226.[doi:10.11992/tis.202412009]
点击复制

基于自适应梯度调制的音视频多模态平衡学习方法

参考文献/References:
[1] 黄学坚, 马廷淮, 王根生. 基于样本内外协同表示和自适应融合的多模态学习方法[J]. 计算机研究与发展, 2024, 61(5): 1310-1324.
HUANG Xuejian, MA Tinghuai, WANG Gensheng. Multimodal learning method based on intra-and inter-sample cooperative representation and adaptive fusion[J]. Journal of computer research and development, 2024, 61(5): 1310-1324.
[2] 潘家辉, 何志鹏, 李自娜, 等. 多模态情绪识别研究综述[J]. 智能系统学报, 2020, 15(4): 633-645.
PAN Jiahui, HE Zhipeng, LI Zina, et al. A review of multimodal emotion recognition[J]. CAAI transactions on intelligent systems, 2020, 15(4): 633-645.
[3] CHANG Yicong, XUE Feng, SHENG Fei, et al. Fast road segmentation via uncertainty-aware symmetric network[C]//2022 International Conference on Robotics and Automation. Philadelphia: IEEE, 2022: 11124-11130.
[4] CUI Can, MA Yunsheng, CAO Xu, et al. A survey on multimodal large language models for autonomous driving [C]//Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. Waikoloa: IEEE, 2024: 958-979.
[5] WANG Weiyao, TRAN D, FEISZLI M. What makes training multi-modal classification networks hard? [C]//2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Seattle: IEEE, 2020: 12695-12705.
[6] HUANG Yu, LIN Junyang, ZHOU Chang, et al. Modality competition: what makes joint training of multi-modal network fail in deep learning?(provably)[C]//International Conference on Machine Learning. Baltimore: PMLR, 2022: 9226-9259.
[7] XU Ruize, FENG Ruoxuan, ZHANG Shixiong, et al. MMCosine: multi-modal cosine loss towards balanced audio-visual fine-grained learning[C]//2023 IEEE International Conference on Acoustics, Speech and Signal Processing. Rhodes Island: IEEE, 2023: 1-5.
[8] LI Hong, LI Xingyu, HU Pengbo, et al. Boosting multi-modal model performance with adaptive gradient modulation[C]//2023 IEEE/CVF International Conference on Computer Vision. Paris: IEEE, 2023: 22157-22167.
[9] WEI Yake, HU Di, DU Henghui, et al. On-the-fly modulation for balanced multimodal learning[J]. IEEE transactions on pattern analysis and machine intelligence, 2025, 47(1): 469-485.
[10] YANG Liu, WU Zhenjie, HONG Junkun, et al. MCL: a contrastive learning method for multimodal data fusion in violence detection[J]. IEEE signal processing letters, 2022, 30: 408-412.
[11] DU Chenzhuang, TENG Jiaye, LI Tingle, et al. On unimodal feature learning in supervised multimodal learn- ing[C]//International Conference on Machine Learning. Honolulu: PMLR, 2023: 8632-8656.
[12] LIU Shilei, LI Lin, SONG Jun, et al. Multimodal pre-training with self-distillation for product understanding in E-commerce[C]//Proceedings of the Sixteenth ACM International Conference on Web Search and Data Mining. Singapore: ACM, 2023: 1039-1047.
[13] PENG Xiaokang, WEI Yake, DENG Andong, et al. Balanced multimodal learning via on-the-fly gradient modulation[C]//Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. New Orleans: IEEE, 2022: 8238-8247.
[14] 刘成广, 王善敏, 刘青山. 类别平衡调制的人脸表情识别[J]. 计算机科学与探索, 2023, 17(12): 3029-3038.
LIU Chengguang, WANG Shanmin, LIU Qingshan. Class-balanced modulation for facial expression recognition[J]. Journal of frontiers of computer science and technology, 2023, 17(12): 3029-3038.
[15] FAN Yunfeng, XU Wenchao, WANG Haozhao, et al. PMR: prototypical modal rebalance for multimodal learning[C]//2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Vancouver: IEEE, 2023: 20029-20038.
[16] LIN Xun, WANG Shuai, CAI Rizhao, et al. Suppress and rebalance: towards generalized multi-modal face anti-spoofing[C]//2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Seattle: IEEE, 2024: 211-221.
[17] 刘佳, 宋泓, 陈大鹏, 等. 非语言信息增强和对比学习的多模态情感分析模型[J]. 电子与信息学报, 2024, 46(8): 3372-3381.
LIU Jia, SONG Hong, CHEN Dapeng, et al. A multimodal sentiment analysis model enhanced with non-verbal information and contrastive learning[J]. Journal of electronics & information technology, 2024, 46(8): 3372-3381.
[18] ZHOU Yipin, LIM S N. Joint audio-visual deepfake detection[C]//2021 IEEE/CVF International Conference on Computer Vision. Montreal: IEEE, 2021: 14780-14789.
[19] YU Wenmeng, XU Hua, YUAN Ziqi, et al. Learning modality-specific representations with self-supervised multi-task learning for multimodal sentiment analysis[J]. Proceedings of the AAAI conference on artificial intelligence, 2021, 35(12): 10790-10797.
[20] XIAO Yi, CODEVILLA F, GURRAM A, et al. Multimo- dal end-to-end autonomous driving[J]. IEEE Transactions on intelligent transportation systems, 2020, 23(1): 537-547.
[21] 刘慧, 朱积成, 王欣雨, 等. 面向医学图像融合的多尺度特征频域分解滤波[J]. 软件学报, 2024, 35(12): 5687-5709.
LIU Hui, ZHU Jicheng, WANG Xinyu, et al. Multi-scale feature frequency domain decomposition filtering for medical image fusion[J]. Journal of software, 2024, 35(12): 5687-5709.
[22] SUN Ya, MAI Sijie, HU Haifeng. Learning to balance the learning rates between various modalities via adaptive tracking factor[J]. IEEE signal processing letters, 2021, 28: 1650-1654.
[23] XIAO Fanyi, LEE Y J, GRAUMAN K, et al. Audiovisual slowfast networks for video recognition[EB/OL]. (2020-01-23)[2024-12-11]. https://arxiv.org/abs/2001.08740.
[24] 罗渊贻, 吴锐, 刘家锋, 等. 基于自适应权值融合的多模态情感分析方法[J]. 软件学报, 2024, 35(10): 4781-4793.
LUO Yuanyi, WU Rui, LIU Jiafeng, et al. Multimodal sentiment analysis method based on adaptive weight fusion[J]. Journal of software, 2024, 35(10): 4781-4793.
[25] WU Nan, JASTRZEBSKI S, Cho K, et al. Characteri- zing and overcoming the greedy nature of learning in multi-modal deep neural networks[C]//International Conference on Machine Learning. Baltimore: PMLR, 2022: 24043-24055.
[26] CAO Houwei, COOPER D G, KEUTMANN M K, et al. CREMA-D: crowd-sourced emotional multimodal actors dataset[J]. IEEE transactions on affective computing, 2014, 5(4): 377-390.
[27] LIVINGSTONE S R, RUSSO F. The ryerson audio-visual database of emotional speech and song (RAVDESS): a dynamic, multimodal set of facial and vocal expressions in North American English[J]. PloS one, 2018, 13(5): e0196391.
[28] HE Kaiming, ZHANG Xiangyu, REN Shaoqing, et al. Deep residual learning for image recognition[C]//2016 IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas: IEEE, 2016: 770-778.
[29] JIN Qin, LI Chengxin, CHEN Shizhe, et al. Speech emotion recognition with acoustic and lexical features[C]//2015 IEEE International Conference on Acoustics, Speech and Signal Processing. South Brisbane: IEEE, 2015: 4749-4753.
[30] TANG Guichen, XIE Yue, LI Ke, et al. Multimodal emotion recognition from facial expression and speech based on feature fusion[J]. Multimedia tools and applications, 2023, 82(11): 16359-16373.
相似文献/References:
[1]肖建力,黄星宇,姜飞.智慧教育中的大语言模型综述[J].智能系统学报,2025,20(5):1054.[doi:10.11992/tis.202406040]
 XIAO Jianli,HUANG Xingyu,JIANG Fei.A survey of large language models in smart education[J].CAAI Transactions on Intelligent Systems,2025,20():1054.[doi:10.11992/tis.202406040]

备注/Memo

收稿日期:2024-12-11。
基金项目:国家重点研发计划项目(2021YFF0501101);国家自然科学基金项目(52272347);国家自然科学基金青年基金项目(62106074).
作者简介:王忠美,讲师,电气与电子工程师协会(IEEE)会员,主要研究方向为人工智能、计算机视觉和遥感信息处理。E-mail:wangzhongmei@hut.edu.cn。;敖文秀,硕士研究生,主要研究方向为模态融合、多模态平衡学习。E-mail:m23081100020@stu.hut.edu.cn。;刘建华,教授,博士生导师,主要研究方向为轨道交通电传动控制与智能运维。主持国家自然科学基金项目2项、国家重点研发计划课题1项。E-mail:jhliu@hut.edu.cn。
通讯作者:王忠美. E-mail:wangzhongmei@hut.edu.cn

更新日期/Last Update: 2025-09-05
Copyright © 《 智能系统学报》 编辑部
地址:(150001)黑龙江省哈尔滨市南岗区南通大街145-1号楼 电话:0451- 82534001、82518134 邮箱:tis@vip.sina.com