<-上一篇/Previous Article 下一篇/Next Article->

[1]赵荣峰,卢宝莉,唐小江,等.面向智能座舱的多源混合模态数据集及层次化融合分类方法[J].智能系统学报,2026,21(1):83-94.[doi:10.11992/tis.202507024]
　ZHAO Rongfeng,LU Baoli,TANG Xiaojiang,et al.Multi-source hybrid-modality dataset and hierarchical fusion classification method for intelligent cockpits[J].CAAI Transactions on Intelligent Systems,2026,21(1):83-94.[doi:10.11992/tis.202507024]

点击复制

面向智能座舱的多源混合模态数据集及层次化融合分类方法

PDF下载 HTML

《智能系统学报》[ISSN 1673-4785/CN 23-1538/TP] 卷: 21 期数: 2026年第1期页码: 83-94 栏目: 学术论文—机器学习出版日期: 2026-03-05

Title:: Multi-source hybrid-modality dataset and hierarchical fusion classification method for intelligent cockpits

作者:: 赵荣峰^1,2, 卢宝莉¹, 唐小江¹, 胡敏⁴, 李卫军^1,3, 宁欣^1,2; 1. 中国科学院半导体研究所人工智能与高速电路实验室, 北京 100083;
2. 中国科学院大学材料科学与光电技术学院, 北京 100049;
3. 中国科学院大学集成电路学院, 北京 100049;
4. 北京中科睿途科技有限公司, 北京 100096

Author(s):: ZHAO Rongfeng^1,2, LU Baoli¹, TANG Xiaojiang¹, HU Min⁴, LI Weijun^1,3, NING Xin^1,2; 1. Institute of Semiconductors, Chinese Academy of Sciences, Beijing 100083, China;
2. College of Materials Science and Opto-Electronic Technology, University of Chinese Academy of Sciences, Beijing 100049, China;
3. School of Integrated Circuits, University of Chinese Academy of Sciences, Beijing 100049, China;
4. Beijing Ratu Technology Co., Ltd, Beijing 100096, China

关键词:: 智能座舱; 数据集; 多模态融合; 视觉多模态; 行为分类; 危险行为; 行为识别; 多源数据

Keywords:: intelligent cockpit; dataset; multimodal fusion; visual multimodality; behavior classification; dangerous behavior; behavior recognition; multi-source data

分类号:: TP391.4

DOI:: 10.11992/tis.202507024

摘要:: 针对驾驶领域智能座舱数据开源少、数据模态维度单一、标注力度不足和场景多样性受限的问题，构建了面向智能座舱的多源混合模态数据集，包含彩色数据、深度数据和红外数据的视觉模态数据与包含车辆信息和多维度驾驶场景的结构化文本模态数据，使用双层行为联合标注规则完成了数据集十类标签的标注。同时，基于该数据集提出了层次化混合模态融合框架，通过跨模态信息交换机制与语义引导融合机制提升了模型对数据特征的提取能力，完成了数据集中彩色数据与其余各数据的不同组合对行为分类任务性能影响的实验。实验表明：多源混合模态数据集能够有效提升对智能座舱的环境理解。在该数据集上，逐渐增加数据集中与彩色数据的不同数据源能够提升所提出方法对数据集分类的能力，当使用所有数据时性能达到最佳，相较于只用彩色数据的准确率提升了15.75%，验证了数据集内多源混合模态数据的有效性。

Abstract:: The scarcity of open-source data for intelligent cockpits in the driving domain is characterized by limited modality dimensions, insufficient annotations, and restricted scene diversity. To address these challenges, a multi-source hybrid-modality dataset has been constructed. This dataset incorporates RGB, depth, and infrared visual data, along with structured textual data detailing vehicle information and driving scenarios. A dual-layer annotation scheme is applied to capture ten behavior categories. Leveraging this dataset, a hierarchical multi-modal fusion framework is proposed to enhance feature extraction via cross-modal information exchange and semantically guided fusion mechanisms. Experiments on video classification tasks reveal significant improvements in environmental understanding when combining RGB data with additional modalities. Using the full range of modalities leads to a 15.75% increase in accuracy compared to using only RGB data. These results validate the effectiveness of the multi-source hybrid-modality dataset in advancing intelligent cockpit systems.

参考文献/References:: [1] 郗来乐, 林声浩, 王震, 等. 智能网联汽车自动驾驶安全: 威胁、攻击与防护[J]. 软件学报, 2025, 36(4): 1859-1880 XI Laile, LIN Shenghao, WANG Zhen, et al. Autonomous driving security of intelligent connected vehicles: threats, attacks, and defenses[J]. Journal of software, 2025, 36(4): 1859-1880
[2] 褚万里, 郭鹏, 章捷, 等. 机动车驾驶员疲劳驾驶检测方法研究综述[J]. 电子设计工程, 2025, 33(4): 36-41 CHU Wanli, GUO Peng, ZHANG Jie, et al. Review of research on fatigue driving detection methods for motor vehicle drivers[J]. Electronic design engineering, 2025, 33(4): 36-41
[3] 王润民, 朱宇, 赵祥模, 等. 自动驾驶测试场景研究进展[J]. 交通运输工程学报, 2021, 21(2): 21-37 WANG Runmin, ZHU Yu, ZHAO Xiangmo, et al. Research progress on test scenario of autonomous driving[J]. Journal of traffic and transportation engineering, 2021, 21(2): 21-37
[4] GAO Fei, GE Xiaojun, LI Jinyu, et al. Intelligent cockpits for connected vehicles: taxonomy, architecture, interaction technologies, and future directions[J]. Sensors, 2024, 24(16): 5172
[5] 刘佳雨. 自动-人工驾驶车辆混行下快速路合流区交通安全评价[D]. 哈尔滨: 哈尔滨工业大学, 2021. LIU Jiayu. Traffic safety evaluation of freeway merging areas under mixed traffic of automated and human-driven vehicles[D]. Harbin: Harbin Institute of Technology, 2021.
[6] GRIGORESCU S, TRASNEA B, COCIAS T, et al. A survey of deep learning techniques for autonomous driving[J]. Journal of field robotics, 2020, 37(3): 362-386
[7] BALTRU?AITIS T, AHUJA C, MORENCY L P. Multimodal machine learning: a survey and taxonomy[J]. IEEE transactions on pattern analysis and machine intelligence, 2019, 41(2): 423-443
[8] 张辉, 杜瑞, 钟杭, 等. 电力设施多模态精细化机器人巡检关键技术及应用[J]. 自动化学报, 2025, 51(1): 20-42 ZHANG Hui, DU Rui, ZHONG Hang, et al. The key technology and application of multi-modal fine robot inspection for power facilities[J]. Acta automatica sinica, 2025, 51(1): 20-42
[9] CHEN Long, LI Yuchen, HUANG Chao, et al. Milestones in autonomous driving and intelligent vehicles: survey of surveys[J]. IEEE transactions on intelligent vehicles, 2023, 8(2): 1046-1056
[10] XU Peng, ZHU Xiatian, CLIFTON D A. Multimodal learning with Transformers: a survey[J]. IEEE transactions on pattern analysis and machine intelligence, 2023, 45(10): 12113-12132
[11] SCHULDT C, LAPTEV I, CAPUTO B. Recognizing human actions: a local SVM approach[C]//Proceedings of the 17th International Conference on Pattern Recognition. Piscataway: IEEE, 2004: 32-36.
[12] GORELICK L, BLANK M, SHECHTMAN E, et al. Actions as space-time shapes[J]. IEEE transactions on pattern analysis and machine intelligence, 2007, 29(12): 2247-2253
[13] MARSZALEK M, LAPTEV I, SCHMID C. Actions in context[C]//2009 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2009: 2929-2936.
[14] SOOMRO K, ZAMIR A R, SHAH M. UCF101: a dataset of 101 human actions classes from videos in the wild[EB/OL]. (2012-12-03)[2025-07-24]. https://arxiv.org/abs/1212.0402.
[15] KUEHNE H, JHUANG H, STIEFELHAGEN R, et al. HMDB51: a large video database for human motion recognition[C]//High Performance Computing in Science and Engineering ‘12. Berlin: Springer, 2013: 571-582.
[16] CARREIRA J, ZISSERMAN A. Quo vadis, action recognition? a new model and the kinetics dataset[C]//2017 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2017: 4724-4733.
[17] SHAHROUDY A, LIU Jun, NG T T, et al. NTU RGB+D: a large scale dataset for 3D human activity analysis[C]//2016 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2016: 1010-1019.
[18] GU Chunhui, SUN Chen, ROSS D A, et al. AVA: a video dataset of spatio-temporally localized atomic visual actions[C]//2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2018: 6047-6056.
[19] RASOULI A, KOTSERUBA I, TSOTSOS J K. Are they going to cross? a benchmark dataset and baseline for pedestrian crosswalk behavior[C]//2017 IEEE International Conference on Computer Vision Workshops. Piscataway: IEEE, 2018: 206-213.
[20] SUN Pei, KRETZSCHMAR H, DOTIWALLA X, et al. Scalability in perception for autonomous driving: waymo open dataset[C]//2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2020: 2443-2451.
[21] CAESAR H, BANKITI V, LANG A H, et al. nuScenes: a multimodal dataset for autonomous driving[EB/OL]. (2020-05-05)[2025-07-24]. https://arxiv.org/abs/1903.11027.
[22] CORDTS M, OMRAN M, RAMOS S, et al. The cityscapes dataset for semantic urban scene understanding[EB/OL]. (2016-04-07)[2025-07-24]. https://arxiv.org/abs/1604.01685.
[23] MARTIN M, ROITBERG A, HAURILET M, et al. Drive&Act: a multi-modal dataset for fine-grained driver behavior recognition in autonomous vehicles[C]//2019 IEEE/CVF International Conference on Computer Vision. Piscataway: IEEE, 2020: 2801-2810.
[24] ORTEGA J D, KOSE N, CA?AS P, et al. DMD: a large-scale multi-modal driver monitoring dataset for attention and alertness analysis[C]//Computer Vision – ECCV 2020 Workshops. Cham: Springer, 2020: 387-405.
[25] ZHAO Chihang, GAO Yongsheng, HE Jie, et al. Recognition of driving postures by multiwavelet transform and multilayer perceptron classifier[J]. Engineering applications of artificial intelligence, 2012, 25(8): 1677-1686
[26] ABOUELNAGA Y, ERAQI H M, MOUSTAFA M N. Real-time distracted driver posture classification[EB/OL]. (2018-11-29)[2025-07-24]. https://arxiv.org/abs/1706.09498.
[27] FEICHTENHOFER C, FAN Haoqi, MALIK J, et al. SlowFast networks for video recognition[C]//2019 IEEE/CVF International Conference on Computer Vision. Seoul: IEEE, 2019: 6201-6210.
[28] WANG Huogen, SONG Zhanjie, LI Wanqing, et al. A hybrid network for large-scale action recognition from RGB and depth modalities[J]. Sensors, 2020, 20(11): 3305
[29] RADFORD A, KIM J W, HALLACY C, et al. Learning transferable visual models from natural language supervision[EB/OL]. (2021-02-26)[2025-07-24]. https://arxiv.org/abs/2103.00020.
[30] CHENG Feng, WANG Xizi, LEI Jie, et al. VindLU: a recipe for effective video-and-language pretraining[C]//2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2023: 10739-10750.
[31] LI Kunchang, LI Xinhao, WANG Yi, et al. VideoMamba: state space model forEfficient video understanding[C]//Computer Vision–ECCV 2024. Cham: Springer, 2025: 237-255.
[32] ZHANG Zhengyou. Flexible camera calibration by viewing a plane from unknown orientations[C]//Proceedings of the Seventh IEEE International Conference on Computer Vision. Piscataway: IEEE, 2002: 666-673.
[33] HUANG Zhilin, LIANG Quanmin, YU Yijie, et al. Bilateral event mining and complementary for event stream super-resolution[C]//2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Seattle: IEEE, 2024: 34-43.

相似文献/References:: [1]宋婉茹,赵晴晴,陈昌红,等.行人重识别研究综述[J].智能系统学报,2017,12(6):770.[doi:10.11992/tis.201706084]
　SONG Wanru,ZHAO Qingqing,CHEN Changhong,et al.Survey on pedestrian re-identification research[J].CAAI Transactions on Intelligent Systems,2017,12():770.[doi:10.11992/tis.201706084]
[2]朱文霖,刘华平,王博文,等.基于视-触跨模态感知的智能导盲系统[J].智能系统学报,2020,15(1):33.[doi:10.11992/tis.201908015]
　ZHU Wenlin,LIU Huaping,WANG Bowen,et al.An intelligent blind guidance system based on visual-touch cross-modal perception[J].CAAI Transactions on Intelligent Systems,2020,15():33.[doi:10.11992/tis.201908015]
[3]徐坚.语义图支持的阅读理解型问题的自动生成[J].智能系统学报,2024,19(2):420.[doi:10.11992/tis.202207001]
　XU Jian.Generating reading comprehension questions automatically based on semantic graphs[J].CAAI Transactions on Intelligent Systems,2024,19():420.[doi:10.11992/tis.202207001]
[4]吴一全,庞雅轩.手机表面缺陷的机器视觉检测方法研究进展[J].智能系统学报,2025,20(1):33.[doi:10.11992/tis.202312036]
　WU Yiquan,PANG Yaxuan.Research progress of mobile phone surface defect detection based on machine vision[J].CAAI Transactions on Intelligent Systems,2025,20():33.[doi:10.11992/tis.202312036]
[5]宫彦,王乃棒,张新钰,等.面向智能网联汽车的 BEV 感知技术与发展趋势[J].智能系统学报,2026,21(1):41.[doi:10.11992/tis.202505027]
　GONG Yan,WANG Naibang,ZHANG Xinyu,et al.BEV perception technologies and development trends for intelligent connected vehicles[J].CAAI Transactions on Intelligent Systems,2026,21():41.[doi:10.11992/tis.202505027]

备注/Memo

收稿日期:2025-7-16。
基金项目:北京市自然科学基金-小米创新联合基金(L233036).
作者简介:赵荣峰，硕士研究生，主要研究方向为智能座舱多模态、多模态大模型和视频理解。获得“优秀义务兵”及“嘉奖”，“青创北京”2022年“挑战杯”首都大学生创业计划竞赛“青绘团史”专项赛省级金奖，2022年国家励志奖学金，2023年北京市“优秀毕业生”称号。 E-mail：zhaorongfeng23@semi.ac.cn。;卢宝莉，助理研究员，博士，中国计算机学会高级会员、中国人工智能学会青年工作委员会委员，曾担任IEEE HPBD&IS 2021和IEEE HDIS 2022国际会议组织主席。主要研究方向为计算机视觉、智能系统、人工智能辅助诊疗。作为子课题负责人及项目骨干参与了国家重点研发计划、国家自然科学基金、北京市自然科学基金等项目10余项，获得发明专利授权10项，在2025 长三角(芜湖)算力算法创新应用大赛中荣获算法赛道冠军，发表学术论文20余篇。E-mail：lubaoli@semi.ac.cn。;宁欣，研究员，博士生导师。中国计算机学会、中国人工智能学会、中国图象图形学学会高级会员，入选2022—2024年全球2%顶尖科学家榜单，中国科学院青促会会员。主持国家重点研发计划、国家自然科学基金青年基金/面上基金、北京市自然科学基金等项目5项。获国家发明专利授权30余项，获中国电子学会科技进步二等奖，获中国科学院半导体研究所首届青年创芯奖一等奖，入选中国科学院半导体研究所青年研究员计划。发表学术论文100余篇，撰写英文专著1部。E-mail：ningxin@semi.ac.cn。
通讯作者:卢宝莉. E-mail：lubaoli@semi.ac.cn

更新日期/Last Update: 2026-01-05

面向智能座舱的多源混合模态数据集及层次化融合分类方法 PDF下载HTML

备注/Memo

面向智能座舱的多源混合模态数据集及层次化融合分类方法

PDF下载 HTML