<-上一篇/Previous Article 下一篇/Next Article->

[1]陆军,赵颢然,鲁林超.基于多模态融合的三维目标检测方法研究[J].智能系统学报,2025,20(5):1167-1177.[doi:10.11992/tis.202502015]
　LU Jun,ZHAO Haoran,LU Linchao.Research on 3D object detection based on multi-modal fusion[J].CAAI Transactions on Intelligent Systems,2025,20(5):1167-1177.[doi:10.11992/tis.202502015]

点击复制

基于多模态融合的三维目标检测方法研究

PDF下载 HTML

《智能系统学报》[ISSN 1673-4785/CN 23-1538/TP] 卷: 20 期数: 2025年第5期页码: 1167-1177 栏目: 学术论文—机器感知与模式识别出版日期: 2025-09-05

Title:: Research on 3D object detection based on multi-modal fusion

作者:: 陆军, 赵颢然, 鲁林超; 哈尔滨工程大学智能科学与工程学院, 黑龙江哈尔滨 150001

Author(s):: LU Jun, ZHAO Haoran, LU Linchao; College of Intelligent Systems Science and Engineering, Harbin Engineering University, Harbin 150001, China

关键词:: 三维目标检测; 多模态融合; 深度学习; 深度估计; 特征聚合; 注意力机制; 激光雷达; 自动驾驶

Keywords:: 3D target detection; multimodal fusion; deep learning; depth estimation; feature aggregation; attention mechanism; LiDAR; autonomous driving

分类号:: TP391

DOI:: 10.11992/tis.202502015

摘要:: 在自动驾驶场景中，由于多模态的融合，三维目标检测效果易受传感器未充分校准的影响，同时，对于目标密集的复杂场景，检测过程中易对目标造成误检，从而降低模型的召回率和检测精度。针对以上问题，设计了多模态融合网络SoftFusion-QC(softfusion with query contrast)用以实现三维目标检测。为了自适应地融合来自激光雷达的点云数据和摄像头捕获的图像信息，提出可变形跨模态特征聚合模块(deformable cross-modality feature aggregate, DCFA)，实现深层次的特征融合。为了有效应对传感器校准不足问题，引入查询对比机制(query contrast, QC)，通过基于Transformer的查询交互策略和查询框对比学习策略，显著提升了检测的精度和鲁棒性，解决了密集目标检测的误检问题。在nuScenes自动驾驶数据集上，取得了69.8%的mAP(mean average precision)与72.8%的NDS(normalized detection score)。通过定量的性能分析和消融实验验证了算法的有效性。

Abstract:: In the context of autonomous driving, the performance of 3D object detection via multimodal fusion is susceptible to insufficient sensor calibration. Additionally, in complex scenes with dense targets, the detection process is prone to false positives, thereby reducing the model’s recall and precision. To address these challenges, we have designed a multimodal fusion network, SoftFusion-QC (softFusion with query contrast), for 3D object detection. To adaptively fuse point cloud data from LiDAR with image information from cameras, we propose a Deformable cross-modality feature aggregate (DCFA) module, which facilitates deep-level feature fusion and effectively mitigates the issue of inadequate sensor calibration. To resolve the problem of false positives in dense object detection, we introduce a query contrast (QC) mechanism. By employing a Transformer-based query interaction strategy and a query box contrastive learning strategy, this mechanism significantly enhances detection accuracy and robustness. On the nuScenes autonomous driving dataset, our method achieves 69.8% mAP (mean average precision) and 72.8% NDS (normalized detection score). The effectiveness of our algorithm is validated through quantitative performance analysis and ablation studies.

参考文献/References:: [1] 张耀丹. 无人驾驶汽车的现状及发展趋势[J]. 汽车实用技术, 2018, 43(6): 10, 15.
ZHANG Yaodan. The current situation and tendency of driverless cars[J]. Automobile applied technology, 2018, 43(6): 10, 15.
[2] 王世峰, 戴祥, 徐宁, 等. 无人驾驶汽车环境感知技术综述[J]. 长春理工大学学报(自然科学版), 2017, 40(1): 1-6.
WANG Shifeng, DAI Xiang, XU Ning, et al. Overview on environment perception technology for unmanned ground vehicle[J]. Journal of Changchun University of Science and Technology (natural science edition), 2017, 40(1): 1-6.
[3] JANA P, MOHANTA P P. Recent trends in 2D object detection and applications in video event recognition[EB/OL]. (2022-02-07)[2025-02-26]. https://arxiv.org/abs/2202.03206.
[4] PRAVALLIKA A, HASHMI M F, GUPTA A. Deep learning frontiers in 3D object detection: a comprehensive review for autonomous driving[J]. IEEE access, 2024, 12: 173936-173980.
[5] ZHU Minling, GONG Yadong, TIAN Chunwei, et al. A systematic survey of transformer-based 3D object detection for autonomous driving: methods, challenges and trends[J]. Drones, 2024, 8(8): 412.
[6] TANG Yingjuan, HE Hongwen, WANG Yong, et al. Multi-modality 3D object detection in autonomous driving: a review[J]. Neurocomputing, 2023, 553: 126587.
[7] VORA S, LANG A H, HELOU B, et al. PointPainting: sequential fusion for 3D object detection[C]//2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Seattle: IEEE, 2020: 4604-4612.
[8] WANG Chunwei, MA Chao, ZHU Ming, et al. PointAugmenting: cross-modal augmentation for 3D object detection[C]//2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Nashville: IEEE, 2021: 11794-11803.
[9] LECUN Y, BOSER B, DENKER J S, et al. Backpropagation applied to handwritten zip code recognition[J]. Neural computation, 1989, 1(4): 541-551.
[10] XU Shaoqing, ZHOU Dingfu, FANG Jin, et al. FusionPainting: multimodal fusion with adaptive attention for 3D object detection[C]//2021 IEEE International Intelligent Transportation Systems Conference. Indianapolis: IEEE, 2021: 3047-3054.
[11] BAI Xuyang, HU Zeyu, ZHU Xinge, et al. TransFusion: robust LiDAR-camera fusion for 3D object detection with transformers[C]//2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New Orleans: IEEE, 2022: 1080-1089.
[12] VASWANI A, SHAZEER N, PARMAR N, et al. Attention is all you need[J]. Advances in neural information processing systems, 2017, 30: 5998-6008.
[13] LI Yingwei, YU A W, MENG Tianjian, et al. DeepFusion: LiDAR-camera deep fusion for multi-modal 3D object detection[C]//2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New Orleans: IEEE, 2022: 17161-17170.
[14] LIANG Tingting, XIE Hongwei, YU Kaicheng, et al. BEVFusion: a simple and robust LiDAR-camera fusion framework[C]//Proceedings of the 36th International Conference on Neural Information Processing Systems. New Orleans: Curran Associates Inc., 2022: 10421-10434.
[15] HU Haotian, WANG Fanyi, SU Jingwen, et al. EA-BEV: edge-aware bird’s-eye-view projector for 3D object detection[EB/OL]. (2023-03-31)[2025-02-26]. https://arxiv.org/abs/2303.17895.
[16] YAN Junjie, LIU Yingfei, SUN Jianjian, et al. Cross modal transformer via coordinates encoding for 3D object dectection[EB/OL]. (2023-01-03)[2025-02-26]. https://arxiv.org/abs/2301.01283.
[17] WANG Haiyang, TANG Hao, SHI Shaoshuai, et al. UniTR: a unified and efficient multi-modal transformer for bird’s-eye-view representation[C]//2023 IEEE/CVF International Conference on Computer Vision. Paris: IEEE, 2023: 6792-6802.
[18] CAESAR H, BANKITI V, LANG A H, et al. nuScenes: a multimodal dataset for autonomous driving[C]//2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Seattle: IEEE, 2020: 11621-11631.
[19] LEE W, KIM H, AHN J. Defect-free atomic array formation using the Hungarian matching algorithm[J]. Physical review A, 2017, 95(5): 053424.
[20] TOLSTIKHIN I O, HOULSBY N, KOLESNIKOV A, et al. MLP-Mixer: an all-MLP architecture for vision[C]//Proceedings of the 35th International Conference on Neural Information Processing Systems. Vancouver: Curran Associates Inc., 2021: 24261-24272.
[21] EGGERT S, KLIEMANN L, SRIVASTAV A. Bipartite graph matchings in the semi-streaming model[C]//Algorithms-ESA 2009. Berlin: Springer Berlin Heidelberg, 2009: 492-503.
[22] LIN T Y, GOYAL P, GIRSHICK R, et al. Focal loss for dense object detection[C]//2017 IEEE International Conference on Computer Vision. Venice: IEEE, 2017: 2999-3007.
[23] CONTRIBUTORS M. MMDetection3D: OpenMMLab next-generation platform for general 3D object detection[EB/OL]. (2019-06-17)[2025-02-26]. https://arxiv.org/abs/1906.07155.
[24] LOSHCHILOV I, HUTTER F. Decoupled weight decay regularization[C]//International Conference on Learning Representations. Singapore: OpenReview.net, 2025: 1-18.
[25] SMITH L N, TOPIN N. Super-convergence: very fast training of neural networks using large learning rates[C]//Artificial Intelligence and Machine Learning for Multi-Domain Operations Applications. Baltimore: SPIE, 2019: 369-386.
[26] LANG A H, VORA S, CAESAR H, et al. PointPillars: fast encoders for object detection from point clouds[C]//2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Long Beach: IEEE, 2019: 12689-12697.
[27] LI Yanwei, CHEN Yilun, QI Xiaojuan, et al. Unifying voxel-based representation with transformer for 3D object detection[C]//Proceedings of the 36th International Conference on Neural Information Processing Systems. New Orleans: Curran Associates Inc., 2022: 18442-18455.
[28] YIN Tianwei, ZHOU Xingyi, KRAHENBUHL P. Center-based 3D object detection and tracking[C]//2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Nashville: IEEE, 2021: 11779-11788.
[29] CHEN Yukang, LIU Jianhui, ZHANG Xiangyu, et al. VoxelNeXt: fully sparse VoxelNet for 3D object detection and tracking[C]//2023 IEEE/CVF conference on computer vision and pattern recognition. Vancouver: IEEE, 2023: 21674-21683.
[30] YOO J H, KIM Y, KIM J, et al. 3D-CVF: generating joint camera and LiDAR features using cross-view spatial feature fusion for 3D object detection[C]//Computer Vision–ECCV 2020. Cham: Springer International Publishing, 2020: 720-736.
[31] YIN Tianwei, ZHOU Xingyi, KR?HENBüHL P. Multimodal virtual point 3D detection[C]//Proceedings of the 35th International Conference on Neural Information Processing Systems. Vancouver: Curran Associates Inc., 2021: 16494-16507.
[32] CHEN Zehui, LI Zhenyu, ZHANG Shiquan, et al. Deformable feature aggregation for dynamic multi-modal 3D object detection[C]//Computer Vision–ECCV 2022. Cham: Springer Nature Switzerland, 2022: 628-644.
[33] HUANG Tengteng, LIU Zhe, CHEN Xiwu, et al. EPNet: enhancing point features with image semantics for 3D object detection[C]//Computer Vision–ECCV 2020. Cham: Springer International Publishing, 2020: 35-52.

相似文献/References:: [1]温晓红,刘华平,阎高伟,等.基于超限学习机的非线性典型相关分析及应用[J].智能系统学报,2018,13(4):633.[doi:10.11992/tis.201703034]
　WEN Xiaohong,LIU Huaping,YAN Gaowei,et al.Nonlinear canonical correlation analysis and application based on extreme learning machine[J].CAAI Transactions on Intelligent Systems,2018,13():633.[doi:10.11992/tis.201703034]
[2]贾晨,刘华平,续欣莹,等.基于宽度学习方法的多模态信息融合[J].智能系统学报,2019,14(1):150.[doi:10.11992/tis.201803022]
　JIA Chen,LIU Huaping,XU Xinying,et al.Multi-modal information fusion based on broad learning method[J].CAAI Transactions on Intelligent Systems,2019,14():150.[doi:10.11992/tis.201803022]
[3]王召新,续欣莹,刘华平,等.基于级联宽度学习的多模态材质识别[J].智能系统学报,2020,15(4):787.[doi:10.11992/tis.201908021]
　WANG Zhaoxin,XU Xinying,LIU Huaping,et al.Cascade broad learning for multi-modal material recognition[J].CAAI Transactions on Intelligent Systems,2020,15():787.[doi:10.11992/tis.201908021]
[4]赵小明,唐志伟,张石清.面向听视觉信息的多模态人格识别研究进展[J].智能系统学报,2021,16(2):189.[doi:10.11992/tis.202101034]
　ZHAO Xiaoming,TANG Zhiwei,ZHANG Shiqing.Research advance of multimodal personality recognition based on audio and visual cues[J].CAAI Transactions on Intelligent Systems,2021,16():189.[doi:10.11992/tis.202101034]
[5]鲁斌,孙洋,杨振宇.融合体素图注意力的三维目标检测算法[J].智能系统学报,2024,19(3):598.[doi:10.11992/tis.202209008]
　LU Bin,SUN Yang,YANG Zhenyu.3D object detection algorithm with voxel graph attention[J].CAAI Transactions on Intelligent Systems,2024,19():598.[doi:10.11992/tis.202209008]
[6]鲁斌,杨振宇,孙洋,等.基于多通道交叉注意力融合的三维目标检测算法[J].智能系统学报,2024,19(4):885.[doi:10.11992/tis.202305029]
　LU Bin,YANG Zhenyu,SUN Yang,et al.3D object detection algorithm with multi-channel cross attention fusion[J].CAAI Transactions on Intelligent Systems,2024,19():885.[doi:10.11992/tis.202305029]
[7]潘在宇,徐家梦,王军,等.基于模态信息度评估策略的掌纹掌静脉特征识别方法[J].智能系统学报,2024,19(5):1136.[doi:10.11992/tis.202310002]
　PAN Zaiyu,XU Jiameng,WANG Jun,et al.Palmprint and palm vein recognition method based on modal information evaluation strategy[J].CAAI Transactions on Intelligent Systems,2024,19():1136.[doi:10.11992/tis.202310002]
[8]黄志鸿,杜瑞,张辉.面向复杂电力环境场景理解的可见光和红外图像特征级融合方法[J].智能系统学报,2025,20(3):631.[doi:10.11992/tis.202404014]
　HUANG Zhihong,DU Rui,ZHANG Hui.Feature-level fusion method of visible and infrared images for scene understanding in complex power environments[J].CAAI Transactions on Intelligent Systems,2025,20():631.[doi:10.11992/tis.202404014]
[9]仲兆满,樊继冬,张渝,等.基于卷积交叉注意力与跨模态动态门控的多模态情感分析模型[J].智能系统学报,2025,20(4):999.[doi:10.11992/tis.202409012]
　ZHONG Zhaoman,FAN Jidong,ZHANG Yu,et al.Multimodal sentiment analysis model with convolutional cross-attention and cross-modal dynamic gating[J].CAAI Transactions on Intelligent Systems,2025,20():999.[doi:10.11992/tis.202409012]

备注/Memo

收稿日期:2025-2-26。
基金项目:黑龙江省自然科学基金项目(F201123).
作者简介:陆军，教授，博士生导师，博士，主要研究方向为计算机视觉、机器感知和机械臂控制。科技部科技型中小企业创新基金项目评审专家，国家自然科学基金同行评议专家。发表学术论文80余篇，出版著作5部。E-mail：lujun0260@sina.com。;赵颢然，硕士研究生，主要研究方向为三维目标检测、计算机视觉。E-mail：1793961894@qq.com。;鲁林超，硕士，主要研究方向为三维目标检测、计算机视觉。E-mail： llczsr@163.com。
通讯作者:陆军. E-mail：lujun0260@sina.com

更新日期/Last Update: 2025-09-05

基于多模态融合的三维目标检测方法研究 PDF下载HTML

备注/Memo

基于多模态融合的三维目标检测方法研究

PDF下载 HTML