<-上一篇/Previous Article 下一篇/Next Article->

[1]张含笑,邢向磊.融合深度学习与神经隐式表征的视觉SLAM系统[J].智能系统学报,2026,21(1):120-131.[doi:10.11992/tis.202505029]
　ZHANG Hanxiao,XING Xianglei.Deep-learning-enhanced visual SLAM with neural implicit scene representation[J].CAAI Transactions on Intelligent Systems,2026,21(1):120-131.[doi:10.11992/tis.202505029]

点击复制

融合深度学习与神经隐式表征的视觉SLAM系统

PDF下载 HTML

《智能系统学报》[ISSN 1673-4785/CN 23-1538/TP] 卷: 21 期数: 2026年第1期页码: 120-131 栏目: 学术论文—机器感知与模式识别出版日期: 2026-03-05

Title:: Deep-learning-enhanced visual SLAM with neural implicit scene representation

作者:: 张含笑, 邢向磊; 哈尔滨工程大学智能科学与工程学院, 黑龙江哈尔滨 150001

Author(s):: ZHANG Hanxiao, XING Xianglei; College of Intelligent Science and Engineering, Harbin Engineering University, Harbin 150001, China

关键词:: 神经辐射场; 视觉SLAM; 回环检测; 位姿估计; 深度学习; 三维重建; 语义嵌入; 轨迹预测

Keywords:: neural radiation field; visual SLAM; loop detection; pose estimation; deep learning; 3D reconstruction; semantic embedding; trajectory prediction

分类号:: TP391.41

DOI:: 10.11992/tis.202505029

摘要:: 近年来，神经辐射场在三维重建任务中展现出卓越性能。然而，应用在视觉同时定位与地图构建(simultaneous localization and mapping, SLAM)中因缺乏全局优化机制容易导致系统定位精度不足以及重建失败。针对该问题，本文提出一种融合深度学习位姿估计与神经隐式表征的视觉SLAM系统。通过稠密束调整层以及高效的全局优化机制对相机位姿和深度进行像素级的循环迭代，并基于神经辐射场方法更新全局一致的隐式重建表面，使得系统在精准定位的同时能够重建高保真场景，并且在此基础上引入语言查询机制，增强系统的交互能力。在EuRoC和Replica数据集上进行大量实验，在不同的输入条件下，分别与3类基准方法进行对比，结果表明该系统在跟踪鲁棒性和重建精度方面相较于现有方法表现更优。本方法可为后续基于神经辐射场的视觉SLAM方法提供参考。

Abstract:: In recent years, neural radiation fields have demonstrated strong capability in high-fidelity three-dimensional scene reconstruction. However, visual simultaneous localization and mapping(SLAM) systems that employ neural radiance fields still face challenges in localization accuracy and the flexibility of explicit scene representation. To address these limitations, this work proposes a visual SLAM system that integrates deep-learning-based pose estimation with neural implicit scene representation. Through dense bundle adjustment layers and efficient global optimization mechanisms, the camera pose and depth are iteratively optimized at the pixel level, and a globally consistent implicit reconstruction surface is incrementally updated based on neural radiation fields, enabling the system to reconstruct high-fidelity scenes while achieving accurate localization. Furthermore, a language query mechanism was introduced to enhance the system’s interactive capability. Extensive experiments were conducted on the EuRoC and Replica datasets, and the results were compared with those of three benchmark methods under different input conditions. The results showed that the proposed system outperformed existing methods in terms of tracking robustness and reconstruction accuracy, providing a reference for subsequent visual SLAM methods based on neural radiation fields.

参考文献/References:: [1] 黄泽霞, 邵春莉. 深度学习下的视觉SLAM综述[J]. 机器人, 2023, 45(6): 756-768 HUANG Zexia, SHAO Chunli. A survey of visual SLAM under deep learning[J]. Robot, 2023, 45(6): 756-768
[2] DETONE D, MALISIEWICZ T, RABINOVICH A. SuperPoint: self-supervised interest point detection and description[C]//2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops. Salt Lake City: IEEE, 2018.
[3] LUO Zixin, SHEN Tianwei, ZHOU Lei, et al. GeoDesc: learning local descriptors by integrating geometry constraints[C]//European Conference on Computer Vision. Munich: ECVA, 2018.
[4] SARLIN P E, DETONE D, MALISIEWICZ T, et al. SuperGlue: learning feature matching with graph neural networks[C]//2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Seattle: IEEE, 2020.
[5] RANFTL R, KOLTUN V. Deep fundamental matrix estimation[C]//European Conference on Computer Vision. Munich: ECVA, 2018.
[6] VON STUMBERG L, WENZEL P, YANG Nan, et al. LM-reloc: levenberg-marquardt based direct visual relocalization[C]//2020 International Conference on 3D Vision. Fukuoka: IEEE, 2020.
[7] SARLIN P E, UNAGAR A, LARSSON M, et al. Back to the feature: learning robust camera localization from pixels to pose[C]//2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Nashville: IEEE, 2021.
[8] MCCORMAC J, HANDA A, DAVISON A, et al. SemanticFusion: dense 3D semantic mapping with convolutional neural networks[C]//2017 IEEE International Conference on Robotics and Automation. Singapore: IEEE, 2017.
[9] YU Chao, LIU Zuxin, LIU Xinjun, et al. DS-SLAM: a semantic visual SLAM towards dynamic environments[C]//2018 IEEE/RSJ International Conference on Intelligent Robots and Systems. Madrid: IEEE, 2018.
[10] TATENO K, TOMBARI F, LAINA I, et al. CNN-SLAM: real-time dense monocular SLAM with learned depth prediction[C]//2017 IEEE Conference on Computer Vision and Pattern Recognition. Honolulu: IEEE, 2017.
[11] ZHOU Huizhong, UMMENHOFER B, BROX T. DeepTAM: deep tracking and mapping[C]//European Conference on Computer Vision. Munich: ECVA, 2018.
[12] BLOESCH M, CZARNOWSKI J, CLARK R, et al. CodeSLAM-learning a compact, optimisable representation for dense visual SLAM[C]//2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City: IEEE, 2018.
[13] CZARNOWSKI J, LAIDLOW T, CLARK R, et al. DeepFactors: real-time probabilistic dense monocular SLAM[J]. IEEE robotics and automation letters, 2020, 5(2): 721-728
[14] TEED Z, DENG J. Droid-slam: Deep visual slam for monocular, stereo, and RGB-D cameras[C]//Proceedings of the 38th Annual Conference on Neural Information Processing Systems. Vancouver: NeurIPS, 2021.
[15] TEED Z, DENG Jia. RAFT: recurrent all-pairs field transforms for optical flow[C]//European Conference on Computer Vision. ONLINE: ECVA, 2020.
[16] MILDENHALL B, SRINIVASAN P P, TANCIK M, et al. NeRF: representing scenes as neural radiance fields for view synthesis[C]//European Conference on Computer Vision. online: ECVA, 2020.
[17] SUCAR E, LIU Shikun, ORTIZ J, et al. iMAP: implicit mapping and positioning in real-time[C]//2021 IEEE/CVF International Conference on Computer Vision. Montreal: IEEE, 2021.
[18] KONG Xin, LIU Shikun, TAHER M, et al. vMAP: vectorised object mapping for neural field SLAM[C]//2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Vancouver: IEEE, 2023.
[19] ZHU Zihan, PENG Songyou, LARSSON V, et al. NICE-SLAM: neural implicit scalable encoding for SLAM[C]//2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New Orleans: IEEE, 2022.
[20] WANG Hengyi, WANG Jingwen, AGAPITO L. Co-SLAM: joint coordinate and sparse parametric encodings for neural real-time SLAM[C]//2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Vancouver: IEEE, 2023.
[21] CHO K, VAN MERRIENBOER B, GULCEHRE C, et al. Learning phrase representations using RNN encoder–decoder for statistical machine translation[C]//Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing. Doha: USAACL, 2014.
[22] M?LLER T, EVANS A, SCHIED C, et al. Instant neural graphics primitives with a multiresolution hash encoding[J]. ACM transactions on graphics, 2022, 41(4): 1-15
[23] RADFORD A, KIM W J, HALLACY C, et al. Learning transferable visual models from natural language supervision[EB/OL]. (2021-02-26)[2025-04-20]. https://arxiv.org/abs/2103.00020.
[24] BURRI M, NIKOLIC J, GOHL P, et al. The EuRoC micro aerial vehicle datasets[J]. The international journal of robotics research, 2016, 35(10): 1157-1163
[25] STRAUB J, WHELAN T, MA L N, et al. The replica dataset: a digital replica of indoor space[EB/OL]. (2019-06-13)[2025-04-20]. https://arxiv.org/abs/1906.05797.
[26] FORSTER C, PIZZOLI M, SCARAMUZZA D. SVO: fast semi-direct monocular visual odometry[C]//2014 IEEE International Conference on Robotics and Automation. Hong Kong: IEEE, 2014.
[27] MUR-ARTAL R, TARD?S J D. ORB-SLAM2: an open-source SLAM system for monocular, stereo, and RGB-D cameras[J]. IEEE transactions on robotics, 2017, 33(5): 1255-1262
[28] CAMPOS C, ELVIRA R, RODR?GUEZ J J G, et al. ORB-SLAM3: an accurate open-source library for visual, visual–inertial, and multimap SLAM[J]. IEEE transactions on robotics, 2021, 37(6): 1874-1890
[29] SCH?NBERGER J L, FRAHM J M. Structure-from-motion revisited[C]//2016 IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas: IEEE, 2016.
[30] ZHU Zihan, PENG Songyou, LARSSON V, et al. NICER-SLAM: neural implicit scene encoding for RGB SLAM[C]//2024 International Conference on 3D Vision. Davos: IEEE, 2024.
[31] YANG Xingrui, LI Hai, ZHAI Hongjia, et al. Vox-fusion: dense tracking and mapping with voxel-based neural implicit representation[C]//2022 IEEE International Symposium on Mixed and Augmented Reality. Singapore: IEEE, 2022.

相似文献/References:: [1]杨慧,张婷,金晟,等.基于二进制生成对抗网络的视觉回环检测研究[J].智能系统学报,2021,16(4):673.[doi:10.11992/tis.202007007]
　YANG Hui,ZHANG Ting,JIN Sheng,et al.Visual loop closure detection based on binary generative adversarial network[J].CAAI Transactions on Intelligent Systems,2021,16():673.[doi:10.11992/tis.202007007]
[2]朱少凯,孟庆浩,金晟,等.基于深度强化学习的室内视觉局部路径规划[J].智能系统学报,2022,17(5):908.[doi:10.11992/tis.202107059]
　ZHU Shaokai,MENG Qinghao,JIN Sheng,et al.Indoor visual local path planning based on deep reinforcement learning[J].CAAI Transactions on Intelligent Systems,2022,17():908.[doi:10.11992/tis.202107059]
[3]殷泽众,郭茂祖,田乐.基于傅里叶频域截断的神经辐射场优化[J].智能系统学报,2024,19(5):1319.[doi:10.11992/tis.202401036]
　YIN Zezhong,GUO Maozu,TIAN Le.Neural radiance field optimization based on Fourier frequency domain truncation[J].CAAI Transactions on Intelligent Systems,2024,19():1319.[doi:10.11992/tis.202401036]

备注/Memo

收稿日期:2025-5-28。
基金项目:国家自然科学基金项目(62076078, 61703119)；中央高校基本科研业务费项目(3072024LJ0403).
作者简介:张含笑，硕士，主要研究方向为计算机视觉。E-mail：2682706067@qq.com。;邢向磊，教授，博士生导师，主要研究方向为模式识别与计算机视觉。获得黑龙江省高校科学技术奖(自然科学类)一等奖，获《智能系统学报》优秀论文奖。发表学术论文 60 余篇。E-mail：xingxl@hrbeu.edu.cn。
通讯作者:邢向磊. E-mail：xingxl@hrbeu.edu.cn

更新日期/Last Update: 2026-01-05

融合深度学习与神经隐式表征的视觉SLAM系统 PDF下载HTML

备注/Memo

融合深度学习与神经隐式表征的视觉SLAM系统

PDF下载 HTML