<-Previous Article Next Article->

[1]ZHANG Hanxiao,XING Xianglei.Deep-learning-enhanced visual SLAM with neural implicit scene representation[J].CAAI Transactions on Intelligent Systems,2026,21(1):120-131.[doi:10.11992/tis.202505029]

Copy

Deep-learning-enhanced visual SLAM with neural implicit scene representation

PDF Download HTML

CAAI Transactions on Intelligent Systems[ISSN 1673-4785/CN 23-1538/TP] Volume: 21 Number of periods: 2026 1 Page number: 120-131 Column: 学术论文—机器感知与模式识别 Public date: 2026-01-05

Title:: Deep-learning-enhanced visual SLAM with neural implicit scene representation

Author(s):: ZHANG Hanxiao; XING Xianglei; College of Intelligent Science and Engineering, Harbin Engineering University, Harbin 150001, China

Keywords:: neural radiation field; visual SLAM; loop detection; pose estimation; deep learning; 3D reconstruction; semantic embedding; trajectory prediction

CLC:: TP391.41

DOI:: 10.11992/tis.202505029

Abstract:: In recent years, neural radiation fields have demonstrated strong capability in high-fidelity three-dimensional scene reconstruction. However, visual simultaneous localization and mapping(SLAM) systems that employ neural radiance fields still face challenges in localization accuracy and the flexibility of explicit scene representation. To address these limitations, this work proposes a visual SLAM system that integrates deep-learning-based pose estimation with neural implicit scene representation. Through dense bundle adjustment layers and efficient global optimization mechanisms, the camera pose and depth are iteratively optimized at the pixel level, and a globally consistent implicit reconstruction surface is incrementally updated based on neural radiation fields, enabling the system to reconstruct high-fidelity scenes while achieving accurate localization. Furthermore, a language query mechanism was introduced to enhance the system’s interactive capability. Extensive experiments were conducted on the EuRoC and Replica datasets, and the results were compared with those of three benchmark methods under different input conditions. The results showed that the proposed system outperformed existing methods in terms of tracking robustness and reconstruction accuracy, providing a reference for subsequent visual SLAM methods based on neural radiation fields.

References:: [1] 黄泽霞, 邵春莉. 深度学习下的视觉SLAM综述[J]. 机器人, 2023, 45(6): 756-768 HUANG Zexia, SHAO Chunli. A survey of visual SLAM under deep learning[J]. Robot, 2023, 45(6): 756-768
[2] DETONE D, MALISIEWICZ T, RABINOVICH A. SuperPoint: self-supervised interest point detection and description[C]//2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops. Salt Lake City: IEEE, 2018.
[3] LUO Zixin, SHEN Tianwei, ZHOU Lei, et al. GeoDesc: learning local descriptors by integrating geometry constraints[C]//European Conference on Computer Vision. Munich: ECVA, 2018.
[4] SARLIN P E, DETONE D, MALISIEWICZ T, et al. SuperGlue: learning feature matching with graph neural networks[C]//2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Seattle: IEEE, 2020.
[5] RANFTL R, KOLTUN V. Deep fundamental matrix estimation[C]//European Conference on Computer Vision. Munich: ECVA, 2018.
[6] VON STUMBERG L, WENZEL P, YANG Nan, et al. LM-reloc: levenberg-marquardt based direct visual relocalization[C]//2020 International Conference on 3D Vision. Fukuoka: IEEE, 2020.
[7] SARLIN P E, UNAGAR A, LARSSON M, et al. Back to the feature: learning robust camera localization from pixels to pose[C]//2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Nashville: IEEE, 2021.
[8] MCCORMAC J, HANDA A, DAVISON A, et al. SemanticFusion: dense 3D semantic mapping with convolutional neural networks[C]//2017 IEEE International Conference on Robotics and Automation. Singapore: IEEE, 2017.
[9] YU Chao, LIU Zuxin, LIU Xinjun, et al. DS-SLAM: a semantic visual SLAM towards dynamic environments[C]//2018 IEEE/RSJ International Conference on Intelligent Robots and Systems. Madrid: IEEE, 2018.
[10] TATENO K, TOMBARI F, LAINA I, et al. CNN-SLAM: real-time dense monocular SLAM with learned depth prediction[C]//2017 IEEE Conference on Computer Vision and Pattern Recognition. Honolulu: IEEE, 2017.
[11] ZHOU Huizhong, UMMENHOFER B, BROX T. DeepTAM: deep tracking and mapping[C]//European Conference on Computer Vision. Munich: ECVA, 2018.
[12] BLOESCH M, CZARNOWSKI J, CLARK R, et al. CodeSLAM-learning a compact, optimisable representation for dense visual SLAM[C]//2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City: IEEE, 2018.
[13] CZARNOWSKI J, LAIDLOW T, CLARK R, et al. DeepFactors: real-time probabilistic dense monocular SLAM[J]. IEEE robotics and automation letters, 2020, 5(2): 721-728
[14] TEED Z, DENG J. Droid-slam: Deep visual slam for monocular, stereo, and RGB-D cameras[C]//Proceedings of the 38th Annual Conference on Neural Information Processing Systems. Vancouver: NeurIPS, 2021.
[15] TEED Z, DENG Jia. RAFT: recurrent all-pairs field transforms for optical flow[C]//European Conference on Computer Vision. ONLINE: ECVA, 2020.
[16] MILDENHALL B, SRINIVASAN P P, TANCIK M, et al. NeRF: representing scenes as neural radiance fields for view synthesis[C]//European Conference on Computer Vision. online: ECVA, 2020.
[17] SUCAR E, LIU Shikun, ORTIZ J, et al. iMAP: implicit mapping and positioning in real-time[C]//2021 IEEE/CVF International Conference on Computer Vision. Montreal: IEEE, 2021.
[18] KONG Xin, LIU Shikun, TAHER M, et al. vMAP: vectorised object mapping for neural field SLAM[C]//2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Vancouver: IEEE, 2023.
[19] ZHU Zihan, PENG Songyou, LARSSON V, et al. NICE-SLAM: neural implicit scalable encoding for SLAM[C]//2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New Orleans: IEEE, 2022.
[20] WANG Hengyi, WANG Jingwen, AGAPITO L. Co-SLAM: joint coordinate and sparse parametric encodings for neural real-time SLAM[C]//2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Vancouver: IEEE, 2023.
[21] CHO K, VAN MERRIENBOER B, GULCEHRE C, et al. Learning phrase representations using RNN encoder–decoder for statistical machine translation[C]//Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing. Doha: USAACL, 2014.
[22] M?LLER T, EVANS A, SCHIED C, et al. Instant neural graphics primitives with a multiresolution hash encoding[J]. ACM transactions on graphics, 2022, 41(4): 1-15
[23] RADFORD A, KIM W J, HALLACY C, et al. Learning transferable visual models from natural language supervision[EB/OL]. (2021-02-26)[2025-04-20]. https://arxiv.org/abs/2103.00020.
[24] BURRI M, NIKOLIC J, GOHL P, et al. The EuRoC micro aerial vehicle datasets[J]. The international journal of robotics research, 2016, 35(10): 1157-1163
[25] STRAUB J, WHELAN T, MA L N, et al. The replica dataset: a digital replica of indoor space[EB/OL]. (2019-06-13)[2025-04-20]. https://arxiv.org/abs/1906.05797.
[26] FORSTER C, PIZZOLI M, SCARAMUZZA D. SVO: fast semi-direct monocular visual odometry[C]//2014 IEEE International Conference on Robotics and Automation. Hong Kong: IEEE, 2014.
[27] MUR-ARTAL R, TARD?S J D. ORB-SLAM2: an open-source SLAM system for monocular, stereo, and RGB-D cameras[J]. IEEE transactions on robotics, 2017, 33(5): 1255-1262
[28] CAMPOS C, ELVIRA R, RODR?GUEZ J J G, et al. ORB-SLAM3: an accurate open-source library for visual, visual–inertial, and multimap SLAM[J]. IEEE transactions on robotics, 2021, 37(6): 1874-1890
[29] SCH?NBERGER J L, FRAHM J M. Structure-from-motion revisited[C]//2016 IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas: IEEE, 2016.
[30] ZHU Zihan, PENG Songyou, LARSSON V, et al. NICER-SLAM: neural implicit scene encoding for RGB SLAM[C]//2024 International Conference on 3D Vision. Davos: IEEE, 2024.
[31] YANG Xingrui, LI Hai, ZHAI Hongjia, et al. Vox-fusion: dense tracking and mapping with voxel-based neural implicit representation[C]//2022 IEEE International Symposium on Mixed and Augmented Reality. Singapore: IEEE, 2022.

Similar References:

Memo

Last Update: 2026-01-05

Deep-learning-enhanced visual SLAM with neural implicit scene representation PDF DownloadHTML

Memo

Deep-learning-enhanced visual SLAM with neural implicit scene representation

PDF Download HTML