<-Previous Article Next Article->

[1]WANG Yefei,GE Quanbo,LIU Huaping,et al.A perceptual manipulation system for audio-visual fusion of robots[J].CAAI Transactions on Intelligent Systems,2023,18(2):381-389.[doi:10.11992/tis.202111036]

Copy

A perceptual manipulation system for audio-visual fusion of robots

PDF Download HTML

CAAI Transactions on Intelligent Systems[ISSN 1673-4785/CN 23-1538/TP] Volume: 18 Number of periods: 2023 2 Page number: 381-389 Column: 吴文俊人工智能科学技术奖论坛 Public date: 2023-05-05

Title:: A perceptual manipulation system for audio-visual fusion of robots

Author(s):: WANG Yefei¹; GE Quanbo²; LIU Huaping³; LU Zhenyu⁴; 1. School of Electronic and Information Engineering, Nanjing University of Information Science and Technology, Nanjing 210044, China;
2. School of Automation, Nanjing University of Information Science and Technology, Nanjing 210044, China;
3. Department of Computer Science and Technology, Tsinghua University, Beijing 100084, China;
4. School of AI, Nanjing University of Information Science and Technology, Nanjing 210044, China

Keywords:: visual positioning; audio recognition; deep learning; visual perception; auditory perception; audio-visual fusion; multi-modal data; active operation

CLC:: TP391

DOI:: 10.11992/tis.202111036

Abstract:: The ability of intelligent robots to function in complex environments has been a longstanding challenge in the field of robotic applications. Referential expressions are frequently utilized for object positioning, making this method a common approach in robot interactions. However, relying on a single visual modality alone is not adequate for all tasks in real-world scenarios. This study proposes a robot perception system based on the fusion of visual and auditory modalities. The system employs a deep learning algorithm model to realize the visual and auditory perceptions of the robot, and it processes natural language and scene information for visual positioning and collects data from 12 types of sound signals for audio recognition. The experimental results indicate that the system integrated into the UR robot has a strong visual positioning ability and audio prediction, and it has successfully carried out an instruction-based audio-visual operation task. The results confirm that audio-visual data has a higher expressive capability than single-modal data.

References:: [1] HE Kaiming, GKIOXARI G, DOLLáR P, et al. Mask R-CNN[C]//2017 IEEE International Conference on Computer Vision. Venice: IEEE, 2017: 2980?2988.
[2] REDMON J, FARHADI A. YOLOv3: an incremental improvement[EB/OL]. (2018?04?08)[2021?01?01].https://arxiv.org/abs/1804.02767.
[3] 周俊佐, 朱宗奎, 何正球,等. 面向人机对话意图分类的混合神经网络模型[J]. 软件学报, 2019, 30(411): 3313–3325
ZHOU Junzuo, ZHU Zongkui, HE Zhengqiu, et al. Hybridneural network models for human-machine dialogue intention classification[J]. Journal of software, 2019, 30(411): 3313–3325
[4] CHOWDHARY K R. Natural language processing[J]. Fundamentals of artificial intelligence, 2020, 17(6): 603–649.
[5] MUREZ Z, VAN AS T, BARTOLOZZI J, et al. Atlas: end-to-end 3D scene reconstruction from posed images[M]//Computer Vision-ECCV 2020. Cham: Springer International Publishing, 2020: 414?431.
[6] HODA? T, BARáTH D, MATAS J. EPOS: estimating 6D pose of objects with symmetries[C]//IEEE/CVF Conference on Computer Vision and Pattern Recognition. Seattle: IEEE, 2020: 11700?11709.
[7] CORONA E, PUMAROLA A, ALENYà G, et al. GanHand: predicting human grasp affordances in multi-object scenes[C]//2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Seattle: IEEE, 2020: 5030?5040.
[8] QURESHI A H, SIMEONOV A, BENCY M J, et al. Motion planning networks[C]//2019 International Conference on Robotics and Automation. Montreal: IEEE, 2019: 2118?2124.
[9] QI Yuankai, WU Qi, ANDERSON P, et al. REVERIE: remote embodied visual referring expression in real indoor environments[C]//2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Seattle: IEEE, 2020: 9979?9988.
[10] GAO Chen, CHEN Jinyu, LIU Si, et al. Room-and-object aware knowledge reasoning for remote embodied referring expression[C]//2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Nashville: IEEE, 2021: 3063?3072.
[11] ZHANG HANBO, LU YUNFAN, YU CUNJUN, et al. INVIGORATE: interactive visual grounding and grasping in clutter[EB/OL]. (2021?08?25)[2021?10?10].https://arxiv.org/abs/2108.11092.
[12] MI Jinpeng, LYU Jianzhi, TANG Song, et al. Interactive natural language grounding via referring expression comprehension and scene graph parsing[J]. Frontiers in neurorobotics, 2020, 14: 43.
[13] ONDRAS J, CELIKTUTAN O, BREMNER P, et al. Audio-driven robot upper-body motion synthesis[J]. IEEE transactions on cybernetics, 2021, 51(11): 5445–5454.
[14] LATHUILIèRE S, MASSé B, MESEJO P, et al. Neural network based reinforcement learning for audio-visual gaze control in human-robot interaction[J]. Pattern recognition letters, 2019, 118: 61–71.
[15] H?NEMANN A, BENNETT C, WAGNER P, et al. Audio-visual synthesized attitudes presented by the German speaking robot SMiRAE[C]//The 15th International Conference on Auditory-Visual Speech Processing. Melbourne: ISCA, 2019: 10?11.
[16] YAMAGUCHI A, ATKESON C G. Recent progress in tactile sensing and sensors for robotic manipulation: can we turn tactile sensing into vision?[J]. Advanced robotics, 2019, 33(14): 661–673.
[17] 朱文霖, 刘华平, 王博文, 等. 开放环境下未知材质的识别技术[J]. 智能系统学报, 2020, 15(1): 33–40
ZHU Wenlin, LIU Huaping, WANG Bowen, et al. An intelligent blind guidance system based on visual-touch cross-modal perception[J]. CAAI transactions on intelligent systems, 2020, 15(1): 33–40
[18] LALONDE J F, VANDAPEL N, HUBER D F, et al. Natural terrain classification using three-dimensional ladar data for ground robot mobility[J]. Journal of field robotics, 2006, 23(10): 839–861.
[19] 张新钰, 邹镇洪, 李志伟, 等. 面向自动驾驶目标检测的深度多模态融合技术[J]. 智能系统学报, 2020, 15(4): 758–771
ZHANG Xinyu, ZOU Zhenhong, LI Zhiwei, et al. Deep multi-modal fusion in object detection for autonomous driving[J]. CAAI transactions on intelligent systems, 2020, 15(4): 758–771
[20] LIU Hongyi, FANG Tongtong, ZHOU Tianyu, et al. Deep learning-based multimodal control interface for human-robot collaboration[J]. Procedia cirp, 2018, 72: 3–8.
[21] YOO Y, LEE C Y, ZHANG B T. Multimodal anomaly detection based on deep auto-encoder for object slip perception of mobile manipulation robots[C]//2021 IEEE International Conference on Robotics and Automation. Xi’an: IEEE, 2021: 11443?11449.
[22] GAN Chuang, ZHANG Yiwei, WU Jiajun, et al. Look, listen, and act: towards audio-visual embodied navigation[C]//2020 IEEE International Conference on Robotics and Automation. Paris: IEEE, 2020: 9701-9707.
[23] JIN Shaowei, LIU Huaping, WANG Bowen, et al. Open-environment robotic acoustic perception for object recognition[J]. Frontiers in neurorobotics, 2019, 13: 96.
[24] JONETZKO Y, FIEDLER N, EPPE M, et al. Multimodal object analysis with auditory and tactile sensing using recurrent neural networks[M]//Communications in Computer and Information Science. Singapore: Springer Singapore, 2021: 253?265.
[25] 靳少卫, 刘华平, 王博文, 等. 开放环境下未知材质的识别技术[J]. 智能系统学报, 2020, 15(5): 1020–1027
JIN Shaowei, LIU Huaping, WANG Bowen, et al. Recognition of unknown materials in an open environment[J]. CAAI transactions on intelligent systems, 2020, 15(5): 1020–1027

Similar References:

Memo

Last Update: 1900-01-01

A perceptual manipulation system for audio-visual fusion of robots PDF DownloadHTML

Memo

A perceptual manipulation system for audio-visual fusion of robots

PDF Download HTML