<-上一篇/Previous Article 下一篇/Next Article->

[1]王业飞,葛泉波,刘华平,等.机器人视觉听觉融合的感知操作系统[J].智能系统学报,2023,18(2):381-389.[doi:10.11992/tis.202111036]
　WANG Yefei,GE Quanbo,LIU Huaping,et al.A perceptual manipulation system for audio-visual fusion of robots[J].CAAI Transactions on Intelligent Systems,2023,18(2):381-389.[doi:10.11992/tis.202111036]

点击复制

机器人视觉听觉融合的感知操作系统

PDF下载 HTML

《智能系统学报》[ISSN 1673-4785/CN 23-1538/TP] 卷: 18 期数: 2023年第2期页码: 381-389 栏目: 吴文俊人工智能科学技术奖论坛出版日期: 2023-05-05

Title:: A perceptual manipulation system for audio-visual fusion of robots

作者:: 王业飞¹, 葛泉波², 刘华平³, 陆振宇⁴; 1. 南京信息工程大学电子与信息工程学院，江苏南京 210044;
2. 南京信息工程大学自动化学院，江苏南京 210044;
3. 清华大学计算机科学与技术系，北京 100084;
4. 南京信息工程大学人工智能学院，江苏南京 210044

Author(s):: WANG Yefei¹, GE Quanbo², LIU Huaping³, LU Zhenyu⁴; 1. School of Electronic and Information Engineering, Nanjing University of Information Science and Technology, Nanjing 210044, China;
2. School of Automation, Nanjing University of Information Science and Technology, Nanjing 210044, China;
3. Department of Computer Science and Technology, Tsinghua University, Beijing 100084, China;
4. School of AI, Nanjing University of Information Science and Technology, Nanjing 210044, China

关键词:: 视觉定位; 音频识别; 深度学习; 视觉感知; 听觉感知; 视听融合; 多模态数据; 主动操作

Keywords:: visual positioning; audio recognition; deep learning; visual perception; auditory perception; audio-visual fusion; multi-modal data; active operation

分类号:: TP391

DOI:: 10.11992/tis.202111036

摘要:: 智能机器人面对复杂环境的操作能力一直是机器人应用领域研究的前沿问题，指称表达是人类对指定对象定位通用的表述方式，因此这种方式常被利用到机器人的交互当中，但是单一视觉模态并不足以满足现实世界中的所有任务。因此本文构建了一种基于视觉和听觉融合的机器人感知操作系统，该系统利用深度学习算法的模型实现了机器人的视觉感知和听觉感知，捕获自然语言操作指令和场景信息用于机器人的视觉定位，并为此收集了12类的声音信号数据用于音频识别。实验结果表明：该系统集成在UR机器人上有良好的视觉定位和音频预测能力，并最终实现了基于指令的视听操作任务，且验证了视听数据优于单一模态数据的表达能力。

Abstract:: The ability of intelligent robots to function in complex environments has been a longstanding challenge in the field of robotic applications. Referential expressions are frequently utilized for object positioning, making this method a common approach in robot interactions. However, relying on a single visual modality alone is not adequate for all tasks in real-world scenarios. This study proposes a robot perception system based on the fusion of visual and auditory modalities. The system employs a deep learning algorithm model to realize the visual and auditory perceptions of the robot, and it processes natural language and scene information for visual positioning and collects data from 12 types of sound signals for audio recognition. The experimental results indicate that the system integrated into the UR robot has a strong visual positioning ability and audio prediction, and it has successfully carried out an instruction-based audio-visual operation task. The results confirm that audio-visual data has a higher expressive capability than single-modal data.

参考文献/References:: [1] HE Kaiming, GKIOXARI G, DOLLáR P, et al. Mask R-CNN[C]//2017 IEEE International Conference on Computer Vision. Venice: IEEE, 2017: 2980?2988.
[2] REDMON J, FARHADI A. YOLOv3: an incremental improvement[EB/OL]. (2018?04?08)[2021?01?01].https://arxiv.org/abs/1804.02767.
[3] 周俊佐, 朱宗奎, 何正球,等. 面向人机对话意图分类的混合神经网络模型[J]. 软件学报, 2019, 30(411): 3313–3325
ZHOU Junzuo, ZHU Zongkui, HE Zhengqiu, et al. Hybridneural network models for human-machine dialogue intention classification[J]. Journal of software, 2019, 30(411): 3313–3325
[4] CHOWDHARY K R. Natural language processing[J]. Fundamentals of artificial intelligence, 2020, 17(6): 603–649.
[5] MUREZ Z, VAN AS T, BARTOLOZZI J, et al. Atlas: end-to-end 3D scene reconstruction from posed images[M]//Computer Vision-ECCV 2020. Cham: Springer International Publishing, 2020: 414?431.
[6] HODA? T, BARáTH D, MATAS J. EPOS: estimating 6D pose of objects with symmetries[C]//IEEE/CVF Conference on Computer Vision and Pattern Recognition. Seattle: IEEE, 2020: 11700?11709.
[7] CORONA E, PUMAROLA A, ALENYà G, et al. GanHand: predicting human grasp affordances in multi-object scenes[C]//2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Seattle: IEEE, 2020: 5030?5040.
[8] QURESHI A H, SIMEONOV A, BENCY M J, et al. Motion planning networks[C]//2019 International Conference on Robotics and Automation. Montreal: IEEE, 2019: 2118?2124.
[9] QI Yuankai, WU Qi, ANDERSON P, et al. REVERIE: remote embodied visual referring expression in real indoor environments[C]//2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Seattle: IEEE, 2020: 9979?9988.
[10] GAO Chen, CHEN Jinyu, LIU Si, et al. Room-and-object aware knowledge reasoning for remote embodied referring expression[C]//2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Nashville: IEEE, 2021: 3063?3072.
[11] ZHANG HANBO, LU YUNFAN, YU CUNJUN, et al. INVIGORATE: interactive visual grounding and grasping in clutter[EB/OL]. (2021?08?25)[2021?10?10].https://arxiv.org/abs/2108.11092.
[12] MI Jinpeng, LYU Jianzhi, TANG Song, et al. Interactive natural language grounding via referring expression comprehension and scene graph parsing[J]. Frontiers in neurorobotics, 2020, 14: 43.
[13] ONDRAS J, CELIKTUTAN O, BREMNER P, et al. Audio-driven robot upper-body motion synthesis[J]. IEEE transactions on cybernetics, 2021, 51(11): 5445–5454.
[14] LATHUILIèRE S, MASSé B, MESEJO P, et al. Neural network based reinforcement learning for audio-visual gaze control in human-robot interaction[J]. Pattern recognition letters, 2019, 118: 61–71.
[15] H?NEMANN A, BENNETT C, WAGNER P, et al. Audio-visual synthesized attitudes presented by the German speaking robot SMiRAE[C]//The 15th International Conference on Auditory-Visual Speech Processing. Melbourne: ISCA, 2019: 10?11.
[16] YAMAGUCHI A, ATKESON C G. Recent progress in tactile sensing and sensors for robotic manipulation: can we turn tactile sensing into vision?[J]. Advanced robotics, 2019, 33(14): 661–673.
[17] 朱文霖, 刘华平, 王博文, 等. 开放环境下未知材质的识别技术[J]. 智能系统学报, 2020, 15(1): 33–40
ZHU Wenlin, LIU Huaping, WANG Bowen, et al. An intelligent blind guidance system based on visual-touch cross-modal perception[J]. CAAI transactions on intelligent systems, 2020, 15(1): 33–40
[18] LALONDE J F, VANDAPEL N, HUBER D F, et al. Natural terrain classification using three-dimensional ladar data for ground robot mobility[J]. Journal of field robotics, 2006, 23(10): 839–861.
[19] 张新钰, 邹镇洪, 李志伟, 等. 面向自动驾驶目标检测的深度多模态融合技术[J]. 智能系统学报, 2020, 15(4): 758–771
ZHANG Xinyu, ZOU Zhenhong, LI Zhiwei, et al. Deep multi-modal fusion in object detection for autonomous driving[J]. CAAI transactions on intelligent systems, 2020, 15(4): 758–771
[20] LIU Hongyi, FANG Tongtong, ZHOU Tianyu, et al. Deep learning-based multimodal control interface for human-robot collaboration[J]. Procedia cirp, 2018, 72: 3–8.
[21] YOO Y, LEE C Y, ZHANG B T. Multimodal anomaly detection based on deep auto-encoder for object slip perception of mobile manipulation robots[C]//2021 IEEE International Conference on Robotics and Automation. Xi’an: IEEE, 2021: 11443?11449.
[22] GAN Chuang, ZHANG Yiwei, WU Jiajun, et al. Look, listen, and act: towards audio-visual embodied navigation[C]//2020 IEEE International Conference on Robotics and Automation. Paris: IEEE, 2020: 9701-9707.
[23] JIN Shaowei, LIU Huaping, WANG Bowen, et al. Open-environment robotic acoustic perception for object recognition[J]. Frontiers in neurorobotics, 2019, 13: 96.
[24] JONETZKO Y, FIEDLER N, EPPE M, et al. Multimodal object analysis with auditory and tactile sensing using recurrent neural networks[M]//Communications in Computer and Information Science. Singapore: Springer Singapore, 2021: 253?265.
[25] 靳少卫, 刘华平, 王博文, 等. 开放环境下未知材质的识别技术[J]. 智能系统学报, 2020, 15(5): 1020–1027
JIN Shaowei, LIU Huaping, WANG Bowen, et al. Recognition of unknown materials in an open environment[J]. CAAI transactions on intelligent systems, 2020, 15(5): 1020–1027

相似文献/References:: [1]黎南,湛鑫,陈涛,等.UUV水下回收中的视觉和短基线定位融合[J].智能系统学报,2013,8(2):156.[doi:10.3969/j.issn.1673-4785.201301020]
　LI Nan,ZHAN Xin,CHEN Tao,et al.Data fusion method of vision and SBL position for UUV underwater docking[J].CAAI Transactions on Intelligent Systems,2013,8():156.[doi:10.3969/j.issn.1673-4785.201301020]
[2]彭刚,熊超,夏成林,等.一种基于Mark点的点胶机器人视觉目标定位方法[J].智能系统学报,2018,13(5):728.[doi:10.11992/tis.201705010]
　PENG Gang,XIONG Chao,XIA Chenglin,et al.A method of vision target localization for dispensing robot based on mark point[J].CAAI Transactions on Intelligent Systems,2018,13():728.[doi:10.11992/tis.201705010]

备注/Memo

收稿日期:2021-11-18。
基金项目:国家自然科学基金项目 (U1613212).
作者简介:王业飞,硕士研究生,主要研究方向为计算机视觉、人机交互;葛泉波,教授,博士生导师,主要研究方向为工程信息融合方法及应用、人机混合系统智能评估。主持国家自然科学基金青年基金项目1项;刘华平,副教授,博士生导师,中国人工智能学会理事、中国人工智能学会认知系统与信息处理专业委员会秘书长,吴文俊人工智能科学技术奖获得者,主要研究方向为机器人感知、学习与控制、多模态信息融合。主持国家自然科学基金重点项目2项。发表学术论文100余篇
通讯作者:刘华平. E-mail：hpliu@tsinghua.edu.cn

更新日期/Last Update: 1900-01-01

机器人视觉听觉融合的感知操作系统 PDF下载HTML

备注/Memo

机器人视觉听觉融合的感知操作系统

PDF下载 HTML