[1]WANG Yefei,GE Quanbo,LIU Huaping,et al.A perceptual manipulation system for audio-visual fusion of robots[J].CAAI Transactions on Intelligent Systems,2023,18(2):381-389.[doi:10.11992/tis.202111036]
Copy
CAAI Transactions on Intelligent Systems[ISSN 1673-4785/CN 23-1538/TP] Volume:
18
Number of periods:
2023 2
Page number:
381-389
Column:
吴文俊人工智能科学技术奖论坛
Public date:
2023-05-05
- Title:
-
A perceptual manipulation system for audio-visual fusion of robots
- Author(s):
-
WANG Yefei1; GE Quanbo2; LIU Huaping3; LU Zhenyu4
-
1. School of Electronic and Information Engineering, Nanjing University of Information Science and Technology, Nanjing 210044, China;
2. School of Automation, Nanjing University of Information Science and Technology, Nanjing 210044, China;
3. Department of Computer Science and Technology, Tsinghua University, Beijing 100084, China;
4. School of AI, Nanjing University of Information Science and Technology, Nanjing 210044, China
-
- Keywords:
-
visual positioning; audio recognition; deep learning; visual perception; auditory perception; audio-visual fusion; multi-modal data; active operation
- CLC:
-
TP391
- DOI:
-
10.11992/tis.202111036
- Abstract:
-
The ability of intelligent robots to function in complex environments has been a longstanding challenge in the field of robotic applications. Referential expressions are frequently utilized for object positioning, making this method a common approach in robot interactions. However, relying on a single visual modality alone is not adequate for all tasks in real-world scenarios. This study proposes a robot perception system based on the fusion of visual and auditory modalities. The system employs a deep learning algorithm model to realize the visual and auditory perceptions of the robot, and it processes natural language and scene information for visual positioning and collects data from 12 types of sound signals for audio recognition. The experimental results indicate that the system integrated into the UR robot has a strong visual positioning ability and audio prediction, and it has successfully carried out an instruction-based audio-visual operation task. The results confirm that audio-visual data has a higher expressive capability than single-modal data.