Authors: Cardenas-Perez, Carlos and Romualdi, Giulio and Elobaid, Mohammed and Dafarra, Stefano and L'erario, Giuseppe and
Traversaro, Silvio and Morerio, Pietro and Del Bue, Alessio and Pucci, Daniele
This paper presents XBG (eXteroceptive Behaviour Generation), a multimodal end-to-end Imitation Learning (IL) system for a whole-body autonomous humanoid robot used in real-world Human-Robot Interaction (HRI) scenarios. The main contribution of this paper is an architecture for learning HRI behaviors using a data-driven approach. Through teleoperation, a diverse dataset is collected, comprising demonstrations across multiple HRI scenarios, including handshaking, handwaving, payload reception, walking, and walking with a payload. After synchronizing, filtering, and transforming the data, different Deep Neural Networks (DNN) models are trained. The final system integrates different modalities comprising exteroceptive and proprioceptive sources of information to provide the robot with an understanding of its environment and its own actions. The robot takes sequence of images (RGB and depth) and joints state information during the interactions and then reacts accordingly, demonstrating learned behaviors. By fusing multimodal signals in time, we encode new autonomous capabilities into the robotic platform, allowing the understanding of context changes over time. The models are deployed on ergoCub, a real-world humanoid robot, and their performance is measured by calculating the success rate of the robot's behavior under the mentioned scenarios.
XBG leverages the avatar system [1] that enables a human operator to embody a humanoid robotic platform. It is composed of two networks: the operator and the robot one, which are connected through diverse teleoperation and teleperception interfaces. Figure 1 shows the operator (left-top) and the robot network (right) connected through the retargeted joints position, walking velocities, and the sensor feedback interface. This work introduces the embodied AI network (left-bottom) where XBG connects to the same interfaces as the operator one allowing to switch from human teleoperation to an embodied AI autonomous behavior (the XBG network architecture is presented in detail in the figure below).
Fig. 1: From Teleoperation to Autonomy: the robot architecture uses the retargeted joints positions and walking velocities interfaces and provides the information for the sensors feedback interface. The robot can be operated either from a human operator or from XBG.
In an autonomous humanoid robot system, much like humans being aware of their surroundings and actions, it is essential for the robot to understand its environment and its own actions. To achieve this, we incorporate two components into the XBG (eXteroceptive Behaviour Generation) model one for processing environmental information (such as what the robot sees and senses) and another for understanding the robot’s state (like its position, movements, and internal conditions). Similar to humans, who do not make decisions or act based on a single moment in time but rather consider longer context to give a sense to an interaction, the robot also needs to understand sequences of events. XBG analyzes the images captured by the robot’s cameras and monitor its joint positions over time to assess its behaviour. This involves observing changes over time, both in visual perception and physical movements. To this purpose the model comprises 4 primary compo- nents that are shown in Figure 2: one for extracting features from the environmental sensory inputs (exteroceptive), a component to assess its state (proprioceptive), another one for integrating these diverse sources of information (modality fusion), and the last one for analyzing temporal dependen- cies and patterns. These components collaborate to provide the robot a comprehensive understanding of how it should interact with its surroundings and humans.
Fig. 2: XBG Neural Network Architecture. a) Exteroceptive Component comprising RGB branch (top) and Depth branch (bottom). b) Propioceptive Component. c) Modal Fusion Component. d) Temporal Component.
The exteroceptive component aims at enhancing the into its context and learns to respond accordingly. For such robot’s perception capabilities. By gathering information from the surrounding environment, the robot gains insight a task, this work uses both RGB and depth data as shown in Figure 2a. Each modality is initially handled by separate branches, extracting a set of features from them, that are then fused. The RGB branch receives input images of size 224x224 pixels by 3 channels. These images go through a feature extraction using a pre-trained model backbone. XBG uses the smallest version of YOLOv8 (You Only Look Once Version 8 nano) [2]. The purpose of using this pretrained sub- network is to leverage its capacity to extract a meaningful set of visual features. On the other hand, the depth branch employs a series of 7 standard convolutional layers, each layer uses a 3x3 kernel, with stride and padding set to 1, and ReLU activation func- tion. The number of channels increase gradually up to 256 throughout the branch, while its dimensionality is reduced by using pooling layers. This branch’s design ensures an output with identical dimensions to the RGB branch, facilitating the multimodal fusion in subsequent network layers.
The proprioceptive component plays a crucial role in fur- nishing the robot with self-awareness by extracting pertinent information regarding its own actions. It encompasses the robot’s upper-body joint positions, locomotion velocities (in- cluding linear, angular, and sideways velocities), and motor currents in the arm joints (shoulders and elbows), which provide insight into the effort exerted by these joints. Figure 3 shows the degrees of freedom of each joint controlled by XBG in the upper-body of the robot and the walking velocities signal for the walking controller in charge of the lower-body joints. These inputs are fed into two fully- connected layers, each comprising 512 neurons with ReLU activation function, obtaining a robot state feature tensor.
Fig. 3: XBG proprioceptive input/output signals. XBG receives as input the positions from the upper-body joints, motor currents from shoulders and elbows, and walking velocities. It outputs the same joints positions for the upper-body and the walking velocities for the walking controller.
XBG employs a mid-fusion strategy to fuse the various modalities within the system (figure 2c), it starts by stacking the channels from the RGB and depth branch conforming a 7x7x512 tensor which pass through a convolutional layer with kernel size of 7. This allows to explicitly provide some spatial relationship between the RGB and the depth informa- tion and also reduce the dimensionality of the exteroceptive component obtaining a 512x1x1 already flattened tensor. This tensor contains the exteroceptive features that are concatenated with the proprioceptive features, obtaining a 1024 length tensor that constitutes the whole spatial features for a single instant of time.
Thanks to the exteroceptive and proprioceptive compo- nents, the robot possesses a set of spatially related features that map the external observation of the environment and the internal robot state. To construct a successful autonomous system, it’s crucial to account for both spatial and temporal dimensions, specifically understanding how the environment evolves over time and how the robot reacts accordingly. To capture these temporal relationships, XBG utilizes a recur- sive neural network layer, specifically a Long Short Term Memory (LSTM) with a hidden size of 512 and considers a time window of 1.6 seconds during training. Finally, the network includes a non-linear fully connected layer followed by a linear one, forming the network’s head. This allows us to decode the features extracted directly to the joint positions for the entire upper body and the locomotion velocities for the walking controller in charge of the lower-body of the robot.
The human raises both arms and wave them to greet the robot. The robot should raise the arms to greet back (In some examples only one arm was raised during the interaction).
The human approaches the robot from different angles and extends his hand to shake hands with the robot. The robot should extends one of its arms until touching the human hand to shake hands.
The human approaches the robot from different angles and extends a payload to the robot. The robot should extends his arms to receive the payload and then hold it until the human takes it back.
The human walks away from the robot moving both arms towards his body like calling the robot, signaling the robot to walk and follow the person.
Once the human delivers a payload to the robot, he can signal the robot to walk while carrying the payload using a similar movement as the one describe in the "Walk" behaviour. The robot should be able to walk balancing the payload and not letting it fall.
In this experiment, we aimed to evaluate the model's ability to transition seamlessly between different actions randomly, simulating real-world scenarios where the robot may need to perform various tasks consecutively without explicit instruction. We randomly selected a sequence of 30 interactions from our set of scenarios and executed them one after another. The robot autonomously determined how to behave based on its understanding of the environment and the changing context, successfully completing 70% of these scenarios.