Abstract
We introduce a novel dataset for multi-robot activity recognition (MRAR) using two robotic arms integrating WiFi channel state information (CSI), video, and audio data. This multimodal dataset utilizes signals of opportunity, leveraging existing WiFi infrastructure to provide detailed indoor environmental sensing without additional sensor deployment. Data were collected using two Franka Emika robotic arms, complemented by three cameras, three WiFi sniffers to collect CSI, and three microphones capturing distinct yet complementary audio data streams. The combination of CSI, visual, and auditory data can enhance robustness and accuracy in MRAR. This comprehensive dataset enables a holistic understanding of robotic environments, facilitating advanced autonomous operations that mimic human-like perception and interaction. By repurposing ubiquitous WiFi signals for environmental sensing, this dataset offers significant potential aiming to advance robotic perception and autonomous systems. It provides a valuable resource for developing sophisticated decision-making and adaptive capabilities in dynamic environments.
Similar content being viewed by others
Background & Summary
Signals of opportunity refer to the use of pre-existing, non-dedicated signals in the environment for secondary purposes beyond their original intent. WiFi signals, for instance, are primarily used for communication, but can also provide valuable information about the environment through channel state information (CSI). This data captures intricate details about the propagation of WiFi signals, including reflections, scattering, and absorption caused by objects and activities in the environment. By repurposing these ubiquitous WiFi signals, we can achieve comprehensive indoor environmental sensing without the need for additional sensor infrastructure.
Combining WiFi CSI with video and audio data creates a powerful multimodal system that significantly enhances activity recognition capabilities. Video data provides rich visual information, capturing spatial and temporal changes in the environment. Audio data adds another layer of contextual information, identifying auditory cues that correlate with specific activities. Integrating these modalities with CSI data leads to a more holistic understanding of the environment, enabling more accurate and reliable multi-robot activity recognition (MRAR). A true potential of such integration lies in the broader concept of sensor fusion. Sensor fusion combines data from multiple sensors to achieve a more accurate and comprehensive understanding of an environment than could be obtained from any single sensor. This approach not only improves accuracy, but also enhances system resilience and supports robust decision-making across various applications, including robotics and autonomous systems. For example, in scenarios where visual information is compromised due to obstructions or lack of line-of-sight, CSI data can detect movements and objects without requiring direct visual contact, compensating for the limitations of video and audio sensors. Such multimodal sensor fusion is critical for complex and dynamic environments, leveraging the complementary strengths of different sensing modalities. The effectiveness of this approach has been demonstrated in various applications, such as multi-sensor fusion for autonomous vehicle tracking, which integrates diverse sensor inputs to enhance object detection and tracking performance1,2.
The use of signals of opportunity, such as WiFi CSI, offers several key advantages. Firstly, it leverages existing infrastructure, reducing the cost and complexity associated with deploying additional sensors. This makes it an economically viable option for large-scale implementations. Secondly, the multimodal nature of the sensing network enhances robustness by compensating for the limitations of individual sensors. For instance, in scenarios where visual data may be obscured or audio data may be noisy, the complementary information from CSI can help maintain accurate activity recognition.
In recent decades, the integration of robotics and automated systems into various sectors has markedly transformed human life and industrial processes3. The advent of multi-agent systems, where multiple robots collaborate, has subtly yet significantly enhanced task efficiency and adaptability4,5. In healthcare, robots perform complex surgeries with unprecedented precision, reducing recovery times and improving outcomes. In manufacturing, automated assembly lines and robotic arms ensure consistent quality and high productivity, revolutionizing production techniques. Additionally, robots are deployed in hazardous environments, like disaster sites6, minimizing human risks.
In the realm of robotics, the integration of multimodal learning systems represents a significant leap towards achieving autonomous operations that closely mimic human-like perception and interaction with the environment7. This paper explores the application of such an advanced learning paradigm through the deployment of two Franka Emika robotic arms, a choice inspired by their versatility and precision in complex tasks8. To enrich the sensory framework essential for multimodal learning, we employ a comprehensive array of sensors: three cameras, three WiFi sniffers, and three microphones. Each modality is strategically chosen to capture distinct yet complementary data streams-visual, wireless signal-based, and auditory-thereby enabling a more holistic understanding of the robotic arms’ surroundings and activities. This multi-sensory approach not only facilitates the robust perception required for intricate manipulations and interactions but also paves the way for groundbreaking advancements in autonomous robotic systems capable of sophisticated decision-making and adaptation in dynamic environments.
The MNIST dataset9, featuring handwritten digits classified into ten categories, was first introduced by LeCun et al. in 1998. At that time, the significant advancements and performance of deep learning techniques were unimaginable. Despite the current extensive capabilities of deep learning, the simple MNIST dataset remains the most widely used benchmark in the field, even surpassing CIFAR-1010 and ImageNet11 in popularity according to Google Trends. Its simplicity hasn’t diminished its usage, despite some in the deep learning community advocating for its decline12.
In this paper, we introduce the RoboMNIST dataset, an innovative extension of the traditional MNIST dataset tailored for robotic applications. This dataset features two Franka Emika robots writing different digits on an imaginary plane within a 3D environment. Our sensor-rich modules, comprising CSI, video, and audio, capture comprehensive data from the environment. We have validated the dataset’s integrity across individual data modalities through a series of experiments.
Methods
The data collection process was conducted in a laboratory, featuring desks, chairs, monitors, and various other office objects in the environment. The layout of the laboratory, along with its physical dimensions, is illustrated in Fig. 1.
Hardware Specifications & Communication
In all the experiments we have used three sensor-rich modules that are positioned in the room capturing data while the two robotic arms are performing different activities. Each sensor-rich module is capable of simultaneously capturing three modalities namely CSI, video, and audio from the environment. Each module is equipped with the following hardware:
-
CSI: A Raspberry Pi 4 Model B, integrated with the Nexmon project13, which passively captures the CSI data.
-
Video: A ZED 2 Stereo Camera for video recording.
-
Audio: A CG CHANGEEK Mini USB Microphone, featuring omni-directional directivity, to record audio.
Figure 2 depicts the sensor-rich modules used in our data collection setup. We use M ∈ {1, 2, 3} to denote the modules based on the numbering notation in Fig. 1 in the rest of the paper. To facilitate the collection of CSI data, an Apple Mac Mini equipped with the 802.11ax WiFi 6 standard served as the WiFi transmitter.
Figure 3 shows the Franka Emika Panda robot used in this dataset and its kinematic parameters according to the Denavit-Hartenberg convention. This robot is equipped with a 7-axis revolute joint, where the angle of these joints is defined in radian as qi for i ∈ {1, 2, … , 7}. The Franka Emika robotic arm is a collaborative robot (cobot) designed to work safely alongside humans in various environments, ranging from industrial settings to direct interaction scenarios. Unlike conventional industrial robots, which are typically enclosed for safety reasons, the Franka Emika arm can perform tasks in close proximity to people without posing a hazard8. This capability makes it ideal for operations that require direct physical interaction, such as drilling, screwing, polishing, and a wide range of inspection and assembly tasks. The Franka Emika robotic arm provides a 3 kg payload capacity and a reach of 850 mm. The robot weighs approximately 18 kg and its repeatability is 0.1 mm. Repeatability is a measure of the ability of the robot to consistently reach a specified point.
Experiments
Our dataset is composed of 60 different primary combinations performed by the robotic arms, capturing activities through our sensor-rich modules. Our dataset encompasses four variations:
Activity
The Franka Emika robotic arms were programmed to draw the digits 0 through 9 on a vertical imaginary plane, creating ten distinct activity classes. The positions of the end effector for each activity are shown in Fig. 4. We denote the activities performed by the robots as A ∈ {0, 1, …, 9}.
Numbers 0 through 9 are drawn by the robotic arm on a vertical imaginary plane, resulting in 10 distinct classes of activities. The plot shows the end effector trajectories that form these numbers, with the robotic arm and background removed for clarity. For illustration purposes the initial and final parts of the robot’s trajectory, where the robot positions itself from its starting point to the imaginary plane and back, are omitted.
To execute these activities, we generated seven fixed waypoints. These waypoints are depicted in Fig 5, illustrating their positions in both the joint space and the corresponding end effector space. The activities are performed using the waypoints in the joint position space. For each activity, the robot follows an appropriate sequence of waypoints from waypoint 2 to waypoint 7. The robot always starts and ends at waypoint 1, after completing the digit in the imaginary plane.
Each activity involves writing a digit using a robotic arm, defined by seven distinct waypoints. The left plot illustrates the waypoints in the end-effector space, while the right plot represents them in the joint space. Each digit is written by following a specific sequence of waypoints in the joint space, with the robot always starting and ending at waypoint 1. For example, to write the digit 7, the robot follows the sequence {1, 2, 5, 7, 1}.
Robot Number
Indicated by R ∈ {1, 2}, this specifies which of the two available robotic arms is performing the activity based on the numbering notation in Fig. 1.
Robot velocity
Denoted by V ∈ {High, Medium, Low}, this describes the velocity level at which the robot performs the activity. To enforce these velocities, we applied the following limits on the maximum joint acceleration and deceleration of the joint angles:
-
High: A maximum joint acceleration and deceleration of amax = 2.5 radian/s2 have been applied on each joint in this setting. This corresponds to 50% of the robot’s maximum allowed acceleration.
-
Medium: A maximum joint acceleration and deceleration of amax = 2.0 radian/s2 have been applied on each joint in this setting. This corresponds to 40% of the robot’s maximum allowed acceleration.
-
Low: A maximum joint acceleration and deceleration of amax = 1.5 radian/s2 have been applied on each joint in this setting. This corresponds to 30% of the robot’s maximum allowed acceleration.
Motion uncertainty
Denoted by \(U\in {{\mathbb{R}}}^{+}\), where \({{\mathbb{R}}}^{+}\) represents the positive real numbers, measures the L2 norm error of the end effector’s position relative to its intended ground truth trajectory over time.
The Franka Emika robotic arm repeatability is 0.1 mm, which ensures that the robot performs a very similar motion for each activity, to make the dataset realistic We manually and purposefully introduced a layer of uncertainty into the robots’ motion to ensure that different repetitions in our dataset are not identical. This approach simulates the variability of handwritten digits performed by humans, which are inherently non-identical. To achieve this, we added multiplicative uniform noise \(\epsilon \sim {\mathcal{U}}(0.7,1.3)\) to the first three joints of the robot (q1, q2, q3) for each waypoint in the sequence, except for the first waypoint. This ensures that the start and end positions of the robot remain consistent. We chose the first three joints because they have the most significant effect on the robots’ final trajectory, while the remaining four joints primarily influence the orientation of the robot’s end effector. Fig. 6 illustrates the impact of these uncertainties on the first three joints and the end effector position of the robot in different repetitions on the same activity. Using this approach, we have increased the robot’s repeatability to an average of 32 cm. To illustrate this, Fig. 7 presents the maximum distance between the robot’s trajectory and the ground truth across all repetitions in the dataset.
The effect of uncertainty on the end effector position (left) and on the joint positions (right) of the robot. Each color represents the result of a repetition where the robot performs the activity with A = 8, R = 1, and V = High. Only the first three joints are shown in this figure, as no uncertainty is applied to the remaining joints. Additionally, waypoint 1 has no added uncertainty, ensuring that the robot always starts and ends in the same position.
The combination of ten activities, two robots performing these activities, and three velocity levels results in a total of 60 unique primary combinations. For each primary combination, we have collected 32 repetitions. Each repetition spans 15 seconds, during which the robot performs the action with consistent variations in activity, robot arm, and velocity, while incorporating motion uncertainty. This introduces deviations in each repetition as the robot writes on an imaginary plane, adding a realistic layer of complexity to the dataset. Fig. 8 shows the uncertainty in motion of all the repetitions in four of the primary combinations as a sample projected on a 2D imaginary plane.
Motion uncertainty of all the repetitions in four of the primary combinations as a sample. All with R = 1, and V = High, but with different values of A. For illustration purposes, the path is projected on a 2D imaginary plane and the initial and final parts of the robot’s trajectory, where the robot positions itself from its starting point to the imaginary plane and back, are omitted.
WiFi CSI Modality
As wireless signals propagate, they encounter various obstacles in the environment, leading to reflections and scattering, a phenomenon known as multipath fading14. WiFi CSI facilitates the analysis of subcarrier propagation from the transmitter to the receiver in wireless communications15. The channel model is represented as
where x, y, and η denote the transmitted signal vector, received signal vector, and additive noise vector, respectively15. The channel matrix \({\bf{H}}\in {{\mathbb{C}}}^{T\times S}\) encapsulates the characteristics of the wireless channel, including multipath propagation, fading, and other impairments, and is defined as
where S and T represent the number of subcarriers for each antenna and the number of transmitted packets, respectively. Each element of the matrix H corresponds to a complex value, known as the channel frequency response, and is given by
where as and ϕs denote the amplitude and phase of subcarrier s at timestamp t, respectively. For human activity recognition (HAR)16,17,18 and robot activity recognition (RAR)19,20,21, studies primarily focus on \({\bf{A}}\in {{\mathbb{R}}}^{T\times S}\), which corresponds to the element-wise amplitude of H, disregarding the phase component.
For each 15-second repetition, we collected CSI measurements at a 30 Hz frequency, over an 80 MHz bandwidth which gives 256 number of subcarriers at each time stamp. This resulted in a 450 × 256 complex matrix for each sensor-rich module. These matrices are stored in a json file. Additionally, we included received signal strength (RSS) information for each timestamp as well. Fig. 9 displays a sample plot of the amplitudes of a CSI matrix.
Video Modality
For each 15-second repetition, we collected video measurements at a frequency of 30 Hz, synchronized with the CSI measurements, using three sensor-rich modules, each containing a stereo camera. For each sample, three ZED 2 stereo cameras simultaneously recorded RGB videos at a resolution of 2560 × 720 pixels, with the frames from the left and right lenses of each stereo camera horizontally concatenated. This setup allowed us to capture three different views of the same action. Each camera, equipped with two lenses, provided stereo video, resulting in a dataset that includes 3 (cameras) × 2 (lenses) = 6 videos for each repetition.
Audio Modality
Audio signals are continuous waveforms that represent sound waves in a format that can be processed by digital systems. These waveforms are characterized by frequency, amplitude, and phase, which contain rich information about the environment and the sources of sound. In the context of activity recognition, audio signals offer a non-intrusive and cost-effective means to infer activities and interactions. By analyzing the acoustic patterns and variations over time, it is possible to identify specific activities based on the distinct sounds associated with each activity. Advanced machine learning algorithms, particularly those leveraging deep learning, have demonstrated significant success in classifying and recognizing activities from audio data. These algorithms extract useful information, such as the frequency and amplitude of the sound wave over time, to analyze and predict activities22,23.
In this paper, we employ auditory perception as many robot activities produce characteristic sounds from which we can effectively infer corresponding actions. We view audio not as a replacement but as a complement to existing sensory modalities. By fusing audio with other sensory data, we aim to achieve particularly robust activity recognition across a wide range of conditions. This multimodal approach enhances the accuracy and reliability of activity recognition systems, making them more effective in diverse environments.
For each 15-second repetition, we collected audio measurements from each sensor-rich module at a sampling rate of 44, 100 Hz while the start and end times were synchronized with our other modalities. To analyze the audio data, we preprocess the audio files by computing their spectrograms. While we delve more into the calculation of the audio spectrogram in the Technical Validation section, a sample spectrogram plot of one of the captured audio is shown in Fig. 10.
True Trajectory
For each 15-second repetition, we provided the true joint positions of the robots at a frequency of 30 Hz, synchronized with the CSI and video measurements. At each timestamp, each joint position, consisting of q1 to q7, corresponding to the 7-axis revolute joints of the robot, has been stored in radians in a json file. For convenience, the x − y − z position of the end effector (EE) in the Cartesian coordinate system is also provided, presented in the robot frame aligned with the x0 and z0 axes and following the right-hand rule, as shown in Fig. 3. Fig. 11 displays a sample plot of the end effector’s Cartesian position of one of the robots performing an activity.
Plot of the end effector’s position of the robot during a repetition with A = 2, R = 1, and V = High. The black trajectory represents the robot drawing the number on an imaginary plane, while the gray trajectory illustrates the robot’s movement from a fixed starting position to the imaginary plane and back.
Synchronization
During data collection, each module incorporates local timestamping directly on the hardware. The timestamped data is subsequently transmitted across the network for further processing. Although this configuration effectively captures data from individual modules, synchronizing timestamps is critical for deployments involving multiple modules. To address this challenge, we developed a method whereby packets containing CSI, video, and audio data, each with its own timestamp, are redirected to a specialized system known as the monitor.
The monitor serves as a central hub, collecting packets from various modules and assigning synchronized timestamps to the data. A visual depiction of this intercommunication process is presented in Fig. 12. Notably, according to the Nexmon project’s specifications, Raspberry Pis configured as CSI sniffers forfeit their WiFi communication capabilities. To overcome this restriction, we connected the sniffers and the monitor using Ethernet cables, thus ensuring uninterrupted communication between them.
Using this configuration, the CSI, video, the true trajectories of the robots, and the start and end time of the audio are synchronized with each other and between all the sensor-rich modules, providing a comprehensive dataset for multi-modal passive MRAR. Fig. 13 compactly shows different synchronized modalities in an experiment.
Plot of the three modalities and the robot’s true trajectory in a repetition with A = 0, R = 1, and V = High, captured by module M = 1. The video, CSI, true trajectory, and start and end time of the audio are synchronized. For illustration purposes, the initial and final parts of the robot’s movement, where it positions itself from its starting point to the imaginary plane and back, are omitted.
Figure 14 shows MNIST formatted plots of the 10 activities and different repetitions based on our modalities.
Data Records
The dataset is available for download from our Figshare repository24.
Based on variations in activity, robot, and velocity, we have 60 primary combinations. Each combination has a dedicated folder in the dataset, named according to the standard described in Fig. 15(a). Within each primary combination, there are at least 32 repetitions, all with 15-second duration and specific R, V, and A variations, differing only by assigned motion uncertainty. Each repetition contains 8 files as listed in Table 1, following the naming convention outlined in Fig. 15(b).
Naming standard for folders and files in our dataset. The naming standard for folders, dedicated to each primary combination, specifies the robot performing the activity (denoted by R), the velocity of the activity (denoted by V), and the specific activity being performed (denoted by A). The naming standard for each file within a folder shares the same values for A, V, and R, and varies by the robot’s motion uncertainty (denoted by U). The value of U is always a floating-point number with two decimal places. For example, UNC258 represents U = 2.58. The Rx indicates the module from which the data is sourced; it can be a specific number or “all,” where “all” signifies that data from all sensors/robots are collected in the same file.
WiFi CSI Description
This section describes the structure of the data files residing in the CSI files, which are the files ending with csi.json. Each repetition is represented by a single CSI file, which contains the CSI data for all the three sensor-rich modules. These CSI files are in json format and consist of an array of three json objects. Each json object corresponds to one of the sensor-rich modules. Within each json object, the data is organized as a set of key-value pairs, as detailed in Table 2.
Video Description
This section describes the structure of the data files residing in the video files, identified by the cam.mp4 extension. Each repetition corresponds to 3 video files, each for one of the sensor-rich modules. Each video file is in mp4 format, where each frame consists of the horizontally concatenated left and right frames of the lenses of the stereo camera, resulting in a final frame with dimensions of 2560 × 720. Each frame is synchronized with the timestamps provided in the CSI file of the same repetition.
Audio Description
This section describes the structure of the data files residing in the audio files, identified by the mic.wav extension. Each repetition corresponds to three audio files, each file for one of the sensor-rich modules. Each audio file is in wav format and has been recorded for the 15-second duration of the repetition.
True Robot Trajectory Description
This section describes the structure of the data files in the position files, which have the pos.json extension. Each repetition corresponds to one position file containing the position information of both robots. Even though only one robot is moving in each repetition, as indicated by the file name, we have included the position data for both robots. Each position file is in json format and consists of an array of two json objects, one for each robot. Each json object is a set of key-value pairs, as detailed in Table 3.
Technical Validation
In this section, we validate our dataset across various aspects, including robot configuration, uncertainty, and the synchronization of CSI, video and audio.
Joint Positions
The documentation of the Franka Emika robot25 provides the safety specifications for the joint positions of each of the seven joints. The safe operating range for the joints is summarized in Table 4.
Figure 16 shows the joint positions, in radians, used throughout our entire dataset for both robots separately. As evident from the figure, all joint positions in the dataset fall within the safe operating range of the robots.
The joint positions for each robot are shown using filled circles, with each joint represented separately in radians. For each joint, all positions used throughout the dataset are displayed. The boxes indicate the maximum and minimum values allowed based on the specifications of the robots. Given the similarity in actions performed by both robots, their joint position ranges are nearly identical and closely aligned.
Joint Velocities
The documentation of the Franka Emika robot also provides the safety specifications for the joint velocities of each of the seven joints. The safe operating velocity ranges for the joints are summarized in Table 5.
Figure 17 shows the joint velocities, in radian/s, used throughout our entire dataset for both robots separately. As evident from the figure, the joint velocities in the dataset fall within the safe operating range of the robots.
Figure 18 shows the linear velocity of the robot’s end effector for the different velocity levels used in this dataset. It illustrates how the High, Medium, and Low velocity levels affect the motion of the end effector. As expected, the velocity is higher when using the High setting compared to the Medium setting, resulting in the activity being completed more quickly. A similar pattern is observed when comparing the Medium and Low velocity levels. When comparing the velocity levels, it is important to note that, for instance, the Medium velocity exhibits a lag in motion compared to the High velocity. This lag should be taken into account when analyzing and comparing the velocities.
The linear velocity of the end effector over time for different velocity levels: High, Medium, and Low. This comparison is based on repetitions with A = 7 and R = 1. The solid lines represent the average velocity across all repetitions, while the shaded areas around them indicate the variance. As expected, the High velocity is consistently greater than the Medium velocity, and the Medium velocity exceeds the Low velocity. In A = 7, the robot follows five waypoints, corresponding to four motions between these waypoints. Each peak in the plot represents one of these motions. It is clear that the different velocity levels behave as expected, with the High velocity showing higher peaks compared to Medium, and Medium showing higher peaks compared to Low. Additionally, the High velocity completes the activity faster than Medium, and Medium finishes sooner than Low.
Motion Uncertainty
As mentioned in the Methods section, we manually added multiplicative uniform noise \(\epsilon \sim {\mathcal{U}}(0.7,1.3)\) to the first three joints of the robot (q1, q2, q3) for each waypoint in the sequence, except for waypoint 1. Fig 19 shows the actual samples for each waypoint of the first three joints. The filled boxes represent the interquartile range (IQR) of the noisy waypoints, while the whiskers extend to 1.5 times the IQR. The maximum range of uncertainty for each waypoint is illustrated using a dashed rectangle. It is evident that the noisy waypoints consistently fall within the range defined by the specified uncertainty.
The sampled waypoints with added uncertainty are plotted for each waypoint, using all repetitions for A = 8, R = 1, and V = High. We chose A = 8 because it traverses all the waypoints multiple times, providing a representative example of the entire dataset. The sampled waypoints with uncertainty are visualized using a box plot, where the boxes represent the IQR and the whiskers extend to 1.5 times the IQR. The maximum allowed uncertainty is indicated with dashed rectangles, showing that the noisy waypoints consistently fall within this range.
CSI Synchronization
To validate the synchronization of WiFi signals captured by the three sniffers, we analyzed the normalized cross-correlation of the WiFi CSI amplitude signals. This metric measures the similarity of the signal patterns and ensures that the data streams are temporally aligned across the devices.
Figure 20 presents the normalized cross-correlation of each pair of sniffers for each repetition. Table 6 summarizes the statistics of the normalized cross-correlation values. As observed, the average normalized correlation values for all sniffer pairs exceed 9.23 × 10−1, with a minimum value of 8.74 × 10−1 and a maximum value of 9.66 × 10−1. The IQR values are also small (e.g., 1.18 × 10−2 to 1.45 × 10−2), indicating low variability among the central data points. These results confirm that the CSI signals are highly linearly correlated, demonstrating strong synchronization across the sniffers.
Video Synchronization
To validate the synchronization of the video recordings captured by the three cameras, we analyzed the normalized cross-correlation of pixel intensity values extracted from the video frames. These metrics measure the similarity of pixel patterns across the video streams, ensuring that the cameras are temporally aligned. The normalized cross-correlation values validate the synchronization, showing consistent similarity across the dataset. These findings demonstrate stable temporal alignment between the cameras, ensuring reliable synchronization.
Figure 21 illustrates the normalized cross-correlation values across all repetitions. The x-axis denotes the repetitions, while the y-axis represents the magnitude of the correlation. These results validate the consistent synchronization of the video data across the three cameras, demonstrating strong alignment. Building on these observations, Table 7 presents key statistical metrics for the normalized cross-correlation values, including the mean, standard deviation, minimum, maximum, and IQR. These metrics quantitatively confirm the strong synchronization across all pairs with minimal variation.
Audio Synchronization
To assess the synchronization of audio signals captured by the microphones, we computed the normalized cross-correlation of their spectrograms. This ensures that audio signals recorded by the devices are aligned in time, a key requirement for synchronization.
The normalized cross-correlation results confirm the synchronization of audio signals across the devices. The small standard deviations in the normalized values indicate stable synchronization with minimal deviations across the dataset. Fig. 22 illustrates the normalized cross-correlation for each pair of microphones across all repetitions, with each colored line representing a pair. Table 8 provides a detailed summary of the normalized cross-correlation statistics. As shown, the average normalized correlation values for all microphone pairs are above 7.84 × 10−1, with minimum values ranging from 7.73 × 10−1 to 7.77 × 10−1 and maximum values close to 7.88 × 10−1. With IQR values ranging from 8.62 × 10−4 to 1.38 × 10−3, the central data points show minimal variation. These results confirm that audio signals are highly synchronized between devices.
Additionally, Fig.. 23 illustrates the data from various sensor modalities, showing the start and end times of the activity performed by the robot during one of the repetitions in the dataset. This figure also validates the synchronization between the different sensors and modules in the dataset.
The plot shows the data from three sensor modules across CSI, audio, and video modalities. For CSI, a heatmap of subcarrier amplitudes is used; for audio, the waveform of the audio signal is presented, showing variations in amplitude over time; and for video, pixel-wise variance (squared differences) between consecutive frames is depicted. The start and end of the activity performed by the robot are indicated by black dashed lines. Comparing the signals from different sensors demonstrates their synchronization in time with respect to each other. This plot is taken from one of the repetitions (a.k.a. samples) in the dataset with A = 0, R = 1, and V = High.
Code availability
The entire dataset is available for download from our Figshare repository24. The interested readers are encouraged to visit our GitHub repository (https://github.com/SiamiLab/RoboMNIST), where example Python notebooks for loading and visualizing our data are provided.wifi_csi_read.ipynb: This Python notebook loads a CSI json file from a repetition and visualizes the CSI amplitudes and RSS values.true_trajectory_read.ipynb: This Python notebook loads a true trajectory json file from a repetition and visualizes the robot’s motion.
The repository also includes the complete set of codes along with detailed explanations on how the robots were controlled, data was collected, and synchronization was achieved across different modules and sensors, and it provides instructions on how to reproduce these processes
References
Vinoth, K. & Sasikumar, P. Multi-sensor fusion and segmentation for autonomous vehicle multi-object tracking using deep q networks. Scientific Reports 14, 31130 (2024).
Celik, Y. & Godfrey, A. Bringing it all together: Wearable data fusion. npj Digital Medicine 6, 149 (2023).
Moran, M. E. Evolution of robotic arms. Journal of robotic surgery 1, 103–111 (2007).
Prajapat, M., Turchetta, M., Zeilinger, M. & Krause, A. Near-optimal multi-agent learning for safe coverage control. Advances in Neural Information Processing Systems 35, 14998–15012 (2022).
Hosseini, S. H., Tavazoei, M. S. & Kuznetsov, N. V. Agent-based time delay margin in consensus of multi-agent systems by an event-triggered control method: Concept and computation. Asian Journal of Control 25, 1866–1876, https://doi.org/10.1002/asjc.2814 (2023).
Jorge, V. A. et al. A survey on unmanned surface vehicles for disaster robotics: Main challenges and directions. Sensors 19, 702 (2019).
Duan, S., Shi, Q. & Wu, J. Multimodal sensors and ml-based data fusion for advanced robots. Advanced Intelligent Systems 4, 2200213 (2022).
Haddadin, S. et al. The franka emika robot: A reference platform for robotics research and education. IEEE Robotics & Automation Magazine 29, 46–64 (2022).
LeCun, Y., Bottou, L., Bengio, Y. & Haffner, P. Gradient-based learning applied to document recognition. Proceedings of the IEEE 86, 2278–2324 (1998).
Krizhevsky, A. et al. Learning multiple layers of features from tiny images. Computer Science University of Toronto (2009).
Deng, J. et al. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, 248–255 (Ieee, 2009).
Xiao, H., Rasul, K. & Vollgraf, R. Fashion-mnist: a novel image dataset for benchmarking machine learning algorithms. arXiv preprint arXiv:1708.07747 (2017).
Schulz, M., Wegemer, D. & Hollick, M. Nexmon: The c-based firmware patching framework. https://nexmon.org (2017).
Yang, Z., Zhou, Z. & Liu, Y. From rssi to csi: Indoor localization via channel response. ACM Computing Surveys (CSUR) 46, 1–32 (2013).
Wang, Z. et al. A survey on human behavior recognition using channel state information. Ieee Access 7, 155986–156024 (2019).
Salehinejad, H. & Valaee, S. Litehar: lightweight human activity recognition from wifi signals with random convolution kernels. In ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 4068–4072 (IEEE, 2022).
Yousefi, S., Narui, H., Dayal, S., Ermon, S. & Valaee, S. A survey on behavior recognition using wifi channel state information. IEEE Communications Magazine 55, 98–104 (2017).
Zheng, Y. et al. Zero-effort cross-domain gesture recognition with wi-fi. In Proceedings of the 17th annual international conference on mobile systems, applications, and services, MobiSys ’19, 313–325 (Association for Computing Machinery, New York, NY, USA, 2019).
Zandi, R., Salehinejad, H., Behzad, K., Motamedi, E. & Siami, M. Robot motion prediction by channel state information. In 2023 IEEE 33rd International Workshop on Machine Learning for Signal Processing (MLSP), 1–6 (IEEE, 2023).
Zandi, R., Behzad, K., Motamedi, E., Salehinejad, H. & Siami, M. Robofisense: Attention-based robotic arm activity recognition with wifi sensing. IEEE Journal of Selected Topics in Signal Processing (2024).
Zandi, R., Behzad, K., Motamedi, E., Salehinejad, H. & Siami, M. Enhancing Robotic Arm Activity Recognition with Vision Transformers and Wavelet-Transformed Channel State Information. IEEE 35th International Symposium on Personal, Indoor and Mobile Radio Communications (PIMRC), Valencia, Spain, pp. 1-6, https://doi.org/10.1109/PIMRC59610.2024.10817193 (IEEE, 2024).
Reinolds, F., Neto, C. & Machado, J. Deep learning for activity recognition using audio and video. Electronics 11, 782 (2022).
Stork, J. A., Spinello, L., Silva, J. & Arras, K. O. Audio-based human activity recognition using non-markovian ensemble voting. In 2012 IEEE RO-MAN: The 21st IEEE International Symposium on Robot and Human Interactive Communication, 509–514 (IEEE, 2012).
Behzad, K., Zandi, R., Motamedi, E., Salehinejad, H. & Siami, M. Robomnist: A multimodal dataset for multi-robot activity recognition using wifi sensing, video, and audio. figshare https://doi.org/10.6084/m9.figshare.28179383 (2024).
Franka, Emika.. Robot and interface specifications https://frankaemika.github.io/docs/control_parameters.html. Accessed: 2024-12-27, (2023).
He, Y. & Liu, S. Analytical inverse kinematics for franka emika panda - a geometrical solver for 7-dof manipulators with unconventional design. In 2021 9th International Conference on Control, Mechatronics and Automation (ICCMA), 194–199 (2021).
Emika, F. Denavit-hartenberg parameters https://frankaemika.github.io/docs/control_parameters.html#denavithartenberg-parameters. Accessed: 2024-06-10. (2024).
Acknowledgements
This material is based upon work supported in part by grants ONR N00014-21-1-2431, NSF 2121121, the U.S. Department of Homeland Security under Grant Award Number 22STESE00001-03-02, and by the Army Research Laboratory under Cooperative Agreement Number W911NF-22-2-0001 (KB, RZ, EM, MS). The views and conclusions contained in this document are solely those of the authors and should not be interpreted as representing the official policies, either expressed or implied, of the U.S. Department of Homeland Security, the Army Research Office, or the U.S. Government.
Author information
Authors and Affiliations
Contributions
Kian Behzad contributed significantly to hardware development, the experimental setup, and data collection, managed the GitHub repository for code availability and the Figshare repository for data sharing, and wrote the initial draft of the manuscript with feedback from all authors. Rojin Zandi led Wi-Fi and audio validation, contributed to data collection, and helped develop the experimental setup. Elaheh Motamedi performed video validation, contributed to data collection, and helped develop the experimental setup. Hojjat Salehinejad, as a senior author, contributed to shaping the research and provided critical feedback. Milad Siami conceived the project, secured funding, provided experimental resources, supervised all aspects, and guided the study throughout. All authors reviewed and approved the final manuscript.
Corresponding author
Ethics declarations
Competing interests
The authors declare no competing interests.
Additional information
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.
About this article
Cite this article
Behzad, K., Zandi, R., Motamedi, E. et al. RoboMNIST: A Multimodal Dataset for Multi-Robot Activity Recognition Using WiFi Sensing, Video, and Audio. Sci Data 12, 326 (2025). https://doi.org/10.1038/s41597-025-04636-2
Received:
Accepted:
Published:
DOI: https://doi.org/10.1038/s41597-025-04636-2
- Springer Nature Limited