RoboMNIST: A Multimodal Dataset for Multi-Robot Activity Recognition Using WiFi Sensing, Video, and Audio

Kian Behzad¹,
Rojin Zandi¹,
Elaheh Motamedi¹,
Hojjat Salehinejad² &
…
Milad Siami ORCID: orcid.org/0000-0001-7253-4464¹

1015 Accesses
3 Altmetric
Explore all metrics

Abstract

We introduce a novel dataset for multi-robot activity recognition (MRAR) using two robotic arms integrating WiFi channel state information (CSI), video, and audio data. This multimodal dataset utilizes signals of opportunity, leveraging existing WiFi infrastructure to provide detailed indoor environmental sensing without additional sensor deployment. Data were collected using two Franka Emika robotic arms, complemented by three cameras, three WiFi sniffers to collect CSI, and three microphones capturing distinct yet complementary audio data streams. The combination of CSI, visual, and auditory data can enhance robustness and accuracy in MRAR. This comprehensive dataset enables a holistic understanding of robotic environments, facilitating advanced autonomous operations that mimic human-like perception and interaction. By repurposing ubiquitous WiFi signals for environmental sensing, this dataset offers significant potential aiming to advance robotic perception and autonomous systems. It provides a valuable resource for developing sophisticated decision-making and adaptive capabilities in dynamic environments.

OPERAnet, a multimodal activity recognition dataset acquired from radio frequency and vision-based sensors

Article Open access 03 August 2022

Multimodal Movement Activity Recognition Using a Robot’s Proprioceptive Sensors

Toward human activity recognition: a survey

Article 20 October 2022

Background & Summary

Signals of opportunity refer to the use of pre-existing, non-dedicated signals in the environment for secondary purposes beyond their original intent. WiFi signals, for instance, are primarily used for communication, but can also provide valuable information about the environment through channel state information (CSI). This data captures intricate details about the propagation of WiFi signals, including reflections, scattering, and absorption caused by objects and activities in the environment. By repurposing these ubiquitous WiFi signals, we can achieve comprehensive indoor environmental sensing without the need for additional sensor infrastructure.

Combining WiFi CSI with video and audio data creates a powerful multimodal system that significantly enhances activity recognition capabilities. Video data provides rich visual information, capturing spatial and temporal changes in the environment. Audio data adds another layer of contextual information, identifying auditory cues that correlate with specific activities. Integrating these modalities with CSI data leads to a more holistic understanding of the environment, enabling more accurate and reliable multi-robot activity recognition (MRAR). A true potential of such integration lies in the broader concept of sensor fusion. Sensor fusion combines data from multiple sensors to achieve a more accurate and comprehensive understanding of an environment than could be obtained from any single sensor. This approach not only improves accuracy, but also enhances system resilience and supports robust decision-making across various applications, including robotics and autonomous systems. For example, in scenarios where visual information is compromised due to obstructions or lack of line-of-sight, CSI data can detect movements and objects without requiring direct visual contact, compensating for the limitations of video and audio sensors. Such multimodal sensor fusion is critical for complex and dynamic environments, leveraging the complementary strengths of different sensing modalities. The effectiveness of this approach has been demonstrated in various applications, such as multi-sensor fusion for autonomous vehicle tracking, which integrates diverse sensor inputs to enhance object detection and tracking performance^1,2.

The use of signals of opportunity, such as WiFi CSI, offers several key advantages. Firstly, it leverages existing infrastructure, reducing the cost and complexity associated with deploying additional sensors. This makes it an economically viable option for large-scale implementations. Secondly, the multimodal nature of the sensing network enhances robustness by compensating for the limitations of individual sensors. For instance, in scenarios where visual data may be obscured or audio data may be noisy, the complementary information from CSI can help maintain accurate activity recognition.

In recent decades, the integration of robotics and automated systems into various sectors has markedly transformed human life and industrial processes³. The advent of multi-agent systems, where multiple robots collaborate, has subtly yet significantly enhanced task efficiency and adaptability^4,5. In healthcare, robots perform complex surgeries with unprecedented precision, reducing recovery times and improving outcomes. In manufacturing, automated assembly lines and robotic arms ensure consistent quality and high productivity, revolutionizing production techniques. Additionally, robots are deployed in hazardous environments, like disaster sites⁶, minimizing human risks.

In the realm of robotics, the integration of multimodal learning systems represents a significant leap towards achieving autonomous operations that closely mimic human-like perception and interaction with the environment⁷. This paper explores the application of such an advanced learning paradigm through the deployment of two Franka Emika robotic arms, a choice inspired by their versatility and precision in complex tasks⁸. To enrich the sensory framework essential for multimodal learning, we employ a comprehensive array of sensors: three cameras, three WiFi sniffers, and three microphones. Each modality is strategically chosen to capture distinct yet complementary data streams-visual, wireless signal-based, and auditory-thereby enabling a more holistic understanding of the robotic arms’ surroundings and activities. This multi-sensory approach not only facilitates the robust perception required for intricate manipulations and interactions but also paves the way for groundbreaking advancements in autonomous robotic systems capable of sophisticated decision-making and adaptation in dynamic environments.

The MNIST dataset⁹, featuring handwritten digits classified into ten categories, was first introduced by LeCun et al. in 1998. At that time, the significant advancements and performance of deep learning techniques were unimaginable. Despite the current extensive capabilities of deep learning, the simple MNIST dataset remains the most widely used benchmark in the field, even surpassing CIFAR-10¹⁰ and ImageNet¹¹ in popularity according to Google Trends. Its simplicity hasn’t diminished its usage, despite some in the deep learning community advocating for its decline¹².

In this paper, we introduce the RoboMNIST dataset, an innovative extension of the traditional MNIST dataset tailored for robotic applications. This dataset features two Franka Emika robots writing different digits on an imaginary plane within a 3D environment. Our sensor-rich modules, comprising CSI, video, and audio, capture comprehensive data from the environment. We have validated the dataset’s integrity across individual data modalities through a series of experiments.

Methods

The data collection process was conducted in a laboratory, featuring desks, chairs, monitors, and various other office objects in the environment. The layout of the laboratory, along with its physical dimensions, is illustrated in Fig. 1.

Hardware Specifications & Communication

In all the experiments we have used three sensor-rich modules that are positioned in the room capturing data while the two robotic arms are performing different activities. Each sensor-rich module is capable of simultaneously capturing three modalities namely CSI, video, and audio from the environment. Each module is equipped with the following hardware:

CSI: A Raspberry Pi 4 Model B, integrated with the Nexmon project¹³, which passively captures the CSI data.
Video: A ZED 2 Stereo Camera for video recording.
Audio: A CG CHANGEEK Mini USB Microphone, featuring omni-directional directivity, to record audio.

Figure 2 depicts the sensor-rich modules used in our data collection setup. We use M ∈ {1, 2, 3} to denote the modules based on the numbering notation in Fig. 1 in the rest of the paper. To facilitate the collection of CSI data, an Apple Mac Mini equipped with the 802.11ax WiFi 6 standard served as the WiFi transmitter.

Figure 3 shows the Franka Emika Panda robot used in this dataset and its kinematic parameters according to the Denavit-Hartenberg convention. This robot is equipped with a 7-axis revolute joint, where the angle of these joints is defined in radian as q_i for i ∈ {1, 2, … , 7}. The Franka Emika robotic arm is a collaborative robot (cobot) designed to work safely alongside humans in various environments, ranging from industrial settings to direct interaction scenarios. Unlike conventional industrial robots, which are typically enclosed for safety reasons, the Franka Emika arm can perform tasks in close proximity to people without posing a hazard⁸. This capability makes it ideal for operations that require direct physical interaction, such as drilling, screwing, polishing, and a wide range of inspection and assembly tasks. The Franka Emika robotic arm provides a 3 kg payload capacity and a reach of 850 mm. The robot weighs approximately 18 kg and its repeatability is 0.1 mm. Repeatability is a measure of the ability of the robot to consistently reach a specified point.

Experiments

Our dataset is composed of 60 different primary combinations performed by the robotic arms, capturing activities through our sensor-rich modules. Our dataset encompasses four variations:

Activity

The Franka Emika robotic arms were programmed to draw the digits 0 through 9 on a vertical imaginary plane, creating ten distinct activity classes. The positions of the end effector for each activity are shown in Fig. 4. We denote the activities performed by the robots as A ∈ {0, 1, …, 9}.

To execute these activities, we generated seven fixed waypoints. These waypoints are depicted in Fig 5, illustrating their positions in both the joint space and the corresponding end effector space. The activities are performed using the waypoints in the joint position space. For each activity, the robot follows an appropriate sequence of waypoints from waypoint 2 to waypoint 7. The robot always starts and ends at waypoint 1, after completing the digit in the imaginary plane.

Robot Number

Indicated by R ∈ {1, 2}, this specifies which of the two available robotic arms is performing the activity based on the numbering notation in Fig. 1.

Robot velocity

Denoted by V ∈ {High, Medium, Low}, this describes the velocity level at which the robot performs the activity. To enforce these velocities, we applied the following limits on the maximum joint acceleration and deceleration of the joint angles:

High: A maximum joint acceleration and deceleration of a_max = 2.5 radian/s² have been applied on each joint in this setting. This corresponds to 50% of the robot’s maximum allowed acceleration.
Medium: A maximum joint acceleration and deceleration of a_max = 2.0 radian/s² have been applied on each joint in this setting. This corresponds to 40% of the robot’s maximum allowed acceleration.
Low: A maximum joint acceleration and deceleration of a_max = 1.5 radian/s² have been applied on each joint in this setting. This corresponds to 30% of the robot’s maximum allowed acceleration.

Motion uncertainty

Denoted by $U\in {{\mathbb{R}}}^{+}$, where ${{\mathbb{R}}}^{+}$ represents the positive real numbers, measures the L₂ norm error of the end effector’s position relative to its intended ground truth trajectory over time.

The Franka Emika robotic arm repeatability is 0.1 mm, which ensures that the robot performs a very similar motion for each activity, to make the dataset realistic We manually and purposefully introduced a layer of uncertainty into the robots’ motion to ensure that different repetitions in our dataset are not identical. This approach simulates the variability of handwritten digits performed by humans, which are inherently non-identical. To achieve this, we added multiplicative uniform noise $\epsilon \sim {\mathcal{U}}(0.7,1.3)$ to the first three joints of the robot (q₁, q₂, q₃) for each waypoint in the sequence, except for the first waypoint. This ensures that the start and end positions of the robot remain consistent. We chose the first three joints because they have the most significant effect on the robots’ final trajectory, while the remaining four joints primarily influence the orientation of the robot’s end effector. Fig. 6 illustrates the impact of these uncertainties on the first three joints and the end effector position of the robot in different repetitions on the same activity. Using this approach, we have increased the robot’s repeatability to an average of 32 cm. To illustrate this, Fig. 7 presents the maximum distance between the robot’s trajectory and the ground truth across all repetitions in the dataset.

The combination of ten activities, two robots performing these activities, and three velocity levels results in a total of 60 unique primary combinations. For each primary combination, we have collected 32 repetitions. Each repetition spans 15 seconds, during which the robot performs the action with consistent variations in activity, robot arm, and velocity, while incorporating motion uncertainty. This introduces deviations in each repetition as the robot writes on an imaginary plane, adding a realistic layer of complexity to the dataset. Fig. 8 shows the uncertainty in motion of all the repetitions in four of the primary combinations as a sample projected on a 2D imaginary plane.

WiFi CSI Modality

As wireless signals propagate, they encounter various obstacles in the environment, leading to reflections and scattering, a phenomenon known as multipath fading¹⁴. WiFi CSI facilitates the analysis of subcarrier propagation from the transmitter to the receiver in wireless communications¹⁵. The channel model is represented as

$${\bf{y}}={\bf{H}}{\bf{x}}+{\boldsymbol{\eta }},$$

(1)

where x, y, and η denote the transmitted signal vector, received signal vector, and additive noise vector, respectively¹⁵. The channel matrix ${\bf{H}}\in {{\mathbb{C}}}^{T\times S}$ encapsulates the characteristics of the wireless channel, including multipath propagation, fading, and other impairments, and is defined as

$${\bf{H}}=\left[\begin{array}{cccc}{h}_{1}[1] & {h}_{2}[1] & \ldots & {h}_{S}[1]\\ {h}_{1}[2] & {h}_{2}[2] & \ldots & {h}_{S}[2]\\ \vdots & \vdots & \ddots & \vdots \\ {h}_{1}[T] & {h}_{2}[T] & \ldots & {h}_{S}[T]\end{array}\right],$$

(2)

where S and T represent the number of subcarriers for each antenna and the number of transmitted packets, respectively. Each element of the matrix H corresponds to a complex value, known as the channel frequency response, and is given by

$${h}_{s}[t]={a}_{s}{e}^{j{\phi }_{s}},$$

(3)

where a_s and ϕ_s denote the amplitude and phase of subcarrier s at timestamp t, respectively. For human activity recognition (HAR)^16,17,18 and robot activity recognition (RAR)^19,20,21, studies primarily focus on ${\bf{A}}\in {{\mathbb{R}}}^{T\times S}$, which corresponds to the element-wise amplitude of H, disregarding the phase component.

For each 15-second repetition, we collected CSI measurements at a 30 Hz frequency, over an 80 MHz bandwidth which gives 256 number of subcarriers at each time stamp. This resulted in a 450 × 256 complex matrix for each sensor-rich module. These matrices are stored in a json file. Additionally, we included received signal strength (RSS) information for each timestamp as well. Fig. 9 displays a sample plot of the amplitudes of a CSI matrix.

Video Modality

For each 15-second repetition, we collected video measurements at a frequency of 30 Hz, synchronized with the CSI measurements, using three sensor-rich modules, each containing a stereo camera. For each sample, three ZED 2 stereo cameras simultaneously recorded RGB videos at a resolution of 2560 × 720 pixels, with the frames from the left and right lenses of each stereo camera horizontally concatenated. This setup allowed us to capture three different views of the same action. Each camera, equipped with two lenses, provided stereo video, resulting in a dataset that includes 3 (cameras) × 2 (lenses) = 6 videos for each repetition.

Audio Modality

Audio signals are continuous waveforms that represent sound waves in a format that can be processed by digital systems. These waveforms are characterized by frequency, amplitude, and phase, which contain rich information about the environment and the sources of sound. In the context of activity recognition, audio signals offer a non-intrusive and cost-effective means to infer activities and interactions. By analyzing the acoustic patterns and variations over time, it is possible to identify specific activities based on the distinct sounds associated with each activity. Advanced machine learning algorithms, particularly those leveraging deep learning, have demonstrated significant success in classifying and recognizing activities from audio data. These algorithms extract useful information, such as the frequency and amplitude of the sound wave over time, to analyze and predict activities^22,23.

In this paper, we employ auditory perception as many robot activities produce characteristic sounds from which we can effectively infer corresponding actions. We view audio not as a replacement but as a complement to existing sensory modalities. By fusing audio with other sensory data, we aim to achieve particularly robust activity recognition across a wide range of conditions. This multimodal approach enhances the accuracy and reliability of activity recognition systems, making them more effective in diverse environments.

For each 15-second repetition, we collected audio measurements from each sensor-rich module at a sampling rate of 44, 100 Hz while the start and end times were synchronized with our other modalities. To analyze the audio data, we preprocess the audio files by computing their spectrograms. While we delve more into the calculation of the audio spectrogram in the Technical Validation section, a sample spectrogram plot of one of the captured audio is shown in Fig. 10.

True Trajectory

For each 15-second repetition, we provided the true joint positions of the robots at a frequency of 30 Hz, synchronized with the CSI and video measurements. At each timestamp, each joint position, consisting of q₁ to q₇, corresponding to the 7-axis revolute joints of the robot, has been stored in radians in a json file. For convenience, the x − y − z position of the end effector (EE) in the Cartesian coordinate system is also provided, presented in the robot frame aligned with the x₀ and z₀ axes and following the right-hand rule, as shown in Fig. 3. Fig. 11 displays a sample plot of the end effector’s Cartesian position of one of the robots performing an activity.

Synchronization

During data collection, each module incorporates local timestamping directly on the hardware. The timestamped data is subsequently transmitted across the network for further processing. Although this configuration effectively captures data from individual modules, synchronizing timestamps is critical for deployments involving multiple modules. To address this challenge, we developed a method whereby packets containing CSI, video, and audio data, each with its own timestamp, are redirected to a specialized system known as the monitor.

The monitor serves as a central hub, collecting packets from various modules and assigning synchronized timestamps to the data. A visual depiction of this intercommunication process is presented in Fig. 12. Notably, according to the Nexmon project’s specifications, Raspberry Pis configured as CSI sniffers forfeit their WiFi communication capabilities. To overcome this restriction, we connected the sniffers and the monitor using Ethernet cables, thus ensuring uninterrupted communication between them.

Using this configuration, the CSI, video, the true trajectories of the robots, and the start and end time of the audio are synchronized with each other and between all the sensor-rich modules, providing a comprehensive dataset for multi-modal passive MRAR. Fig. 13 compactly shows different synchronized modalities in an experiment.

Figure 14 shows MNIST formatted plots of the 10 activities and different repetitions based on our modalities.

Data Records

The dataset is available for download from our Figshare repository²⁴.

Based on variations in activity, robot, and velocity, we have 60 primary combinations. Each combination has a dedicated folder in the dataset, named according to the standard described in Fig. 15(a). Within each primary combination, there are at least 32 repetitions, all with 15-second duration and specific R, V, and A variations, differing only by assigned motion uncertainty. Each repetition contains 8 files as listed in Table 1, following the naming convention outlined in Fig. 15(b).

Table 1 Associated files and their formats corresponding to each repetition of a primary combination.

Full size table

WiFi CSI Description

This section describes the structure of the data files residing in the CSI files, which are the files ending with csi.json. Each repetition is represented by a single CSI file, which contains the CSI data for all the three sensor-rich modules. These CSI files are in json format and consist of an array of three json objects. Each json object corresponds to one of the sensor-rich modules. Within each json object, the data is organized as a set of key-value pairs, as detailed in Table 2.

Table 2 Description of key-value sets of json objects in CSI files.

Full size table

Video Description

This section describes the structure of the data files residing in the video files, identified by the cam.mp4 extension. Each repetition corresponds to 3 video files, each for one of the sensor-rich modules. Each video file is in mp4 format, where each frame consists of the horizontally concatenated left and right frames of the lenses of the stereo camera, resulting in a final frame with dimensions of 2560 × 720. Each frame is synchronized with the timestamps provided in the CSI file of the same repetition.

Audio Description

This section describes the structure of the data files residing in the audio files, identified by the mic.wav extension. Each repetition corresponds to three audio files, each file for one of the sensor-rich modules. Each audio file is in wav format and has been recorded for the 15-second duration of the repetition.

True Robot Trajectory Description

This section describes the structure of the data files in the position files, which have the pos.json extension. Each repetition corresponds to one position file containing the position information of both robots. Even though only one robot is moving in each repetition, as indicated by the file name, we have included the position data for both robots. Each position file is in json format and consists of an array of two json objects, one for each robot. Each json object is a set of key-value pairs, as detailed in Table 3.

Table 3 Description of key-value sets of json objects in position files.

Full size table

Technical Validation

In this section, we validate our dataset across various aspects, including robot configuration, uncertainty, and the synchronization of CSI, video and audio.

Joint Positions

The documentation of the Franka Emika robot²⁵ provides the safety specifications for the joint positions of each of the seven joints. The safe operating range for the joints is summarized in Table 4.

Table 4 Safe operating ranges for the joint positions of the Franka Emika robot in radians.

Full size table

Figure 16 shows the joint positions, in radians, used throughout our entire dataset for both robots separately. As evident from the figure, all joint positions in the dataset fall within the safe operating range of the robots.

Joint Velocities

The documentation of the Franka Emika robot also provides the safety specifications for the joint velocities of each of the seven joints. The safe operating velocity ranges for the joints are summarized in Table 5.

Table 5 Safe operating velocity ranges for the joints of the Franka Emika robot.

Full size table

Figure 17 shows the joint velocities, in radian/s, used throughout our entire dataset for both robots separately. As evident from the figure, the joint velocities in the dataset fall within the safe operating range of the robots.

Figure 18 shows the linear velocity of the robot’s end effector for the different velocity levels used in this dataset. It illustrates how the High, Medium, and Low velocity levels affect the motion of the end effector. As expected, the velocity is higher when using the High setting compared to the Medium setting, resulting in the activity being completed more quickly. A similar pattern is observed when comparing the Medium and Low velocity levels. When comparing the velocity levels, it is important to note that, for instance, the Medium velocity exhibits a lag in motion compared to the High velocity. This lag should be taken into account when analyzing and comparing the velocities.

Motion Uncertainty

As mentioned in the Methods section, we manually added multiplicative uniform noise $\epsilon \sim {\mathcal{U}}(0.7,1.3)$ to the first three joints of the robot (q₁, q₂, q₃) for each waypoint in the sequence, except for waypoint 1. Fig 19 shows the actual samples for each waypoint of the first three joints. The filled boxes represent the interquartile range (IQR) of the noisy waypoints, while the whiskers extend to 1.5 times the IQR. The maximum range of uncertainty for each waypoint is illustrated using a dashed rectangle. It is evident that the noisy waypoints consistently fall within the range defined by the specified uncertainty.

CSI Synchronization

To validate the synchronization of WiFi signals captured by the three sniffers, we analyzed the normalized cross-correlation of the WiFi CSI amplitude signals. This metric measures the similarity of the signal patterns and ensures that the data streams are temporally aligned across the devices.

Figure 20 presents the normalized cross-correlation of each pair of sniffers for each repetition. Table 6 summarizes the statistics of the normalized cross-correlation values. As observed, the average normalized correlation values for all sniffer pairs exceed 9.23 × 10⁻¹, with a minimum value of 8.74 × 10⁻¹ and a maximum value of 9.66 × 10⁻¹. The IQR values are also small (e.g., 1.18 × 10⁻² to 1.45 × 10⁻²), indicating low variability among the central data points. These results confirm that the CSI signals are highly linearly correlated, demonstrating strong synchronization across the sniffers.

Table 6 CSI Normalized Cross-Correlation Statistics.

Full size table

Video Synchronization

To validate the synchronization of the video recordings captured by the three cameras, we analyzed the normalized cross-correlation of pixel intensity values extracted from the video frames. These metrics measure the similarity of pixel patterns across the video streams, ensuring that the cameras are temporally aligned. The normalized cross-correlation values validate the synchronization, showing consistent similarity across the dataset. These findings demonstrate stable temporal alignment between the cameras, ensuring reliable synchronization.

Figure 21 illustrates the normalized cross-correlation values across all repetitions. The x-axis denotes the repetitions, while the y-axis represents the magnitude of the correlation. These results validate the consistent synchronization of the video data across the three cameras, demonstrating strong alignment. Building on these observations, Table 7 presents key statistical metrics for the normalized cross-correlation values, including the mean, standard deviation, minimum, maximum, and IQR. These metrics quantitatively confirm the strong synchronization across all pairs with minimal variation.

Table 7 Video Normalized Cross-Correlation Results.

Full size table

Audio Synchronization

To assess the synchronization of audio signals captured by the microphones, we computed the normalized cross-correlation of their spectrograms. This ensures that audio signals recorded by the devices are aligned in time, a key requirement for synchronization.

The normalized cross-correlation results confirm the synchronization of audio signals across the devices. The small standard deviations in the normalized values indicate stable synchronization with minimal deviations across the dataset. Fig. 22 illustrates the normalized cross-correlation for each pair of microphones across all repetitions, with each colored line representing a pair. Table 8 provides a detailed summary of the normalized cross-correlation statistics. As shown, the average normalized correlation values for all microphone pairs are above 7.84 × 10⁻¹, with minimum values ranging from 7.73 × 10⁻¹ to 7.77 × 10⁻¹ and maximum values close to 7.88 × 10⁻¹. With IQR values ranging from 8.62 × 10⁻⁴ to 1.38 × 10⁻³, the central data points show minimal variation. These results confirm that audio signals are highly synchronized between devices.

Table 8 Audio Normalized Cross-Correlation Results.

Full size table

Additionally, Fig.. 23 illustrates the data from various sensor modalities, showing the start and end times of the activity performed by the robot during one of the repetitions in the dataset. This figure also validates the synchronization between the different sensors and modules in the dataset.

Code availability

The entire dataset is available for download from our Figshare repository²⁴. The interested readers are encouraged to visit our GitHub repository (https://github.com/SiamiLab/RoboMNIST), where example Python notebooks for loading and visualizing our data are provided.wifi_csi_read.ipynb: This Python notebook loads a CSI json file from a repetition and visualizes the CSI amplitudes and RSS values.true_trajectory_read.ipynb: This Python notebook loads a true trajectory json file from a repetition and visualizes the robot’s motion.

The repository also includes the complete set of codes along with detailed explanations on how the robots were controlled, data was collected, and synchronization was achieved across different modules and sensors, and it provides instructions on how to reproduce these processes

References

Vinoth, K. & Sasikumar, P. Multi-sensor fusion and segmentation for autonomous vehicle multi-object tracking using deep q networks. Scientific Reports 14, 31130 (2024).
Article CAS PubMed PubMed Central MATH Google Scholar
Celik, Y. & Godfrey, A. Bringing it all together: Wearable data fusion. npj Digital Medicine 6, 149 (2023).
Article PubMed PubMed Central MATH Google Scholar
Moran, M. E. Evolution of robotic arms. Journal of robotic surgery 1, 103–111 (2007).
Article ADS PubMed MATH Google Scholar
Prajapat, M., Turchetta, M., Zeilinger, M. & Krause, A. Near-optimal multi-agent learning for safe coverage control. Advances in Neural Information Processing Systems 35, 14998–15012 (2022).
Google Scholar
Hosseini, S. H., Tavazoei, M. S. & Kuznetsov, N. V. Agent-based time delay margin in consensus of multi-agent systems by an event-triggered control method: Concept and computation. Asian Journal of Control 25, 1866–1876, https://doi.org/10.1002/asjc.2814 (2023).
Article MathSciNet MATH Google Scholar
Jorge, V. A. et al. A survey on unmanned surface vehicles for disaster robotics: Main challenges and directions. Sensors 19, 702 (2019).
Article ADS PubMed PubMed Central MATH Google Scholar
Duan, S., Shi, Q. & Wu, J. Multimodal sensors and ml-based data fusion for advanced robots. Advanced Intelligent Systems 4, 2200213 (2022).
Article MATH Google Scholar
Haddadin, S. et al. The franka emika robot: A reference platform for robotics research and education. IEEE Robotics & Automation Magazine 29, 46–64 (2022).
Article MATH Google Scholar
LeCun, Y., Bottou, L., Bengio, Y. & Haffner, P. Gradient-based learning applied to document recognition. Proceedings of the IEEE 86, 2278–2324 (1998).
Article MATH Google Scholar
Krizhevsky, A. et al. Learning multiple layers of features from tiny images. Computer Science University of Toronto (2009).
Deng, J. et al. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, 248–255 (Ieee, 2009).
Xiao, H., Rasul, K. & Vollgraf, R. Fashion-mnist: a novel image dataset for benchmarking machine learning algorithms. arXiv preprint arXiv:1708.07747 (2017).
Schulz, M., Wegemer, D. & Hollick, M. Nexmon: The c-based firmware patching framework. https://nexmon.org (2017).
Yang, Z., Zhou, Z. & Liu, Y. From rssi to csi: Indoor localization via channel response. ACM Computing Surveys (CSUR) 46, 1–32 (2013).
Article CAS MATH Google Scholar
Wang, Z. et al. A survey on human behavior recognition using channel state information. Ieee Access 7, 155986–156024 (2019).
Article Google Scholar
Salehinejad, H. & Valaee, S. Litehar: lightweight human activity recognition from wifi signals with random convolution kernels. In ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 4068–4072 (IEEE, 2022).
Yousefi, S., Narui, H., Dayal, S., Ermon, S. & Valaee, S. A survey on behavior recognition using wifi channel state information. IEEE Communications Magazine 55, 98–104 (2017).
Article Google Scholar
Zheng, Y. et al. Zero-effort cross-domain gesture recognition with wi-fi. In Proceedings of the 17th annual international conference on mobile systems, applications, and services, MobiSys ’19, 313–325 (Association for Computing Machinery, New York, NY, USA, 2019).
Zandi, R., Salehinejad, H., Behzad, K., Motamedi, E. & Siami, M. Robot motion prediction by channel state information. In 2023 IEEE 33rd International Workshop on Machine Learning for Signal Processing (MLSP), 1–6 (IEEE, 2023).
Zandi, R., Behzad, K., Motamedi, E., Salehinejad, H. & Siami, M. Robofisense: Attention-based robotic arm activity recognition with wifi sensing. IEEE Journal of Selected Topics in Signal Processing (2024).
Zandi, R., Behzad, K., Motamedi, E., Salehinejad, H. & Siami, M. Enhancing Robotic Arm Activity Recognition with Vision Transformers and Wavelet-Transformed Channel State Information. IEEE 35th International Symposium on Personal, Indoor and Mobile Radio Communications (PIMRC), Valencia, Spain, pp. 1-6, https://doi.org/10.1109/PIMRC59610.2024.10817193 (IEEE, 2024).
Reinolds, F., Neto, C. & Machado, J. Deep learning for activity recognition using audio and video. Electronics 11, 782 (2022).
Article Google Scholar
Stork, J. A., Spinello, L., Silva, J. & Arras, K. O. Audio-based human activity recognition using non-markovian ensemble voting. In 2012 IEEE RO-MAN: The 21st IEEE International Symposium on Robot and Human Interactive Communication, 509–514 (IEEE, 2012).
Behzad, K., Zandi, R., Motamedi, E., Salehinejad, H. & Siami, M. Robomnist: A multimodal dataset for multi-robot activity recognition using wifi sensing, video, and audio. figshare https://doi.org/10.6084/m9.figshare.28179383 (2024).
Franka, Emika.. Robot and interface specifications https://frankaemika.github.io/docs/control_parameters.html. Accessed: 2024-12-27, (2023).
He, Y. & Liu, S. Analytical inverse kinematics for franka emika panda - a geometrical solver for 7-dof manipulators with unconventional design. In 2021 9th International Conference on Control, Mechatronics and Automation (ICCMA), 194–199 (2021).
Emika, F. Denavit-hartenberg parameters https://frankaemika.github.io/docs/control_parameters.html#denavithartenberg-parameters. Accessed: 2024-06-10. (2024).

Download references

Acknowledgements

This material is based upon work supported in part by grants ONR N00014-21-1-2431, NSF 2121121, the U.S. Department of Homeland Security under Grant Award Number 22STESE00001-03-02, and by the Army Research Laboratory under Cooperative Agreement Number W911NF-22-2-0001 (KB, RZ, EM, MS). The views and conclusions contained in this document are solely those of the authors and should not be interpreted as representing the official policies, either expressed or implied, of the U.S. Department of Homeland Security, the Army Research Office, or the U.S. Government.

Author information

Authors and Affiliations

Department of Electrical & Computer Engineering, Northeastern University, Boston, MA, USA
Kian Behzad, Rojin Zandi, Elaheh Motamedi & Milad Siami
Kern Center for the Science of Health Care Delivery and Department of Artificial Intelligence and Informatics, Mayo Clinic, Rochester, MN, USA
Hojjat Salehinejad

Authors

Kian Behzad
View author publications
You can also search for this author in PubMed Google Scholar
Rojin Zandi
View author publications
You can also search for this author in PubMed Google Scholar
Elaheh Motamedi
View author publications
You can also search for this author in PubMed Google Scholar
Hojjat Salehinejad
View author publications
You can also search for this author in PubMed Google Scholar
Milad Siami
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

Kian Behzad contributed significantly to hardware development, the experimental setup, and data collection, managed the GitHub repository for code availability and the Figshare repository for data sharing, and wrote the initial draft of the manuscript with feedback from all authors. Rojin Zandi led Wi-Fi and audio validation, contributed to data collection, and helped develop the experimental setup. Elaheh Motamedi performed video validation, contributed to data collection, and helped develop the experimental setup. Hojjat Salehinejad, as a senior author, contributed to shaping the research and provided critical feedback. Milad Siami conceived the project, secured funding, provided experimental resources, supervised all aspects, and guided the study throughout. All authors reviewed and approved the final manuscript.

Corresponding author

Correspondence to Milad Siami.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.

Reprints and permissions

About this article

Cite this article

Behzad, K., Zandi, R., Motamedi, E. et al. RoboMNIST: A Multimodal Dataset for Multi-Robot Activity Recognition Using WiFi Sensing, Video, and Audio. Sci Data 12, 326 (2025). https://doi.org/10.1038/s41597-025-04636-2

Download citation

Received: 16 August 2024
Accepted: 12 February 2025
Published: 22 February 2025
DOI: https://doi.org/10.1038/s41597-025-04636-2
Springer Nature Limited

RoboMNIST: A Multimodal Dataset for Multi-Robot Activity Recognition Using WiFi Sensing, Video, and Audio

Abstract

Similar content being viewed by others

OPERAnet, a multimodal activity recognition dataset acquired from radio frequency and vision-based sensors

Multimodal Movement Activity Recognition Using a Robot’s Proprioceptive Sensors

Toward human activity recognition: a survey

Background & Summary

Methods

Hardware Specifications & Communication

Experiments

Activity

Robot Number

Robot velocity

Motion uncertainty

WiFi CSI Modality

Video Modality

Audio Modality

True Trajectory

Synchronization

Data Records

WiFi CSI Description

Video Description

Audio Description

True Robot Trajectory Description

Technical Validation

Joint Positions

Joint Velocities

Motion Uncertainty

CSI Synchronization

Video Synchronization

Audio Synchronization

Code availability

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Competing interests

Additional information

Rights and permissions

About this article

Cite this article

Share this article