In this section, we conducted a systematic evaluation on the cross-device sensing method illustrated in the previous section. We first built a cross-device VAHF dataset consisting of 10 users × 20 sentences × (8+1) gestures = 1800 samples. Then we evaluated our cross-device sensing method on the dataset in the dimensions of sensor combination, model selection, gesture reduction, and model ablation.
5.2 Data Collection
We collected gesture samples from 10 participants. The data collection entailed recording voice, ultrasound, and motion data while participants performed VAHF gestures corresponding to Section 3 and speak with daily voice commands. Each data collection study lasted about 60 minutes. Initially, participants were asked to read and sign consent forms. They were then shown instruction slides explaining the overall procedure of the data collection session and videos of VAHF gesture set from Section 3. Then we instructed participants to put on the earbuds, the smartwatch, and the ring properly, helping them to adjust the wearing until they felt comfortable with the devices.
Each participant was required to perform 15 gestures and record 10 voice commands for each gesture (150 gesture samples in total). For each gesture, the participant was shown a slide with the gesture’s name, the 10 voice commands, and the posture (sitting or standing). The order of the gestures and the posture condition were randomly picked to remove the order effect. The 10 voice commands were randomly picked from the daily Siri voice commands
4, as shown in Table
2.
During the recording of the 10 gesture samples of each gesture, the experimenter first turned on the recording of the IMU ring, the watch’s ultrasound, and the recorder. Then the participant clapped his or her hands to provide a synchronous signal used for the synchronization of different sensors. For each gesture sample, the experimenter first pressed a key on the PC to label a tick and record the system time, which was used for gesture sample segmentation, and then signaled the participant to perform the gesture and read the corresponding voice command while keeping the gesture. This process was repeated 10 times until the participant finished all 10 gesture samples. After that, the experimenter turned off the recording.
5.4 Evaluation Design
The evaluation consists of three sessions. In the first session, we conducted a two-factorial evaluation to analyze the recognition performance with regard to sensor combination and model selection. For sensor combination, corresponding to Section 4.3, we investigated five settings: 1) single (right) earbud with inner and outer microphones (RE, 2 audio channels), 2) two earbuds with inner and outer microphones (LE+RE, 4 audio channels), 3) two earbuds with outer microphones + watch (LE+RE+W, 3 audio channels), 4) all devices without the earbuds’ inner channels (ALL-4ch, 4 audio channels), and 5) all devices with all channels (ALL-6ch, 6 audio channels). For model selection, we investigated the following six models: 1) vocal only (V), 2) ultrasound only (U), 3) IMU only (I), 4) vocal + ultrasound(V+U), 5) vocal + ultrasound + IMU with logit-level fusion (ALL-L), and 6) vocal + ultrasound + IMU with feature-level fusion (ALL-F). It is worth mentioning that the above two factors are correlated. The ultrasound channel would be activated unless the watch is used. Similarly, the IMU channel would be activated when the ring is used. Other factors, including the network structure, hyperparameters (max training epoch=100, dropout=0.5), and optimization strategies, are strictly controlled. We adopt three optimization strategies - pretraining, dropout, and
warm-up to improve the performance and training robustness of our model. For pretraining, we initialized the MobileNet V3’s parameters with the one pretrained on ImageNet [
60]. For dropout, we added a dropout layer with a probability of 0.5 after the input layer to alleviate overfitting during training. For
warm-up, we adopted a
warm-up and weight decay strategy on the learning rate using the following piecewise function: if
n ≤ 10, then
lr(n) = 0.1 × n × lr(0), else
lr(
n) = 0.97
n − 10 ×
lr(0), where
n is the training epoch and
lr(
n) is the learning rate of the
nth epoch.
In the second session, we conducted an extensive evaluation on a reduced gesture set to analyze the optimal performance and usability of each sensor combination for practical deployment. The reduced gesture set contains three signature gestures - cover mouth with palm (G1), cover ear with arched palm (G2), and hold up the palm beside nose and mouth (G3) - which received high preference scores from the previous user study and intuitively had significant effects on the acoustic propagation. For each sensor combination, we chose the optimal model, as acquired above, to compute the classification accuracies of each gesture and all three gestures ({G1, E}, {G2, E}, {G3, E}, and {G1, G2, G3, E}, where E refers to the empty gesture). All the evaluation settings were consistent with the first session. Such an evaluation helps to ground the applicability value of the minimal functionality under different hardware settings.
In the last session, we conducted an ablation study for the optimal model to analyze the effects of the optimization strategies in the model design: 1) pretraining, 2) dropout, and 3) warm-up. After getting the optimal model above, we ran the model in the same setting except disabling 1) all the three optimizations, 2) pretraining, 3) dropout, and 4) warm-up to acquire the recognition accuracy in these 4 ablation settings. Such a study helped to validate the effectiveness of our model design.
All the above evaluations were conducted with leave-one-user-out cross-validation. For all the numerical comparisons, we reported the results along with the Wilcoxon Signed-Rank test to indicate the significance.
5.5 Results
Table
3 showed the results of the recognition performance regarding sensor combination and model selection.
For vocal-only models, we observed a constant increase in recognition accuracy as more sensor nodes were introduced (e.g., from 39.5% with a single earbud to 90.0% with all the sensors, Z = −2.81, p < 0.05). However, the difference between ALL-4ch and ALL-6ch was not significant, meaning when multiple devices were used, the introduction of the earbuds’ inner channels brings limited information for the vocal channel. For ultrasonic-only models, the performance increased from 52.5% to 70.9% (Z = −2.81, p < 0.05) as the ring microphone and the earbuds’ inner microphones were added. Notably, the independent use of the ultrasonic channels has its unique advantage of not relying on the vocal feature so that the model can still work well in scenarios such as noisy environments and whispering. The IMU model achieved an accuracy of 49.0%, meaning the IMU could provide complementary information on hand and finger movement, though far from practical as an individual model.
As for the sensor fusion models, we notice the vocal+ultra model had a performance increase over the vocal-only model with fewer input channels (LE+RE+W,
84.2% V.S. 85.8%, Z = −1.64, p = 0.1), while it had no increase for ALL-4ch
(89.9% V.S. 89.8%, Z = −0.18, p = 0.86) and had a decrease for ALL-6ch
(90.0% V.S. 89.2%, Z = −0.98, p = 0.33). This is probably because the vocal-only model with multiple channels (e.g., 6 channels) is a strong baseline, and combining it with an inferior model would introduce additional noise. Regarding all-channel fusion, we found feature-level fusion slightly outperformed logit-level fusion in accuracy
(91.5% V.S. 90.8%, Z = −0.98, p = 0.33), probably due to the larger parameter space. We also observed a slight performance decrease for fusion models when adding the inner channels of the earbuds to ALL-4ch
(91.5% V.S. 90.9%, Z = −0.36, p = 0.72), although the difference was not significant. The optimal model (all-channel feature-level fusion for ALL-4ch) achieved a 9-class recognition accuracy of 91.5%
, which significantly outperformed the vocal-only model (Z = −1.96, p < 0.05) with the same channels.To ground a better understanding of how each channel (vocal, ultrasound, and IMU) contributed to the recognition, we analyzed the confusion matrix of four models (vocal-only, ultra-only, IMU-only, and feature-level fusion) under ALL-4ch, as shown in Figure
6. This result was understandable because for the gestures with larger confusion, we could easily figure out their similarity based on semantics. For example, gesture pairs (0,4) and (3,7) yield larger confusion for vocal and ultrasound models, where we observed similar touch positions for each pair of gestures (ear for (0,4) and mouth for (3,7)). Gesture 1 confuses with gestures 3 and 7 in the ultrasound model probably due to a similar hand position, though it yields less confusion for the vocal model probably due to different occlusion levels (gestures 3 and 7 yield greater occlusion) that may influence the frequency response of the human voice.
Results on the reduced gesture set were shown in Table
4. Since ALL-4ch achieved higher recognition accuracy than ALL-6ch in the fusion model (e.g., 91.5%
V.S. 90.9%), we dropped ALL-6ch in this table. We had the following observations: 1) Using one earbud with inner and outer microphones (RE), which is a severely restricted setting, could achieve a narrowly applicable accuracy of over
\(80\%\) for recognizing a specific single gesture (82.3% for G1 and 83.0% for G2) while it performed worse in recognizing other gesture (e.g., G3) or multiple gestures, which is understandable due to limited sensing information. 2) Using a pair of earbuds (LE+RE) could significantly boost the performance, with promising accuracies of 87.3% for recognizing all three gestures and 98.8% for recognizing G2, indicating the high applicability of such compact hardware form. 3) Additional hardware including a watch and ring brought the feasibility of fusing more input channels (e.g., ultrasound), which constantly improved the performance to a highly robust one (e.g., 97.3% for recognizing all three gestures and 100% for recognizing G2 and G3) and meanwhile lifting the distinguishable gesture space (e.g., from 3 gestures to 8 gestures, see Table
3) with high applicability (e.g., 91.5% for simultaneously recognizing 8 gestures). The above results showed a leap over previous work with similar interaction modality (e.g., PrivateTalk [
76]), revealing the feasibility of broadened gesture space (e.g., recognizing 8 gestures simultaneously) and the effectiveness of multi-device sensing.
The results of the ablation study are shown in Table
5. We found disabling pretraining, dropout, and
warm-up caused different levels of performance degradation. Disabling pretraining caused the most significant decrease in performance (
\(-14.3\%\),
Z = −2.81, p < 0.05), which is probably because the feature extractor network (MobileNet V3) with pretraining on large-scale datasets could better extract different levels of image features. Meanwhile, disabling dropout caused slight decrease of
\(0.3\%\) (Z = −0.18, p = 0.86), which was not significant, and disabling
warm-up caused a decrease of
\(3.7\%\) (Z = −2.67, p < 0.05). The introduction of
warm-up and dropout aims to optimize the training procedure (e.g., alleviating overfitting) and improve the robustness of the model. Compared with the raw model with no optimization, our model achieved a significant increase of
\(15.7\%\) (Z = −2.81, p < 0.05), showing the superiority of all the optimization techniques.