1 Introduction
Human environments contain rich contextual information that could be used to power a variety of context-aware computing applications. Users’ presence in a kitchen, for example, often indicates food preparation activities, whereas classrooms indicate learning and theaters indicate entertainment. As a result, accurate and robust sensing of user presence in environments with varying functionalities has long been desired in HCI [
1,
70,
71]. Additionally, fine-grained information on user location could also facilitate conventional sensor-aided approaches such as gait analysis [
14,
30], activity logging [
26], and beyond for medical research and many more in-the-wild studies.
In this research, we create a wearable system to recognize ground surfaces, which are a universal and expressive feature of human environments and often are strong indicators of user contexts. Surface texture, as a distinguishing feature of any ground surface defined by four characteristics including lay, flaw, roughness, and waviness [
43], has recently received tons of attention in the sensing research field. For example, texture-based ground surface detection has been widely used in applications of robotics, such as assisting mobile robots in detecting obstacles [
73] and promoting autonomous agriculture [
42].
As we lay barefoot on ground surfaces, we feel the soft grass of a lawn, lumpy fabrics on a carpet, gritty soil of a hiking trail, smooth tiles of a bathroom, grainy wood of a floor, and rough sands on a beach. We believe that wearable intelligence could benefit from enhanced perceptual capabilities of sensing ground surfaces, similar to what humans can do but without limitations in sensitivity, granularity, latency, and time of operation, in order to achieve a better understanding of environments and user contexts, and to provide assistance, accommodate for natural interactions, and log important patterns in information for analysis and diagnosis.
As users’ feet are almost always in contact with ground surfaces, shoe-instrumented wearables serve as an ideal platform for sensing ground surfaces. To enable shoe wearables to sense ground surfaces, we propose
LaserShoes, a low-cost ground surface detection system using the laser speckle imaging technique (Fig.
1). In comparison with conventional vision-based approaches taking RGB photos of ground surfaces, laser speckle imaging reveals richer and more accurate information about textures of ground surfaces using an active signal – laser beams. When compared to cameras, laser speckle imaging can distinguish surface textures that appear visually similar. Additionally, unlike conventional imaging systems which require lenses, laser speckle imaging does not require a lens and thus cannot provide clear visuals of users’ backgrounds to preserve privacy.
Our system mainly consists of a laser emitter, an image sensor (CCD), and a Raspberry Pi board. The laser emitter and the image sensor are connected to shoes to capture videos of speckle patterns that reflect surface textures. The Raspberry Pi board is instrumented to a user’s lower leg and runs the detection pipeline which features a pre-processing phase to eliminate blurry images, and a deep learning model to acquire ground surface types. The entire system costs $136. We recruited 15 participants in a user study where they were asked to walk on 24 ground surfaces for 1~2 minutes. In total, we collected 28,492 1.5s video sessions. We validated our system under within-user and cross-user conditions, and the classification accuracy of within-user and cross-user conditions is \(86.93\%\) and \(80.57\%\), respectively. We also carried out three additional studies to tease out the performance of our system in detecting dry, wet, and frozen surfaces, and sand surfaces of different grain sizes, and under various lighting conditions. Finally, we demonstrated applications enabled by our system, such as personal running assistant, gait analysis, surface-aware cleaning equipment, coarse navigation, and daily activity recognition through localization.
In summary, our main contributions include:
•
We designed and implemented LaserShoes, a laser-imaging-based ground surface detection wearable system that can identify ground surfaces.
•
We designed a data process method for LaserShoes to identify relative stationary frames from collected videos and completed an end-to-end real-time inference pipeline based on contemporary deep learning techniques.
•
We conducted an evaluation with 15 participants to investigate the performance of LaserShoes with two validation mechanisms (i.e., within- and cross-user), and under various surface and environmental conditions.
3 Principles of Operation
LaserShoes is based on two principles of operation: 1) we used Laser Speckle Imaging to detect ground surface textures, and 2) we used the variance of grayscale-converted frames from recorded videos to infer gait status and obtain speckle images with high quality.
First, Laser Speckle Imaging can reveal surface texture characteristics. When a beam of coherent light (e.g., laser) illuminates a ground surface, the light will be reflected, and captured by a nearby image sensor, forming an image with laser speckles, as shown in Fig.
2 (a). This phenomenon occurs because ground surfaces are rough – the micro geometry of ground surfaces varies the optical paths of the laser beam. Thus, each pixel of the image sensor will receive the reflected laser beam with different constructive and destructive interference, forming laser speckles. Because different ground surfaces have different micro geometries, the resulting laser speckle patterns vary and could be leveraged to identify ground surfaces.
Second, we applied Laser Speckle Imaging with the consideration that a user’s feet could be in constant motion (e.g., walking and running) in relation to ground surfaces. The sensor’s movements relative to the ground manifest as the motion effect on images, resulting in blurry laser speckle images that have lower variances compared with those that have sharp speckles. As illustrated in Fig.
2 (b), the laser speckle images are much clearer when a user’s foot is in contact with the ground than when the foot is moving in the air. We utilized the variances of grayscale speckle images to identify the foot-ground contact period from recorded videos and used only speckle images collected from this period for the subsequent classification.
4 Hardware Design
We prototype
LaserShoes to investigate the capabilities of laser imaging in ground surface detection. Although our current implementation is relatively bulky and impractical for direct adoption, our end-to-end prototype enables us to effectively verify our sensing principle, conduct technical evaluation, and explore potential applications. The form factor of our current prototype is akin to established works in the HCI community [
7,
61,
68]. In this section, we introduce our hardware configurations and fabrication.
4.1 Embedded System
We apply Laser Speckle Imaging to capture speckle patterns and recognize ground surfaces. The technique has been used in the HCI community and could be eye-safe [
4]. To utilize this technique, our system consists of four parts: 1) a laser emitter, 2) an image sensor, 3) a Raspberry Pi board, and 4) assistant modules. The laser emitter and the image sensor compose the detecting component, while other parts compose the processing and assistant component. The hardware details of our system are shown in Fig.
3. Compared to prior works [
20,
65], the core sensors bundled in our system are more compact to set on shoes. The enclosure of the system is 3D printed using photosensitive resin. The entire system and its manufacturing cost are $135.23, and the combined cost of the laser emitter and image sensor is $23.14. The cost of each component is shown in Table
1.
Laser Emitter. We select a laser emitter with a 520nm wavelength and 5mW output power based on our configuration experiments (see Section
4.2.2). Given that using a low-power laser emitter will result in insufficient illumination and unclear speckle patterns, and that using a high-power laser may not be eye-safe, we ultimately choose a 5
mW laser (Class IIIA) which is chronic viewing hazard but safe for transient exposures. Additionally, in order to have maximum laser reflection to preserve signal-to-noise ratio (SNR), we set the laser emitter vertical to ground surfaces.
Image Sensor. Given that our system is mounted on users’ shoes, it is subject to movement as users walk, leading to the loss of speckle information in parts of the image due to motion blur. To extract images with clear speckle patterns from captured videos, we select an OV2710 image sensor with a relatively high frame rate of 60 fps. We set the resolution of the image sensor as 1280 × 720 pixels, which is the highest resolution under the 60-fps frame rate. It is worth noting that our system does not use a lens because laser beams reflected by ground surfaces are always in focus, resulting in sharp speckle patterns that are distributed uniformly across the captured images when a user’s shoe is relatively still with respect to ground surfaces. To further improve SNR, the image sensor is placed right next to the laser emitter.
Raspberry Pi Board and Assistant Modules. For image acquisition and processing, we choose the Raspberry Pi Zero 2 W, for its compact size, superior speed, and wireless connectivity. With the connected laser emitter and image sensor, the Raspberry Pi board carries out three functions: 1) supplying power to the laser emitter from GPIO, 2) acquiring videos from the image sensor through a USB interface and 3) processing acquired videos and yielding the detected type of ground surface to users. The assistant modules include a battery module, a USB interface module, and a switch module to safely supply power to the entire system.
4.2 Configurations
In order to identify the optimal configuration of our system, we conducted experiments using various combinations of laser wavelengths and distances, as they are two significant factors affecting the formation of laser speckles, and investigated their performance in surface classification. In these experiments, we used an image sensor which was a model commonly used on webcams with a pixel size of 3μm × 3μm.
4.2.1 Image sensor.
Given that our system operates in a moving scenario, an image sensor with a sufficient frame rate is required to ensure the quality of captured videos and to extract clear speckle patterns from those videos. Through experiments in which we collected videos while researchers with the camera configured at different frame rates were walking at their normal speed, we discovered that the standard 30-fps frame rate is insufficient due to the motion effect, resulting in an excessive number of blurry images. On the other end, sensors with higher frame rates are often costly, which contradicts our design goal of being low-cost. As a result, we choose a frame rate of 60 and rely on a custom pre-processing pipeline to mitigate the motion blur (see in Section
5.1).
4.2.2 Wavelength and distance.
Since infrared lasers are difficult to debug, we selected wavelengths of laser in the visible spectrum. Specially, in our experiments, we investigated 4 different representative laser wavelengths (405
nm, 450
nm, 520
nm, and 650
nm). In terms of distance, considering that our system is intended to be fixed on shoes, which often hold a relatively short distance with ground surfaces, we kept the distance as short as possible while maintaining sufficient clearance for the light path (i.e., from the emitter to ground surfaces and back to the image sensor). Thus, in this case, for each wavelength, we investigated its performance at distances with the ground surfaces of 1
cm, 3
cm, 5
cm, 7
cm, 9
cm, 11
cm, 13
cm and 15
cm (Fig.
4).
For each wavelength-and-distance combination, we collected a number of images with speckle patterns on five surfaces (wood, fabric, concrete, rubber, and ceramic). During the collection, we manually swapped the laser emitter of different wavelengths and adjusted the sensor distance to the ground surface. In order to evaluate the qualities of these images, we conducted a quick validation using ResNet-18 [
21], with collected images split into a training set and a testing set. Our assumption is that laser speckle images with high-quality speckle patterns will yield relatively high classification accuracy, revealing optimal wavelength-and-distance combinations.
The average classification accuracies and their standard deviations of all wavelength-and-distance combinations are shown in Appendix
A. Results indicate that the green laser (520
nm) exhibits both high accuracy and stability, though almost all combinations reach high classification accuracies. When the distance is under 11
cm, the accuracies of the green laser are all above
\(98\%\). Thus, in our subsequent studies, we choose the green laser with a 520
nm wavelength and set the distance between the sensor and ground surfaces to under 11
cm when affixing the sensor to users’ shoes.
4.3 Mechanical Structure and Fabrication
We build a mechanical structure of two modules that can achieve angle adjustment of the detecting component to ground surfaces and the fixation of the system on a user’s leg (Fig.
3). The first module consists of five parts: two semi-cubic shells forming a container (b11, b13), a limiter with two cylindrical channels (b12), a cylindrical housing (b14), and a clamping part (b15). The two semi-cubic shell surfaces are joined together into a cube container by screws on the side. The image sensor is fixed inside the cube housing via slots in the four corners of the cube container’s inner side, and the laser is fixed on the bottom side of the cube housing via a fixture (b12). A number of rivet structures are used to connect the cube container to the column housing b14, and to implement the rotatable connection between the column housing b14 and the clamping part (b15). Screws are used to secure a series of discontinuous holes in the column housing and the clamping part, allowing an adjustable angle between the cube container and the clamping part, ranging from 0 to 90 degrees in a 15-degree step. As the clamping part of the first module is fixed to the outer side of a user’s ankle, adjusting the angle between the cube container and the clamping part changes the angle between the laser sensing beam with the user’s leg and thus with the ground surfaces.
The second module contains four parts: a supporting part (b8), a square housing (b9), a top lid (b10), and a controller box (b5). Among these, b8, b9, and b10 are jointed by three studs on the corners to form a container for the combined structure of the Raspberry Pi board and the battery module. The container measures approximately 65.7mm in length, 30.6mm in width, and 46.0mm in height. The USB port and the charging port are reserved for the exterior of the container. The controller box (b5) contains the switch module and is attached to the rest of the module with a side slide. This module is fixed to the outside of the user’s lower leg with straps fitting through b8 and the main structure of the container is kept away from the user’s skin to avoid possible discomfort due to the heat dissipation of our system. The above mechanical structures are 3D printed with photosensitive resin at a 0.05 mm resolution using a Lite600HD 3D printer.
5 Ground Surface Detection
The whole ground surface detection pipeline of
LaserShoes is illustrated in Fig.
5.
LaserShoes device is expected to work despite the constant motion with ground surfaces while users are walking. Every 90 frames are treated as a video session, taking about 1.5 seconds to collect. This duration is selected for our observation that at least one foot-ground contact would appear in the video session when users walk at normal speeds.
Video sessions are fed into our ground surface detection system, which consists of a pre-processing phase and a deep learning model for classification. Specifically, with this pre-processing phase, we select images with clear speckle patterns from the collected videos and crop selected speckle images into small images before feeding them into a deep learning model for classification, as a data enhancement technique to increase our data collection efficiency. This pre-processing phase allows LaserShoes to deal with distance change and motion blur caused by users’ gait.
5.1 Data Pre-processing
The motion of users’ feet causes the speckle patterns to be blurry and thus contain little information on ground surfaces (Fig.
2 (b)). To achieve high detection accuracy, it is necessary to extract high-quality images with clear speckle patterns. Our pre-processing phase contains four stages (Fig.
5 (b)-(e)), including 1) identifying the foot-ground contact periods, 2) cropping images, 3) removing partial blurry images, and 4) removing fuzzy patterns. Specifically, we first identify images collected from foot-ground contact periods. We then crop these foot-ground contact images into small images with the size of 256 × 256. We discard cropped images with partial blur or fuzzy patterns. After the pre-processing phase, we obtain a group of cropped images with clear speckle patterns to feed into our deep-learning model. The details of each stage of this pre-processing phase are explained below, and the efficacy of the data pre-processing is discussed in Section
8.1.
5.1.1 Identifying foot-ground contact periods with variance-based threshold.
We observe that the distribution of bright and dark regions in speckle images contains the majority of information about ground surfaces, and that color is not a significant factor. Therefore, to increase the efficiency of our pre-processing phase, we convert all speckle images to grayscale.
The first step, after acquiring the grayscale frames, is to identify speckle images that correspond to the foot-ground contact period. These images are often less blurry, revealing much information about ground surfaces. We note that, when
LaserShoes is moving in relation to the ground, the collected speckle images are less visible, resulting in lower variances of pixel intensities across an image for the edge of the speckle patterns being fuzzy. Fig.
6 shows some example speckle images from the foot-ground contact period and from a user’s foot in motion, illustrating the difference in blurriness. Hence, by comparing the variances of pixels, we identify speckle images that are collected from the foot-ground contact period and pass them to the next stage.
We calculate the grayscale variance of each speckle image in each video session. Then, for each speckle image, we recognize it as one collected from the foot-ground contact period if it has a cross-pixel variance that is larger than the top 8% variance value of the previous 90-frame video segment. To further improve robustness, we use adjacent images to aid in identification – we consider a speckle image to be a foot-ground contact image only when both its previous frame and next frame have high variance. Finally, before feeding these selected images into the next stage, we conduct a center crop on them for the lack of sensitivity at the edges of the CCD module, not being able to output clear laser speckles. The pseudo-code of this pre-processing stage is shown in Algorithm 1 .
5.1.2 Cropping images.
The first stage yields foot-ground contact images of 1024 × 592 pixels. We conduct a test to investigate the effect of image size on the detection performance in Section
5.3, and choose 256 × 256 pixels as the size of our input data. Specifically, we use an extraction window of that size to crop out input images from each foot-ground contact image. This cropping operation also increases the number of samples and improves the efficiency of deep learning model training.
However, within each foot-ground contact image, some regions may still be blurry while others have clear speckle patterns. We eliminate those with blurry speckle patterns in this stage to further improve our system’s robustness. Instead of using the intuitive approach to calculating pixel variance of all cropped images, which could be computationally expensive, we calculate the variances of the cropped images along the left edge of a foot-ground contact image to decide the blurriness of rows in these cropped images reside. We note that the distributions of speckle patterns in each image row are often similar to the rolling shutter of our image sensor (Fig.
6). Thus, we can determine if a row has clear speckle patterns by inspecting only one section of it. Specifically, we slide the extraction window in the y direction to crop out different image patches and check whether they are clear by thresholding their pixel variances (Fig.
5). The slide stride is 56 pixels, and thus for each foot-ground contact image, six cropped images will be extracted. If a cropped image has a variance higher than the top 20 percent of all variances of all foot-ground contact images belonging to the current video session, we consider it to have clear speckle patterns and save it in a buffer. We also save the indexes of these cropped images for sliding the extraction window along rows of these indexes with a 128-pixel stride. The extracted patches from this step are candidate images. Histogram equalization is applied to candidate images to amplify their contrast. All candidate images are fed into the next pre-processing stage after histogram equalization. Algorithm 2 shows the pseudo-code of this stage.
5.1.3 Removing partial blurry images with region-based sum comparison.
There could still be blurry images resulting from the aforementioned stages. To eliminate these images, we design an additional pre-processing stage for fine selection. Because the contrasts of these potentially blurry candidate images become much larger after histogram equalization, the pixel variances of different regions of these images all vary greatly (shown in Fig.
7 (a)). Thus, to identify blurry images, each candidate image is equally divided into four sub-images. We calculate the sum of the grayscale values of every sub-image and eliminate the candidate image if the difference between any two sums exceeds a given threshold. The rest of the candidate images are then fed into the final pre-processing stage.
5.1.4 Removing fuzzy patterns with Gabor filter.
Since there may still be relative motions between our sensor and ground surfaces during the foot-ground contact period due to the deformation of ground surfaces, fuzzy patterns can be generated in the speckle images. These fuzzy patterns often appear as stripes oriented in a particular direction, while clear speckle images have patterns with no obvious orientation (as shown in Fig.
7(b) and (c)). To remove images with fuzzy patterns, we apply 8 Gabor Filters with different directions (30, 60, 120, 150, 210, 240, 300, and 330 degrees) and remove those with unbalanced filtered results. Specifically, we eliminate an image if there is a difference between any two filtered results greater than a given threshold. The candidate images that are not eliminated by the third and fourth stages are the output of our pre-processing phase and are the input to the deep learning model. The pseudo-code for these two pre-processing stages is described in Algorithm 3 .
5.2 Deep Learning Model
Image classification is a mature field in Computer Vision (CV), and many deep learning algorithms have shown remarkable performance. To choose a proper model for our sensing, we conduct a comparison study with different models, including ResNet-18[
21], VGG [
47], GoogleNet [
55], and MobileNetV3 [
24]. As shown in Table
2, ResNet-18 and GoogleNet achieve comparatively high accuracies. We eventually choose ResNet-18 to implement
LaserShoes for its smaller size, despite its slightly lower accuracy than GoogleNet.
In the ResNet model, input images first pass through a convolution layer, a batch normalization (BN) layer, and a rectified linear unit (ReLU) layer. The data then goes through a series of basic blocks which consists of a residual mapping and an identity mapping. For the residual mapping, the input passes through a convolution layer, a BN layer, a ReLU layer, another convolution layer, and another BN layer, while for the identity mapping, the input only passes through a 1 × 1 convolution layer to be downsampled to the same size as the residual mapping result. Then the two mapping results are added and the sum passes through a ReLU layer to get the output of a basic block. Finally, an average pooling and a full connection layer are operated to obtain the classification results. During training, we select Cross Entropy Loss as the loss function and use the Adam optimizer. The learning rate and the batch size are set to 0.0001 and 32, respectively. We do not use a pre-trained model to initialize our parameters and use 150 epochs for the model training because we find that it is enough for our models to be converged.
5.3 Image Size Selection
The model’s input is the clear candidate images from the data pre-processing phase, and the model’s output is the type of ground surfaces. Image size is set to 256 × 256 in our ground surface detection, the same as the size used in SensiCut [
13]. To verify the efficacy of this image size, we extract a number of clear candidate images with different sizes to train a series of ResNet-18 models. The experimented image sizes included 64 × 64, 128 × 128, 256 × 256, and 512 × 512. The results of average accuracy and inference time for the classification of one input image are shown in Table
3. As expected, input images with larger sizes lead to higher accuracy but take significantly longer to classify. Given the improvement in accuracy is modest from 256 × 256 to 512 × 512, we select 256 × 256 as the size of the input images to our model to balance accuracy with inference time.
5.4 Real-Time Inference
In real-time detection, the image sensor continually records frames, and every 90 frames constitute a video session that is fed into the pre-processing stage. If no clear candidate images are detected by the pre-processing phase, the detection pipeline outputs “None” as a neutral label. We conduct testing using 100 video sessions captured during participants’ normal walks on various everyday ground surfaces. Our result shows that for every video session, after the data pre-process phase, the average number of input images fed into the subsequent model is 11. We use C++ for implementing the data pre-processing for a superior speed and use Python for implementing the deep learning model. For every input image of a video session, the classification model will output a corresponding surface type. Among all these types, we choose the surface type that appears the most frequently as the surface label of this video session. And the label is also provided to the user as the detection feedback. We record the average time needed for completing the pre-processing and inference of one video session, with 100 sessions collected from various participants and ground surfaces processed on a Raspberry Pi Zero 2 W, a laptop with a CPU of 3.1 GHz dual-core Intel Core i5, and a GPU of NVIDIA GeForce RTX 3090 respectively. Results are shown in Table
4. We find that the current implementation of
LaserShoes running solely on the Raspberry Pi board cannot perform real-time detection without dropping input images if the duty cycle of users’ feet contacting ground surfaces is too high, which we acknowledge as a limitation of our system.
6 Evaluation
Our user study consisted of one main study and three supplementary investigations. The main study involved collecting data with 24 ground surfaces to understand LaserShoes ’ ability to classify the ground surface material while its wearer is walking. In the supplementary studies, we aimed to evaluate the robustness of LaserShoes under various conditions (i.e., on dry, wet, and icy surfaces, on sand surfaces of different grain sizes, and under different lighting conditions).
Considering that when pre-processing, identifying the foot-ground contact periods of a 1.5s video session is the first stage and is the basis of the subsequent pre-processing stages, a high detection accuracy (DA) of identifying the foot-ground contact period (FGCP) is necessary. Thus, we first evaluated this detection accuracy, which is defined as
Then, we used accuracy, precision, recall, and F1 score as our evaluation metrics for the ground surface classification. To calculate them, we only considered the 1.5s video sessions that have surface label (SL) output and eliminated those with “None” signals. The classification accuracy (CA) is defined as
6.1 Main Study with 24 Ground Surface
6.1.1 Ground surface materials.
We selected a total of 24 common ground surfaces, comprising 15 indoor surfaces and 9 outdoor surfaces, for our study. These surfaces could be classified into five groups: 1) rough, 2) smooth, 3) hard, 4) discontinuous, and 5) granular. These surfaces are shown in detail in Fig.
8. For each ground surface, we prepared at least one continuous area of 20 square meters in size to allow our participants to walk naturally (e.g., not need to frequently turn or turn back, not need to keep looking down the ground) during data collection for our study.
6.1.2 Participants and apparatus.
We recruited 15 participants (7 males and 8 females), with ages ranging from 20 to 27 years old (mean = 23.40, SD = 1.56) via social media and flyers. Their body weights ranged from 48.0kg to 82.6kg (mean = 61.03, SD = 9.93) and their heights ranged from 158.5cm to 182.0cm (mean = 170.13, SD = 6.83). Of all the participants, 5 wore sneakers, 6 wore running shoes, 3 wore canvas shoes, 1 wore ankle boots, and 1 wore snow boots. Their shoe sizes ranged from 23.0cm to 27.0cm, with a mean of 24.67 (SD = 1.12).
Participants wore their own shoes normally and our
LaserShoes as described in Section
4 to collect videos from ground surfaces while participants were walking on them. Considering that our device requires proximity to ground surfaces, we required participants to wear flat shoes. Fig.
9 shows some example shoe styles that
LaserShoes is compatible with. Distances between our image sensor and ground surfaces in the study varied from 6cm to 10cm across the 15 participants. The detection component was attached tightly to participants’ shoes through our designed clamping mechanism, while the processing and assistant component was attached to participants’ lower legs using Nylon tapes.
6.1.3 Data collection procedure.
We started the study with an introduction of the procedure and helped the participant put the devices on. For each surface, we used tapes to indicate an area that the participants could walk on. Participants were allowed to walk freely in the area. Each study had two sessions. A short practice session was at the start, where the participant walked through all surfaces. This session was used to familiarize participants with the system and no data was collected. We asked the participant to slow down their walk if no clear speckle patterns could be captured by LaserShoes (i.e., output from the pre-processing phase). After the practice session, the participants were asked to walk on each chosen surface for 1~2 minutes in the second session for data collection. The order of the surfaces each participant needed to walk on was randomized to avoid bias (e.g., a change in walking speed or gait caused by fatigue). In addition, in order to simulate real-world scenarios, participants were asked to adjust their LaserShoes after each session and to take breaks in between sessions (around 2 mins). The study was conducted under typical indoor and outdoor lighting conditions. To collect the ground truth of foot-ground contact periods, a camera was set up to record the foot movements of participants during the study and research assistants labeled all foot-ground contact timestamps manually. In total, we collected 28,492 1.5s video sessions on 24 surfaces from the 15 participants. And it took around 2 hours for each participant to finish the data collection.
6.1.4 Results.
To evaluate the performance of our system for ground surface classification, we used both within-user and cross-user approaches. For within-user evaluation, to ensure there is no overlapping between the training set and the test set, we first split all data into ten folders and randomly selected two folders as the test datasets. Of note that no time-adjacent input images were included in both training or test datasets. For cross-user evaluation, we used leave-one-out evaluation methods using 14 participants’ data to train and the remaining one to test.
Detection Accuracy of Identifying Foot-Ground Contact Periods. The collected videos were processed using the method described in Section
5.1 and we first evaluated the performance of identifying foot-ground contact periods using the formula defined above. The detection accuracy is
\(90.91\%\), indicating that our method can detect the majority of foot-ground contact periods from recorded data.
Within-User Evaluation Results. Results of the within-user detection accuracy for 24 ground surfaces are shown in Fig.
10 (a). The average classification accuracy of the 24 ground surfaces is
\(86.93\%\), with the recall of
\(87.17\%\) (
SD = 10.09), the precision of
\(85.82\%\) (
SD = 13.57) and the F1 score of
\(85.94\%\) (
SD = 10.59). For 15 indoor surfaces, the average classification accuracy is
\(91.53\%\), with the recall of
\(90.60\%\) (
SD = 9.62), the precision of
\(92.48\%\) (
SD = 7.23) and the F1 score of
\(91.23\%\) (
SD = 7.06), while for 9 outdoor surfaces, the average classification accuracy is
\(78.86\%\), with the recall of
\(81.46\%\) (
SD = 8.07), the precision of
\(74.73\%\) (
SD = 14.39) and the F1 score of
\(77.13\%\) (
SD = 9.58). Indoor surface detection is more accurate than outdoor surface detection. The reason for this could be that the light condition outside is less stable than it is indoors due to changes in intensity and angle of sunlight. This may reduce the quality of collected images, resulting in poor detection results.
Besides, we also evaluated the detection accuracy of surfaces with different characteristics and the results are shown in Table
5. The results show that rough surfaces have the highest accuracy and the lowest standard deviation among the five surface groups with varying characteristics. This makes sense because the microstructure of rough surfaces is more complex, resulting in more subtle patterns. Furthermore, discontinuous surfaces have the lowest average accuracy and a large standard deviation.
Cross-User Evaluation Results. For cross-user evaluation, the detection results are shown in Fig.
10 (b). The average classification accuracy of the cross-user model is
\(80.57\%\), with the recall of
\(80.36\%\) (
SD = 10.48), the precision of
\(78.32\%\) (
SD = 17.62) and the F1 score of
\(78.73\%\) (
SD = 13.86). For indoors and outdoors, the average classification accuracy are
\(83.22\%\) and
\(73.13\%\), with the recalls of
\(85.48\%\) (
SD = 8.95) and
\(71.85\%\) (
SD = 6.56), the precision of
\(87.79\%\) (
SD = 10.00) and
\(62.54\%\) (
SD = 16.21), and the F1 scores of
\(86.39\%\) (
SD = 8.45) and
\(65.97\%\) (
SD = 11.53), respectively. In contrast to within-user results, classification accuracy decreases in the cross-user evaluation. This could be due to the fact that participants were wearing different shoes in the study, which caused different distances between the image sensor and ground surfaces. Furthermore, different foot postures of participants when their feet come into contact with ground surfaces contribute to a decrease in accuracy. Some participants’ feet were in aversion, while others were in inversion or in neutral positions. These different foot postures (shown in Fig.
11) cause a distance change between the image sensor and ground surfaces. The distance differences result in differently formed speckle patterns and thus variance between training and test datasets – the same type of ground surface may correspond to multiple speckle patterns. This variance may decrease the accuracy of the cross-user evaluation. And indoor detection, like within-user results, outperformed outdoor detection.
We also tested the performance of the cross-user model for five groups of surfaces with different characteristics. The results are shown in Table
5, which indicates that compared to within-user results, detection accuracy did not change a lot for smooth and hard surfaces. However, for rough, discontinuous, and granular surfaces, there is a large decrease. The reason may be that surfaces with complex microstructure amplified the difference in participants’ foot postures, resulting in larger differences of speckle patterns belonging to the same type of ground surfaces.
Visually Similar Ground Surfaces. Among our selected ground surfaces, light-colored wood and artificial flooring look very similar, which are not easy to distinguish by conventional RGB cameras. The results shown in Fig.
10 reveal that in both within-user and cross-user conditions, these two visually similar surfaces can be distinguished from the other one with
LaserShoes.
6.2 Supplementary Investigation
Given the length of the primary data collection, the supplementary study is not conducted on the same day to avoid the fatigue of participants. 12 participants took part in our supplementary studies. The basic procedure was the same as the main study procedure. We finally collected 19,319, 4,250, and, 41,005 1.5s video sessions for each study, respectively.
6.2.1 Dry, wet, and icy surfaces.
In outdoor settings, ground surfaces could be dry, wet, or icy due to different types of weather. This may pose a potential danger to pedestrians. Thus, the sensing capability of
LaserShoes to identify ground surface conditions could have real-world uses. We conducted experiments to classify ground surface conditions on nine types of outdoor surfaces, shown in Fig.
8, under three conditions (i.e, dry, wet, and icy). For the wet condition, we poured water on the ground while for the icy condition, we put crushed ice on the ground. We conducted two evaluations in this study. We treated each combination of surface and condition as a separate label (27 in total) in the first validation. In the second evaluation, we combined all of the surfaces of the icy condition into one label (19 in total). The detailed results are shown in Fig.
12. In the first evaluation, the detection model has a
\(62.89\%\) recall, a
\(66.06\%\) precision, and a
\(59.91\%\) F1. In the second evaluation, after merging icy surfaces, the detection model has a
\(76.06\%\) recall, a
\(76.75\%\) precision, and a
\(74.29\%\) F1. These results show the feasibility of
LaserShoes detecting ground surface conditions in real-world applications to improve pedestrian safety.
6.2.2 Sand surfaces with different grain sizes.
Even when the material is the same, the physical state of the material (e.g., graininess, looseness) can vary. We also investigated how LaserShoes could perform finer-grained ground surface material sensing. Participants were asked to walk on three different types of sand surfaces with the same procedure as the main study. To be more specific, we assess the classification performance using data collected on sand surfaces with sands of three different grain sizes (i.e., small, medium, and large). The classification accuracy for the sand types is \(92.28\%\) with an \(87.60\%\) recall, a \(95.56\%\) precision, and a \(90.59\%\) F1, which indicates that LaserShoes could identify the same type of surfaces with different fine-grained surface geometries.
6.2.3 Different lighting conditions.
Lighting conditions may affect the quality of speckle images and thus the ground surface detection performance. To test the robustness of LaserShoes against this factor, we collected data in five different lighting conditions. These conditions included two for the 15 indoor surfaces and three for the 9 outdoor surfaces, and are listed as follows:
•
Indoor-with-light: lamps (cold light source) on in a room.
•
Indoor-without-light: lamps off in a room.
•
Outdoor-at-daytime: much sunlight outdoors at daytime.
•
Outdoor-at-dusk: little sunlight outdoors at dusk.
•
Outdoor-at-night: no sunlight, with streetlamps on, outdoors at night.
We trained five classification models, each using the data collected under different lighting conditions. Table
6 shows the average surface classification accuracies for the five different lighting conditions. The results demonstrate that, with the exception of the outdoor-at-daytime condition, the classification accuracy for all other conditions was above 87%. This indicates the robustness of
LaserShoes, except under lighting conditions with strong ambient light, which requires further improvement.
8 Discussion
8.1 Efficacy of Data Pre-processing
Although machine learning models are somewhat resilient to noisy data points, they require more computation power during inference. To alleviate the such burden, a denoising process is commonly performed prior to feeding into machine learning models [
68,
69]. In our case, if we do not remove blurry images, the time consumption for inference will be large, which is opposite to our goal of real-time prediction. Even if we only extract one image by cropping one raw frame and do not perform data pre-processing, the number of images from one video session fed into the classification model will be 90. However, the average number of images fed into the classification model after data pre-processing is 11, indicating that our data pre-processing step can significantly reduce computation costs during inference.
Further, to evaluate the influence of the data pre-processing step in terms of ground surface classification performance, we conducted experiments on data collected from one of our participants. The experiment procedure is the same as our main study except that we replaced the pre-processing part with cropping one 256 × 256 image from each frame. For the classification model trained with raw data, the recall, precision, and F1 are \(64.25\%\), \(67.22\%\), and \(60.60\%\), respectively. For the classification model trained with data after pre-processing, the recall, precision, and F1 are \(88.45\%\), \(88.05\%\), and \(87.60\%\), respectively. Therefore, conducting our data pre-processing step can achieve better performance compared to using raw data.
8.2 Avoid Overfitting
Overfitting is a common issue in deep learning applications, especially when the number of training samples is small. To prevent the deep model in our system from overfitting, common techniques, including data augmentation, and normalization, were applied during the training process. Besides, as described in Section
5.1.2, cropping a raw speckle image to generate multiple smaller input images helps increase the number of training samples. Moreover, we set the number of training epochs of the model to 150, after we conducted experiments using a validation set of data and found that the training loss converged while the validation loss did not degrade at around 150 epochs. The evaluation results with high classification accuracies, especially those from the cross-user study, demonstrate effective mitigation of overfitting.
8.3 Power Consumption
There are three main parts that consume power in our system: The laser emitter with a switch module (51.3 mW), the image sensor (1047.9 mW), and the Raspberry Pi (2643.6 mW). LaserShoes has a relatively high total consumption, which prevents it from being continuously used for a long time without battery exchange. In the future, to reduce the power consumption on Raspberry Pi, the collected data could be transferred to a cloud server via low-power wireless communications. We could also design a custom circuit and reduce power consumption by removing components that are not in use and using low-power MCU and communication modules. Besides, the current image sensor captures images of 1280 × 720 pixels for efficient data collection. However, in live classification, the input images need not be that large, possibly taken by smaller image sensors to preserve power.
8.4 Sense Surfaces ahead for Early Alert
Since
LaserShoes uses images captured when a user’s foot is in contact with the ground, our system in its current implementation could not predict ground surface conditions in advance, limiting use scenarios such as alerts of dangerous surface conditions. To achieve this,
LaserShoes should be able to leverage in-flight images. To mitigate the motion effect, we could add an IMU sensor to measure motion speed and implement deblurring methods [
9,
17]. Image sensors with a short exposure time could also help to obtain clear images when the user’s foot is moving in the air. Second, we could tilt up our device to sense ground surfaces in front of a user for early alerts (Fig.
15 (a)). We performed a test to see if our sensing system could still function with our device tilted up, pointing to the front of a shoe. Results indicate discernable speckle patterns up to 45 degrees for the three types of surfaces we tested (Fig.
15 (b)). However, it merits future research to investigate how this sensor configuration could work in real-use cases powered by real-time signal processing and classification.
8.5 Loose or Transparent Ground Surfaces
In practice, we discover that
LaserShoes cannot capture frames with high-quality speckle patterns on loose ground surfaces such as grass for insufficient reflected light intensity. We suspect that grass surfaces diffused or absorbed most of the laser energy due to their layered surface micro geometries. Besides, for transparent ground surfaces such as glass (Fig.
15 (b)), the reflected laser is also weakened. Though speckle patterns can still be formed on transparent ground surfaces, information on the textured surfaces underneath the transparent coating layer is much deluded, resulting in less discernable speckle patterns than ones induced on surfaces without the transparent coating laser.
8.6 LaserShoes under Intense Ambient Light
When the ambient light is too intense, the image sensor receives too much ambient light, which lowers the signal-to-noise ratio (SNR). As a result, the speckle patterns become blurry or invisible under some outdoor conditions in our study. To mitigate this issue, future systems could leverage optical filters. Given that laser light is polarized and has a narrow frequency band, we could include a polarizer or a band-pass filter between ground surfaces and the image sensor. These filters could make the laser a dominant signal on captured laser speckle images that have sufficient SNR for classification. Another tactic to preserve SNR is to implement synchronous detection, with the image sensor and the laser in sync. Specifically, we could leverage high-speed image sensors to take two consecutive photos with and without the laser turning on. The subtraction between these two consecutive photos should reveal little effect imposed by the ambient light which is relatively constant and therefore the effect could be subtracted out.
8.7 Form Factor Optimization
Our current implementation is relatively bulky. Furthermore, different image sensor heights, which are affected by shoe styles and foot postures, will reduce ground detection accuracy as discussed in Section
6.1.4. In the future,
LaserShoes could be replicated with better form factor designs.
One possible solution is to make the height of the image sensor consistent across shoe styles by adding a height-adjustable mechanical module as shown in Fig.
16 (a). This module could also mitigate variances introduced by the foot posture by asking users to calibrate and adjust
LaserShoes before use.
Since the diode of a laser emitter and the chip of an image sensor are both very small, they can be combined into a single integrated component that might be sufficiently thin to be integrated on a smart sole under shoes as shown in Fig.
16 (b). In this case, the sensing distance is short and consistent, and the sensor is isolated from the ambient light when the sole is in contact with ground surfaces, all of which could result in an improved SNR that yields higher classification accuracies.