Nothing Special   »   [go: up one dir, main page]

You seem to have javascript disabled. Please note that many of the page functionalities won't work as expected without javascript enabled.
 
 
Sign in to use this feature.

Years

Between: -

Subjects

remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline

Journals

Article Types

Countries / Regions

Search Results (98)

Search Parameters:
Keywords = masked face recognition

Order results
Result details
Results per page
Select all
Export citation of selected articles as:
37 pages, 8629 KiB  
Review
A Comprehensive Survey of Masked Faces: Recognition, Detection, and Unmasking
by Mohamed Mahmoud, Mahmoud SalahEldin Kasem and Hyun-Soo Kang
Appl. Sci. 2024, 14(19), 8781; https://doi.org/10.3390/app14198781 - 28 Sep 2024
Cited by 1 | Viewed by 1169
Abstract
Masked face recognition (MFR) has emerged as a critical domain in biometric identification, especially with the global COVID-19 pandemic, which introduced widespread face masks. This survey paper presents a comprehensive analysis of the challenges and advancements in recognizing and detecting individuals with masked [...] Read more.
Masked face recognition (MFR) has emerged as a critical domain in biometric identification, especially with the global COVID-19 pandemic, which introduced widespread face masks. This survey paper presents a comprehensive analysis of the challenges and advancements in recognizing and detecting individuals with masked faces, which has seen innovative shifts due to the necessity of adapting to new societal norms. Advanced through deep learning techniques, MFR, along with face mask recognition (FMR) and face unmasking (FU), represents significant areas of focus. These methods address unique challenges posed by obscured facial features, from fully to partially covered faces. Our comprehensive review explores the various deep learning-based methodologies developed for MFR, FMR, and FU, highlighting their distinctive challenges and the solutions proposed to overcome them. Additionally, we explore benchmark datasets and evaluation metrics specifically tailored for assessing performance in MFR research. The survey also discusses the substantial obstacles still facing researchers in this field and proposes future directions for the ongoing development of more robust and effective masked face recognition systems. This paper serves as an invaluable resource for researchers and practitioners, offering insights into the evolving landscape of face recognition technologies in the face of global health crises and beyond. Full article
Show Figures

Figure 1

Figure 1
<p>Illustration showcasing the tasks of masked face recognition (MFR), face mask recognition (FMR), and face unmasking (FU) with varied outputs for the same input.</p>
Full article ">Figure 2
<p>Illustrates the evolving landscape of MFR and FMR studies from 2019 to 2024. The data were sourced from Scopus using keywords “Masked face recognition” for MFR and “Face mask detection”, “Face masks”, and “Mask detection” for FMR.</p>
Full article ">Figure 3
<p>Samples of masked and unmasked faces from the real-mask masked face datasets used in masked face recognition.</p>
Full article ">Figure 4
<p>Samples from real masked face datasets used in face mask recognition.</p>
Full article ">Figure 5
<p>Samples of synthetic masked faces from benchmark datasets.</p>
Full article ">Figure 6
<p>Illustration of the FMR-Net architecture for face mask recognition, depicting two-subtask scenarios: 2-class (with and without mask) and 3-class (with, incorrect, and without mask).</p>
Full article ">Figure 7
<p>Overview of the GAN network as an example of FU-Net for face mask removal.</p>
Full article ">Figure 8
<p>Face unmasking outputs from three state-of-the-art models: GANMasker, GUMF, and FFII-GatedCon. The first column shows the input masked face, while the second column displays the original unmasked face for reference.</p>
Full article ">Figure 9
<p>Three directions in masked face recognition (MFR): face restoration, masked region discarding, and deep learning-based approaches.</p>
Full article ">
16 pages, 730 KiB  
Article
Dimensions of Alexithymia and Identification of Emotions in Masked and Unmasked Faces
by Thomas Suslow, Anette Kersting and Charlott Maria Bodenschatz
Behav. Sci. 2024, 14(8), 692; https://doi.org/10.3390/bs14080692 - 9 Aug 2024
Viewed by 825
Abstract
Alexithymia, a multifaceted personality construct, is known to be related to difficulties in the decoding of emotional facial expressions, especially in case of suboptimal stimuli. The present study investigated whether and which facets of alexithymia are related to impairments in the recognition of [...] Read more.
Alexithymia, a multifaceted personality construct, is known to be related to difficulties in the decoding of emotional facial expressions, especially in case of suboptimal stimuli. The present study investigated whether and which facets of alexithymia are related to impairments in the recognition of emotions in faces with face masks. Accuracy and speed of emotion recognition were examined in a block of faces with and a block of faces without face masks in a sample of 102 healthy individuals. The order of blocks varied between participants. Emotions were recognized better and faster in unmasked than in masked faces. Recognition performance was worst and slowest for participants starting the task with masked faces. In the whole sample, there were no correlations of alexithymia facets with accuracy and speed of emotion recognition for masked and unmasked faces. In participants starting the task with masked faces, the facet externally oriented thinking was positively correlated with reaction latencies of correct responses for masked faces. Our findings indicate that an externally oriented thinking style could be linked to a less efficient identification of emotions from faces wearing masks when task difficulty is high and support the utility of a facet approach in alexithymia research. Full article
Show Figures

Figure 1

Figure 1
<p>Number of hits (correct responses) in the emotion recognition task as a function of mask and order of blocks, i.e., start with unmasked vs. start with masked faces (error bars: standard error).</p>
Full article ">Figure 2
<p>Response times (in seconds) for correct answers in the emotion recognition task as a function of mask and order of blocks, i.e., start with unmasked vs. start with masked faces (error bars: standard error).</p>
Full article ">
24 pages, 11993 KiB  
Article
A Method for Extracting Joints on Mountain Tunnel Faces Based on Mask R-CNN Image Segmentation Algorithm
by Honglei Qiao, Xinan Yang, Zuquan Liang, Yu Liu, Zhifan Ge and Jian Zhou
Appl. Sci. 2024, 14(15), 6403; https://doi.org/10.3390/app14156403 - 23 Jul 2024
Viewed by 863
Abstract
The accurate distribution of joints on the tunnel face is crucial for assessing the stability and safety of surrounding rock during tunnel construction. This paper introduces the Mask R-CNN image segmentation algorithm, a state-of-the-art deep learning model, to achieve efficient and accurate identification [...] Read more.
The accurate distribution of joints on the tunnel face is crucial for assessing the stability and safety of surrounding rock during tunnel construction. This paper introduces the Mask R-CNN image segmentation algorithm, a state-of-the-art deep learning model, to achieve efficient and accurate identification and extraction of joints on tunnel face images. First, digital images of tunnel faces were captured and stitched, resulting in 286 complete images suitable for analysis. Then, the joints on the tunnel face were extracted using traditional image processing algorithms, the commonly used U-net image segmentation model, and the Mask R-CNN image segmentation model introduced in this paper to address the lack of recognition accuracy. Finally, the extraction results obtained by the three methods were compared. The comparison results show that the joint extraction method based on the Mask R-CNN image segmentation deep learning model introduced in this paper achieved the best joint extraction effect with a Dice similarity coefficient of 87.48%, outperforming traditional methods and the U-net model, which scored 60.59% and 75.36%, respectively, realizing accurate and efficient acquisition of tunnel face rock joints. These findings suggest that the Mask R-CNN model can be effectively implemented in real-time monitoring systems for tunnel construction projects. Full article
(This article belongs to the Section Civil Engineering)
Show Figures

Figure 1

Figure 1
<p>Image acquisition equipment.</p>
Full article ">Figure 2
<p>Tunnel-lined platform car light source.</p>
Full article ">Figure 3
<p>Adverse Interferences in Tunnel Face Photography. (The red boxes are where the tunnel face is obscured).</p>
Full article ">Figure 4
<p>Map of tunnel locations.</p>
Full article ">Figure 5
<p>Onsite digital image shooting plan. (Each number corresponds to a part of the tunnel face that is divided.</p>
Full article ">Figure 6
<p>Image overlapping area.</p>
Full article ">Figure 7
<p>Unnatural edges in image stitching. (As the yellow box marks).</p>
Full article ">Figure 8
<p>Comparison of the Effects Between Stitched and Fused Images and Full Cross-Section Photographed Images.</p>
Full article ">Figure 9
<p>Grayscale processing result of tunnel face image.</p>
Full article ">Figure 10
<p>Bilateral filtering effect on tunnel face image.</p>
Full article ">Figure 11
<p>Binarized image.</p>
Full article ">Figure 12
<p>Comparison of Joints Before and After Morphological Processing.</p>
Full article ">Figure 13
<p>Schematic diagram of morphological processing.</p>
Full article ">Figure 14
<p>Image noise removal process.</p>
Full article ">Figure 15
<p>Comparison of non-joint and joint areas.</p>
Full article ">Figure 16
<p>Main interface view of EISeg annotation software.</p>
Full article ">Figure 17
<p>Using EISeg software for joint data annotation.</p>
Full article ">Figure 18
<p>Dataset augmentation operations. (The orange line is added later to determine the direction of the picture).</p>
Full article ">Figure 19
<p>U-Net convolutional neural network architecture.</p>
Full article ">Figure 20
<p>Schematic diagram of convolution processing.</p>
Full article ">Figure 21
<p>ReLU Function Graph.</p>
Full article ">Figure 22
<p>Max pooling diagram.</p>
Full article ">Figure 23
<p>Average pooling diagram.</p>
Full article ">Figure 24
<p>Changes in Loss Values for Training and Validation Sets.</p>
Full article ">Figure 25
<p>Changes in accuracy for training and validation sets.</p>
Full article ">Figure 26
<p>Comparison of U-Net prediction results.</p>
Full article ">Figure 26 Cont.
<p>Comparison of U-Net prediction results.</p>
Full article ">Figure 27
<p>Mask R-CNN network architecture.</p>
Full article ">Figure 28
<p>Bilinear interpolation effect.</p>
Full article ">Figure 29
<p>Changes in loss values.</p>
Full article ">Figure 30
<p>Comparison of Mask R-CNN prediction results. (The red boxes in subfigure (<b>c</b>) are the identified joints).</p>
Full article ">Figure 31
<p>Comparison of prediction results.</p>
Full article ">
17 pages, 754 KiB  
Article
A Dynamic Multi-Scale Convolution Model for Face Recognition Using Event-Related Potentials
by Shengkai Li, Tonglin Zhang, Fangmei Yang, Xian Li, Ziyang Wang and Dongjie Zhao
Sensors 2024, 24(13), 4368; https://doi.org/10.3390/s24134368 - 5 Jul 2024
Viewed by 737
Abstract
With the development of data mining technology, the analysis of event-related potential (ERP) data has evolved from statistical analysis of time-domain features to data-driven techniques based on supervised and unsupervised learning. However, there are still many challenges in understanding the relationship between ERP [...] Read more.
With the development of data mining technology, the analysis of event-related potential (ERP) data has evolved from statistical analysis of time-domain features to data-driven techniques based on supervised and unsupervised learning. However, there are still many challenges in understanding the relationship between ERP components and the representation of familiar and unfamiliar faces. To address this, this paper proposes a model based on Dynamic Multi-Scale Convolution for group recognition of familiar and unfamiliar faces. This approach uses generated weight masks for cross-subject familiar/unfamiliar face recognition using a multi-scale model. The model employs a variable-length filter generator to dynamically determine the optimal filter length for time-series samples, thereby capturing features at different time scales. Comparative experiments are conducted to evaluate the model’s performance against SOTA models. The results demonstrate that our model achieves impressive outcomes, with a balanced accuracy rate of 93.20% and an F1 score of 88.54%, outperforming the methods used for comparison. The ERP data extracted from different time regions in the model can also provide data-driven technical support for research based on the representation of different ERP components. Full article
(This article belongs to the Section Sensing and Imaging)
Show Figures

Figure 1

Figure 1
<p>Flowchart of a single experiment. Note: The faces in the flowchart are generated by AI. The experimental paradigm and procedure used are the same as those in [<a href="#B25-sensors-24-04368" class="html-bibr">25</a>].</p>
Full article ">Figure 2
<p>Multi–scale variable–length depthwise separable convolution. In the 0–1 mask, the blue cubes represent a mask value of 1, the red cubes represent a mask value of 0, and the light blue represents values between 0 and 1. ⊕ represents concatenation, ⊗ represents multiplication, and ⊖ indicates the use of Formula (<a href="#FD8-sensors-24-04368" class="html-disp-formula">8</a>) on the next gray module to obtain the light blue module, where <span class="html-italic">i</span> is the index of the scale, and <span class="html-italic">j</span> is the time step from 1 onwards. Solid lines represent the flow of ERP data, while dashed lines indicate the flow of weights. The gray dashed box encompasses the dynamic generative weight mask model, while outside the dashed box is the standard model.</p>
Full article ">Figure 3
<p>The average value of the Pz channel. The blue portion represents the standard deviation of positive samples, while the red region represents the standard deviation of the averaged negative samples. The gray shadow indicates the region of significant difference between familiarity and unfamiliarity (<span class="html-italic">p</span> &lt; 0.05).</p>
Full article ">Figure 4
<p>Comparison of the validation losses across models.</p>
Full article ">Figure 5
<p>Differences in the FPR between the masked weight model and the standard model with different numbers of scales (<span class="html-italic">p</span> &lt; 0.01).</p>
Full article ">Figure 6
<p>Differences in the TPR between the masked weight model and the standard model with different numbers of scales (<span class="html-italic">p</span> &lt; 0.5).</p>
Full article ">Figure 7
<p>Differences in the BA between the masked weight model and the standard model with different numbers of scales (<span class="html-italic">p</span> &lt; 0.01).</p>
Full article ">Figure 8
<p>Differences in the F1 score between the masked weight model and the standard model with different numbers of scales (<span class="html-italic">p</span> &lt; 0.01).</p>
Full article ">Figure 9
<p>Differences in the AUC between the masked weight model and the standard model with different numbers of scales (<span class="html-italic">p</span> &lt; 0.01).</p>
Full article ">Figure 10
<p>The effect of the window size and window position.</p>
Full article ">
21 pages, 29279 KiB  
Article
Neutral Facial Rigging from Limited Spatiotemporal Meshes
by Jing Hou, Dongdong Weng, Zhihe Zhao, Ying Li and Jixiang Zhou
Electronics 2024, 13(13), 2445; https://doi.org/10.3390/electronics13132445 - 21 Jun 2024
Viewed by 703
Abstract
Manual facial rigging is time-consuming. Traditional automatic rigging methods lack either 3D datasets or explainable semantic parameters, which makes it difficult to retarget a certain 3D expression to a new face. To address the problem, we automatically generate a large 3D dataset containing [...] Read more.
Manual facial rigging is time-consuming. Traditional automatic rigging methods lack either 3D datasets or explainable semantic parameters, which makes it difficult to retarget a certain 3D expression to a new face. To address the problem, we automatically generate a large 3D dataset containing semantic parameters, joint positions, and vertex positions from a limited number of spatiotemporal meshes. We establish an expression generator based on a multilayer perceptron with vertex constraints from the semantic parameters to the joint positions and establish an expression recognizer based on a generative adversarial structure from the joint positions to the semantic parameters. To enhance the accuracy of key facial area recognition, we add local vertex constraints for the eyes and lips, which are determined by the 3D masks computed by the proposed projection-searching algorithm. We testthe generation and recognition effects on a limited number of publicly available Metahuman meshes and self-collected meshes. Compared with existing methods, our generator has the shortest generation time of 14.78 ms and the smallest vertex relative mean square error of 1.57 × 10−3, while our recognizer has the highest accuracy of 92.92%. The ablation experiment verifies that the local constraints can improve the recognition accuracy by 3.02%. Compared with other 3D mask selection methods, the recognition accuracy is improved by 1.03%. In addition, our method shows robust results for meshes of different levels of detail, and the rig has more dimensions of semantic space. The source code will be made available if this paper is accepted for publication. Full article
(This article belongs to the Section Electronic Multimedia)
Show Figures

Figure 1

Figure 1
<p>Overview of our method: (<b>a</b>) dataset expansion, (<b>b</b>) structure of the proposed RigGenNet, and (<b>c</b>) structure of the proposed RigRecogNet.</p>
Full article ">Figure 2
<p>Explanation of joint rotation transformations. The fixed angles of joint <math display="inline"><semantics> <mi mathvariant="bold-italic">B</mi> </semantics></math> are represented by the positions of sub-joints <math display="inline"><semantics> <mrow> <msub> <mi mathvariant="bold-italic">B</mi> <mn>1</mn> </msub> <mo>,</mo> <msub> <mi mathvariant="bold-italic">B</mi> <mn>2</mn> </msub> <mo>,</mo> <msub> <mi mathvariant="bold-italic">B</mi> <mn>3</mn> </msub> </mrow> </semantics></math>.</p>
Full article ">Figure 3
<p>Illustration showing the selection of 3D masks; 3D meshes are projected onto the <math display="inline"><semantics> <mrow> <mi>Z</mi> <mi>O</mi> <mi>X</mi> </mrow> </semantics></math> plane, then the 3D vertices, which have corresponding 2D points that are less distant from the detected 2D points than a certain value, are selected as masks.</p>
Full article ">Figure 4
<p>Mask illustration at different levels of detail: (<b>a</b>) the detected 2D points [<a href="#B45-electronics-13-02445" class="html-bibr">45</a>], (<b>b</b>) the chosen 2D points, and (<b>c</b>) the 3D mask. The green parts are the adopted vertices with a mask value of 1, while the gray parts are the useless vertices with a mask value of 0.</p>
Full article ">Figure 5
<p>The absolute vertex error visualized on the generated meshes of different datasets: (<b>a</b>) ground truth, (<b>b</b>) blendshape-based rig [<a href="#B43-electronics-13-02445" class="html-bibr">43</a>,<a href="#B48-electronics-13-02445" class="html-bibr">48</a>], (<b>c</b>) CTBNET [<a href="#B31-electronics-13-02445" class="html-bibr">31</a>], (<b>d</b>) NFR [<a href="#B20-electronics-13-02445" class="html-bibr">20</a>], (<b>e</b>) SketchMetaFace [<a href="#B16-electronics-13-02445" class="html-bibr">16</a>], and (<b>f</b>) ours.</p>
Full article ">Figure 6
<p>The absolute vertex error visualized on the reconstructed mesh of different datasets. The reconstructed meshes are computed by inputting the recognized semantic parameters into RigGenNet. (<b>a</b>) Ground truth; (<b>b</b>) blendshape-based rig [<a href="#B43-electronics-13-02445" class="html-bibr">43</a>,<a href="#B48-electronics-13-02445" class="html-bibr">48</a>]; (<b>c</b>) FFNet [<a href="#B30-electronics-13-02445" class="html-bibr">30</a>]; (<b>d</b>) BTCNET [<a href="#B31-electronics-13-02445" class="html-bibr">31</a>]; (<b>e</b>) NFR [<a href="#B20-electronics-13-02445" class="html-bibr">20</a>]; (<b>f</b>) Shape Transformer [<a href="#B24-electronics-13-02445" class="html-bibr">24</a>]; (<b>g</b>) ours.</p>
Full article ">Figure 7
<p>The absolute vertex error visualized on the generated mesh and the UV mapping of different levels of detail. The generated meshes were computed by inputting the labeled semantic parameters into RigGenNet. (<b>a</b>) LOD6; (<b>b</b>) LOD3; (<b>c</b>) LOD0.</p>
Full article ">Figure 8
<p>The absolute vertex error visualized on the reconstructed mesh and the UV mapping of different levels of detail. The reconstructed meshes were computed by inputting the recognized semantic parameters into the RigGenNet. (<b>a</b>) Ground truth; (<b>b</b>) recognized semantic parameters; (<b>c</b>) reconstructed rig; (<b>d</b>) error heatmap drawn on the reconstructed mesh; (<b>e</b>) error heatmap drawn on the UV mapping.</p>
Full article ">Figure 9
<p>Unsmooth phenomena in the lips: (<b>a</b>) true expression and (<b>b</b>) the expression reenacted using our method.</p>
Full article ">
18 pages, 1387 KiB  
Article
KRT-FUAP: Key Regions Tuned via Flow Field for Facial Universal Adversarial Perturbation
by Xi Jin, Yong Liu, Guangling Sun, Yanli Chen, Zhicheng Dong and Hanzhou Wu
Appl. Sci. 2024, 14(12), 4973; https://doi.org/10.3390/app14124973 - 7 Jun 2024
Viewed by 736
Abstract
It has been established that convolutional neural networks are susceptible to elaborate tiny universal adversarial perturbations (UAPs) in natural image classification tasks. However, UAP attacks against face recognition systems have not been fully explored. This paper proposes a spatial perturbation method that generates [...] Read more.
It has been established that convolutional neural networks are susceptible to elaborate tiny universal adversarial perturbations (UAPs) in natural image classification tasks. However, UAP attacks against face recognition systems have not been fully explored. This paper proposes a spatial perturbation method that generates UAPs with local stealthiness by learning variable flow field to fine-tune facial key regions (KRT-FUAP). We ensure that the generated adversarial perturbations are positioned within reasonable regions of the face by designing a mask specifically tailored to facial key regions. In addition, we pay special attention to improving the effectiveness of the attack while maintaining the stealthiness of the perturbation and achieve the dual optimization of aggressiveness and stealthiness by accurately controlling the balance between adversarial loss and stealthiness loss. Experiments conducted on the frameworks of IResNet50 and MobileFaceNet demonstrate that our proposed method achieves an attack performance comparable to existing natural image universal attack methods, but with significantly improved stealthiness. Full article
(This article belongs to the Section Computing and Artificial Intelligence)
Show Figures

Figure 1

Figure 1
<p>Diagram of universal adversarial perturbations against face recognition. During the normal recognition process, the feature extractor obtains clean face embedding features, which can be used to determine if it is the same face. However, during the adversarial attack process, a universal adversarial perturbation is superimposed on a set of images, causing the feature extractor to incorrectly recognize the embedding features and achieve the attack effect.</p>
Full article ">Figure 2
<p>Architecture diagram of KRT-FUAP. First, keypoint detection is employed to acquire the positions of keypoints on the facial images. The convex hull algorithm is then utilized to obtain the key regions of these facial images, and the intersection of these key regions is taken to obtain the key regions mask tailored to the dataset. Subsequently, spatial transformation flow field and noise are initialized, and the flow field is utilized to control the spatial transformation of the key regions mask, thereby obtaining learnable key regions. Afterward, perturbation weights are adjusted based on the positions of these regions to obtain universal adversarial perturbation. The perturbation is superimposed onto clean images, and adversarial loss and stealthiness loss are computed separately using a target facial recognition model and a VGG model. The iteration continues until a certain criterion is fulfilled.</p>
Full article ">Figure 3
<p>Visualization of noise superimposed on key and non-key regions.</p>
Full article ">Figure 4
<p>(<b>a</b>,<b>b</b>), respectively, depict the variation trends of the face recognition accuracy tested on three backbone extraction networks using the LFW dataset and the CASIA-WebFace dataset. It can be observed from the figure that as the noise level increases, there is a certain decrease in recognition accuracy. Moreover, under the experimental condition of overlaying noise on key regions, the rate of decrease is greater.</p>
Full article ">Figure 5
<p>Some examples for the adversarial image: (<b>a</b>,<b>f</b>,<b>k</b>) are clean images; (<b>b</b>,<b>g</b>,<b>l</b>) were generated using UAP; (<b>c</b>,<b>h</b>,<b>m</b>) were generated using FG-UAP; (<b>d</b>,<b>i</b>,<b>n</b>) were generated using FTGAP; and (<b>e</b>,<b>j</b>,<b>o</b>) were generated using KRT-FUAP.</p>
Full article ">Figure 6
<p>The time required to generate universal adversarial perturbations to achieve the designated fooling rates on IResNet50 and MobileFaceNet using four different approaches.</p>
Full article ">
17 pages, 6993 KiB  
Article
An Improved YOLOv5s Model for Building Detection
by Jingyi Zhao, Yifan Li, Jing Cao, Yutai Gu, Yuanze Wu, Chong Chen and Yingying Wang
Electronics 2024, 13(11), 2197; https://doi.org/10.3390/electronics13112197 - 4 Jun 2024
Viewed by 1283
Abstract
With the continuous advancement of autonomous vehicle technology, the recognition of buildings becomes increasingly crucial. It enables autonomous vehicles to better comprehend their surrounding environment, facilitating safer navigation and decision-making processes. Therefore, it is significant to improve detection efficiency on edge devices. However, [...] Read more.
With the continuous advancement of autonomous vehicle technology, the recognition of buildings becomes increasingly crucial. It enables autonomous vehicles to better comprehend their surrounding environment, facilitating safer navigation and decision-making processes. Therefore, it is significant to improve detection efficiency on edge devices. However, building recognition faces problems such as severe occlusion and large size of detection models that cannot be deployed on edge devices. To solve these problems, a lightweight building recognition model based on YOLOv5s is proposed in this study. We first collected a building dataset from real scenes and the internet, and applied an improved GridMask data augmentation method to expand the dataset and reduce the impact of occlusion. To make the model lightweight, we pruned the model by the channel pruning method, which decreases the computational costs of the model. Furthermore, we used Mish as the activation function to help the model converge better in sparse training. Finally, comparing it to YOLOv5s (baseline), the experiments show that the improved model reduces the model size by 9.595 MB, and the [email protected] reaches 82.3%. This study will offer insights into lightweight building detection, demonstrating its significance in environmental perception, monitoring, and detection, particularly in the field of autonomous driving. Full article
Show Figures

Figure 1

Figure 1
<p>Some examples of the dataset: (<b>b</b>,<b>e</b>) are the first type of library, (<b>d</b>,<b>g</b>) are the first type of statue, (<b>c</b>,<b>i</b>) are the second type of library, (<b>a</b>) is the second type of statue, and (<b>f</b>,<b>h</b>) is the gymnasium.</p>
Full article ">Figure 2
<p>Some examples of labeled results of different buildings (coordinate information) by LabelImg.</p>
Full article ">Figure 3
<p>The process of data augmentation.</p>
Full article ">Figure 4
<p>The structure of YOLOv5.</p>
Full article ">Figure 5
<p>(<b>a</b>) The structure of Conv. (<b>b</b>) The structure of C3. (<b>c</b>) The structure of SPP. (<b>d</b>) The structure of FPN-PAN.</p>
Full article ">Figure 6
<p>The method of BN layer.</p>
Full article ">Figure 7
<p>The process of sparse training and pruning.</p>
Full article ">Figure 8
<p><span class="html-italic">Mish</span> activation function.</p>
Full article ">Figure 9
<p>The process of building detection method.</p>
Full article ">Figure 10
<p>Gymnasium detection with and without data augmentation.</p>
Full article ">Figure 11
<p>(<b>a</b>) Weights change in BN layers during sparse training process. (<b>b</b>) Weights in one BN layer of the model after sparse training (model.23.m.0.cv1.bn.weight).</p>
Full article ">Figure 12
<p>Detection performance of different models in precision, recall, and mAP@0.5.</p>
Full article ">Figure 13
<p>Comparison between different activation functions with different pruning rates in YOLOv5.</p>
Full article ">Figure 14
<p>The process of the gradient of backpropagation in Conv module.</p>
Full article ">
22 pages, 7124 KiB  
Article
ADM-SLAM: Accurate and Fast Dynamic Visual SLAM with Adaptive Feature Point Extraction, Deeplabv3pro, and Multi-View Geometry
by Xiaotao Huang, Xingbin Chen, Ning Zhang, Hongjie He and Sang Feng
Sensors 2024, 24(11), 3578; https://doi.org/10.3390/s24113578 - 2 Jun 2024
Viewed by 1544
Abstract
Visual Simultaneous Localization and Mapping (V-SLAM) plays a crucial role in the development of intelligent robotics and autonomous navigation systems. However, it still faces significant challenges in handling highly dynamic environments. The prevalent method currently used for dynamic object recognition in the environment [...] Read more.
Visual Simultaneous Localization and Mapping (V-SLAM) plays a crucial role in the development of intelligent robotics and autonomous navigation systems. However, it still faces significant challenges in handling highly dynamic environments. The prevalent method currently used for dynamic object recognition in the environment is deep learning. However, models such as Yolov5 and Mask R-CNN require significant computational resources, which limits their potential in real-time applications due to hardware and time constraints. To overcome this limitation, this paper proposes ADM-SLAM, a visual SLAM system designed for dynamic environments that builds upon the ORB-SLAM2. This system integrates efficient adaptive feature point homogenization extraction, lightweight deep learning semantic segmentation based on an improved DeepLabv3, and multi-view geometric segmentation. It optimizes keyframe extraction, segments potential dynamic objects using contextual information with the semantic segmentation network, and detects the motion states of dynamic objects using multi-view geometric methods, thereby eliminating dynamic interference points. The results indicate that ADM-SLAM outperforms ORB-SLAM2 in dynamic environments, especially in high-dynamic scenes, where it achieves up to a 97% reduction in Absolute Trajectory Error (ATE). In various highly dynamic test sequences, ADM-SLAM outperforms DS-SLAM and DynaSLAM in terms of real-time performance and accuracy, proving its excellent adaptability. Full article
(This article belongs to the Section Navigation and Positioning)
Show Figures

Figure 1

Figure 1
<p>The schematic of the overall framework of the ADM-SLAM system.</p>
Full article ">Figure 2
<p>The schematic of the overall structure of Deeplabv3pro.</p>
Full article ">Figure 3
<p>Modified ASPP framework. The upper part is the improved structure of the ASPP module, and the lower part is the newly added ResidualBlock.</p>
Full article ">Figure 4
<p>Multi-view geometry. (<b>a</b>) represents a keypoint identified as a static point, and (<b>b</b>) represents a keypoint identified as a dynamic point.</p>
Full article ">Figure 5
<p>The schematic of the evaluation of homogeneity distribution coefficients.</p>
Full article ">Figure 6
<p>The quadtree and the SSC are compared on different data sets. The difference is more obvious in the places selected by the red circle. (<b>a</b>) Quadtree; (<b>b</b>) SSC.</p>
Full article ">Figure 7
<p>Segment images and extract feature points from dynamic scenes. (<b>a</b>) is the result after improved Deeplabv3pro segmentation, masks of different colors represent segmented objects; and (<b>b</b>) is the result of using SSC to extract feature points from the segmented image. Red represents dynamic feature points, green points represent static feature points.</p>
Full article ">Figure 8
<p>Images segmented using different models; (<b>a</b>) is the original input image, (<b>b</b>) is the image segmented by deeplab v3+, (<b>c</b>) is the image segmented by improved deeplab v3+.</p>
Full article ">Figure 9
<p>Trajectories of highly dynamic sequences. The black line is the ground truth, the blue line is the estimated value, and the red line is the difference. (<b>a</b>,<b>c</b>) is the result of ORB-SLAM2, (<b>b</b>,<b>d</b>) is the result of ADM-SLAM. (<b>a</b>,<b>b</b>) fr3_w_xyz; (<b>c</b>,<b>d</b>) fr3_w_half.</p>
Full article ">Figure 9 Cont.
<p>Trajectories of highly dynamic sequences. The black line is the ground truth, the blue line is the estimated value, and the red line is the difference. (<b>a</b>,<b>c</b>) is the result of ORB-SLAM2, (<b>b</b>,<b>d</b>) is the result of ADM-SLAM. (<b>a</b>,<b>b</b>) fr3_w_xyz; (<b>c</b>,<b>d</b>) fr3_w_half.</p>
Full article ">Figure 10
<p>Absolute trajectory error distribution plot of fr3_w_half sequence. (<b>a</b>) APE of ORB-SLAM2; (<b>b</b>) APE of ADM-SLAM.</p>
Full article ">Figure 11
<p>Absolute trajectory error distribution plots of fr3_w_half sequence. (<b>a</b>) APE of ORB-SLAM2; (<b>b</b>) APE of ADM-SLAM.</p>
Full article ">Figure 12
<p>Three-dimensional space absolute trajectory error distribution diagram of fr3_w_half sequence. (<b>a</b>) APE of ORB-SLAM2; (<b>b</b>) APE of ADM-SLAM.</p>
Full article ">Figure 13
<p>Three-dimensional space absolute trajectory error distribution plots of fr3_w_half sequence. (<b>a</b>) APE of ORB-SLAM2; (<b>b</b>) APE of ADM-SLAM.</p>
Full article ">
15 pages, 6287 KiB  
Article
Research on Improved Road Visual Navigation Recognition Method Based on DeepLabV3+ in Pitaya Orchard
by Lixue Zhu, Wenqian Deng, Yingjie Lai, Xiaogeng Guo and Shiang Zhang
Agronomy 2024, 14(6), 1119; https://doi.org/10.3390/agronomy14061119 - 24 May 2024
Cited by 3 | Viewed by 987
Abstract
Traditional DeepLabV3+ image semantic segmentation methods face challenges in pitaya orchard environments characterized by multiple interference factors, complex image backgrounds, high computational complexity, and extensive memory consumption. This paper introduces an improved visual navigation path recognition method for pitaya orchards. Initially, DeepLabV3+ utilizes [...] Read more.
Traditional DeepLabV3+ image semantic segmentation methods face challenges in pitaya orchard environments characterized by multiple interference factors, complex image backgrounds, high computational complexity, and extensive memory consumption. This paper introduces an improved visual navigation path recognition method for pitaya orchards. Initially, DeepLabV3+ utilizes a lightweight MobileNetV2 as its primary feature extraction backbone, which is augmented with a Pyramid Split Attention (PSA) module placed after the Atrous Spatial Pyramid Pooling (ASPP) module. This improvement enhances the spatial feature representation of feature maps, thereby sharpening the segmentation boundaries. Additionally, an Efficient Channel Attention Network (ECANet) mechanism is integrated with the lower-level features of MobileNetV2 to reduce computational complexity and refine the clarity of target boundaries. The paper also designs a navigation path extraction algorithm, which fits the road mask regions segmented by the model to achieve precise navigation path recognition. Experimental findings show that the enhanced DeepLabV3+ model achieved a Mean Intersection over Union (MIoU) and average pixel accuracy of 95.79% and 97.81%, respectively. These figures represent increases of 0.59 and 0.41 percentage points when contrasted with the original model. Furthermore, the model’s memory consumption is reduced by 85.64%, 84.70%, and 85.06% when contrasted with the Pyramid Scene Parsing Network (PSPNet), U-Net, and Fully Convolutional Network (FCN) models, respectively. This reduction makes the proposed model more efficient while maintaining high segmentation accuracy, thus supporting enhanced operational efficiency in practical applications. The test results for navigation path recognition accuracy reveal that the angle error between the navigation centerline extracted using the least squares method and the manually fitted centerline is less than 5°. Additionally, the average deviation between the road centerlines extracted under three different lighting conditions and the actual road centerline is only 2.66 pixels, with an average image recognition time of 0.10 s. This performance suggests that the study can provide an effective reference for visual navigation in smart agriculture. Full article
(This article belongs to the Special Issue The Applications of Deep Learning in Smart Agriculture)
Show Figures

Figure 1

Figure 1
<p>Image Data.</p>
Full article ">Figure 2
<p>Structure diagram of the improved DeepLabv3+ network.</p>
Full article ">Figure 3
<p>Structural diagram of the SPC module.</p>
Full article ">Figure 4
<p>Structural diagram of the SE module.</p>
Full article ">Figure 5
<p>Structural diagram of the PSA module.</p>
Full article ">Figure 6
<p>Structural diagram of the ECAnet module. (ECA generates channel weights by performing a fast 1D convolution of size k).</p>
Full article ">Figure 7
<p>Flow chart of the fitted navigation lines.</p>
Full article ">Figure 8
<p>Fitting navigation lines. (The green dot is the edge point of the split, The white line is the navigation line).</p>
Full article ">Figure 9
<p>Comparison of orchard road segmentation. (The red area is the road).</p>
Full article ">Figure 10
<p>The segmentation results of road scenes under different environmental conditions. (The red area is the road).</p>
Full article ">Figure 11
<p>Navigation path accuracy test.</p>
Full article ">Figure 12
<p>The deviation statistical curve.</p>
Full article ">
19 pages, 736 KiB  
Article
Recognizing and Looking at Masked Emotional Faces in Alexithymia
by Marla Fuchs, Anette Kersting, Thomas Suslow and Charlott Maria Bodenschatz
Behav. Sci. 2024, 14(4), 343; https://doi.org/10.3390/bs14040343 - 18 Apr 2024
Cited by 1 | Viewed by 1482
Abstract
Alexithymia is a clinically relevant personality construct characterized by difficulties identifying and communicating one’s emotions and externally oriented thinking. Alexithymia has been found to be related to poor emotion decoding and diminished attention to the eyes. The present eye tracking study investigated whether [...] Read more.
Alexithymia is a clinically relevant personality construct characterized by difficulties identifying and communicating one’s emotions and externally oriented thinking. Alexithymia has been found to be related to poor emotion decoding and diminished attention to the eyes. The present eye tracking study investigated whether high levels of alexithymia are related to impairments in recognizing emotions in masked faces and reduced attentional preference for the eyes. An emotion recognition task with happy, fearful, disgusted, and neutral faces with face masks was administered to high-alexithymic and non-alexithymic individuals. Hit rates, latencies of correct responses, and fixation duration on eyes and face mask were analyzed as a function of group and sex. Alexithymia had no effects on accuracy and speed of emotion recognition. However, alexithymic men showed less attentional preference for the eyes relative to the mask than non-alexithymic men, which was due to their increased attention to face masks. No fixation duration differences were observed between alexithymic and non-alexithymic women. Our data indicate that high levels of alexithymia might not have adverse effects on the efficiency of emotion recognition from faces wearing masks. Future research on gaze behavior during facial emotion recognition in high alexithymia should consider sex as a moderating variable. Full article
(This article belongs to the Section Psychiatric, Emotional and Behavioral Disorders)
Show Figures

Figure 1

Figure 1
<p>Examples of facial stimuli administered in the emotion recognition task: (<b>A</b>) happy expression, (<b>B</b>) neutral expression, (<b>C</b>) disgusted expression, and (<b>D</b>) fearful expression. The original images were taken from the MPI FACES database [<a href="#B86-behavsci-14-00343" class="html-bibr">86</a>]. The depicted face is 066_y_m.</p>
Full article ">Figure 2
<p>Areas of interest in our eye tracking experiment: eyes (white coloring) and face mask (blue coloring). The depicted face shows model 066_y_m from the MPI FACES database [<a href="#B86-behavsci-14-00343" class="html-bibr">86</a>].</p>
Full article ">Figure 3
<p>Overall eyes-to-mask gaze ratio as a function of alexithymia and sex (means with standard errors).</p>
Full article ">
17 pages, 7113 KiB  
Article
Effect of Face Masks on Automatic Speech Recognition Accuracy for Mandarin
by Xiaoya Li, Ke Ni and Yu Huang
Appl. Sci. 2024, 14(8), 3273; https://doi.org/10.3390/app14083273 - 12 Apr 2024
Cited by 1 | Viewed by 800
Abstract
Automatic speech recognition (ASR) has been widely used to realize daily human–machine interactions. Face masks have become everyday wear in our post-pandemic life, and speech through masks may have impaired the ASR. This study explored the effects of different kinds of face masks [...] Read more.
Automatic speech recognition (ASR) has been widely used to realize daily human–machine interactions. Face masks have become everyday wear in our post-pandemic life, and speech through masks may have impaired the ASR. This study explored the effects of different kinds of face masks (e.g., surgical mask, KN95 mask, and cloth mask) on the Mandarin word accuracy of two ASR systems with or without noises. A mouth simulator was used to play speech audio with or without wearing a mask. Acoustic signals were recorded at distances of 0.2 m and 0.6 m. Recordings were mixed with two noises at a signal-to-noise ratio of +3 dB: restaurant noise and speech-shaped noise. Results showed that masks did not affect ASR accuracy without noise. Under noises, masks did not significantly influence ASR accuracy at 0.2 m but had significant effects at 0.6 m. The activated-carbon mask had the most significant impact on ASR accuracy at 0.6 m, reducing the accuracy by 18.5 percentage points compared to that without a mask, whereas the cloth mask had the least effect on ASR accuracy at 0.6 m, reducing the accuracy by 0.9 percentage points. The acoustic attenuation of masks on the high-frequency band at around 3.15 kHz of the speech signal attributed to the effects of masks on ASR accuracy. When training ASR models, it may be important to consider mask robustness. Full article
(This article belongs to the Special Issue Signal Acquisition and Processing for Measurement and Testing)
Show Figures

Figure 1

Figure 1
<p>Connection set-up of apparatus and 7 mask conditions. (<b>a</b>) Connection set-up of apparatus; (<b>b</b>) No mask (M0); (<b>c</b>) Surgical mask (M1); (<b>d</b>) Activated-carbon mask (M2); (<b>e</b>) Hanging-ear medical protective mask (M3); (<b>f</b>) Headwear medical protective mask (M4); (<b>g</b>) Anti-particulate mask (M5); and (<b>h</b>) Cloth mask (M6).</p>
Full article ">Figure 2
<p>The research procedure.</p>
Full article ">Figure 3
<p>The word accuracy (ACC) values for all speakers at a recording distance of 0.2 m. (<b>a</b>) ASR<sub>D</sub>, no noise; (<b>b</b>) ASR<sub>T</sub>, no noise; (<b>c</b>) ASR<sub>D</sub>, restaurant noise; (<b>d</b>) ASR<sub>T</sub>, restaurant noise; (<b>e</b>) ASR<sub>D</sub>, speech-shaped noise; and (<b>f</b>) ASR<sub>T</sub>, speech-shaped noise.</p>
Full article ">Figure 4
<p>The ACC for all speakers at a recording distance of 0.6 m. (<b>a</b>) ASR<sub>D</sub>, no noise; (<b>b</b>) ASR<sub>T</sub>, no noise; (<b>c</b>) ASR<sub>D</sub>, restaurant noise; (<b>d</b>) ASR<sub>T</sub>, restaurant noise; (<b>e</b>) ASR<sub>D</sub>, speech-shaped noise; and (<b>f</b>) ASR<sub>T</sub>, speech-shaped noise. Kruskal–Wallis tests and post-pairwise comparisons: *, <span class="html-italic">p</span> &lt; 0.05; **, <span class="html-italic">p</span> &lt; 0.01; and ***, <span class="html-italic">p</span> &lt; 0.001.</p>
Full article ">Figure 5
<p>The ACC for male speakers at a recording distance of 0.6 m. (<b>a</b>) ASR<sub>D</sub>, no noise; (<b>b</b>) ASR<sub>T</sub>, no noise; (<b>c</b>) ASR<sub>D</sub>, restaurant noise; (<b>d</b>) ASR<sub>T</sub>, restaurant noise; (<b>e</b>) ASR<sub>D</sub>, speech-shaped noise; and (<b>f</b>) ASR<sub>T</sub>, speech-shaped noise. Kruskal–Wallis tests and post-pairwise comparisons: *, <span class="html-italic">p</span> &lt; 0.05; **, <span class="html-italic">p</span> &lt; 0.01; ***, and <span class="html-italic">p</span> &lt; 0.001.</p>
Full article ">Figure 6
<p>The ACC for female speakers at a recording distance of 0.6 m. (<b>a</b>) ASR<sub>D</sub>, no noise; (<b>b</b>) ASR<sub>T</sub>, no noise; (<b>c</b>) ASR<sub>D</sub>, restaurant noise; (<b>d</b>) ASR<sub>T</sub>, restaurant noise; (<b>e</b>) ASR<sub>D</sub>, speech-shaped noise; (<b>f</b>) ASR<sub>T</sub>, speech-shaped noise. Kruskal–Wallis tests and post-pairwise comparisons: *, <span class="html-italic">p</span> &lt; 0.05; **, <span class="html-italic">p</span> &lt; 0.01.</p>
Full article ">Figure 7
<p>The sound-transmission loss of masks.</p>
Full article ">Figure 8
<p>The average spectra of the restaurant noise, the speech-shaped noise, and the speech material. The amplitude of the spectra was normalized.</p>
Full article ">
20 pages, 6478 KiB  
Article
CSINet: Channel–Spatial Fusion Networks for Asymmetric Facial Expression Recognition
by Yan Cheng and Defeng Kong
Symmetry 2024, 16(4), 471; https://doi.org/10.3390/sym16040471 - 12 Apr 2024
Cited by 1 | Viewed by 1168
Abstract
Occlusion or posture change of the face in natural scenes has typical asymmetry; however, an asymmetric face plays a key part in the lack of information available for facial expression recognition. To solve the problem of low accuracy of asymmetric facial expression recognition, [...] Read more.
Occlusion or posture change of the face in natural scenes has typical asymmetry; however, an asymmetric face plays a key part in the lack of information available for facial expression recognition. To solve the problem of low accuracy of asymmetric facial expression recognition, this paper proposes a fusion of channel global features and a spatial local information expression recognition network called the “Channel–Spatial Integration Network” (CSINet). First, to extract the underlying detail information and deepen the network, the attention residual module with a redundant information filtering function is designed, and the backbone feature-extraction network is constituted by module stacking. Second, considering the loss of information in the local key area of face occlusion, the channel–spatial fusion structure is constructed, and the channel features and spatial features are combined to enhance the accuracy of occluded facial recognition. Finally, before the full connection layer, more local spatial information is embedded into the global channel information to capture the relationship between different channel–spatial targets, which improves the accuracy of feature expression. Experimental results on the natural scene facial expression data sets RAF-DB and FERPlus show that the recognition accuracies of the modeling approach proposed in this paper are 89.67% and 90.83%, which are 13.24% and 11.52% higher than that of the baseline network ResNet50, respectively. Compared with the latest facial expression recognition methods such as CVT, PACVT, etc., the method in this paper obtains better evaluation results of masked facial expression recognition, which provides certain theoretical and technical references for daily facial emotion analysis and human–computer interaction applications. Full article
Show Figures

Figure 1

Figure 1
<p>Channel–spatial fusion network structure.</p>
Full article ">Figure 2
<p>Basic feature-extraction module.</p>
Full article ">Figure 3
<p>Attention residual module.</p>
Full article ">Figure 4
<p>Channel–spatial feature fusion structure.</p>
Full article ">Figure 5
<p>Local global feature coordination enhancement module.</p>
Full article ">Figure 6
<p>Sample expressions for different types of data sets.</p>
Full article ">Figure 7
<p>Visualization of different network models for facial expression recognition.</p>
Full article ">Figure 8
<p>CSINet confusion matrix on the RAF-DB data.</p>
Full article ">Figure 9
<p>Face images under different occlusion methods.</p>
Full article ">Figure 10
<p>Real-time natural environment facial expression test image.</p>
Full article ">Figure 11
<p>Facial expression recognition accuracy of CSINet and AMP-Net in the real-time natural environment.</p>
Full article ">
16 pages, 2880 KiB  
Article
Customizable Presentation Attack Detection for Improved Resilience of Biometric Applications Using Near-Infrared Skin Detection
by Tobias Scheer, Markus Rohde, Ralph Breithaupt, Norbert Jung and Robert Lange
Sensors 2024, 24(8), 2389; https://doi.org/10.3390/s24082389 - 9 Apr 2024
Viewed by 970
Abstract
Due to their user-friendliness and reliability, biometric systems have taken a central role in everyday digital identity management for all kinds of private, financial and governmental applications with increasing security requirements. A central security aspect of unsupervised biometric authentication systems is the presentation [...] Read more.
Due to their user-friendliness and reliability, biometric systems have taken a central role in everyday digital identity management for all kinds of private, financial and governmental applications with increasing security requirements. A central security aspect of unsupervised biometric authentication systems is the presentation attack detection (PAD) mechanism, which defines the robustness to fake or altered biometric features. Artifacts like photos, artificial fingers, face masks and fake iris contact lenses are a general security threat for all biometric modalities. The Biometric Evaluation Center of the Institute of Safety and Security Research (ISF) at the University of Applied Sciences Bonn-Rhein-Sieg has specialized in the development of a near-infrared (NIR)-based contact-less detection technology that can distinguish between human skin and most artifact materials. This technology is highly adaptable and has already been successfully integrated into fingerprint scanners, face recognition devices and hand vein scanners. In this work, we introduce a cutting-edge, miniaturized near-infrared presentation attack detection (NIR-PAD) device. It includes an innovative signal processing chain and an integrated distance measurement feature to boost both reliability and resilience. We detail the device’s modular configuration and conceptual decisions, highlighting its suitability as a versatile platform for sensor fusion and seamless integration into future biometric systems. This paper elucidates the technological foundations and conceptual framework of the NIR-PAD reference platform, alongside an exploration of its potential applications and prospective enhancements. Full article
(This article belongs to the Section Optical Sensors)
Show Figures

Figure 1

Figure 1
<p>Remission spectra of human skin and exemplary artifact materials.</p>
Full article ">Figure 2
<p>Multispectral sensor working principle.</p>
Full article ">Figure 3
<p>Simplified multispectral sensor implementation schema.</p>
Full article ">Figure 4
<p>Images of the PCB stack: (<b>a</b>) CAD. (<b>b</b>) Photo.</p>
Full article ">Figure 5
<p>Analog filter chain.</p>
Full article ">Figure 6
<p>The optical shielding of the proposed sensor.</p>
Full article ">Figure 7
<p>Sensor images: (<b>a</b>) Closed Variant Front (CAD). (<b>b</b>) Focused Variant (Photo). (<b>c</b>,<b>d</b>) Closed Variant Front and Back (Photo)).</p>
Full article ">Figure 8
<p>LED remission/NIR-protection transmission spectra.</p>
Full article ">Figure 9
<p>Normalized differences of human skin and exemplary artifact materials. (<b>a</b>) Fixed distance at 20 cm. (<b>b</b>) Full range from 5 cm to 30 cm.</p>
Full article ">
17 pages, 10960 KiB  
Article
Deep Learning and YOLOv8 Utilized in an Accurate Face Mask Detection System
by Christine Dewi, Danny Manongga, Hendry, Evangs Mailoa and Kristoko Dwi Hartomo
Big Data Cogn. Comput. 2024, 8(1), 9; https://doi.org/10.3390/bdcc8010009 - 16 Jan 2024
Cited by 6 | Viewed by 4738
Abstract
Face mask detection is a technological application that employs computer vision methodologies to ascertain the presence or absence of a face mask on an individual depicted in an image or video. This technology gained significant attention and adoption during the COVID-19 pandemic, as [...] Read more.
Face mask detection is a technological application that employs computer vision methodologies to ascertain the presence or absence of a face mask on an individual depicted in an image or video. This technology gained significant attention and adoption during the COVID-19 pandemic, as wearing face masks became an important measure to prevent the spread of the virus. Face mask detection helps to enforce mask-wearing guidelines, which can significantly reduce the spread of respiratory illnesses, including COVID-19. Wearing masks in densely populated areas provides individuals with protection and hinders the spread of airborne particles that transmit viruses. The application of deep learning models in object recognition has shown significant progress, leading to promising outcomes in the identification and localization of objects within images. The primary aim of this study is to annotate and classify face mask entities depicted in authentic images. To mitigate the spread of COVID-19 within public settings, individuals can employ the use of face masks created from materials specifically designed for medical purposes. This study utilizes YOLOv8, a state-of-the-art object detection algorithm, to accurately detect and identify face masks. To analyze this study, we conducted an experiment in which we combined the Face Mask Dataset (FMD) and the Medical Mask Dataset (MMD) into a single dataset. The detection performance of an earlier research study using the FMD and MMD was improved by the suggested model to a “Good” level of 99.1%, up from 98.6%. Our study demonstrates that the model scheme we have provided is a reliable method for detecting faces that are obscured by medical masks. Additionally, after the completion of the study, a comparative analysis was conducted to examine the findings in conjunction with those of related research. The proposed detector demonstrated superior performance compared to previous research in terms of both accuracy and precision. Full article
Show Figures

Figure 1

Figure 1
<p>System architecture of YOLOv8.</p>
Full article ">Figure 2
<p>Sample images in the experimental Face Mask Dataset (FMD), and Medical Mask Dataset (MMD).</p>
Full article ">Figure 3
<p>The combination of MMD and FMD datasets.</p>
Full article ">Figure 4
<p>Training process for the (<b>a</b>) test batch with 0 labels and (<b>b</b>) the test batch with 0 predictions. The human features depicted in the figures were obtained from publicly available datasets (FMD and MMD).</p>
Full article ">Figure 4 Cont.
<p>Training process for the (<b>a</b>) test batch with 0 labels and (<b>b</b>) the test batch with 0 predictions. The human features depicted in the figures were obtained from publicly available datasets (FMD and MMD).</p>
Full article ">Figure 5
<p>Training performance using (<b>a</b>) YOLOv8n, (<b>b</b>) YOLOv8s, and (<b>c</b>) YOLOv8m.</p>
Full article ">Figure 5 Cont.
<p>Training performance using (<b>a</b>) YOLOv8n, (<b>b</b>) YOLOv8s, and (<b>c</b>) YOLOv8m.</p>
Full article ">Figure 6
<p>Recognition results using YOLOv8n: (<b>a</b>) good class and (<b>b</b>) good and none.</p>
Full article ">Figure 6 Cont.
<p>Recognition results using YOLOv8n: (<b>a</b>) good class and (<b>b</b>) good and none.</p>
Full article ">
23 pages, 22773 KiB  
Article
Anti-Software Attack Ear Identification System Using Deep Feature Learning and Blockchain Protection
by Xuebin Xu, Yibiao Liu, Chenguang Liu and Longbin Lu
Symmetry 2024, 16(1), 85; https://doi.org/10.3390/sym16010085 - 9 Jan 2024
Viewed by 1179
Abstract
Ear recognition has made good progress as an emerging biometric technology. However, the recognition performance, generalization ability, and feature robustness of ear recognition systems based on hand-crafted features are relatively poor. With the development of deep learning, these problems have been partly overcome. [...] Read more.
Ear recognition has made good progress as an emerging biometric technology. However, the recognition performance, generalization ability, and feature robustness of ear recognition systems based on hand-crafted features are relatively poor. With the development of deep learning, these problems have been partly overcome. However, the recognition performance of existing ear recognition systems still needs to be improved when facing unconstrained ear databases in realistic scenarios. Another critical problem is that most systems with ear feature template databases are vulnerable to software attacks that disclose users’ privacy and even bring down the system. This paper proposes a software-attack-proof ear recognition system using deep feature learning and blockchain protection to address the problem that the recognition performance of existing systems is generally poor in the face of unconstrained ear databases in realistic scenarios. First, we propose an accommodative DropBlock (AccDrop) to generate drop masks with adaptive shapes. It has an advantage over DropBlock in coping with unconstrained ear databases. Second, we introduce a simple and parameterless attention module that uses 3D weights to refine the ear features output from the convolutional layer. To protect the security of the ear feature template database and the user’s privacy, we use Merkle tree nodes to store the ear feature templates, ensuring the determinism of the root node in the smart contract. We achieve Rank-1 (R1) recognition accuracies of 83.87% and 96.52% on the AWE and EARVN1.0 ear databases, which outperform most advanced ear recognition systems. Full article
Show Figures

Figure 1

Figure 1
<p>Proposed anti-software attack ear recognition system.</p>
Full article ">Figure 2
<p>Architectural details of the proposed deep-feature-learning-based ear feature extraction model.</p>
Full article ">Figure 3
<p>(<b>a</b>) Image of the input network; (<b>b</b>) Dropout; (<b>c</b>) DropBlock; and (<b>d</b>) proposed AccDrop. The regions marked using purple squares are activation units with semantic information, and black pentagram markers represent dropout operations. The elements at adjacent locations in the ear feature map share semantic information spatially. The elements adjacent to the dropped activation units still retain the semantic information at that location, causing the dropout to easily ignore spatial features. DropBlock is a structured regularization method that puts the units in adjacent regions of the feature map together to drop out. However, it will affect the feature learning ability of the network, resulting in some meaningful feature information being lost. The proposed AccDrop generates drop masks with adaptive shapes, which makes the model pay more attention to spatial information and allows it to effectively learn discriminative ear features in the face of an unconstrained ear database in realistic scenarios.</p>
Full article ">Figure 4
<p>(<b>a</b>) We sample the mask <math display="inline"><semantics> <mi>M</mi> </semantics></math> on each feature map in an operation similar to DropBlock. (<b>b</b>) The sampled elements are expanded into square blocks of size block_size × block_size. (<b>c</b>) The top-z-th percentile element of each square block is subjected to the drop operation.</p>
Full article ">Figure 5
<p>This attention mechanism allows the direct estimation of 3D weights to refine the ear features further. The same color indicates using a single scalar for each point on the ear feature. X is the input feature, H is the height of the input data, C is the number of channels, and W is the input data width.</p>
Full article ">Figure 6
<p>Ear images of three subjects were randomly selected from two ear databases, AWE and EARVN1.0, for presentation, with ten images presented for each subject. These ear images had large variations in angle, resolution, and brightness, and had the challenge of being obscured by jewelry and hair.</p>
Full article ">Figure 7
<p>We augmented the two ear databases with data, and the image processing techniques used were vertical crop, luminance increase, Gaussian blur, Gaussian noise, horizontal flip, vertical flip, rotation, limited contrast adapted histogram equalization, and color histogram equalization.</p>
Full article ">Figure 8
<p>Relationship between R1, R5 recognition rate, and block_size of the AWE database.</p>
Full article ">Figure 9
<p>Relationship between R1, R5 recognition rate, and block_size of the EARVN1.0 database.</p>
Full article ">Figure 10
<p>Relationship between R1, R5 recognition rate, and drop_prob of the AWE database.</p>
Full article ">Figure 11
<p>Relationship between R1, R5 recognition rate, and drop_prob of the EARVN1.0 database.</p>
Full article ">Figure 12
<p>Relationship between R1, R5 recognition rate, and z of the AWE database.</p>
Full article ">Figure 13
<p>Relationship between R1, R5 recognition rate, and z of the EARVN1.0 database.</p>
Full article ">Figure 14
<p>The CMC curves compare the effects of different regularization methods on the recognition performance of the system.</p>
Full article ">Figure 15
<p>Relationship between R1, R5 recognition rate, and λ of the AWE database.</p>
Full article ">Figure 16
<p>Relationship between R1, R5 recognition rate, and λ of the EARVN1.0 database.</p>
Full article ">Figure 17
<p>CMC curves comparing the identification performance of different models in ablation experiments.</p>
Full article ">Figure 18
<p>CMC curves of the AWE and EARVN1.0 databases before and after template tampering with two ear identification systems (divided into whether they are protected by blockchain or not).</p>
Full article ">Figure 19
<p>GradCAM visually interprets the ear category distinctions of different models on the AWE database to localize the region of interest and further help us understand the predictions made by different models. The visualization results of different models are shown in each subplot from left to right as the original image, EARNet model, Proposed model, Attacked model, and Proposed (secured) model. (<b>a</b>) EARNet only focuses on the middle part of the ear contour and ignores the top and bottom parts of the ear contour. Attacked only focuses on the earlobe and stud, which leads to incorrect predictions. (<b>b</b>) EARNet does not pay enough attention to the ear contour features in the upper part. Attacked focuses on earlobes and ear ornaments. (<b>c</b>) EARNet ignores the bulk ear contour features and only focuses on local ear features. Attacked focuses on local hair interference features. (<b>d</b>) Both EARNet and Attacked focus on hair interference features. (<b>e</b>) EARNet pays excessive attention to hair interference features. Attacked pays attention to hair and facial features. (<b>f</b>) EARNet focuses only on earplug and earlobe features. Attacked ignores the ear features in the middle and lower parts.</p>
Full article ">Figure 20
<p>GradCAM visually interprets the ear category distinction of the different models on the EARVN1.0 database to localize the region of interest and further help us understand the predictions made by the different models. The visualization results of different models are shown in each subplot from left to right as the original image, EARNet model, Proposed model, Attacked model, and Proposed (secured) model. (<b>a</b>) EARNet focuses only on local ear contour features. Attacked focuses on invalid features such as earpieces and backgrounds. (<b>b</b>) EARNet focuses on earphones and face features. Attacked only focuses on background interference features. (<b>c</b>) EARNet is disturbed by earphone features. Attacked is disturbed by background features. (<b>d</b>) EARNet ignores the upper and lower ear contour features. Attacked is disturbed by features such as hair as well as background. (<b>e</b>) Both EARNet and Attacked are disturbed by invalid features such as the background. EARNet even ignores the upper and lower parts of ear features. (<b>f</b>) Both EARNet and Attacked ignore some ear features.</p>
Full article ">
Back to TopTop