Nothing Special   »   [go: up one dir, main page]

You seem to have javascript disabled. Please note that many of the page functionalities won't work as expected without javascript enabled.
 
 
Sign in to use this feature.

Years

Between: -

Subjects

remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline

Journals

Article Types

Countries / Regions

Search Results (22)

Search Parameters:
Keywords = multiscale key frames

Order results
Result details
Results per page
Select all
Export citation of selected articles as:
21 pages, 4811 KiB  
Article
YOLO-AMM: A Real-Time Classroom Behavior Detection Algorithm Based on Multi-Dimensional Feature Optimization
by Yi Cao, Qian Cao, Chengshan Qian and Deji Chen
Sensors 2025, 25(4), 1142; https://doi.org/10.3390/s25041142 - 13 Feb 2025
Viewed by 322
Abstract
Classroom behavior detection is a key task in constructing intelligent educational environments. However, the existing models are still deficient in detail feature capture capability, multi-layer feature correlation, and multi-scale target adaptability, making it challenging to realize high-precision real-time detection in complex scenes. This [...] Read more.
Classroom behavior detection is a key task in constructing intelligent educational environments. However, the existing models are still deficient in detail feature capture capability, multi-layer feature correlation, and multi-scale target adaptability, making it challenging to realize high-precision real-time detection in complex scenes. This paper proposes an improved classroom behavior detection algorithm, YOLO-AMM, to solve these problems. Firstly, we constructed the Adaptive Efficient Feature Fusion (AEFF) module to enhance the fusion of semantic information between different features and improve the model’s ability to capture detailed features. Then, we designed a Multi-dimensional Feature Flow Network (MFFN), which fuses multi-dimensional features and enhances the correlation information between features through the multi-scale feature aggregation module and contextual information diffusion mechanism. Finally, we proposed a Multi-Scale Perception and Fusion Detection Head (MSPF-Head), which significantly improves the adaptability of the head to different scale targets by introducing multi-scale feature perception, feature interaction, and fusion mechanisms. The experimental results showed that compared with the YOLOv8n model, YOLO-AMM improved the mAP0.5 and mAP0.5-0.95 by 3.1% and 4.0%, significantly improving the detection accuracy. Meanwhile, YOLO-AMM increased the detection speed (FPS) by 12.9 frames per second to 169.1 frames per second, which meets the requirement for real-time detection of classroom behavior. Full article
(This article belongs to the Special Issue Sensor-Based Behavioral Biometrics)
Show Figures

Figure 1

Figure 1
<p>Network structure diagram of YOLOv8.</p>
Full article ">Figure 2
<p>Comparison of feature fusion network structure before and after optimization: (<b>a</b>) (YOLOv8); (<b>b</b>) (YOLO-AMM).</p>
Full article ">Figure 3
<p>Network structure diagram of YOLO-AMM.</p>
Full article ">Figure 4
<p>AEFF structure diagram.</p>
Full article ">Figure 5
<p>LDConv structure diagram.</p>
Full article ">Figure 6
<p>ELA structure diagram.</p>
Full article ">Figure 7
<p>Comparison of feature fusion network structures: (<b>a</b>) Original feature fusion network structure; (<b>b</b>) MFFN structure.</p>
Full article ">Figure 8
<p>Internal structure of MSFA.</p>
Full article ">Figure 9
<p>MSPF-Head structure diagram.</p>
Full article ">Figure 10
<p>Model training result curves: (<b>a</b>) Precision curve; (<b>b</b>) Recall curve; (<b>c</b>) mAP50 curve; and (<b>d</b>) mAP50-95 curve.</p>
Full article ">Figure 11
<p>Comparison of mAP values before and after improvement: (<b>a</b>) Comparison of mAP50 curves; (<b>b</b>) Comparison of mAP50-95 curves.</p>
Full article ">Figure 12
<p>Comparison of sparse behavior detection based on number of students: (<b>a</b>) student behaviors detected by YOLOv8 in a dense scene; (<b>b</b>) heat map generated by YOLOv8, highlighting the detected areas; (<b>c</b>) detection results of YOLO-AMM in the same classroom environment; (<b>d</b>) heat map generated by YOLO-AMM. The heat map colors range from blue (weak features) to red (strong features), reflecting the intensity of the features detected by the model.</p>
Full article ">Figure 13
<p>Comparison of intensive behavior detection based on number of students: (<b>a</b>) student behaviors detected by YOLOv8 in a dense scene; (<b>b</b>) heat map generated by YOLOv8, highlighting the detected areas; (<b>c</b>) detection results of YOLO-AMM in the same classroom environment; (<b>d</b>) heat map generated by YOLO-AMM. The heat map colors range from blue (weak features) to red (strong features), reflecting the intensity of the features detected by the model.</p>
Full article ">
23 pages, 4874 KiB  
Article
Cross-Modal Transformer-Based Streaming Dense Video Captioning with Neural ODE Temporal Localization
by Shakhnoza Muksimova, Sabina Umirzakova, Murodjon Sultanov and Young Im Cho
Sensors 2025, 25(3), 707; https://doi.org/10.3390/s25030707 - 24 Jan 2025
Viewed by 819
Abstract
Dense video captioning is a critical task in video understanding, requiring precise temporal localization of events and the generation of detailed, contextually rich descriptions. However, the current state-of-the-art (SOTA) models face significant challenges in event boundary detection, contextual understanding, and real-time processing, limiting [...] Read more.
Dense video captioning is a critical task in video understanding, requiring precise temporal localization of events and the generation of detailed, contextually rich descriptions. However, the current state-of-the-art (SOTA) models face significant challenges in event boundary detection, contextual understanding, and real-time processing, limiting their applicability to complex, multi-event videos. In this paper, we introduce CMSTR-ODE, a novel Cross-Modal Streaming Transformer with Neural ODE Temporal Localization framework for dense video captioning. Our model incorporates three key innovations: (1) Neural ODE-based Temporal Localization for continuous and efficient event boundary prediction, improving the accuracy of temporal segmentation; (2) cross-modal memory retrieval, which enriches video features with external textual knowledge, enabling more context-aware and descriptive captioning; and (3) a Streaming Multi-Scale Transformer Decoder that generates captions in real time, handling objects and events of varying scales. We evaluate CMSTR-ODE on two benchmark datasets, YouCook2, Flickr30k, and ActivityNet Captions, where it achieves SOTA performance, significantly outperforming existing models in terms of CIDEr, BLEU-4, and ROUGE scores. Our model also demonstrates superior computational efficiency, processing videos at 15 frames per second, making it suitable for real-time applications such as video surveillance and live video captioning. Ablation studies highlight the contributions of each component, confirming the effectiveness of our approach. By addressing the limitations of current methods, CMSTR-ODE sets a new benchmark for dense video captioning, offering a robust and scalable solution for both real-time and long-form video understanding tasks. Full article
(This article belongs to the Section Sensing and Imaging)
Show Figures

Figure 1

Figure 1
<p>Architecture of a cross-modal neural network for event detection and captioning.</p>
Full article ">Figure 2
<p>A variety of dynamic scenes featuring people, animals, and outdoor activities.</p>
Full article ">
27 pages, 19274 KiB  
Article
Enhancing Underwater Video from Consecutive Frames While Preserving Temporal Consistency
by Kai Hu, Yuancheng Meng, Zichen Liao, Lei Tang and Xiaoling Ye
J. Mar. Sci. Eng. 2025, 13(1), 127; https://doi.org/10.3390/jmse13010127 - 12 Jan 2025
Viewed by 753
Abstract
Current methods for underwater image enhancement primarily focus on single-frame processing. While these approaches achieve impressive results for static images, they often fail to maintain temporal coherence across frames in underwater videos, which leads to temporal artifacts and frame flickering. Furthermore, existing enhancement [...] Read more.
Current methods for underwater image enhancement primarily focus on single-frame processing. While these approaches achieve impressive results for static images, they often fail to maintain temporal coherence across frames in underwater videos, which leads to temporal artifacts and frame flickering. Furthermore, existing enhancement methods struggle to accurately capture features in underwater scenes. This makes it difficult to handle challenges such as uneven lighting and edge blurring in complex underwater environments. To address these issues, this paper presents a dual-branch underwater video enhancement network. The network synthesizes short-range video sequences by learning and inferring optical flow from individual frames. It effectively enhances temporal consistency across video frames through predicted optical flow information, thereby mitigating temporal instability within frame sequences. In addition, to address the limitations of traditional U-Net models in handling complex multiscale feature fusion, this study proposes a novel underwater feature fusion module. By applying both max pooling and average pooling, this module separately extracts local and global features. It utilizes an attention mechanism to adaptively adjust the weights of different regions in the feature map, thereby effectively enhancing key regions within underwater video frames. Experimental results indicate that when compared with the existing underwater image enhancement baseline method and the consistency enhancement baseline method, the proposed model improves the consistency index by 30% and shows a marginal decrease of only 0.6% in enhancement quality index, demonstrating its superiority in underwater video enhancement tasks. Full article
(This article belongs to the Section Ocean Engineering)
Show Figures

Figure 1

Figure 1
<p>(<b>a</b>) Degraded simulated underwater video frame, (<b>b</b>) segmentation result, (<b>c</b>) optical flow prediction, and (<b>d</b>) ground truth.</p>
Full article ">Figure 2
<p>Underwater feature fusion module; this spatial attention mechanism is particularly important in underwater visual enhancement tasks, where it helps address issues such as uneven illumination, scattering effects, and color degradation in underwater scenes. Consequently, it significantly enhances the ability to focus on target information and improves the overall image quality.</p>
Full article ">Figure 3
<p>Underwater video enhancement network.</p>
Full article ">Figure 4
<p>Overview of the full pipeline, consisting of two steps: (<b>a</b>) during optical flow prediction, Watermask is used to separate the object from the background and ten guided motion vectors are randomly sampled from each object area, after which the guided motion vectors and the clear underwater video frame are input to the CMP optical flow prediction model to obtain the predicted optical flow; (<b>b</b>) during training and testing, the network consists of two branches. The upper branch functions in both the training and testing phases, while the lower branch serves as an auxiliary branch used only during training to enforce temporal consistency. Images in the second branch are warped from those in the main branch using the same optical flow. During the testing phase, the network can directly take the input and predict the output without requiring optical flow prediction.</p>
Full article ">Figure 5
<p>Comparison Experiment Scenario 1: (<b>a</b>) Input, (<b>b</b>) MSR, (<b>c</b>) UWCNN, (<b>d</b>) SGUIENet, (<b>e</b>) UGAN, (<b>f</b>) FunieGAN, (<b>g</b>) BLIND, (<b>h</b>) Ours, (<b>i</b>) GT. The six pictures in the figure are continuous video frames. They are shown together to demonstrate the enhancement results of the different models in terms of temporal consistency.</p>
Full article ">Figure 6
<p>Histogram and standard deviation for Comparison Experiment Scenario 1: (<b>a</b>) MSR, (<b>b</b>) UWCNN, (<b>c</b>) SGUIENet, (<b>d</b>) UGAN, (<b>e</b>) FunieGAN, (<b>f</b>) BLIND, (<b>g</b>) Ours, (<b>h</b>) GT. The x-axis represents pixel values 0–255 and the y-axis represents the pixel count for each pixel value. The semi-transparent RGB histograms represent the pixel value distribution for the R, G, and B channels in each frame. The red, green, and blue thick lines represent the standard deviation curves for the R, G, and B channels, respectively.</p>
Full article ">Figure 7
<p>Comparison Experiment Scenario 1 details: (<b>a</b>) MSR, (<b>b</b>) UWCNN, (<b>c</b>) SGUIENet, (<b>d</b>) UGAN, (<b>e</b>) FunieGAN, (<b>f</b>) BLIND, (<b>g</b>) Ours, (<b>h</b>) GT. The figure shows the edge details and edge artifacts of video frames generated by different models.</p>
Full article ">Figure 8
<p>Demonstration of model performance variation with loss function weight <math display="inline"><semantics> <mi>λ</mi> </semantics></math>: training a temporally stable image-based model is actually a compromise between visual quality and temporal stability. The optimal result lies in the balance of them.</p>
Full article ">Figure 9
<p>Ablation Experiment Scenario 1: (<b>a</b>) Input, (<b>b</b>) U-Net, (<b>c</b>) U-Net + TripleWFM, (<b>d</b>) U-Net + WFENet, (<b>e</b>) Ours, (<b>f</b>) GT. The six pictures in the figure are continuous video frames. They are presented together to show the enhancement results of the different models for temporal consistency.</p>
Full article ">Figure A1
<p>Comparison Experiment Scenario 2: (<b>a</b>) Input, (<b>b</b>) MSR, (<b>c</b>) UWCNN, (<b>d</b>) SGUIENet, (<b>e</b>) UGAN, (<b>f</b>) FunieGAN, (<b>g</b>) BLIND, (<b>h</b>) Ours, (<b>i</b>) GT.</p>
Full article ">Figure A1 Cont.
<p>Comparison Experiment Scenario 2: (<b>a</b>) Input, (<b>b</b>) MSR, (<b>c</b>) UWCNN, (<b>d</b>) SGUIENet, (<b>e</b>) UGAN, (<b>f</b>) FunieGAN, (<b>g</b>) BLIND, (<b>h</b>) Ours, (<b>i</b>) GT.</p>
Full article ">Figure A2
<p>Comparison Experiment Scenario 3: (<b>a</b>) Input, (<b>b</b>) MSR, (<b>c</b>) UWCNN, (<b>d</b>) SGUIENet, (<b>e</b>) UGAN, (<b>f</b>) FunieGAN, (<b>g</b>) BLIND, (<b>h</b>) Ours, (<b>i</b>) GT.</p>
Full article ">Figure A2 Cont.
<p>Comparison Experiment Scenario 3: (<b>a</b>) Input, (<b>b</b>) MSR, (<b>c</b>) UWCNN, (<b>d</b>) SGUIENet, (<b>e</b>) UGAN, (<b>f</b>) FunieGAN, (<b>g</b>) BLIND, (<b>h</b>) Ours, (<b>i</b>) GT.</p>
Full article ">Figure A3
<p>Comparison Experiment Scenario 4: (<b>a</b>) Input, (<b>b</b>) MSR, (<b>c</b>) UWCNN, (<b>d</b>) SGUIENet, (<b>e</b>) UGAN, (<b>f</b>) FunieGAN, (<b>g</b>) BLIND, (<b>h</b>) Ours, (<b>i</b>) GT.</p>
Full article ">Figure A3 Cont.
<p>Comparison Experiment Scenario 4: (<b>a</b>) Input, (<b>b</b>) MSR, (<b>c</b>) UWCNN, (<b>d</b>) SGUIENet, (<b>e</b>) UGAN, (<b>f</b>) FunieGAN, (<b>g</b>) BLIND, (<b>h</b>) Ours, (<b>i</b>) GT.</p>
Full article ">Figure A4
<p>Ablation Experiment Scenario 2: (<b>a</b>) Input, (<b>b</b>) U-Net, (<b>c</b>) U-Net + TripleWFM, (<b>d</b>) U-Net + WFENet, (<b>e</b>) Ours, (<b>f</b>) GT.</p>
Full article ">Figure A4 Cont.
<p>Ablation Experiment Scenario 2: (<b>a</b>) Input, (<b>b</b>) U-Net, (<b>c</b>) U-Net + TripleWFM, (<b>d</b>) U-Net + WFENet, (<b>e</b>) Ours, (<b>f</b>) GT.</p>
Full article ">Figure A5
<p>Ablation Experiment Scenario 3: (<b>a</b>) Input, (<b>b</b>) U-Net, (<b>c</b>) U-Net + TripleWFM, (<b>d</b>) U-Net + WFENet, (<b>e</b>) Ours, (<b>f</b>) GT.</p>
Full article ">Figure A5 Cont.
<p>Ablation Experiment Scenario 3: (<b>a</b>) Input, (<b>b</b>) U-Net, (<b>c</b>) U-Net + TripleWFM, (<b>d</b>) U-Net + WFENet, (<b>e</b>) Ours, (<b>f</b>) GT.</p>
Full article ">Figure A6
<p>Ablation Experiment Scenario 4: (<b>a</b>) Input, (<b>b</b>) U-Net, (<b>c</b>) U-Net + TripleWFM, (<b>d</b>) U-Net + WFENet, (<b>e</b>) Ours, (<b>f</b>) GT.</p>
Full article ">
20 pages, 2870 KiB  
Article
Research on Mine-Personnel Helmet Detection Based on Multi-Strategy-Improved YOLOv11
by Lei Zhang, Zhipeng Sun, Hongjing Tao, Meng Wang and Weixun Yi
Sensors 2025, 25(1), 170; https://doi.org/10.3390/s25010170 - 31 Dec 2024
Viewed by 907
Abstract
In the complex environment of fully mechanized mining faces, the current object detection algorithms face significant challenges in achieving optimal accuracy and real-time detection of mine personnel and safety helmets. This difficulty arises from factors such as uneven lighting conditions and equipment obstructions, [...] Read more.
In the complex environment of fully mechanized mining faces, the current object detection algorithms face significant challenges in achieving optimal accuracy and real-time detection of mine personnel and safety helmets. This difficulty arises from factors such as uneven lighting conditions and equipment obstructions, which often lead to missed detections. Consequently, these limitations pose a considerable challenge to effective mine safety management. This article presents an enhanced algorithm based on YOLOv11n, referred to as GCB-YOLOv11. The proposed improvements are realized through three key aspects: Firstly, the traditional convolution is replaced with GSConv, which significantly enhances feature extraction capabilities while simultaneously reducing computational costs. Secondly, a novel C3K2_FE module was designed that integrates Faster_block and ECA attention mechanisms. This design aims to improve detection accuracy while also accelerating detection speed. Finally, the introduction of the Bi FPN mechanism in the Neck section optimizes the efficiency of multi-scale feature fusion and addresses issues related to feature loss and redundancy. The experimental results demonstrate that GCB-YOLOv11 exhibits strong performance on the dataset concerning mine personnel and safety helmets, achieving a mean average precision of 93.6%. Additionally, the frames per second reached 90.3 f·s−1, representing increases of 3.3% and 9.4%, respectively, compared to the baseline model. In addition, when compared to models such as YOLOv5s, YOLOv8s, YOLOv3 Tiny, Fast R-CNN, and RT-DETR, GCB-YOLOv11 demonstrates superior performance in both detection accuracy and model complexity. This highlights its advantages in mining environments and offers a viable technical solution for enhancing the safety of mine personnel. Full article
(This article belongs to the Special Issue Recent Advances in Optical Sensor for Mining)
Show Figures

Figure 1

Figure 1
<p>YOLOv11 model structure.</p>
Full article ">Figure 2
<p>Two types of residual structures: (<b>a</b>) C3 structure and (<b>b</b>) C3K2 structure.</p>
Full article ">Figure 3
<p>C2PSA module structure.</p>
Full article ">Figure 4
<p>Comparison of YOLOv11 and YOLOv8 model detection head structures.</p>
Full article ">Figure 5
<p>GCB-YOLOv11 network structure.</p>
Full article ">Figure 6
<p>GSConv structure.</p>
Full article ">Figure 7
<p>Faster_block structure.</p>
Full article ">Figure 8
<p>ECA attention mechanism structure.</p>
Full article ">Figure 9
<p>C3K2_faster structure.</p>
Full article ">Figure 10
<p>Feature pyramid structure: (<b>a</b>) FPN + PAN structure and (<b>b</b>) Bi FPN structure.</p>
Full article ">Figure 11
<p>Sample example of dataset.</p>
Full article ">Figure 12
<p>Comparison of training process curves between two types of models: (<b>a</b>) mAP@0.5 curve and (<b>b</b>) loss curve.</p>
Full article ">Figure 13
<p>Comparison of P-R curves of two types of models on the validation set: (<b>a</b>) P-R curve of YOLOv11n and (<b>b</b>) P-R curve of GCB-YOLOv11.</p>
Full article ">Figure 14
<p>Different model detection results.</p>
Full article ">Figure 15
<p>GCB-YOLOv11 heatmap.</p>
Full article ">
23 pages, 3884 KiB  
Article
Cascaded Feature Fusion Grasping Network for Real-Time Robotic Systems
by Hao Li and Lixin Zheng
Sensors 2024, 24(24), 7958; https://doi.org/10.3390/s24247958 - 13 Dec 2024
Viewed by 735
Abstract
Grasping objects of irregular shapes and various sizes remains a key challenge in the field of robotic grasping. This paper proposes a novel RGB-D data-based grasping pose prediction network, termed Cascaded Feature Fusion Grasping Network (CFFGN), designed for high-efficiency, lightweight, and rapid grasping [...] Read more.
Grasping objects of irregular shapes and various sizes remains a key challenge in the field of robotic grasping. This paper proposes a novel RGB-D data-based grasping pose prediction network, termed Cascaded Feature Fusion Grasping Network (CFFGN), designed for high-efficiency, lightweight, and rapid grasping pose estimation. The network employs innovative structural designs, including depth-wise separable convolutions to reduce parameters and enhance computational efficiency; convolutional block attention modules to augment the model’s ability to focus on key features; multi-scale dilated convolution to expand the receptive field and capture multi-scale information; and bidirectional feature pyramid modules to achieve effective fusion and information flow of features at different levels. In tests on the Cornell dataset, our network achieved grasping pose prediction at a speed of 66.7 frames per second, with accuracy rates of 98.6% and 96.9% for image-wise and object-wise splits, respectively. The experimental results show that our method achieves high-speed processing while maintaining high accuracy. In real-world robotic grasping experiments, our method also proved to be effective, achieving an average grasping success rate of 95.6% on a robot equipped with parallel grippers. Full article
(This article belongs to the Section Sensors and Robotics)
Show Figures

Figure 1

Figure 1
<p>Grasping configuration representation.</p>
Full article ">Figure 2
<p>Illustration of the complete grasp representation and angle encoding pipeline. Left: input RGB-D images with grasp parameters annotated—grasp center (<span class="html-italic">u</span>, <span class="html-italic">v</span>), grasp angle (<math display="inline"><semantics> <mi>θ</mi> </semantics></math>), and grasp width (<span class="html-italic">w</span>). Middle: three parameterized grasp maps derived from the input—the grasp quality map <span class="html-italic">Q</span> (values from 0 to 1.0, indicating grasp success probability), grasp angle map <math display="inline"><semantics> <mo>Φ</mo> </semantics></math> (angle range <math display="inline"><semantics> <mrow> <mo>[</mo> <mo>−</mo> <mstyle scriptlevel="0" displaystyle="true"> <mfrac> <mi>π</mi> <mn>2</mn> </mfrac> </mstyle> <mo>,</mo> <mstyle scriptlevel="0" displaystyle="true"> <mfrac> <mi>π</mi> <mn>2</mn> </mfrac> </mstyle> <mo>]</mo> </mrow> </semantics></math>), and grasp width map W (in pixels). Right: angle encoding using trigonometric transformations—<math display="inline"><semantics> <mrow> <msub> <mo>Φ</mo> <mrow> <mi>c</mi> <mi>o</mi> <mi>s</mi> </mrow> </msub> <mo>=</mo> <mo form="prefix">cos</mo> <mrow> <mo>(</mo> <mn>2</mn> <mo>Φ</mo> <mo>)</mo> </mrow> </mrow> </semantics></math> and <math display="inline"><semantics> <mrow> <msub> <mo>Φ</mo> <mrow> <mi>s</mi> <mi>i</mi> <mi>n</mi> </mrow> </msub> <mo>=</mo> <mo form="prefix">sin</mo> <mrow> <mo>(</mo> <mn>2</mn> <mo>Φ</mo> <mo>)</mo> </mrow> </mrow> </semantics></math> to handle angle periodicity. The color scales indicate the range of values for each map—grasp quality (0–1.0), angles (<math display="inline"><semantics> <mrow> <mo>−</mo> <mstyle scriptlevel="0" displaystyle="true"> <mfrac> <mi>π</mi> <mn>2</mn> </mfrac> </mstyle> </mrow> </semantics></math> to <math display="inline"><semantics> <mstyle scriptlevel="0" displaystyle="true"> <mfrac> <mi>π</mi> <mn>2</mn> </mfrac> </mstyle> </semantics></math>), and width (0–100 pixels).</p>
Full article ">Figure 3
<p>Network architecture of the Cascaded Feature Fusion Grasp Network (CFFGN).</p>
Full article ">Figure 4
<p>Grasp parameter calculation process. The network takes RGB-D data as input and outputs four values: <span class="html-italic">Q</span>, <math display="inline"><semantics> <mrow> <mo form="prefix">cos</mo> <mo>(</mo> <mn>2</mn> <mo>Φ</mo> <mo>)</mo> </mrow> </semantics></math>, <math display="inline"><semantics> <mrow> <mo form="prefix">sin</mo> <mo>(</mo> <mn>2</mn> <mo>Φ</mo> <mo>)</mo> </mrow> </semantics></math>, and <span class="html-italic">W</span>.</p>
Full article ">Figure 5
<p>Left: standard convolution with BN and Relu layers.Right: depth-wise separable convolution structure.</p>
Full article ">Figure 6
<p>Schematic diagram of the CBAM module. This module comprises two components: the channel attention module and the spatial attention module. The input features undergo sequential processing.</p>
Full article ">Figure 7
<p>Channel attention module in the CBAM. The input feature F with dimensions H × W × C undergoes global max pooling (MaxPool) and average pooling (AvgPool) operations, resulting in two feature descriptors of size 1 × 1 × C. These descriptors are then processed by a shared multi-layer perceptron (Shared MLP). The outputs are combined to generate the final channel attention map <math display="inline"><semantics> <msub> <mi>M</mi> <mi>c</mi> </msub> </semantics></math> with dimensions 1 × 1 × C.</p>
Full article ">Figure 8
<p>Spatial attention module in the CBAM. The channel-refined feature F′ with dimensions H′ × W′ × C undergoes max pooling and average pooling operations, resulting in features of size H′ × W′ × 1. These are then processed to generate the spatial attention map <math display="inline"><semantics> <msub> <mi>M</mi> <mi>s</mi> </msub> </semantics></math> with dimensions H′ × W′ × 1, which captures important spatial information in the input feature map.</p>
Full article ">Figure 9
<p>Structure of the Multi-scale Dilated Convolution Module (MCDM).</p>
Full article ">Figure 10
<p>BiFPN structure diagram. P3–P7: represent feature maps of different scales, from the shallow layer (P3) to the deep layer (P7). Red arrows: top-down path, fusing high-level semantic information to low-level features. Blue arrows: bottom-up path, propagating fine-grained information from low-level to high-level features. Purple arrows: same-level connections, integrating features from the same scale. Black arrows: flow paths of the initial features. The colored circles in the diagram represent feature maps at different scales. From P3 to P7, they indicate feature maps progressing from the shallow layer (P3) to the deep layer (P7).</p>
Full article ">Figure 11
<p>Architecture of the Baseline network. It consists of a 9 × 9 convolutional layer, followed by 5 × 5 and 2 × 2 max pooling layers, and then progressive upsampling layers.</p>
Full article ">Figure 12
<p>Experimental platform setup for robotic grasping. The platform integrates an EPSON C4-A901S six-axis robot arm equipped with an electric parallel gripper as the end-effector. A RealSense D415 depth camera is mounted overhead in an eye-to-hand configuration. The gripping area (marked with red dashed box) represents the workspace where objects are placed for grasping experiments. All key components are labeled for clarity.</p>
Full article ">Figure 13
<p>Sequential demonstration of a successful umbrella grasping experiment. Left: the robotic arm approaches the target umbrella based on the predicted optimal grasping pose. Center: the gripper aligns with the detected grasping point on the umbrella body and adjusts to the appropriate width. Right: the gripper successfully executes the grasp and lifts the umbrella, demonstrating the algorithm’s capability to identify and execute grasps on the main body structure rather than conventional grasping points like handles.</p>
Full article ">
20 pages, 5142 KiB  
Article
Adaptive Real-Time Tracking of Molten Metal Using Multi-Scale Features and Weighted Histograms
by Yifan Lei and Degang Xu
Electronics 2024, 13(15), 2905; https://doi.org/10.3390/electronics13152905 - 23 Jul 2024
Viewed by 701
Abstract
In this study, we study the tracking of the molten metal region in the dross removal process during metal ingot casting, and propose a real-time tracking method based on adaptive feature selection and weighted histogram. This research is highly significant in metal smelting, [...] Read more.
In this study, we study the tracking of the molten metal region in the dross removal process during metal ingot casting, and propose a real-time tracking method based on adaptive feature selection and weighted histogram. This research is highly significant in metal smelting, as efficient molten metal tracking is crucial for effective dross removal and ensuring the quality of metal ingots. Due to the influence of illumination and temperature in the tracking environment, it is difficult to extract suitable features for tracking molten metal during the metal pouring process using industrial cameras. We transform the images captured by the camera into a multi-scale feature space and select the features with the maximum distinction between the molten metal region and its surrounding background for tracking. Furthermore, we introduce a weighted histogram based on the pixel values of the target region into the mean-shift tracking algorithm to improve tracking accuracy. During the tracking process, the target model updates based on changes in the molten metal region across frames. Experimental tests confirm that this tracking method meets practical requirements, effectively addressing key challenges in molten metal tracking and providing reliable support for the dross removal process. Full article
(This article belongs to the Special Issue Machine Vision in Industrial Systems)
Show Figures

Figure 1

Figure 1
<p>The installation position of the dross removal robot on the casting production line and a schematic diagram of its dross skimming operation.</p>
Full article ">Figure 2
<p>The dross removal process and tracking challenges in tracking molten metal area.</p>
Full article ">Figure 3
<p>Evaluating the separability between target and background classes.</p>
Full article ">Figure 4
<p>(<b>a</b>) A sample image with rectangular frames delineating molten metal and background samples. (<b>b</b>) Images produced by all 49 candidate features, rank-ordered by the variance ratio measure.</p>
Full article ">Figure 5
<p>(<b>a</b>) The tracking frame in selected feature space: —2R+2G. The frames 13 (<b>b</b>), 17 (<b>c</b>), 21 (<b>d</b>), 25 (<b>e</b>), and 29 (<b>f</b>) are shown.</p>
Full article ">Figure 6
<p>(<b>a</b>) The tracking frame in selected feature space: 2R—G—2B. The frames 155 (<b>b</b>), 159 (<b>c</b>), 163 (<b>d</b>), 167 (<b>e</b>), and 171 (<b>f</b>) are shown.</p>
Full article ">Figure 7
<p>In frame 527, when the tracking target’s upper edge meets the set <span class="html-italic">y</span>-value, a downward search along the <span class="html-italic">y</span>-axis locates the nearest new target region.</p>
Full article ">Figure 8
<p>Qualitative comparison of molten metal region tracking. We compared our method (red) with current state-of-the-art (SOTA) deep learning network tracking methods: DiMP (blue), KYS (yellow), and ToMP (green). From the experimental results, it can be seen that our method demonstrates better accuracy and robustness in tracking the molten metal region.</p>
Full article ">Figure 9
<p>Variation in Intersection over Union (IoU) values for four object tracking methods over a series of frames. The methods compared are our proposed method, DiMP, KYS, and ToMP.</p>
Full article ">
21 pages, 5041 KiB  
Article
DDEYOLOv9: Network for Detecting and Counting Abnormal Fish Behaviors in Complex Water Environments
by Yinjia Li, Zeyuan Hu, Yixi Zhang, Jihang Liu, Wan Tu and Hong Yu
Fishes 2024, 9(6), 242; https://doi.org/10.3390/fishes9060242 - 20 Jun 2024
Cited by 7 | Viewed by 2441
Abstract
Accurately detecting and counting abnormal fish behaviors in aquaculture is essential. Timely detection allows farmers to take swift action to protect fish health and prevent economic losses. This paper proposes an enhanced high-precision detection algorithm based on YOLOv9, named DDEYOLOv9, to facilitate the [...] Read more.
Accurately detecting and counting abnormal fish behaviors in aquaculture is essential. Timely detection allows farmers to take swift action to protect fish health and prevent economic losses. This paper proposes an enhanced high-precision detection algorithm based on YOLOv9, named DDEYOLOv9, to facilitate the detection and counting of abnormal fish behavior in industrial aquaculture environments. To address the lack of publicly available datasets on abnormal behavior in fish, we created the “Abnormal Behavior Dataset of Takifugu rubripes”, which includes five categories of fish behaviors. The detection algorithm was further enhanced in several key aspects. Firstly, the DRNELAN4 feature extraction module was introduced to replace the original RepNCSPELAN4 module. This change improves the model’s detection accuracy for high-density and occluded fish in complex water environments while reducing the computational cost. Secondly, the proposed DCNv4-Dyhead detection head enhances the model’s multi-scale feature learning capability, effectively recognizes various abnormal fish behaviors, and improves the computational speed. Lastly, to address the issue of sample imbalance in the abnormal fish behavior dataset, we propose EMA-SlideLoss, which enhances the model’s focus on hard samples, thereby improving the model’s robustness. The experimental results demonstrate that the DDEYOLOv9 model achieves high Precision, Recall, and mean Average Precision (mAP) on the “Abnormal Behavior Dataset of Takifugu rubripes”, with values of 91.7%, 90.4%, and 94.1%, respectively. Compared to the YOLOv9 model, these metrics are improved by 5.4%, 5.5%, and 5.4%, respectively. The model also achieves a running speed of 119 frames per second (FPS), which is 45 FPS faster than YOLOv9. Experimental results show that the DDEYOLOv9 algorithm can accurately and efficiently identify and quantify abnormal fish behaviors in specific complex environments. Full article
(This article belongs to the Special Issue AI and Fisheries)
Show Figures

Figure 1

Figure 1
<p>Image acquisition.</p>
Full article ">Figure 2
<p>Abnormal behavior of Takifugu rubripes (framed fish with abnormal behavior).</p>
Full article ">Figure 3
<p>Sample distribution of the abnormal behavior dataset of <span class="html-italic">Takifugu rubripes</span>.</p>
Full article ">Figure 4
<p>Structure diagram of the DDEYOLOv9 model. SPPELAN stands for Spatial Pyramid Pooling with Enhanced Local Attention Network. This block plays a crucial role in our model by enhancing feature extraction and improving the accuracy of abnormal behavior detection in fish. Through the cooperative work of multiple sub-modules, the DRNELAN4 module can more effectively extract the fish characteristics in the input image in complex water environments. ADown is the convolutional block of down-sampling operation, which is used to reduce the feature map spatial dimension. It helps the model to capture the features of the image at a higher level while reducing the amount of computation.</p>
Full article ">Figure 5
<p>Dilated Reparam Block. A dilated small kernel conv layer is used to augment the non-dilated large kernel conv layer. From a parametric point of view, this dilated layer is equivalent to a non-dilated conv layer with a larger sparse kernel, so that the whole block can be equivalently transformed into a single large kernel conv.</p>
Full article ">Figure 6
<p>Comparison of improved DRNELAN4 and RepNCSPELAN4 modules.</p>
Full article ">Figure 7
<p>The core operation of spatial aggregation of query pixels at different locations in the same channel in DCNv4. DCNv4 combines DCNv3’s use of dynamic weights to aggregate spatial features and convolution’s flexible unbounded values for aggregate weights.</p>
Full article ">Figure 8
<p>Structure of DCNv4-Dyhead.</p>
Full article ">Figure 9
<p>An illustration of the DCNv4-Dyhead approach.</p>
Full article ">Figure 10
<p>Comparison of the learning curves of the training dataset before and after improvement. (<b>a</b>) shows the <math display="inline"><semantics> <mrow> <mi>E</mi> <mi>p</mi> <mi>o</mi> <mi>c</mi> <mi>h</mi> <mi>s</mi> <mo>−</mo> <mi>P</mi> <mi>r</mi> <mi>e</mi> <mi>c</mi> <mi>i</mi> <mi>s</mi> <mi>i</mi> <mi>o</mi> <mi>n</mi> </mrow> </semantics></math> curves of YOLOv9 and DDEYOLOv9 models. (<b>b</b>) shows the curve of <math display="inline"><semantics> <mrow> <mi>E</mi> <mi>p</mi> <mi>o</mi> <mi>c</mi> <mi>h</mi> <mi>s</mi> <mo>−</mo> <mi>R</mi> <mi>e</mi> <mi>c</mi> <mi>a</mi> <mi>l</mi> <mi>l</mi> </mrow> </semantics></math>; (<b>c</b>) shows the plot of the <math display="inline"><semantics> <mrow> <mi>E</mi> <mi>p</mi> <mi>o</mi> <mi>c</mi> <mi>h</mi> <mi>s</mi> <mo>−</mo> <mi>m</mi> <mi>A</mi> <mi>P</mi> </mrow> </semantics></math>.</p>
Full article ">Figure 11
<p>Comparison of accuracy before and after improvement. (<b>a</b>) shows the bar graph of <math display="inline"><semantics> <mrow> <mi>P</mi> <mi>r</mi> <mi>e</mi> <mi>c</mi> <mi>i</mi> <mi>s</mi> <mi>i</mi> <mi>o</mi> <mi>n</mi> </mrow> </semantics></math> comparison for the six behavioral categories of the shoal; (<b>b</b>) shows the <math display="inline"><semantics> <mrow> <mi>R</mi> <mi>e</mi> <mi>c</mi> <mi>a</mi> <mi>l</mi> <mi>l</mi> </mrow> </semantics></math> comparison bar charts for the six behaviors; (<b>c</b>) presents the <math display="inline"><semantics> <mrow> <mi>m</mi> <mi>A</mi> <mi>P</mi> </mrow> </semantics></math> versus bar charts for the five behaviors.</p>
Full article ">Figure 12
<p>Renderings of the detection of abnormal behaviors of fish in different abnormal environments ((<b>a</b>) YOLOv9 has false detection, and (<b>b</b>) YOLOv9 has missed detection).</p>
Full article ">Figure 13
<p>Performance comparisons. (<b>a</b>–<b>c</b>) show the <math display="inline"><semantics> <mrow> <mi>E</mi> <mi>p</mi> <mi>o</mi> <mi>c</mi> <mi>h</mi> <mi>s</mi> <mo>−</mo> <mi>P</mi> <mi>r</mi> <mi>e</mi> <mi>c</mi> <mi>i</mi> <mi>s</mi> <mi>i</mi> <mi>o</mi> <mi>n</mi> </mrow> </semantics></math>, <math display="inline"><semantics> <mrow> <mi>E</mi> <mi>p</mi> <mi>o</mi> <mi>c</mi> <mi>h</mi> <mi>s</mi> <mo>−</mo> <mi>R</mi> <mi>e</mi> <mi>c</mi> <mi>a</mi> <mi>l</mi> <mi>l</mi> </mrow> </semantics></math>, and <math display="inline"><semantics> <mrow> <mi>E</mi> <mi>p</mi> <mi>o</mi> <mi>c</mi> <mi>h</mi> <mi>s</mi> <mo>−</mo> <mi>m</mi> <mi>A</mi> <mi>P</mi> </mrow> </semantics></math> curves of the six models respectively.</p>
Full article ">Figure 13 Cont.
<p>Performance comparisons. (<b>a</b>–<b>c</b>) show the <math display="inline"><semantics> <mrow> <mi>E</mi> <mi>p</mi> <mi>o</mi> <mi>c</mi> <mi>h</mi> <mi>s</mi> <mo>−</mo> <mi>P</mi> <mi>r</mi> <mi>e</mi> <mi>c</mi> <mi>i</mi> <mi>s</mi> <mi>i</mi> <mi>o</mi> <mi>n</mi> </mrow> </semantics></math>, <math display="inline"><semantics> <mrow> <mi>E</mi> <mi>p</mi> <mi>o</mi> <mi>c</mi> <mi>h</mi> <mi>s</mi> <mo>−</mo> <mi>R</mi> <mi>e</mi> <mi>c</mi> <mi>a</mi> <mi>l</mi> <mi>l</mi> </mrow> </semantics></math>, and <math display="inline"><semantics> <mrow> <mi>E</mi> <mi>p</mi> <mi>o</mi> <mi>c</mi> <mi>h</mi> <mi>s</mi> <mo>−</mo> <mi>m</mi> <mi>A</mi> <mi>P</mi> </mrow> </semantics></math> curves of the six models respectively.</p>
Full article ">
19 pages, 4123 KiB  
Article
Optimizing OCR Performance for Programming Videos: The Role of Image Super-Resolution and Large Language Models
by Mohammad D. Alahmadi and Moayad Alshangiti
Mathematics 2024, 12(7), 1036; https://doi.org/10.3390/math12071036 - 30 Mar 2024
Cited by 3 | Viewed by 1990
Abstract
The rapid evolution of video programming tutorials as a key educational resource has highlighted the need for effective code extraction methods. These tutorials, varying widely in video quality, present a challenge for accurately transcribing the embedded source code, crucial for learning and software [...] Read more.
The rapid evolution of video programming tutorials as a key educational resource has highlighted the need for effective code extraction methods. These tutorials, varying widely in video quality, present a challenge for accurately transcribing the embedded source code, crucial for learning and software development. This study investigates the impact of video quality on the performance of optical character recognition (OCR) engines and the potential of large language models (LLMs) to enhance code extraction accuracy. Our comprehensive empirical analysis utilizes a rich dataset of programming screencasts, involving manual transcription of source code and the application of both traditional OCR engines, like Tesseract and Google Vision, and advanced LLMs, including GPT-4V and Gemini. We investigate the efficacy of image super-resolution (SR) techniques, namely, enhanced deep super-resolution (EDSR) and multi-scale deep super-resolution (MDSR), in improving the quality of low-resolution video frames. The findings reveal significant improvements in OCR accuracy with the use of SR, particularly at lower resolutions such as 360p. LLMs demonstrate superior performance across all video qualities, indicating their robustness and advanced capabilities in diverse scenarios. This research contributes to the field of software engineering by offering a benchmark for code extraction from video tutorials and demonstrating the substantial impact of SR techniques and LLMs in enhancing the readability and reusability of code from these educational resources. Full article
(This article belongs to the Special Issue AI-Augmented Software Engineering)
Show Figures

Figure 1

Figure 1
<p>An overview of our empirical study on OCR and LLM accuracy across different video programming qualities using super-resolution techniques.</p>
Full article ">Figure 2
<p>Visual representation of images with varying resolutions within our Python dataset, spanning from 360p to 1080p. These images showcase the diverse quality levels found in our dataset, reflecting the range of available resolutions.</p>
Full article ">Figure 3
<p>Boxplots showing how well OCRs and LLMs worked on different image qualities, measured by NLD scores.</p>
Full article ">Figure 4
<p>Boxplots showing how well OCRs and LLMs worked on different image qualities, measured by NLD-Token scores on different programming languages.</p>
Full article ">Figure 5
<p>Boxplots showing how well OCRs and LLMs worked on different image qualities, measured by NLD-Token scores on pre-processed images using super-resolution.</p>
Full article ">Figure 6
<p>A sample of Python code images with a 360p resolution processed using EDSR-×2 and EDSR-×4 as part of our super-resolution techniques.</p>
Full article ">
24 pages, 8939 KiB  
Article
YOLOv7-GCA: A Lightweight and High-Performance Model for Pepper Disease Detection
by Xuejun Yue, Haifeng Li, Qingkui Song, Fanguo Zeng, Jianyu Zheng, Ziyu Ding, Gaobi Kang, Yulin Cai, Yongda Lin, Xiaowan Xu and Chaoran Yu
Agronomy 2024, 14(3), 618; https://doi.org/10.3390/agronomy14030618 - 19 Mar 2024
Cited by 2 | Viewed by 1676
Abstract
Existing disease detection models for deep learning-based monitoring and prevention of pepper diseases face challenges in accurately identifying and preventing diseases due to inter-crop occlusion and various complex backgrounds. To address this issue, we propose a modified YOLOv7-GCA model based on YOLOv7 for [...] Read more.
Existing disease detection models for deep learning-based monitoring and prevention of pepper diseases face challenges in accurately identifying and preventing diseases due to inter-crop occlusion and various complex backgrounds. To address this issue, we propose a modified YOLOv7-GCA model based on YOLOv7 for pepper disease detection, which can effectively overcome these challenges. The model introduces three key enhancements: Firstly, lightweight GhostNetV2 is used as the feature extraction network of the model to improve the detection speed. Secondly, the Cascading fusion network (CFNet) replaces the original feature fusion network, which improves the expression ability of the model in complex backgrounds and realizes multi-scale feature extraction and fusion. Finally, the Convolutional Block Attention Module (CBAM) is introduced to focus on the important features in the images and improve the accuracy and robustness of the model. This study uses the collected dataset, which was processed to construct a dataset of 1259 images with four types of pepper diseases: anthracnose, bacterial diseases, umbilical rot, and viral diseases. We applied data augmentation to the collected dataset, and then experimental verification was carried out on this dataset. The experimental results demonstrate that the YOLOv7-GCA model reduces the parameter count by 34.3% compared to the YOLOv7 original model while improving 13.4% in mAP and 124 frames/s in detection speed. Additionally, the model size was reduced from 74.8 MB to 46.9 MB, which facilitates the deployment of the model on mobile devices. When compared to the other seven mainstream detection models, it was indicated that the YOLOv7-GCA model achieved a balance between speed, model size, and accuracy. This model proves to be a high-performance and lightweight pepper disease detection solution that can provide accurate and timely diagnosis results for farmers and researchers. Full article
Show Figures

Figure 1

Figure 1
<p>Example diagram of data augmentation: (<b>a</b>) Original image; (<b>b</b>) Contrast data augmentation; (<b>c</b>) Cutout data augmentation; (<b>d</b>) Rotation; (<b>e</b>) Kernel Filters; (<b>f</b>) Add salt-and-pepper noise noise; (<b>g</b>) Scaling; (<b>h</b>) Random cropping; (<b>i</b>) Mosaic data augmentation.</p>
Full article ">Figure 2
<p>The network structure of the original YOLOv7.</p>
Full article ">Figure 3
<p>DFC mechanism and GhostNetV2 module. Mul is feature map multiplication. Add is feature map addition.</p>
Full article ">Figure 4
<p>GhostNetV2 information aggregation process diagram.</p>
Full article ">Figure 5
<p>CBAM algorithm implementation flowchart.</p>
Full article ">Figure 6
<p>The network structure of CFNet.</p>
Full article ">Figure 7
<p>Illustration of the CIoU loss formula.</p>
Full article ">Figure 8
<p>YOLOv7-GCA Network architecture.</p>
Full article ">Figure 9
<p>Results of the PR plots in the YOLOv7 (<b>a</b>) and YOLOv7-GCA (<b>b</b>) models.</p>
Full article ">Figure 10
<p>Recognition effect analysis: (<b>a</b>,<b>d</b>,<b>g</b>) are the labeled results; (<b>b</b>,<b>e</b>,<b>h</b>) are the YOLOv7 detection results; (<b>c</b>,<b>f</b>,<b>i</b>) are the YOLOv7-GCA detection results.</p>
Full article ">Figure 11
<p>The mAP (<b>a</b>) and training loss (<b>b</b>) of the ablation experiments.</p>
Full article ">Figure 12
<p>Confusion matrix of YOLOv7 (<b>a</b>) and YOLOv7-GCA (<b>b</b>) to identify results.</p>
Full article ">Figure 13
<p>Flowchart of deployment process on Android terminal.</p>
Full article ">Figure 14
<p>Effective picture for pepper disease detection: (<b>a</b>) anthracnose; (<b>b</b>) umbilical rot diseases; (<b>c</b>) viral diseases.</p>
Full article ">
15 pages, 4905 KiB  
Article
Transformer-Based Cascading Reconstruction Network for Video Snapshot Compressive Imaging
by Jiaxuan Wen, Junru Huang, Xunhao Chen, Kaixuan Huang and Yubao Sun
Appl. Sci. 2023, 13(10), 5922; https://doi.org/10.3390/app13105922 - 11 May 2023
Cited by 2 | Viewed by 1560
Abstract
Video Snapshot Compressive Imaging (SCI) is a new imaging method based on compressive sensing. It encodes image sequences into a single snapshot measurement and then recovers the original high-speed video through reconstruction algorithms, which has the advantages of a low hardware cost and [...] Read more.
Video Snapshot Compressive Imaging (SCI) is a new imaging method based on compressive sensing. It encodes image sequences into a single snapshot measurement and then recovers the original high-speed video through reconstruction algorithms, which has the advantages of a low hardware cost and high imaging efficiency. How to construct an efficient algorithm is the key problem of video SCI. Although the current mainstream deep convolution network reconstruction methods can directly learn the inverse reconstruction mapping, they still have shortcomings in the representation of the complex spatiotemporal content of video scenes and the modeling of long-range contextual correlation. The quality of reconstruction still needs to be improved. To solve this problem, we propose a Transformer-based Cascading Reconstruction Network for Video Snapshot Compressive Imaging. In terms of the long-range correlation matching in the Transformer, the proposed network can effectively capture the spatiotemporal correlation of video frames for reconstruction. Specifically, according to the residual measurement mechanism, the reconstruction network is configured as a cascade of two stages: overall structure reconstruction and incremental details reconstruction. In the first stage, a multi-scale Transformer module is designed to extract the long-range multi-scale spatiotemporal features and reconstruct the overall structure. The second stage takes the measurement of the first stage as the input and employs a dynamic fusion module to adaptively fuse the output features of the two stages so that the cascading network can effectively represent the content of complex video scenes and reconstruct more incremental details. Experiments on simulation and real datasets show that the proposed method can effectively improve the reconstruction accuracy, and ablation experiments also verify the validity of the constructed network modules. Full article
Show Figures

Figure 1

Figure 1
<p>Schematic diagram of video snapshot compressive imaging.</p>
Full article ">Figure 2
<p>The diagram of the Transformer-based Cascading Reconstruction Network for Video Compressive Snapshot Imaging.</p>
Full article ">Figure 3
<p>The diagram of the multi-scale Transformer network for overall structure reconstruction.</p>
Full article ">Figure 4
<p>The diagram of the dynamic fusion Transformer network for incremental details reconstruction.</p>
Full article ">Figure 5
<p>The reconstruction results of two stages in our network: (<b>a</b>) Overall structure reconstruction; (<b>b</b>) Incremental details reconstruction; (<b>c</b>) Final reconstruction.</p>
Full article ">Figure 6
<p>Reconstructed frames of six simulation datasets by different methods (the left side is the Ground Truth, and the right side is the reconstruction result of each method). The sequence numbers of the selected frames of each dataset are Aerial #5, Crash #24, Drop #4, Kobe #6, Runner #1 and Traffic #18.</p>
Full article ">Figure 7
<p>Reconstruction results of different methods on the real dataset Wheel. (The red boxes are enlarged detail images.)</p>
Full article ">
15 pages, 4694 KiB  
Article
CenterPNets: A Multi-Task Shared Network for Traffic Perception
by Guangqiu Chen, Tao Wu, Jin Duan, Qi Hu, Dandan Huang and Hao Li
Sensors 2023, 23(5), 2467; https://doi.org/10.3390/s23052467 - 23 Feb 2023
Cited by 2 | Viewed by 2015
Abstract
The importance of panoramic traffic perception tasks in autonomous driving is increasing, so shared networks with high accuracy are becoming increasingly important. In this paper, we propose a multi-task shared sensing network, called CenterPNets, that can perform the three major detection tasks of [...] Read more.
The importance of panoramic traffic perception tasks in autonomous driving is increasing, so shared networks with high accuracy are becoming increasingly important. In this paper, we propose a multi-task shared sensing network, called CenterPNets, that can perform the three major detection tasks of target detection, driving area segmentation, and lane detection in traffic sensing in one go and propose several key optimizations to improve the overall detection performance. First, this paper proposes an efficient detection head and segmentation head based on a shared path aggregation network to improve the overall reuse rate of CenterPNets and an efficient multi-task joint training loss function to optimize the model. Secondly, the detection head branch uses an anchor-free frame mechanism to automatically regress target location information to improve the inference speed of the model. Finally, the split-head branch fuses deep multi-scale features with shallow fine-grained features, ensuring that the extracted features are rich in detail. CenterPNets achieves an average detection accuracy of 75.8% on the publicly available large-scale Berkeley DeepDrive dataset, with an intersection ratio of 92.8% and 32.1% for driveableareas and lane areas, respectively. Therefore, CenterPNets is a precise and effective solution to the multi-tasking detection issue. Full article
(This article belongs to the Section Vehicular Sensing)
Show Figures

Figure 1

Figure 1
<p>HybridNets Architecture has one encoder: backbone network and neck network;two decoders: Detection Head and Segmentation Head.</p>
Full article ">Figure 2
<p>Illustration of the detection head branching process.</p>
Full article ">Figure 3
<p>Illustration of the branching process of the segmented head.</p>
Full article ">Figure 4
<p>Target detection visualization comparison results. (<b>a</b>) YOLOP, (<b>b</b>) HybridNets, (<b>c</b>) CenterPNets.</p>
Full article ">Figure 5
<p>Comparative results of the visualization of the driveable area segmentation. (<b>a</b>) YOLOP, (<b>b</b>) HybridNets, (<b>c</b>) CenterPNets.</p>
Full article ">Figure 6
<p>Comparison of lane split visualization results. (<b>a</b>) YOLOP, (<b>b</b>) HybridNets, (<b>c</b>) CenterPNets.</p>
Full article ">Figure 6 Cont.
<p>Comparison of lane split visualization results. (<b>a</b>) YOLOP, (<b>b</b>) HybridNets, (<b>c</b>) CenterPNets.</p>
Full article ">Figure 7
<p>CenterPNets multi-tasking results.</p>
Full article ">
13 pages, 2627 KiB  
Article
Method for Segmentation of Litchi Branches Based on the Improved DeepLabv3+
by Jiaxing Xie, Tingwei Jing, Binhan Chen, Jiajun Peng, Xiaowei Zhang, Peihua He, Huili Yin, Daozong Sun, Weixing Wang, Ao Xiao, Shilei Lyu and Jun Li
Agronomy 2022, 12(11), 2812; https://doi.org/10.3390/agronomy12112812 - 11 Nov 2022
Cited by 10 | Viewed by 2197
Abstract
It is necessary to develop automatic picking technology to improve the efficiency of litchi picking, and the accurate segmentation of litchi branches is the key that allows robots to complete the picking task. To solve the problem of inaccurate segmentation of litchi branches [...] Read more.
It is necessary to develop automatic picking technology to improve the efficiency of litchi picking, and the accurate segmentation of litchi branches is the key that allows robots to complete the picking task. To solve the problem of inaccurate segmentation of litchi branches under natural conditions, this paper proposes a segmentation method for litchi branches based on the improved DeepLabv3+, which replaced the backbone network of DeepLabv3+ and used the Dilated Residual Networks as the backbone network to enhance the model’s feature extraction capability. During the training process, a combination of Cross-Entropy loss and the dice coefficient loss was used as the loss function to cause the model to pay more attention to the litchi branch area, which could alleviate the negative impact of the imbalance between the litchi branches and the background. In addition, the Coordinate Attention module is added to the atrous spatial pyramid pooling, and the channel and location information of the multi-scale semantic features acquired by the network are simultaneously considered. The experimental results show that the model’s mean intersection over union and mean pixel accuracy are 90.28% and 94.95%, respectively, and the frames per second (FPS) is 19.83. Compared with the classical DeepLabv3+ network, the model’s mean intersection over union and mean pixel accuracy are improved by 13.57% and 15.78%, respectively. This method can accurately segment litchi branches, which provides powerful technical support to help litchi-picking robots find branches. Full article
(This article belongs to the Special Issue Precision Operation Technology and Intelligent Equipment in Farmland)
Show Figures

Figure 1

Figure 1
<p>A comparison of ResNet18 and DRN-C-26. Each rectangle in the figure represents a Conv-BN-ReLU combination. The number in the rectangle indicates the size of the convolution kernel and the number of output channels. H × W indicates the height and width of the feature map.</p>
Full article ">Figure 2
<p>A comparison of DRN-D-22 and DRN-C-26. The DRN is divided into eight stages, and each stage outputs identically-sized feature maps and uses the same expansion coefficient. Each rectangle in the figure represents a Conv-BN-ReLU combination. The number in the rectangle indicates the size of the convolution kernel and the number of output channels. H × W is the height and width of the feature map, and the green lines represent downsampling by a stride of two.</p>
Full article ">Figure 3
<p>The coordinate attention mechanism.</p>
Full article ">Figure 4
<p>The improved DeepLabv3+ network structure.</p>
Full article ">Figure 5
<p>A comparison of the <span class="html-italic">mIoU</span> curves for transfer learning.</p>
Full article ">Figure 6
<p><span class="html-italic">mIoU</span> curves of the ablation experiment.</p>
Full article ">Figure 7
<p>A comparison of the network prediction effects.</p>
Full article ">
17 pages, 4355 KiB  
Article
Research on the Symbolic 3D Route Scene Expression Method Based on the Importance of Objects
by Fulin Han, Liang Huo, Tao Shen, Xiaoyong Zhang, Tianjia Zhang and Na Ma
Appl. Sci. 2022, 12(20), 10532; https://doi.org/10.3390/app122010532 - 19 Oct 2022
Viewed by 1555
Abstract
In the study of 3D route scene construction, the expression of key targets needs to be highlighted. This is because compared with the 3D model, the abstract 3D symbols can reflect the number and spatial distribution characteristics of entities more intuitively. Therefore, this [...] Read more.
In the study of 3D route scene construction, the expression of key targets needs to be highlighted. This is because compared with the 3D model, the abstract 3D symbols can reflect the number and spatial distribution characteristics of entities more intuitively. Therefore, this research proposes a symbolic 3D route scene representation method based on the importance of the object. The method takes the object importance evaluation model as the theoretical basis, calculates the spatial importance of the same type of objects according to the spatial characteristics of the geographical objects in the 3D route scene, and constructs the object importance evaluation model by combining semantic factors. The 3D symbols are then designed in a hierarchical manner on the basis of the results of the object importance evaluation and the CityGML standard. Finally, the LOD0-LOD4 symbolic 3D railway scene was constructed on the basis of a railroad data to realise the multi-scale expression of symbolic 3D route scene. Compared with the conventional loading method, the real-time frame rate of the scene was improved by 20 fps and was more stable. The scene loading speed was also improved by 5–10 s. The results show that the method can effectively improve the efficiency of the 3D route scene construction and the prominent expression effect of the key objects in the 3D route scene. Full article
(This article belongs to the Special Issue State-of-the-Art Earth Sciences and Geography in China)
Show Figures

Figure 1

Figure 1
<p>A technological roadmap for the construction of symbolic 3D route scenes according to the importance of objects.</p>
Full article ">Figure 2
<p>Road junction and corner importance target: Assuming that the importance of each road <span class="html-italic">CI</span> is 1, the <span class="html-italic">CIT</span> quantitative calculations are as follows. (<b>a</b>) Four roads intersect at a point, and each road is spatially connected to the junction, so the junction <span class="html-italic">CIT</span> = 2 + 2 + 2 + 2 = 8. (<b>b</b>) Two roads intersect at a single point, and each road crosses the junction in spatial relationship; therefore, the junction <span class="html-italic">CIT</span> = 3.5 + 3.5 = 7. (<b>c</b>) Two roads intersect at a point, one road crosses the junction, and the other road joins the junction, so the junction <span class="html-italic">CIT</span> = 3.5 + 2 = 5.5. (<b>d</b>) A corner of the road itself, so that the corner <span class="html-italic">CIT</span> = 2.</p>
Full article ">Figure 3
<p>Flowchart: 3D symbol classification and hierarchy.</p>
Full article ">Figure 4
<p>Natural classification of geographic entities for 3D route scenes. It has combined specific route scenes in reality according to relevant specifications. A classification of the geographical entities that may appear in the construction of 3D route scenes was made.</p>
Full article ">Figure 5
<p>The five LODs of CityGML: (<b>a</b>) LOD0: 2D symbol; (<b>b</b>) LOD1: simple geometry; (<b>c</b>) LOD2: simple geometry combinations; (<b>d</b>) LOD3: geometry with realistic textures; (<b>e</b>) LOD4: real internal structure.</p>
Full article ">Figure 6
<p>Range of experimental data. It contains a junction and three corners in two roads.</p>
Full article ">Figure 7
<p>Three-dimensional symbol modelling: (<b>a</b>) bridge model; (<b>b</b>) tunnel model; (<b>c</b>) signal machine model.</p>
Full article ">Figure 8
<p>Multi-scale representation of symbolic 3D railway scenes: (<b>a</b>) LOD0: Only geographic entities with object importance level 5 were loaded, containing stations in the route and tunnels at some junctions. The symbolic accuracy was at the lowest level of detail. (<b>b</b>) LOD1: Three-dimensional symbols of tunnels and some bridges are present in the scene. The accuracy of the 3D symbols at this level was improved compared to the previous level. (<b>c</b>) LOD2: In addition to the previously loaded stations, tunnels, and bridges, at this level of detail, some of the roadbeds out of junctions and turning points were loaded. The roadbeds were generally connected to the bridges. The 3D symbols at this level of detail had a clearer outline and colour. (<b>d</b>) LOD3: At this level, all the roadbeds are shown. The whole road was fully loaded, consisting of the tunnel, bridge, and roadbed stitched together. The 3D symbol structure at this scale was complete and well textured, especially the station model. (<b>e</b>) LOD4: At the highest level of detail scenes, all 3D symbols were at their highest accuracy. The 3D symbols of the signalling machines on the route were also loaded. These symbols were reproduced to the greatest extent possible for the real scene.</p>
Full article ">Figure 9
<p>Quantitative comparative analysis: (<b>a</b>) The symbolic 3D railway scene data constructed by this research method had a higher frame rate after loading. Scene fluency was optimised. The user experience was improved. (<b>b</b>) Under the multi-scale constraint of this research method, the loading of geographic entities with focused features was faster. Moreover, the advantage became more obvious as the LOD level gradually increased. It improved the perceived depth of the user.</p>
Full article ">
18 pages, 4630 KiB  
Article
A Method for Obtaining the Number of Maize Seedlings Based on the Improved YOLOv4 Lightweight Neural Network
by Jiaxin Gao, Feng Tan, Jiapeng Cui and Bo Ma
Agriculture 2022, 12(10), 1679; https://doi.org/10.3390/agriculture12101679 - 12 Oct 2022
Cited by 10 | Viewed by 1851
Abstract
Obtaining the number of plants is the key to evaluating the effect of maize mechanical sowing, and is also a reference for subsequent statistics on the number of missing seedlings. When the existing model is used for plant number detection, the recognition accuracy [...] Read more.
Obtaining the number of plants is the key to evaluating the effect of maize mechanical sowing, and is also a reference for subsequent statistics on the number of missing seedlings. When the existing model is used for plant number detection, the recognition accuracy is low, the model parameters are large, and the single recognition area is small. This study proposes a method for detecting the number of maize seedlings based on an improved You Only Look Once version 4 (YOLOv4) lightweight neural network. First, the method uses the improved Ghostnet as the model feature extraction network, and successively introduces the attention mechanism and k-means clustering algorithm into the model, thereby improving the detection accuracy of the number of maize seedlings. Second, using depthwise separable convolutions instead of ordinary convolutions makes the network more lightweight. Finally, the multi-scale feature fusion network structure is improved to further reduce the total number of model parameters, pre-training with transfer learning to obtain the optimal model for prediction on the test set. The experimental results show that the harmonic mean, recall rate, average precision and accuracy rate of the model on all test sets are 0.95%, 94.02%, 97.03% and 96.25%, respectively, the model network parameters are 18.793 M, the model size is 71.690 MB, and frames per second (FPS) is 22.92. The research results show that the model has high recognition accuracy, fast recognition speed, and low model complexity, which can provide technical support for corn management at the seedling stage. Full article
(This article belongs to the Section Digital Agriculture)
Show Figures

Figure 1

Figure 1
<p>Study area: (<b>a</b>) experimental-area location; (<b>b</b>) splicing diagram of test field.</p>
Full article ">Figure 2
<p>Structure diagram of ghost bottlenecks.</p>
Full article ">Figure 3
<p>Improved ghost module.</p>
Full article ">Figure 4
<p>Depthwise convolution.</p>
Full article ">Figure 5
<p>Pointwise convolution.</p>
Full article ">Figure 6
<p>Structure of CBAM.</p>
Full article ">Figure 7
<p>Improved multi-scale feature fusion network structure.</p>
Full article ">Figure 8
<p>Flow chart of seedling number detection.</p>
Full article ">Figure 9
<p>The curve of the loss value changing with the number of iterations.</p>
Full article ">Figure 10
<p>Detection results of different models: (<b>a</b>) YOLOv4 test results; (<b>b</b>) Improved YOLOv4 lightweight networks test results; (<b>c</b>) Mbilenetv1-YOLOv4 test results; (<b>d</b>) Mobilenetv3-YOLOv4 test results; (<b>e</b>) Densenet121-YOLOv4 test results; (<b>f</b>) Vgg-YOLOv4 test results.</p>
Full article ">
16 pages, 2701 KiB  
Article
Muti-Frame Point Cloud Feature Fusion Based on Attention Mechanisms for 3D Object Detection
by Zhenyu Zhai, Qiantong Wang, Zongxu Pan, Zhentong Gao and Wenlong Hu
Sensors 2022, 22(19), 7473; https://doi.org/10.3390/s22197473 - 2 Oct 2022
Cited by 9 | Viewed by 3421
Abstract
Continuous frames of point-cloud-based object detection is a new research direction. Currently, most research studies fuse multi-frame point clouds using concatenation-based methods. The method aligns different frames by using information on GPS, IMU, etc. However, this fusion method can only align static objects [...] Read more.
Continuous frames of point-cloud-based object detection is a new research direction. Currently, most research studies fuse multi-frame point clouds using concatenation-based methods. The method aligns different frames by using information on GPS, IMU, etc. However, this fusion method can only align static objects and not moving objects. In this paper, we proposed a non-local-based multi-scale feature fusion method, which can handle both moving and static objects without GPS- and IMU-based registrations. Considering that non-local methods are resource-consuming, we proposed a novel simplified non-local block based on the sparsity of the point cloud. By filtering out empty units, memory consumption decreased by 99.93%. In addition, triple attention is adopted to enhance the key information on the object and suppresses background noise, further benefiting non-local-based feature fusion methods. Finally, we verify the method based on PointPillars and CenterPoint. Experimental results show that the mAP of the proposed method improved by 3.9% and 4.1% in mAP compared with concatenation-based fusion modules, PointPillars-2 and CenterPoint-2, respectively. In addition, the proposed network outperforms powerful 3D-VID by 1.2% in mAP. Full article
(This article belongs to the Special Issue Artificial Intelligence and Smart Sensors for Autonomous Driving)
Show Figures

Figure 1

Figure 1
<p>Multiple frames are concatenated into one frame by registration. The black dashed box marks the area where motion blur occurs.</p>
Full article ">Figure 2
<p>The overall framework of our proposed multi-frame fusion method.</p>
Full article ">Figure 3
<p>The overall process of grid-based point cloud encoder.</p>
Full article ">Figure 4
<p>Feature extraction and fusion network. The 0th layer is the pseudo-image, which is generated by the point cloud encoder.</p>
Full article ">Figure 5
<p>Non-local module. The blue symbols represent 1 × 1 convolutions, the orange symbols represent matrix multiplication, and the green symbols represent element-wise addition.</p>
Full article ">Figure 6
<p>The correlation matrix calculation of index-nonlocal module. In the feature map and similarity calculation stage, grids with color represent non-empty units. The classes of color represent the similarity of feature points. In the correlation matrix, the gray level represents the relevance among feature points.</p>
Full article ">Figure 7
<p>The position in which triple attention is applied.</p>
Full article ">Figure 8
<p>The relationship between keyframes and intermediate frames. The red line represents the point cloud frames used in this study.</p>
Full article ">Figure 9
<p>Comparison between MFFFNet and CenterPoint-2. Line (<b>a</b>,<b>b</b>) indicate two different scenes. The first column is the ground truth. The second and third columns are the detection results of the CenterPoint-2, and the MFFFNet, respectively. The green box represents the ground truth. The red box indicates the test results. The black dashed box indicates the areas that need to be focused on. Blue and orange circles indicate false positive and false negative results.</p>
Full article ">Figure 10
<p>Comparison between MFFFNet and PointPillars-2. Line (<b>a</b>,<b>b</b>) indicate two different scenes. The first column is the ground truth. The second and third columns are the detection results of the CenterPoint-2, and the MFFFNet, respectively. The green box represents the ground truth. The red box indicates the test results. The black dashed box indicates the areas that need to be focused on. Blue and orange circles indicate false positive and false negative results.</p>
Full article ">
Back to TopTop