MDPI - Publisher of Open Access Journals

21 pages, 4811 KiB

Open AccessArticle

YOLO-AMM: A Real-Time Classroom Behavior Detection Algorithm Based on Multi-Dimensional Feature Optimization

by Yi Cao, Qian Cao, Chengshan Qian and Deji Chen

Sensors 2025, 25(4), 1142; https://doi.org/10.3390/s25041142 - 13 Feb 2025

Viewed by 570

Classroom behavior detection is a key task in constructing intelligent educational environments. However, the existing models are still deficient in detail feature capture capability, multi-layer feature correlation, and multi-scale target adaptability, making it challenging to realize high-precision real-time detection in complex scenes. This [...] Read more.

Classroom behavior detection is a key task in constructing intelligent educational environments. However, the existing models are still deficient in detail feature capture capability, multi-layer feature correlation, and multi-scale target adaptability, making it challenging to realize high-precision real-time detection in complex scenes. This paper proposes an improved classroom behavior detection algorithm, YOLO-AMM, to solve these problems. Firstly, we constructed the Adaptive Efficient Feature Fusion (AEFF) module to enhance the fusion of semantic information between different features and improve the model’s ability to capture detailed features. Then, we designed a Multi-dimensional Feature Flow Network (MFFN), which fuses multi-dimensional features and enhances the correlation information between features through the multi-scale feature aggregation module and contextual information diffusion mechanism. Finally, we proposed a Multi-Scale Perception and Fusion Detection Head (MSPF-Head), which significantly improves the adaptability of the head to different scale targets by introducing multi-scale feature perception, feature interaction, and fusion mechanisms. The experimental results showed that compared with the YOLOv8n model, YOLO-AMM improved the mAP0.5 and mAP0.5-0.95 by 3.1% and 4.0%, significantly improving the detection accuracy. Meanwhile, YOLO-AMM increased the detection speed (FPS) by 12.9 frames per second to 169.1 frames per second, which meets the requirement for real-time detection of classroom behavior. Full article

(This article belongs to the Special Issue Sensor-Based Behavioral Biometrics)

► Show Figures

Figure 1

23 pages, 4874 KiB

Open AccessArticle

Cross-Modal Transformer-Based Streaming Dense Video Captioning with Neural ODE Temporal Localization

by Shakhnoza Muksimova, Sabina Umirzakova, Murodjon Sultanov and Young Im Cho

Sensors 2025, 25(3), 707; https://doi.org/10.3390/s25030707 - 24 Jan 2025

Viewed by 896

Abstract

Dense video captioning is a critical task in video understanding, requiring precise temporal localization of events and the generation of detailed, contextually rich descriptions. However, the current state-of-the-art (SOTA) models face significant challenges in event boundary detection, contextual understanding, and real-time processing, limiting [...] Read more.

Dense video captioning is a critical task in video understanding, requiring precise temporal localization of events and the generation of detailed, contextually rich descriptions. However, the current state-of-the-art (SOTA) models face significant challenges in event boundary detection, contextual understanding, and real-time processing, limiting their applicability to complex, multi-event videos. In this paper, we introduce CMSTR-ODE, a novel Cross-Modal Streaming Transformer with Neural ODE Temporal Localization framework for dense video captioning. Our model incorporates three key innovations: (1) Neural ODE-based Temporal Localization for continuous and efficient event boundary prediction, improving the accuracy of temporal segmentation; (2) cross-modal memory retrieval, which enriches video features with external textual knowledge, enabling more context-aware and descriptive captioning; and (3) a Streaming Multi-Scale Transformer Decoder that generates captions in real time, handling objects and events of varying scales. We evaluate CMSTR-ODE on two benchmark datasets, YouCook2, Flickr30k, and ActivityNet Captions, where it achieves SOTA performance, significantly outperforming existing models in terms of CIDEr, BLEU-4, and ROUGE scores. Our model also demonstrates superior computational efficiency, processing videos at 15 frames per second, making it suitable for real-time applications such as video surveillance and live video captioning. Ablation studies highlight the contributions of each component, confirming the effectiveness of our approach. By addressing the limitations of current methods, CMSTR-ODE sets a new benchmark for dense video captioning, offering a robust and scalable solution for both real-time and long-form video understanding tasks. Full article

(This article belongs to the Section Sensing and Imaging)

► Show Figures

Figure 1

27 pages, 19274 KiB

Open AccessArticle

Enhancing Underwater Video from Consecutive Frames While Preserving Temporal Consistency

by Kai Hu, Yuancheng Meng, Zichen Liao, Lei Tang and Xiaoling Ye

J. Mar. Sci. Eng. 2025, 13(1), 127; https://doi.org/10.3390/jmse13010127 - 12 Jan 2025

Viewed by 811

Abstract

Current methods for underwater image enhancement primarily focus on single-frame processing. While these approaches achieve impressive results for static images, they often fail to maintain temporal coherence across frames in underwater videos, which leads to temporal artifacts and frame flickering. Furthermore, existing enhancement [...] Read more.

Current methods for underwater image enhancement primarily focus on single-frame processing. While these approaches achieve impressive results for static images, they often fail to maintain temporal coherence across frames in underwater videos, which leads to temporal artifacts and frame flickering. Furthermore, existing enhancement methods struggle to accurately capture features in underwater scenes. This makes it difficult to handle challenges such as uneven lighting and edge blurring in complex underwater environments. To address these issues, this paper presents a dual-branch underwater video enhancement network. The network synthesizes short-range video sequences by learning and inferring optical flow from individual frames. It effectively enhances temporal consistency across video frames through predicted optical flow information, thereby mitigating temporal instability within frame sequences. In addition, to address the limitations of traditional U-Net models in handling complex multiscale feature fusion, this study proposes a novel underwater feature fusion module. By applying both max pooling and average pooling, this module separately extracts local and global features. It utilizes an attention mechanism to adaptively adjust the weights of different regions in the feature map, thereby effectively enhancing key regions within underwater video frames. Experimental results indicate that when compared with the existing underwater image enhancement baseline method and the consistency enhancement baseline method, the proposed model improves the consistency index by 30% and shows a marginal decrease of only 0.6% in enhancement quality index, demonstrating its superiority in underwater video enhancement tasks. Full article

(This article belongs to the Section Ocean Engineering)

► Show Figures

Figure 1

20 pages, 2870 KiB

Open AccessArticle

Research on Mine-Personnel Helmet Detection Based on Multi-Strategy-Improved YOLOv11

by Lei Zhang, Zhipeng Sun, Hongjing Tao, Meng Wang and Weixun Yi

Sensors 2025, 25(1), 170; https://doi.org/10.3390/s25010170 - 31 Dec 2024

Viewed by 1067

Abstract

In the complex environment of fully mechanized mining faces, the current object detection algorithms face significant challenges in achieving optimal accuracy and real-time detection of mine personnel and safety helmets. This difficulty arises from factors such as uneven lighting conditions and equipment obstructions, [...] Read more.

In the complex environment of fully mechanized mining faces, the current object detection algorithms face significant challenges in achieving optimal accuracy and real-time detection of mine personnel and safety helmets. This difficulty arises from factors such as uneven lighting conditions and equipment obstructions, which often lead to missed detections. Consequently, these limitations pose a considerable challenge to effective mine safety management. This article presents an enhanced algorithm based on YOLOv11n, referred to as GCB-YOLOv11. The proposed improvements are realized through three key aspects: Firstly, the traditional convolution is replaced with GSConv, which significantly enhances feature extraction capabilities while simultaneously reducing computational costs. Secondly, a novel C3K2_FE module was designed that integrates Faster_block and ECA attention mechanisms. This design aims to improve detection accuracy while also accelerating detection speed. Finally, the introduction of the Bi FPN mechanism in the Neck section optimizes the efficiency of multi-scale feature fusion and addresses issues related to feature loss and redundancy. The experimental results demonstrate that GCB-YOLOv11 exhibits strong performance on the dataset concerning mine personnel and safety helmets, achieving a mean average precision of 93.6%. Additionally, the frames per second reached 90.3 f·s⁻¹, representing increases of 3.3% and 9.4%, respectively, compared to the baseline model. In addition, when compared to models such as YOLOv5s, YOLOv8s, YOLOv3 Tiny, Fast R-CNN, and RT-DETR, GCB-YOLOv11 demonstrates superior performance in both detection accuracy and model complexity. This highlights its advantages in mining environments and offers a viable technical solution for enhancing the safety of mine personnel. Full article

(This article belongs to the Special Issue Recent Advances in Optical Sensor for Mining)

► Show Figures

Figure 1

23 pages, 3884 KiB

Open AccessArticle

Cascaded Feature Fusion Grasping Network for Real-Time Robotic Systems

by Hao Li and Lixin Zheng

Sensors 2024, 24(24), 7958; https://doi.org/10.3390/s24247958 - 13 Dec 2024

Viewed by 772

Abstract

Grasping objects of irregular shapes and various sizes remains a key challenge in the field of robotic grasping. This paper proposes a novel RGB-D data-based grasping pose prediction network, termed Cascaded Feature Fusion Grasping Network (CFFGN), designed for high-efficiency, lightweight, and rapid grasping [...] Read more.

Grasping objects of irregular shapes and various sizes remains a key challenge in the field of robotic grasping. This paper proposes a novel RGB-D data-based grasping pose prediction network, termed Cascaded Feature Fusion Grasping Network (CFFGN), designed for high-efficiency, lightweight, and rapid grasping pose estimation. The network employs innovative structural designs, including depth-wise separable convolutions to reduce parameters and enhance computational efficiency; convolutional block attention modules to augment the model’s ability to focus on key features; multi-scale dilated convolution to expand the receptive field and capture multi-scale information; and bidirectional feature pyramid modules to achieve effective fusion and information flow of features at different levels. In tests on the Cornell dataset, our network achieved grasping pose prediction at a speed of 66.7 frames per second, with accuracy rates of 98.6% and 96.9% for image-wise and object-wise splits, respectively. The experimental results show that our method achieves high-speed processing while maintaining high accuracy. In real-world robotic grasping experiments, our method also proved to be effective, achieving an average grasping success rate of 95.6% on a robot equipped with parallel grippers. Full article

(This article belongs to the Section Sensors and Robotics)

► Show Figures

Figure 1

20 pages, 5142 KiB

Open AccessArticle

Adaptive Real-Time Tracking of Molten Metal Using Multi-Scale Features and Weighted Histograms

by Yifan Lei and Degang Xu

Electronics 2024, 13(15), 2905; https://doi.org/10.3390/electronics13152905 - 23 Jul 2024

Viewed by 716

Abstract

In this study, we study the tracking of the molten metal region in the dross removal process during metal ingot casting, and propose a real-time tracking method based on adaptive feature selection and weighted histogram. This research is highly significant in metal smelting, [...] Read more.

In this study, we study the tracking of the molten metal region in the dross removal process during metal ingot casting, and propose a real-time tracking method based on adaptive feature selection and weighted histogram. This research is highly significant in metal smelting, as efficient molten metal tracking is crucial for effective dross removal and ensuring the quality of metal ingots. Due to the influence of illumination and temperature in the tracking environment, it is difficult to extract suitable features for tracking molten metal during the metal pouring process using industrial cameras. We transform the images captured by the camera into a multi-scale feature space and select the features with the maximum distinction between the molten metal region and its surrounding background for tracking. Furthermore, we introduce a weighted histogram based on the pixel values of the target region into the mean-shift tracking algorithm to improve tracking accuracy. During the tracking process, the target model updates based on changes in the molten metal region across frames. Experimental tests confirm that this tracking method meets practical requirements, effectively addressing key challenges in molten metal tracking and providing reliable support for the dross removal process. Full article

(This article belongs to the Special Issue Machine Vision in Industrial Systems)

► Show Figures

Figure 1

21 pages, 5041 KiB

Open AccessArticle

DDEYOLOv9: Network for Detecting and Counting Abnormal Fish Behaviors in Complex Water Environments

by Yinjia Li, Zeyuan Hu, Yixi Zhang, Jihang Liu, Wan Tu and Hong Yu

Fishes 2024, 9(6), 242; https://doi.org/10.3390/fishes9060242 - 20 Jun 2024

Cited by 9 | Viewed by 2542

Abstract

Accurately detecting and counting abnormal fish behaviors in aquaculture is essential. Timely detection allows farmers to take swift action to protect fish health and prevent economic losses. This paper proposes an enhanced high-precision detection algorithm based on YOLOv9, named DDEYOLOv9, to facilitate the [...] Read more.

Accurately detecting and counting abnormal fish behaviors in aquaculture is essential. Timely detection allows farmers to take swift action to protect fish health and prevent economic losses. This paper proposes an enhanced high-precision detection algorithm based on YOLOv9, named DDEYOLOv9, to facilitate the detection and counting of abnormal fish behavior in industrial aquaculture environments. To address the lack of publicly available datasets on abnormal behavior in fish, we created the “Abnormal Behavior Dataset of Takifugu rubripes”, which includes five categories of fish behaviors. The detection algorithm was further enhanced in several key aspects. Firstly, the DRNELAN4 feature extraction module was introduced to replace the original RepNCSPELAN4 module. This change improves the model’s detection accuracy for high-density and occluded fish in complex water environments while reducing the computational cost. Secondly, the proposed DCNv4-Dyhead detection head enhances the model’s multi-scale feature learning capability, effectively recognizes various abnormal fish behaviors, and improves the computational speed. Lastly, to address the issue of sample imbalance in the abnormal fish behavior dataset, we propose EMA-SlideLoss, which enhances the model’s focus on hard samples, thereby improving the model’s robustness. The experimental results demonstrate that the DDEYOLOv9 model achieves high

P r e c i s i o n

,

R e c a l l

, and

m e a n A v e r a g e P r e c i s i o n

(

m A P

) on the “Abnormal Behavior Dataset of Takifugu rubripes”, with values of 91.7%, 90.4%, and 94.1%, respectively. Compared to the YOLOv9 model, these metrics are improved by 5.4%, 5.5%, and 5.4%, respectively. The model also achieves a running speed of 119 frames per second (

F P S

), which is 45

F P S

faster than YOLOv9. Experimental results show that the DDEYOLOv9 algorithm can accurately and efficiently identify and quantify abnormal fish behaviors in specific complex environments. Full article

(This article belongs to the Special Issue AI and Fisheries)

► Show Figures

Figure 1

19 pages, 4123 KiB

Open AccessArticle

Optimizing OCR Performance for Programming Videos: The Role of Image Super-Resolution and Large Language Models

by Mohammad D. Alahmadi and Moayad Alshangiti

Mathematics 2024, 12(7), 1036; https://doi.org/10.3390/math12071036 - 30 Mar 2024

Cited by 3 | Viewed by 2088

Abstract

The rapid evolution of video programming tutorials as a key educational resource has highlighted the need for effective code extraction methods. These tutorials, varying widely in video quality, present a challenge for accurately transcribing the embedded source code, crucial for learning and software [...] Read more.

The rapid evolution of video programming tutorials as a key educational resource has highlighted the need for effective code extraction methods. These tutorials, varying widely in video quality, present a challenge for accurately transcribing the embedded source code, crucial for learning and software development. This study investigates the impact of video quality on the performance of optical character recognition (OCR) engines and the potential of large language models (LLMs) to enhance code extraction accuracy. Our comprehensive empirical analysis utilizes a rich dataset of programming screencasts, involving manual transcription of source code and the application of both traditional OCR engines, like Tesseract and Google Vision, and advanced LLMs, including GPT-4V and Gemini. We investigate the efficacy of image super-resolution (SR) techniques, namely, enhanced deep super-resolution (EDSR) and multi-scale deep super-resolution (MDSR), in improving the quality of low-resolution video frames. The findings reveal significant improvements in OCR accuracy with the use of SR, particularly at lower resolutions such as 360p. LLMs demonstrate superior performance across all video qualities, indicating their robustness and advanced capabilities in diverse scenarios. This research contributes to the field of software engineering by offering a benchmark for code extraction from video tutorials and demonstrating the substantial impact of SR techniques and LLMs in enhancing the readability and reusability of code from these educational resources. Full article

(This article belongs to the Special Issue AI-Augmented Software Engineering)

► Show Figures

Figure 1

24 pages, 8939 KiB

Open AccessArticle

YOLOv7-GCA: A Lightweight and High-Performance Model for Pepper Disease Detection

by Xuejun Yue, Haifeng Li, Qingkui Song, Fanguo Zeng, Jianyu Zheng, Ziyu Ding, Gaobi Kang, Yulin Cai, Yongda Lin, Xiaowan Xu and Chaoran Yu

Agronomy 2024, 14(3), 618; https://doi.org/10.3390/agronomy14030618 - 19 Mar 2024

Cited by 2 | Viewed by 1704

Abstract

Existing disease detection models for deep learning-based monitoring and prevention of pepper diseases face challenges in accurately identifying and preventing diseases due to inter-crop occlusion and various complex backgrounds. To address this issue, we propose a modified YOLOv7-GCA model based on YOLOv7 for [...] Read more.

Existing disease detection models for deep learning-based monitoring and prevention of pepper diseases face challenges in accurately identifying and preventing diseases due to inter-crop occlusion and various complex backgrounds. To address this issue, we propose a modified YOLOv7-GCA model based on YOLOv7 for pepper disease detection, which can effectively overcome these challenges. The model introduces three key enhancements: Firstly, lightweight GhostNetV2 is used as the feature extraction network of the model to improve the detection speed. Secondly, the Cascading fusion network (CFNet) replaces the original feature fusion network, which improves the expression ability of the model in complex backgrounds and realizes multi-scale feature extraction and fusion. Finally, the Convolutional Block Attention Module (CBAM) is introduced to focus on the important features in the images and improve the accuracy and robustness of the model. This study uses the collected dataset, which was processed to construct a dataset of 1259 images with four types of pepper diseases: anthracnose, bacterial diseases, umbilical rot, and viral diseases. We applied data augmentation to the collected dataset, and then experimental verification was carried out on this dataset. The experimental results demonstrate that the YOLOv7-GCA model reduces the parameter count by 34.3% compared to the YOLOv7 original model while improving 13.4% in mAP and 124 frames/s in detection speed. Additionally, the model size was reduced from 74.8 MB to 46.9 MB, which facilitates the deployment of the model on mobile devices. When compared to the other seven mainstream detection models, it was indicated that the YOLOv7-GCA model achieved a balance between speed, model size, and accuracy. This model proves to be a high-performance and lightweight pepper disease detection solution that can provide accurate and timely diagnosis results for farmers and researchers. Full article

(This article belongs to the Special Issue Computer Vision and Deep Learning Technology in Agriculture: 2nd Edition)

► Show Figures

Figure 1

15 pages, 4905 KiB

Open AccessArticle

Transformer-Based Cascading Reconstruction Network for Video Snapshot Compressive Imaging

by Jiaxuan Wen, Junru Huang, Xunhao Chen, Kaixuan Huang and Yubao Sun

Appl. Sci. 2023, 13(10), 5922; https://doi.org/10.3390/app13105922 - 11 May 2023

Cited by 2 | Viewed by 1589

Abstract

Video Snapshot Compressive Imaging (SCI) is a new imaging method based on compressive sensing. It encodes image sequences into a single snapshot measurement and then recovers the original high-speed video through reconstruction algorithms, which has the advantages of a low hardware cost and [...] Read more.

Video Snapshot Compressive Imaging (SCI) is a new imaging method based on compressive sensing. It encodes image sequences into a single snapshot measurement and then recovers the original high-speed video through reconstruction algorithms, which has the advantages of a low hardware cost and high imaging efficiency. How to construct an efficient algorithm is the key problem of video SCI. Although the current mainstream deep convolution network reconstruction methods can directly learn the inverse reconstruction mapping, they still have shortcomings in the representation of the complex spatiotemporal content of video scenes and the modeling of long-range contextual correlation. The quality of reconstruction still needs to be improved. To solve this problem, we propose a Transformer-based Cascading Reconstruction Network for Video Snapshot Compressive Imaging. In terms of the long-range correlation matching in the Transformer, the proposed network can effectively capture the spatiotemporal correlation of video frames for reconstruction. Specifically, according to the residual measurement mechanism, the reconstruction network is configured as a cascade of two stages: overall structure reconstruction and incremental details reconstruction. In the first stage, a multi-scale Transformer module is designed to extract the long-range multi-scale spatiotemporal features and reconstruct the overall structure. The second stage takes the measurement of the first stage as the input and employs a dynamic fusion module to adaptively fuse the output features of the two stages so that the cascading network can effectively represent the content of complex video scenes and reconstruct more incremental details. Experiments on simulation and real datasets show that the proposed method can effectively improve the reconstruction accuracy, and ablation experiments also verify the validity of the constructed network modules. Full article

► Show Figures

Figure 1

Figure 1
Schematic diagram of video snapshot compressive imaging. Full article ">Figure 2
The diagram of the Transformer-based Cascading Reconstruction Network for Video Compressive Snapshot Imaging. Full article ">Figure 3
The diagram of the multi-scale Transformer network for overall structure reconstruction. Full article ">Figure 4
The diagram of the dynamic fusion Transformer network for incremental details reconstruction. Full article ">Figure 5
The reconstruction results of two stages in our network: (a) Overall structure reconstruction; (b) Incremental details reconstruction; (c) Final reconstruction. Full article ">Figure 6
Reconstructed frames of six simulation datasets by different methods (the left side is the Ground Truth, and the right side is the reconstruction result of each method). The sequence numbers of the selected frames of each dataset are Aerial #5, Crash #24, Drop #4, Kobe #6, Runner #1 and Traffic #18. Full article ">Figure 7
Reconstruction results of different methods on the real dataset Wheel. (The red boxes are enlarged detail images.) Full article ">

15 pages, 4694 KiB

Open AccessArticle

CenterPNets: A Multi-Task Shared Network for Traffic Perception

by Guangqiu Chen, Tao Wu, Jin Duan, Qi Hu, Dandan Huang and Hao Li

Sensors 2023, 23(5), 2467; https://doi.org/10.3390/s23052467 - 23 Feb 2023

Cited by 2 | Viewed by 2042

Abstract

The importance of panoramic traffic perception tasks in autonomous driving is increasing, so shared networks with high accuracy are becoming increasingly important. In this paper, we propose a multi-task shared sensing network, called CenterPNets, that can perform the three major detection tasks of [...] Read more.

The importance of panoramic traffic perception tasks in autonomous driving is increasing, so shared networks with high accuracy are becoming increasingly important. In this paper, we propose a multi-task shared sensing network, called CenterPNets, that can perform the three major detection tasks of target detection, driving area segmentation, and lane detection in traffic sensing in one go and propose several key optimizations to improve the overall detection performance. First, this paper proposes an efficient detection head and segmentation head based on a shared path aggregation network to improve the overall reuse rate of CenterPNets and an efficient multi-task joint training loss function to optimize the model. Secondly, the detection head branch uses an anchor-free frame mechanism to automatically regress target location information to improve the inference speed of the model. Finally, the split-head branch fuses deep multi-scale features with shallow fine-grained features, ensuring that the extracted features are rich in detail. CenterPNets achieves an average detection accuracy of 75.8% on the publicly available large-scale Berkeley DeepDrive dataset, with an intersection ratio of 92.8% and 32.1% for driveableareas and lane areas, respectively. Therefore, CenterPNets is a precise and effective solution to the multi-tasking detection issue. Full article

(This article belongs to the Section Vehicular Sensing)

► Show Figures

Figure 1

Figure 1
HybridNets Architecture has one encoder: backbone network and neck network;two decoders: Detection Head and Segmentation Head. Full article ">Figure 2
Illustration of the detection head branching process. Full article ">Figure 3
Illustration of the branching process of the segmented head. Full article ">Figure 4
Target detection visualization comparison results. (a) YOLOP, (b) HybridNets, (c) CenterPNets. Full article ">Figure 5
Comparative results of the visualization of the driveable area segmentation. (a) YOLOP, (b) HybridNets, (c) CenterPNets. Full article ">Figure 6
Comparison of lane split visualization results. (a) YOLOP, (b) HybridNets, (c) CenterPNets. Full article ">Figure 6 Cont.
Comparison of lane split visualization results. (a) YOLOP, (b) HybridNets, (c) CenterPNets. Full article ">Figure 7
CenterPNets multi-tasking results. Full article ">

13 pages, 2627 KiB

Open AccessArticle

Method for Segmentation of Litchi Branches Based on the Improved DeepLabv3+

by Jiaxing Xie, Tingwei Jing, Binhan Chen, Jiajun Peng, Xiaowei Zhang, Peihua He, Huili Yin, Daozong Sun, Weixing Wang, Ao Xiao, Shilei Lyu and Jun Li

Agronomy 2022, 12(11), 2812; https://doi.org/10.3390/agronomy12112812 - 11 Nov 2022

Cited by 11 | Viewed by 2220

Abstract

It is necessary to develop automatic picking technology to improve the efficiency of litchi picking, and the accurate segmentation of litchi branches is the key that allows robots to complete the picking task. To solve the problem of inaccurate segmentation of litchi branches [...] Read more.

It is necessary to develop automatic picking technology to improve the efficiency of litchi picking, and the accurate segmentation of litchi branches is the key that allows robots to complete the picking task. To solve the problem of inaccurate segmentation of litchi branches under natural conditions, this paper proposes a segmentation method for litchi branches based on the improved DeepLabv3+, which replaced the backbone network of DeepLabv3+ and used the Dilated Residual Networks as the backbone network to enhance the model’s feature extraction capability. During the training process, a combination of Cross-Entropy loss and the dice coefficient loss was used as the loss function to cause the model to pay more attention to the litchi branch area, which could alleviate the negative impact of the imbalance between the litchi branches and the background. In addition, the Coordinate Attention module is added to the atrous spatial pyramid pooling, and the channel and location information of the multi-scale semantic features acquired by the network are simultaneously considered. The experimental results show that the model’s mean intersection over union and mean pixel accuracy are 90.28% and 94.95%, respectively, and the frames per second (FPS) is 19.83. Compared with the classical DeepLabv3+ network, the model’s mean intersection over union and mean pixel accuracy are improved by 13.57% and 15.78%, respectively. This method can accurately segment litchi branches, which provides powerful technical support to help litchi-picking robots find branches. Full article

(This article belongs to the Special Issue Precision Operation Technology and Intelligent Equipment in Farmland)

► Show Figures

Figure 1

Figure 1
A comparison of ResNet18 and DRN-C-26. Each rectangle in the figure represents a Conv-BN-ReLU combination. The number in the rectangle indicates the size of the convolution kernel and the number of output channels. H × W indicates the height and width of the feature map. Full article ">Figure 2
A comparison of DRN-D-22 and DRN-C-26. The DRN is divided into eight stages, and each stage outputs identically-sized feature maps and uses the same expansion coefficient. Each rectangle in the figure represents a Conv-BN-ReLU combination. The number in the rectangle indicates the size of the convolution kernel and the number of output channels. H × W is the height and width of the feature map, and the green lines represent downsampling by a stride of two. Full article ">Figure 3
The coordinate attention mechanism. Full article ">Figure 4
The improved DeepLabv3+ network structure. Full article ">Figure 5
A comparison of the mIoU curves for transfer learning. Full article ">Figure 6
mIoU curves of the ablation experiment. Full article ">Figure 7
A comparison of the network prediction effects. Full article ">

17 pages, 4355 KiB

Open AccessArticle

Research on the Symbolic 3D Route Scene Expression Method Based on the Importance of Objects

by Fulin Han, Liang Huo, Tao Shen, Xiaoyong Zhang, Tianjia Zhang and Na Ma

Appl. Sci. 2022, 12(20), 10532; https://doi.org/10.3390/app122010532 - 19 Oct 2022

Cited by 1 | Viewed by 1563

Abstract

In the study of 3D route scene construction, the expression of key targets needs to be highlighted. This is because compared with the 3D model, the abstract 3D symbols can reflect the number and spatial distribution characteristics of entities more intuitively. Therefore, this [...] Read more.

In the study of 3D route scene construction, the expression of key targets needs to be highlighted. This is because compared with the 3D model, the abstract 3D symbols can reflect the number and spatial distribution characteristics of entities more intuitively. Therefore, this research proposes a symbolic 3D route scene representation method based on the importance of the object. The method takes the object importance evaluation model as the theoretical basis, calculates the spatial importance of the same type of objects according to the spatial characteristics of the geographical objects in the 3D route scene, and constructs the object importance evaluation model by combining semantic factors. The 3D symbols are then designed in a hierarchical manner on the basis of the results of the object importance evaluation and the CityGML standard. Finally, the LOD0-LOD4 symbolic 3D railway scene was constructed on the basis of a railroad data to realise the multi-scale expression of symbolic 3D route scene. Compared with the conventional loading method, the real-time frame rate of the scene was improved by 20 fps and was more stable. The scene loading speed was also improved by 5–10 s. The results show that the method can effectively improve the efficiency of the 3D route scene construction and the prominent expression effect of the key objects in the 3D route scene. Full article

(This article belongs to the Special Issue State-of-the-Art Earth Sciences and Geography in China)

► Show Figures

Figure 1

18 pages, 4630 KiB

Open AccessArticle

A Method for Obtaining the Number of Maize Seedlings Based on the Improved YOLOv4 Lightweight Neural Network

by Jiaxin Gao, Feng Tan, Jiapeng Cui and Bo Ma

Agriculture 2022, 12(10), 1679; https://doi.org/10.3390/agriculture12101679 - 12 Oct 2022

Cited by 10 | Viewed by 1871

Abstract

Obtaining the number of plants is the key to evaluating the effect of maize mechanical sowing, and is also a reference for subsequent statistics on the number of missing seedlings. When the existing model is used for plant number detection, the recognition accuracy [...] Read more.

Obtaining the number of plants is the key to evaluating the effect of maize mechanical sowing, and is also a reference for subsequent statistics on the number of missing seedlings. When the existing model is used for plant number detection, the recognition accuracy is low, the model parameters are large, and the single recognition area is small. This study proposes a method for detecting the number of maize seedlings based on an improved You Only Look Once version 4 (YOLOv4) lightweight neural network. First, the method uses the improved Ghostnet as the model feature extraction network, and successively introduces the attention mechanism and k-means clustering algorithm into the model, thereby improving the detection accuracy of the number of maize seedlings. Second, using depthwise separable convolutions instead of ordinary convolutions makes the network more lightweight. Finally, the multi-scale feature fusion network structure is improved to further reduce the total number of model parameters, pre-training with transfer learning to obtain the optimal model for prediction on the test set. The experimental results show that the harmonic mean, recall rate, average precision and accuracy rate of the model on all test sets are 0.95%, 94.02%, 97.03% and 96.25%, respectively, the model network parameters are 18.793 M, the model size is 71.690 MB, and frames per second (FPS) is 22.92. The research results show that the model has high recognition accuracy, fast recognition speed, and low model complexity, which can provide technical support for corn management at the seedling stage. Full article

(This article belongs to the Section Digital Agriculture)

► Show Figures

Figure 1

16 pages, 2701 KiB

Open AccessArticle

Muti-Frame Point Cloud Feature Fusion Based on Attention Mechanisms for 3D Object Detection

by Zhenyu Zhai, Qiantong Wang, Zongxu Pan, Zhentong Gao and Wenlong Hu

Sensors 2022, 22(19), 7473; https://doi.org/10.3390/s22197473 - 2 Oct 2022

Cited by 10 | Viewed by 3470

Abstract

Continuous frames of point-cloud-based object detection is a new research direction. Currently, most research studies fuse multi-frame point clouds using concatenation-based methods. The method aligns different frames by using information on GPS, IMU, etc. However, this fusion method can only align static objects [...] Read more.

Continuous frames of point-cloud-based object detection is a new research direction. Currently, most research studies fuse multi-frame point clouds using concatenation-based methods. The method aligns different frames by using information on GPS, IMU, etc. However, this fusion method can only align static objects and not moving objects. In this paper, we proposed a non-local-based multi-scale feature fusion method, which can handle both moving and static objects without GPS- and IMU-based registrations. Considering that non-local methods are resource-consuming, we proposed a novel simplified non-local block based on the sparsity of the point cloud. By filtering out empty units, memory consumption decreased by 99.93%. In addition, triple attention is adopted to enhance the key information on the object and suppresses background noise, further benefiting non-local-based feature fusion methods. Finally, we verify the method based on PointPillars and CenterPoint. Experimental results show that the mAP of the proposed method improved by 3.9% and 4.1% in mAP compared with concatenation-based fusion modules, PointPillars-2 and CenterPoint-2, respectively. In addition, the proposed network outperforms powerful 3D-VID by 1.2% in mAP. Full article

(This article belongs to the Special Issue Artificial Intelligence and Smart Sensors for Autonomous Driving)

► Show Figures

Figure 1

Search Results (22)

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

Saved Queries

Search Filter Reset All

Years

Feature Papers

Subjects

Journals

Article Types

Countries / Regions

Search Results (22)

Further Information

Guidelines

MDPI Initiatives

Follow MDPI