1. Introduction
Pedestrian detection has always been a prominent research direction in target detection. In densely populated public places such as railway stations and airports, accurate pedestrian detection is crucial for ensuring public safety. It can promptly detect potential safety hazards and provide an important basis for the reasonable allocation of traffic flow and corresponding security measures. The core task of pedestrian detection is to identify all the pedestrians in an image or video frame, regardless of their location and size, with the target annotation generally being a rectangular box. In addition to pedestrian detection, target detection also covers typical problems such as face detection, vehicle detection, and remote sensing detection.
Pedestrian detection technology has significant application value. It can be combined with pedestrian tracking [
1], pedestrian re-identification [
2], and other technologies, and applied to unmanned systems [
3], intelligent transportation [
4], intelligent robotics [
5], intelligent video surveillance [
6], human behavior analysis [
7], and other fields. In particular, dense pedestrian detection is crucial for large public places such as railway stations and airports, where the flow of people in dense areas directly affects the distribution of traffic flow and the corresponding security measures.
Pedestrian detection faces several challenges due to the diversity of human postures and the vast differences in appearance at different angles, lighting conditions, and levels of occlusion. For example, occlusion between pedestrians can lead to difficulties in feature extraction, making it challenging for the model to accurately identify the location and number of pedestrians.
Based on the principles of algorithm implementation, pedestrian detection algorithms can be classified into two types: stationary detection-based algorithms and deep learning-based algorithms. Stationary detection-based algorithms assume that the camera is stationary. They utilize background modeling algorithms to extract foreground targets in motion. A classifier is then used to classify these moving targets and determine if they contain pedestrians. Classical foreground modeling algorithms include the Gaussian mixture algorithm, VIBE algorithm, frame difference algorithm, and sample consistency algorithm.
The development of target detection based on deep learning can be divided into two cycles. The first cycle is based on the traditional manual extraction of target detection algorithms. Despite the long period of development, traditional target detection algorithms have not significantly improved in recognition effectiveness. They also require substantial computational resources and have gradually faded from the forefront of target detection research.
The second cycle is based on deep learning target detection algorithms, which are primarily categorized into two-stage and one-stage approaches. In the two-stage approach, candidate box regions likely to contain the target are first generated. Then, feature extraction is performed on these candidate regions, followed by sample classification using convolutional neural networks (CNNs). Common two-stage methods include R-CNN [
8], which pioneered the application of deep learning to image recognition tasks, and its derivatives such as faster R-CNN [
9] and fast R-CNN [
10]. These algorithms offer advantages in detection accuracy but suffer from long training times and slow inference speeds. On the other hand, one-stage approaches directly treat target detection as a regression task across the entire image without generating candidate boxes. Representative one-stage algorithms include the Yolo series and the single-shot multi-box detector (SSD) [
11]. One-stage algorithms reduce training time and accelerate inference speed but may sacrifice some accuracy.
Although target detection techniques have made significant progress through traditional and deep learning methods, there is still relatively less research on dense pedestrian detection in the presence of substantial occlusion. The accuracy and speed of dense pedestrian detection still require further improvement, leaving ample room for enhancement in multi-scene and dense-scene pedestrian detection. Deep learning methods have achieved certain results in pedestrian detection, yet they face significant challenges in dense pedestrian scenarios. The high data complexity in dense pedestrian scenes and the mutual occlusion between pedestrians make feature extraction exceptionally challenging. Datasets like the crowd human dataset from the Crowd Vision Research Institute and the broader person dataset curated by the Biometrics and Security Research Centre at the National Pattern Recognition Laboratory, Institute of Automation of the Chinese Academy of Sciences, provide multi-scene images with varying degrees of pedestrian density. These datasets play a crucial role in advancing research on dense pedestrian detection. The broader person dataset, in particular, extends beyond just traffic scenes to encompass a wide range of scenarios, marking a significant step towards enhancing pedestrian detection technology for dense environments. This advancement holds substantial practical significance for fields such as autonomous driving and security monitoring. Moreover, issues such as multi-scale variations in images and high false positive rates also severely impact the accuracy and reliability of detection.
Aiming at addressing issues such as low detection efficiency, difficulty in feature extraction, image multi-scale variation, and high false detection rate due to pedestrian occlusion or dense crowd flow in crowded areas, this paper proposes a new detection model, GR-YOLO (Gold-Repc3 YOLO), based on YOLOv8n to improve the aforementioned challenges. The specific contributions of this paper are outlined in the following four aspects:
(1) To tackle the dense pedestrian detection challenges arising from pedestrian occlusion, where backgrounds are often misclassified as pedestrians, the Repc3 module is introduced to optimize the backbone network of YOLOv8. The Repc3 module enhances feature integration and information enhancement through convolution operations, thereby improving the model’s feature expression capability and addressing issues with feature extraction due to pedestrian occlusion.
(2) The paper introduces an aggregation–distribution mechanism that represents a paradigm shift aimed at enhancing model perceptual discrimination through multi-scale feature fusion. This mechanism efficiently exchanges information in YOLOv8 by fusing features across multiple layers and injecting global information into higher layers. This enhancement significantly boosts the fusion capability of the network’s neck architecture, thereby improving the model’s final detection performance and addressing challenges related to insufficient feature fusion capabilities for multi-scale image variations.
(3) In scenarios involving frequent occlusion of target objects, model convergence may be slower, and there could be instances of missed detection. To mitigate these issues, the paper employs the Giou loss function, which aids model convergence and is particularly sensitive to detecting overlapping situations. This approach enhances the model’s accuracy in predicting target locations and reduces missed detections when dealing with dense data.
(4) The effectiveness of the proposed algorithm is validated through comprehensive multi-group ablation experiments and comparisons with mainstream target detection algorithms. These experiments analyze final results and provide significant advancements in the field of dense pedestrian detection.
These contributions collectively propel advancements in dense pedestrian detection, addressing critical challenges in crowded environments and scenarios with mutual occlusion.
4. Experimental Part
4.1. Experimental Environment
During the experiment, we trained the model several times and recorded the performance metrics of the model under different training calendar times, as shown in
Figure 11 below, including precision rate, recall rate, F1 score, average mean precision, and so on. By analyzing these metrics, we found that the model’s performance gradually stabilized and reached its optimal state within 70–80 calendar hours. Specifically, we observed that the model’s performance metrics continued to improve until 70 calendar hours, while after 80 calendar hours, the model’s performance metrics began to show a steady trend. Therefore, we concluded that the model reached its optimal performance within 70–80 calendar hours; therefore, instead of adopting the official recommendation of 300 epochs by yolov8, we chose to set the epoch to 100. In order to further optimize the model’s performance, the hyperparameters were selected and adjusted based on the characteristics of the dataset and the experimental results. In the experimental validation, it was found that the effect of using the SGD optimizer was better than other optimizers. The learning rate was set to 0.01, and the kinetic energy was set to 0.937. The batch size was set to 16 to adapt to the GPU memory requirement.
Yolov10 was trained on the Pytorch 1.10.1 framework using Python version 3.9, and Python 3.8 was used to train the rest of the models presented in the paper. All experiments were executed using NVIDIA GeForce RTX 3090 GPUs (Nvidia, Santa Clara, CA, USA) on a memory size of 24,268 M. These adjustments and optimizations were made to ensure the performance of the models while improving the training efficiency and making the best use of hardware resources. The specific experimental parameter configuration is shown in
Table 1.
4.2. Model Analysis Index
In order to evaluate the model performance more accurately, a full range of model evaluations were performed using multiple metrics such as precision, recall, F1 score, and average mean precision.
In pedestrian detection scenarios, true positive (TP) refers to pedestrians correctly identified, false positive (FP) denotes non-pedestrians incorrectly classified as pedestrians, and false negative (FN) signifies actual pedestrians that were not correctly detected.
Precision, also known as positive predictive value, refers to the proportion of correctly predicted positive observations to the total predicted positives. It is defined as follows:
Recall, also known as sensitivity or true positive rate, measures the proportion of actual positives that the model correctly identifies. It is defined as follows:
The F1 score is the harmonic mean of precision and recall, serving as a metric in statistical mathematics to evaluate the accuracy of binary or multi-class models. Ranging from 0 to 1, a value closer to 1 indicates better balance between precision and recall and vice versa. The definition formula for the F1 score is as follows:
Mean average precision, commonly referred to as mAP, is the average of the average precision across all categories within all images. It represents a combination of precision and recall. For a given category, the area under the precision–recall curve is calculated, which is known as the mean average precision. mAP50 denotes the mean precision at an intersection over union (IOU) threshold of 0.5. It is defined as follows:
4.3. Ablation Experiment
In the ablation experiments of this study, we chose the WiderPeople dataset for training and validation. This is because this dataset covers a wide range of dense pedestrians in various scenarios and can better reflect the performance of the model in dealing with pedestrian occlusion and dense scenarios. In addition, this dataset has been widely used for evaluating dense pedestrian detection models in previous studies and has a high representative and reference value.
To test the usefulness of the improved GR-YOLO for dense pedestrian detection, Yolov8, Yolov8 + Giou, Yolov8 + Repc3, Yolov8 + Gold, Yolov8 + Repc3 + Giou, and GR-yolo were used to train the WiderPeople dataset, setting up the addition of different improved ablation experiments to explore the different effects of different module improvements on the final model.
If there is no additional description, it means that the same experimental environment is set up for each group. GR-yolo is the improved yolo model proposed in this paper. Yolov8 + Repc3 is the optimization of the backbone network using the Repc3 module on top of yolov8. Yolov8 + Giou is the prediction of the bounding box using the Giou loss function on top of yolov8 for the prediction of bounding box loss. Yolov8 + Gold means using a Gold module for improvement based on yolov8. Yolov8 + repc3 + gold means using the repc3 module along with a gold module for model improvement based on yolov8. The above improvements are compared with yolov8 as a benchmark. The results of ablation experiments are shown in
Table 2.
Map50 represents the average accuracy of calculating all the pictures of the pedestrian class when the iou is set to 0.5, and MAP50-95 represents the accuracy of calculating all the pictures of the pedestrian class at different IOU thresholds, ranging from 0.5 to 0.95 in steps of 0.05.
The experimental results are shown in
Table 2. Comparing the experimental results in the above table, it can be seen that the GR-yolo model proposed in this paper performs well in the dense pedestrian dataset of WiderPeople, with the map50 reaching 88.1%, which is 3.2% higher compared to yolov8, and the map50-95 reaching 60%, which is 4% higher compared to yolov8. In addition, there is also an improvement in the precision rate, which is 1.9% compared to yolov8, and a significant improvement in the recall rate, which is 4% compared to yolov8. These results show that GR-yolo has significant advantages in reducing the leakage rate and improving the detection performance.
Other improvements in the module part of yolov8 in the detection of dense pedestrians were also observed: yolov8 + Giou map50 reached 85%, yolov8 was enhanced by 0.1%, map50-95 was enhanced by 0.1%, yolov8 + repc3 map50 increased to 85.2%, yolov8 map50 improved by 0.3%, map50-95 boost improved by 0.2%, yolov8 + gold map50 reached 0.879, yolov8 map50 was boosted to 3%, map50-95 was boosted 3.8%, yolov8 + repc3 + gold map50 reached 0.880, yolov8 map50 was boosted by 3.1%, map50-95 was boosted 4%, and map50-95 was boosted by 4%. All these experimental results show that the improvement aspects proposed in this paper to enhance the performance of yolov8 in dense pedestrian detection are fruitful.
Based on the results presented in the tables and images, we can see that the GR-yolo model proposed in this paper improves both precision and recall, which leads to a significant improvement in the overall performance of the detection model of yolov8. More importantly, the various modular parts that improve the yolov8 model are essential, and their interactions enable GR-yolo to maximize its benefits in dense pedestrian detection. This also illustrates that optimizing and improving model performance does not rely on a single module but requires synergy and interplay between the individual modules. With this synergy, the GR-yolo model is able to show superior performance in all situations, especially when dealing with the problem of dense pedestrian detection.
Figure 12 shows the detection results of the original yolov8 model and our improved model, which show the comparison results in the WiderPeople dataset, CrowdHuman dataset, and People Detection dataset, respectively. From the comparison results, it can be seen that the original yolov8 model has missed detection in all three datasets, while our proposed model can better avoid missed detection. The three images are randomly selected from all three datasets, which also better demonstrates the generalization ability and robustness of our model. The results show that our adopted Repc3, Gold, and Giou modules can effectively enhance the model’s ability to detect dense pedestrians.
4.4. Comparative Experiment
Experimental validation was carried out using yolov5, yolov8, yolov9, yolov10, and improved GR-yolo on different datasets: WiderPeople, CrowdHuman, and People Detection, respectively. The following values are the results of map50 in the validation set. The results of comparative experiments are shown in
Table 3.
From the data in
Table 3, it can be seen that the GR-Yolo proposed in this paper improves on various datasets, especially on the CrowdHuman dataset, which is denser, and the improvement is very significant. GR-Yolo improved by 10.1% compared to Yolov5, 7.2% compared to Yolov8, and 4.6% compared to Yolov10.
On the generally dense WiderPeople dataset, GR-yolo improved by 3.7% compared to yolov5, 3.2% compared to yolov8, 0.1 compared to yolov9, and 4.2% compared to yolov10.
However, for the multi-scene, unevenly dense People Detection dataset, GR-yolo showed the most significant improvement. gr-yolo improved by 9.6% compared to yolov5, 2.2% compared to yolov9, 11.5% compared to yolov10, and more importantly, 11.7% compared to yolov8. This result reinforces the importance of the improvements made to the yolov8 module in this paper, especially when dealing with multi-scene, unevenly dense situations. The performance improvement of GR-yolo is particularly significant.
Yolov5 benefits from mosaic data enhancement and focus structure, performing well on all three datasets, but there is still room for improvement in the dense pedestrian detection task. By introducing C2f and adjusting the number of channels, yolov8 further improves the detection performance on the basis of yolov5 and performs better, especially in complex scenes.
The Yolov9 model is large in scale, and the larger number of layers enables it to learn more complex feature representations, which improves accuracy to some extent. However, the large scale of the model also brings some problems, such as high computational resource requirements and long training time, etc. Although Yolov10 is a new generation model, it may be affected by factors such as pedestrians occluding each other when dealing with dense pedestrian detection tasks, resulting in the difficulty of accurately detecting occluded individuals, which makes it much less effective in dense image processing scenarios. In contrast, the GR-yolo model, while maintaining a relatively small scale, effectively improves the feature extraction capability, multi-scale fusion capability, and prediction accuracy of target location by introducing the repc3 module to optimize the backbone network, adopting the aggregation–distribution mechanism to reconfigure the neck structure, as well as using Giou’s loss computation and other improvement measures, thus demonstrating higher detection accuracy and lower missed detection rate.
The GR-yolo proposed in this paper significantly improves the performance of the model in the dense pedestrian detection task. By employing several different datasets, such as the WiderPerson dataset, CrowdHuman dataset, and People Detection Image dataset, a wide variety of scenarios, crowd densities, and occlusion situations can be covered. Experiments on these datasets show that the GR-yolo model maintains good detection performance in different scenarios, demonstrating strong adaptability and robustness.
Through comparative experiments, we compared the GR-yolo model with other state-of-the-art target detection models, such as Yolov5, Yolov8, Yolov9, and Yolov10. On different datasets, the GR-yolo model outperformed the other models, especially in the dense pedestrian detection task, which fully demonstrates its robustness and superiority.
In addition, by comparing the experimental validation results of Yolov8, Yolov8 + Repc3, Yolov8 + Giou, Yolov8 + Repc3 + Giou, and GR-yolo for ablation on the WiderPeople dataset, each of the improved modules contributed positively to the performance of the model, and their interactions enabled GR-yolo to achieve higher detection accuracy and higher leakage rate in dense pedestrian detection. Yolo has higher detection accuracy and a lower miss detection rate in dense pedestrian detection. This suggests that the various parts of the model work in concert with each other to steadily improve the performance of the model.
In summary, these results show that the improved GR-yolo has higher detection accuracy and robustness in dense pedestrian detection tasks, providing strong support for future research and applications.
5. Results
Our work proposes a target detection model framework named GR-yolo dedicated to dense pedestrian detection. We demonstrate the superiority of GR-yolo in detection performance by performing experimental validation on the following three datasets: the large dense dataset, CrowdHuman, the large outdoor dense crowd dataset, WiderPeople, and the People Detection Images in the official roboflow released by paralytics dataset. The experimental results show that GR-yolo significantly outperforms the other comparison models in detecting these datasets, verifying its effectiveness in the dense pedestrian detection task. Although this ablation experiment mainly focuses on the WiderPeople dataset, we realize the importance of conducting ablation experiments on other datasets. In our future work, we will further expand the scope of the ablation experiments and analyze the CrowdHuman dataset and the People Detection dataset in detail in order to more comprehensively evaluate the performance of the model in different scenarios and the roles of various modules, and will try to improve the design based on these datasets and the base model, with the goal of simplifying the model structure, improving the inference speed of the model, etc. The following are some of the planned directions:
Model pruning techniques: We plan to employ state-of-the-art pruning techniques to reduce the number of parameters and computational load of the model, thereby accelerating inference speed while striving to maintain or enhance the model’s detection performance.
Innovative convolutional operations: We aim to incorporate cutting-edge convolutional operations, such as dynamic and deformable convolutions, to further improve the model’s feature extraction capabilities and detection accuracy.
Multi-task learning: By integrating other related tasks, such as pose estimation and action recognition, we seek to increase the model’s versatility and practicality.
With these improvements and optimizations, we anticipate that GR-YOLO will achieve even more outstanding performance in the field of dense pedestrian detection and provide stronger support for practical applications.