Keywords

1 Introduction

The traffic light recognition (TLR) is an important task in advanced driving assistant systems (ADAS) or autonomous driving systems. The real application scenarios of the traffic lights are complicated due to various distances, illumination, and distortion conditions. In addition, real-time recognition is important to ensure the system’s reliability. To achieve traffic light recognition, there are two main streams: methods based on traditional features and methods based on deep learning.

Some related work that used the traditional methods of traffic light recognition suffered from two main drawbacks: poor adaptation and heavy time consumption. For instance, in [6], the color segmentation was applied to detect traffic lights. However, due to different illumination conditions, the lights’ color varied, making its performance not very good. In [8], the model used the morphology feature to detect urban traffic light. However, the complicated background in the urban area made it difficult to extract morphology features. Time consumption is also a problem for traditional methods. For instance, [5] and [6] applied traditional feature extraction such as HOG feature extraction but it was time-consuming, making it difficult to achieve real-time performance. There were also deep-learning methods. For example, in [3], YOLO was applied to detect the traffic light, and in [4], a network named PCANet was introduced. However, these networks’ structures were kind of complex, weakening their ability of real-time application. To conclude, while the deep learning method is better than traditional methods to deal with complicated scenarios, the structure of the deep learning model should be simple to achieve real-time performance.

Fig. 1.
figure 1

Pipeline of the proposed deep-learning-based traffic light recognition system (DeTLR). It includes four steps: skip sampling, traffic light detector (TLD), preprocessing, and traffic light classifier (TLC)

Therefore, in this paper, we propose a deep-learning based traffic light recognition (DeTLR) model with four main modules (see Fig. 1): skip sampling system, traffic light detector (TLD), preprocessing, and traffic light classifier (TLC). The TLD module applies MobileNetV2-SSDLite, and TLC is based on a self-designed small convolutional neural network (CNN) structure, named smallNet. Both the MobileNetV2 and SSD are very light, adopting methods such as deep-wise convolution, which ensures the real-time running speed of our TLD module. Plus, our TLC based on smallNet is efficient, taking only 1.6 ms on the RK3399 platform. By training on a large dataset, our DeTLR system can deal with various application scenarios and have a good performance of generalization. Furthermore, to ensure real-time recognition on slower devices, we devise a skip sampling technique, which can make up the time delay in the final decision-making process. To verify the model’s performance, we conduct several experiments related to the model’s running speed and generalization.

Fig. 2.
figure 2

A demonstration of the skipping sampling system.

2 Method

2.1 Skip Sampling Module

Boosting the TLR’s efficiency is crucial for the ADAS’s reliability and real-time recognition. For some systems, due to the restriction of their devices’ calculation ability, they require some practical method to make up loss caused by the hardware. Therefore, we propose an algorithm in the system’s input module (see Fig. 2).

Generally speaking, the system’s real-time recognition and reliability performance are related to two factors: the vehicle’s speed and its distance to the traffic light. For instance, when a red light is detected and classified, the current distance should be longer than the car’s braking distance (see Fig. 2). Some variables used in the analysis are shown in Table 1.

Table 1. Variables table of establishing real-time and reliability criterion

According to the description, we can establish the criterion of judging the system’s real-time recognition and reliability. Once a red light appears, if \(s_{initial}\) ensures that the car can brake within the distance of \(s_{brake}\), then the system is reliable and safe. From a practical perspective, we can establish a set of more specific relations among these variables. Let the initial velocity to be \(v_0\), by physical knowledge the relationship between braking distance and velocity is as following:

$$\begin{aligned} s_{brake}= \frac{v_{0}^{2}}{2g\mu } \end{aligned}$$
(1)

We can obtain criterion of real-time detection and reliability as follows:

$$\begin{aligned} s_{initial} > v_0\times (t_{algorithm}+t_{react})+\frac{v_{0}^{2}}{2g\mu } \end{aligned}$$
(2)

In practice, however, the braking system is not locked entirely in the normal driving process. Therefore, we decided to add a modifying coefficient of \(\eta \) in the Eq. (2). Since our system can ensure safety during normal driving, emergency situations are sufficiently under control. Thus, Eq. (2) can be modified into (3):

$$\begin{aligned} s_{initial} > v_0\times (t_{algorithm}+t_{react})+{\eta }\times \frac{v_{0}^{2}}{2g\mu } \end{aligned}$$
(3)

Analyze these variables, only \(s_{initial}\) and \(t_{algorithm}\) are closely related to our TLR system’s performance, and the rest are related to external conditions or the vehicle itself. With better detection performance, vehicle can gain further \(s_{initial}\) and shorter \(t_{algorithm}\). The method of depth estimation in a single frame can be applied to work out \(s_{initial}\), and we can use different devices (both CPU and GPU) to measure \(t_{algorithm}\).

Fig. 3.
figure 3

The relationship between \(s_{initial}\) and \(t_{algorithm}\), when \(s_{initial}\) is above the line, the skip sampling system’s real-time performance is ensured

Here is an example of determining the number of frames to omit. Supposing that the velocity of the car is 60 km/h, equal to 16.7 m/s, and the \(t_{react}\) is about 0.8 s. Additionally, for the normal road surface, \(\mu \) is 0.75, the modifying coefficient \(\mu \) is 1.5. According to the relation above, we can obtain that the quantitative relationship between \(s_{initial}\) and \(t_{algorithm}\) is:

$$\begin{aligned} s_{initial} > 0.0166\times t_{algorithm}+41.5 \end{aligned}$$
(4)

\(t_{algorithm}\) is in the millisecond and \(s_{initial}\) is in the meter, and the relationship is shown in Fig. 3. Suppose that the video is 30 FPS. We can get the result of \(s_{initial}\) from \(t_{algorithm}\). This result means that in the condition of skip sampling recognition, the detectable frame must be obtained at \(s_{initial}\) or further. Otherwise, the vehicle is unsafe, such as the sample point in Fig. 3.

2.2 Traffic Light Detector (TLD)

In our TLD model, the main network derives from MobileNetV2 (see Fig. 4), connected to SSDLite. From stage 3 to stage 8, every stage uses a 1 \(\times \) 1 kernel to get complete object detection. The classification part judges the box’s category, and cls refers to the number of objection’s states (in our DeTLR system, it equals to 2, light and no-traffic lights.). The position part determines the object’s position. After this process, a non maximum suppression (NMS) is provided to confirm the result of object detection. According to [2], the network is memory-efficient, with a max number of channels/memory 200K, comparing to other networks such as ShuffleNet (600K) and MobileNetV1 (800K). Calculation of MobileNetV2 also indicates that it has fewer parameters (3.4M in [2]), leading to less time of objecting detection.

Fig. 4.
figure 4

Structure of our TLD module

The SSDLite is a friendly variant of regular SSD. It is a model based on traditional convolutional neural networks, such as VGG. Regular convolutions in the structure are replaced by separable convolutions (depthwise followed by 1 * 1 projection) in SSDLite prediction layers. SSDLite has the parameter number of 4.3M, much less than traditional object detection structure such as YOLO (50.7M)

The input of TLD is resized into 288 \(\times \) 288, and the output is the traffic light’s bounding boxes.

2.3 Preprocessing

Shapes and further morphological characteristics of traffic lights can be different in different scenarios. The input of the TLC is quadrate, but the bounding boxes obtained by TLD is often long and narrow. To solve this problem, Therefore, we introduce an intermediate preprocessing method. Initially, we select the maximum value of width and height as criteria of scaling and scale the box with the proportion of criteria to the expected size. Since the shape of boxes is not altered, there will be a gap of blank in the expected size of an input template. To deal with these blank pixels, we choose to fill them with color detect at the edge of boxes (see Fig. 5).

Fig. 5.
figure 5

Our preprocessing method (Color figure online)

After preprocessing, the input of our TLC can be satisfied. In our paper, the preprocessing resizes all bounding boxes into 32 \(\times \) 32 crops.

2.4 Traffic Light Classifier (TLC)

As mentioned in the introduction, our TLC is a convolutional neural network named smallNet. Our TLC’s structure is shown in Table 2. The smallNet consists of three main layers, making the structure simple. The input is a 32 \(\times \) 32 crop, and the output is one of the four traffic light categories, green, red, yellow, and none type.

Table 2. The structure of smallNet

From Table 2, we can give a theoretical analysis of its efficiency, by calculating its the number of parameters and FLOPs (refer to floating point operations per second). The classifier’s total amount of parameters is about 55K. The FLOPs of our TLC model is about 5.2M. The two theoretical results indicate that our TLC model is light and efficient.

3 Experiments

3.1 Datasets

Our DeTLR model is trained and validated on the Berkeley Driving Dataset (BDD). The dataset contains 100,000 videos, and each of them is about 40 s long, sized of 1280 \(\times \) 720, and 30 fps. The dataset has two advantages. Firstly, the dataset covers different weather conditions, time and regions, meaning that it contains various scenarios with different illumination conditions, distances, and distortion degrees (see Fig. 6). Another advantage is its great capacity of images with precise labels, which makes it suitable to be training set for a model based on deep learning methods.

Table 3. Information of datasets
Fig. 6.
figure 6

Frames of different scenarios in the BDD dataset.

Besides the BDD dataset, other datasets are also used in this paper, including WPI [4], LaRA [5], and LISA [10, 11]. From their comparison (see Table 3), the BDD’s advantages mentioned above are reasonable, and different scenarios and great capacity make the BDD suitable for training a deep-learning model.

3.2 Experiment Setup

In the experiment, we train our model on NVIDIA’s GPUs and evaluate the running time on TITAN Black GPU and RK3399 platforms. To gain the best performance, our TLD and TLC modules use a different setup.

In the process of TLD training, the input is normalized into 288 \(\times \) 288. We train our TLD on two GPUs, the batch size of each GPU is 16. The learning rate is initialized to 0.0001, and the SGD learning algorithm is based on the multistep method, and the weight decay is 0.00001.

In the process of TLC training, the input is normalized into 32 \(\times \) 32 crops. We train the model with a batch size of 16. Our learning algorithm is based on Adam method, with the learning parameter initialized to 0.01 and the weight decay set to 0.9. The iteration number is 15. The testing batch size is also 16.

Due to the precise ground truth provided by the BDD dataset, we initially extract the boxes from original images to train our TLC. We have extracted more than 100,000 boxes with the label of bounding boxes and their categories (see Table 4). The distribution of each type is not balanced, with a relatively more example of green light and red light, and fewer yellow light. However, from the samples shown in Fig. 7, the yellow lights are similar to red ones. Classifying yellow to red ones can actually increase the DeTLR’s reliability. Furthermore, in reality, the chance of meeting yellow lights is much lower than the green and red lights. Therefore, the dataset’s unbalanced distribution can have a little negative effect.

Table 4. Distribution of training and testing sets
Fig. 7.
figure 7

Traffic light samples: (a) green, (b) red, (c) yellow, (d) none (Color figure online)

3.3 TLC Performance

Firstly, we test the performance of TLC. Our TLC initially works to ground truth boxes extracted from the BDD dataset. We collect the results of recall, precision as well as running time (see Table 5). The running time of TLC is about 0.7 ms on Nvidia’s Titan Black GPU.

Table 5. Our TLC’s performance

The result shows that green, red, and none type of traffic lights’ recognition reaches a high recall and precision rate. Although the performance on yellow lights is not really good, our TLC result shows that about 73% of false classification is a red type. Since classifying a yellow light into red one does no harm for practical driving, this minor error can be ignored.

3.4 Comparison to the One-Step Detection Framework

In this experiment, we compare our DeTLR model (two-step) to a one-step variant, which combines the detection and classification steps into one single step. When using our DeTLR model, the parameter cls in our TLD model remains the value 2 (judge if the box is the traffic light or not). When using one-step traffic light detection, it not only judges the box is the traffic light or not, but also provides its color category, so the parameter cls is modified to the value 5. Both two methods are tested on Nvidia’s Titan Black GPU (see Table 6).

We also compare two methods’ mean average precision (MAP), which is an important criterion for the performance. The one-step method’s MAP is 26.27%, while the two-step’s method gains 33.84%. This also indicates that when our TLD has better performance than the one-step method.

Table 6. The comparison of DeTLR (two-step) to the one-step method

The one-step model is 100 ms on average, and the two-step method is 100 ms (TLD) plus 0.7 ms (TLC). The result in the table shows that the average performance of our two-step DeTLR model is better than the one-step model, and the average running time’s difference is small enough to be ignored. This is because the identification’s precision of two categories is much easier than five categories, and our TLC model can be trained to a high precision independently. In conclusion, our DeTLR model can have a better performance than the one-step model. Figure 8 shows some examples of our model’s result.

Fig. 8.
figure 8

Our DeLTR’s achievement of TLR in the BDD dataset.

3.5 Generalization

Generalization is an important property of the traffic light recognition model because scenarios, in reality, are various. To verify this property, we fine-tune the model on different datasets and analyze its performance. These datasets include WPI, LaRA, and LISA. Firstly, we fine-tune our model on the WPI dataset and obtain the performance on every sequence (see Fig. 9).

Fig. 9.
figure 9

Our DeTLR model’s generalization on WPI dataset, HOG and PCANet’s result can refer to [4]

Our DeTLR model has a better average performance than both HOG and PCANet. Our performance is more stable and remains at a relatively high level. The average precision of our model is 96.7%, and PCANet’s precision is 93.1%, and the HOG feature is 80.5%.

Table 7. Our DeTLR model’s generalization on the LaRA and LISA dataset.

We then apply the similar method to LaRA and LISA datasets (see Table 7). As the Table 3 shows, the LISA dataset is separated by the condition of illumination (day and night), and the result of this test verifies that our model has reliable TLR performance in different illumination conditions. To conclude our DeTLR model has a stable performance on different datasets, indicating our model’s generalization performance is good.

4 Conclusion

In this paper, we propose a deep-learning based traffic light recognition (DeTLR) model. The model can achieve reliable recognition and real-time running speed. The model consists of three main modules: a skip sampling system, a traffic light detector (TLD), and a traffic light classifier (TLC). We use MoblieNetV2 and the Single Stage Detector (SSD) framework to construct the TLD, and design a small convolutional neural network for the TLC. The skip sampling system is developed to make up the delay of the time in the response system. We train our models on the BDD dataset, which includes plenty of real scenarios. We get a precision of 96.7% for green lights and 94.6% on red lights. Our TLD and TLC module are separate, and make our model a two-step model. A comparison of the one-step model and two-step model shows that the two-step model has better performance than the one-step model, because it has better precision. The experiments on other datasets for traffic light recognition also shows that our model has a good generalization performance.