Real-Time Traffic Accidents Post-Impact Prediction - Based On Crowdsourcing Data
Real-Time Traffic Accidents Post-Impact Prediction - Based On Crowdsourcing Data
Real-Time Traffic Accidents Post-Impact Prediction - Based On Crowdsourcing Data
A R T I C LE I N FO A B S T R A C T
Keywords: Traffic accident management is a critical issue for advanced intelligent traffic management. The increasingly
Crowdsourcing data abundant crowdsourcing data and floating car data provide new support for improving traffic accident man-
Traffic accidents post-impact agement. This paper investigates the methods to predict the complicated behavior of traffic flow evolution after
Machine learning traffic accidents using crowdsourcing data. Based on the available data source, the traffic condition is divided
Sequential prediction
into four levels by congestion delay index: severely congested, congested, slow moving and uncongested. Four
types of accidents are consequently defined based on the occurrence of each level. A hierarchical scheme is
designed for identifying the most congested level and sequentially predicting duration of each level. The pro-
posed model is validated using traffic accident data in 2017 from an anonymous source in Beijing, China by
embedding three machine learning algorithms, random forest (RF), support vector machine (SVM) and neural
network (NN), in the scheme. The results show NN outperforms the other two models when the assessment is
conducted in absolute differences. Meanwhile, RF has a slightly better performance than SVM, especially when
predicting the short-period congestion of severely congested level at the first time. By continuously updating the
traffic condition information, significant improvement in accuracy can be acquired regardless of the exact model
used. This study shows that emerging crowdsourcing data can be used in a real-time analysis of traffic accidents
and the proposed model is effective to analyze such data.
⁎
Corresponding author.
E-mail address: lrmin@tsinghua.edu.cn (R. Li).
https://doi.org/10.1016/j.aap.2020.105696
Received 20 March 2020; Received in revised form 1 June 2020; Accepted 14 July 2020
0001-4575/ © 2020 Elsevier Ltd. All rights reserved.
Y. Lin and R. Li Accident Analysis and Prevention 145 (2020) 105696
prediction. and Yoon (2012), revealing that the distribution can be much better
In contrast, crowdsourcing data (CD) from mobile applications described by a log-normal distribution, while Hojati et al. (2013) and
(APPs) has become an emerging data source for transportation systems Alkaabi et al. (2011) found that a Weibull distribution fits best. How-
due to its abundancy. Crowdsourcing, which converts all participants to ever, the scalability of such analysis is greatly limited by the adopted
potential supervisors of the transportation system, can have informa- dataset. Researchers then turned to use statistical models to describe
tion from all the users spreading over the entire road network. Users on the relationship between accident duration and other related factors.
the platform can share information immediately once they observed Among all the proposed statistical models, regression model is the most
any changes on roads while others who may be involved in future can basic one, started by linear regression (Cohen and Nouveliere, 1997;
make decision beforehand based on the consistently updating traffic Khattak et al., 2012) which considered the duration as a linear com-
conditions. Obviously, the exploitation of CD comes at the cost of in- bination of different factors. Later, Wang et al., 2013; Agarwal et al.
fidelity and uncertainty when interpreting it. Supplemented with credit (2016) expanded regression models and combined the merits of several
systems, crowdsourcing is becoming a comprehensive but cheap way to different models. The other representative class of models is the sur-
collect traffic related data (Yang et al., 2015; Hasan and Ukkusuri, vival analysis/hazard-based model (Qi and Teng, 2008; Chung, 2010; Li
2014; Zhang et al., 2018). Rashidi et al. (2017) and Chaniotakis et al. et al., 2015): a parametric accelerated failure time (AFT) model that is
(2016) provided a more detailed discussion about how CD could be widely used on different traffic duration time phases to figure out the
utilized in studying transportation issues. impact factors; however, different results are achieved due to the dif-
Recent studies about the application of CD in transportation fields ferences in datasets and regions (Li et al., 2018). Zou et al. (2018) used
mainly lie in the usage of social media, such as Twitter and Facebook. copula approach to jointly analyze incident clearance and response
By extracting the traffic event from textual data, locating the event and time. The results showed that the proposed copula model can better
associating with auxiliary data, traffic conditions can be estimated more estimate conditional survival probability of clearance time than AFT
accurately (Wang et al., 2013). However, we note that User-Generated models.
Crowdsourcing Data (UGCD) which has a more direct connection with As the amount and variety of data generated in transportation sys-
traffic conditions remains unexplored. Several navigation systems such tems grow explosively, not only the impact of each factor is appealing
as Google maps, Waze, Inrix as well as Autonavi and Baidu maps in to researchers, but also the relationship among all variables as well as
China provide users with an interface for reporting various traffic in- the structure of the model itself remains uninvestigated. To harness the
cidents in real time, along with their navigation services. According to massive data with unknown pattern, machine learning algorithms
this feature, accident information as well as surrounding traffic condi- realize the data fusion and bridge the gap when corresponding the in-
tions can be obtained simultaneously in real time. It is applicable to puts to output without exogenous assumptions. Thus, lots of machine
develop a novel method to analyze TAPI using UGCD, especially, to learning algorithms have been implemented to simulate human
predict when the nearby road segment has resumed normal operation learning activities and solved a lot of problems with high accuracy. The
rather than just the clearance time of an accident. typical machine learning methods used in traffic accident duration
In this study, we novally introduce how to use UGCD in real-time prediction include the following: (1) tree models: tree models char-
TAPI prediction and propose a hierarchical scheme to perform se- acterize the nonlinear structure of model and output the average
quential prediction. Our work utilizes the power of open crowdsourcing duration of accidents with similar characteristics, which can give a
data in traffic accident analysis, thus the public is able to capture more good accuracy; however, outliers in the input dataset will largely in-
detailed perturbation of traffic conditions in a cost-efficient way fluence the results. Ma et al. (2017) proposed an efficient gradient-
without inquiring the government department. UGDC can also provide boosting decision tree model for prediction by using a threshold of
additional coverage to existing sources of the traffic management 15 min. In comparison to traditional models and other methods in-
system (Amin-Naseri et al., 2018), which facilitates more traffic parti- cluding RF, SVM and back-propagation-neural network, this model is
cipants. Moreover, this work explores the potential of using advanced superior in both long-lasting and short-period incident prediction. (2)
Artificial Intelligence model in comprehensively predicting accident Artificial neural network (ANN): The ANN approach is a data-driven,
impacts. The remainder of this paper is organized as follows: Section 2 self-adaptive and nonlinear methodology. Wei and Lee (2007) built a
is a brief literature review on traffic accident impact prediction and the data fusion model using ANN techniques with 1 hidden layer and ob-
emerging usage of crowdsourcing data in transportation field. Section 3 tained MAPE under 40% mostly. Yu et al. (2016) compared the per-
gives a detailed introduction and explanation of the data used. Section 4 formance of ANN and SVM and concluded that SVM model has a
presents the hierarchical model and embedded machine learning al- comprehensively better performance despite of some long duration
gorithms in this study, and Section 5 shows the numerical results with cases. (3) Hybrid models: deviation within these simple models acti-
analysis. Finally, Section 6 summarizes the major findings of this study. vates researchers to employ a hybrid model for a more exhaustive
prediction in recent years. Kim and Chang (2012) developed a hybrid
2. Literature review model, which combines a tree model, a logit model and a Bayesian
classifier together. Lin et al. (2016) proposed a combined M5P tree and
2.1. Traffic accident impact prediction model hazard-based duration model. This hybrid model achieves a lower
MAPE and identifies the significant variables much more easily than a
Traffic accident impact is usually measured by traffic accident single model. Shang et al. (2019) used Bayesian Optimization Algo-
duration, which is typically subdivided into 4 sections (Nam and rithm to optimize parameters of RF while the relevant features are
Mannering, 2000), including detecting/reporting time, preparing/dis- calculated by Neighborhood Components Analysis. Kuang et al. (2019)
patching time, travel time and clearance time. Studies that focus on modeled the relationship of different features by a cost-sensitive
traffic accident duration, to a large degree, only consider specific Bayesian network and classified accidents according to its severity. A
components within overall duration, especially the reported time dif- weighted K-nearest neighbor was then applied for duration prediction.
ference between occurrence and clearance. Fu et al. (2019) considered traffic incident duration as a multi-task
Since the life circle of accident is a good indicator of its impact, learning problem and proposed a spatiotemporal feature learning fra-
probabilistic distribution analysis was used to characterize the evolu- mework.
tion of accidents for decades solely based on accident information.
Jones et al. (1991); Qi and Teng (2008) and Chung (2010) used a log-
logistic distribution to fit the distribution of traffic accident duration.
However, there were also studies such as Golob et al. (1987) and Chung
2
Y. Lin and R. Li Accident Analysis and Prevention 145 (2020) 105696
2.2. Real-time measurements and time series analysis used in duration posted by users. Applications in different regions reveal great potential
prediction by covering existing reported incidents with just a small sample of
tweets. UGCD has also been explored in accident analysis in recent
As a traffic accident evolving, more information changes from un- years. Amin-Naseri et al. (2018) discovered that UGCD can provide
known to known and thus could be added into prior consideration; that additional coverage of accidents with low false alarm to conventional
is, an up-to-date replenishment of information in model can result in a traffic management system. Timely reporting has also been found
better prediction. To better encode the changing environment, either a compared to probe-based alternative. Perez et al. (2018) extracted ac-
real-time indicator or a time series model can be beneficial. Khattak cident reports from Waze. They identified the repetitive reports ac-
et al. (1995) split the whole process into 10 phases. Only basic factors cording to road safety theory and obtained the patterns using clustering
such as time, location and weather were obtained from reports in the algorithm.
first phase, which led to a blurry classification of incidents. Details Comprehensively, the internet, associated with a social platform,
about how one incident took place, how the operational conditions can provide first-hand data without a chain of trivial reporting pro-
were around that spot and other descriptions could be added into the cesses. Most of the previous work focuses on incident detection from
analysis in the following phases. Accuracy would increase after accu- social media; the subsequent impact, which extends to a sequential
mulation and correction of all data from different phases. Wei and Lee detection, is rarely mentioned, probably due to the lack of corre-
(2007) created two adaptive ANN-based models. The first one is used to sponding real-time traffic flow status data, or possibly for other reasons.
forecast the duration at the first time of detection or report, while the To the best of our knowledge, traffic duration prediction has not been
other one includes multiperiod updates after the incident notification. performed using UGCD up to now.
Pereira et al. (2013) developed a sequential model which can con-
sistently generate prediction updates whenever new text information is 3. Data preprocessing and problem definition
received while taking into account the elapsed time. Li et al. (2015)
proposed a time-dependent mixture model which performs better than 3.1. Data description
a model only with initial information. A reasonable structure in time
series combined with timely detection of new information can make The data used in this paper was collected from an anonymous na-
sequential prediction powerful. Shi and Abdel-Aty (2015) provided a vigation system. Compared to social media, such as Twitter, navigation
real-time congestion measurement based on Big Data which could de- system is highly related to the transportation system so that it provides
monstrate the temporo-spatial change of congestion patterns. Both di- a perfect interface for CD and traffic condition. In this study, real-time
rect and indirect congestion indicators were found to have significant accident information and surrounding traffic conditions can be ob-
impact on rear-end crashes. Since different real-time indicators might tained by accident report and congestion level estimation functions of
be correlated with each other, Shi et al. (2016) took multicollinearity navigation system respectively.
among independent variables into consideration and the performance The procedure of reporting accident to the navigation system is
of model was further improved by using Bayesian ridge regression to quite simple. Users can depict a traffic accident by selecting accident
deal with the issue. Ghosh et al. (2018) provided updated prediction type, lane location from a preset framework while the time and location
based on real-time streaming data by creating adaptive feature subsets can be automatically detected. This kind of preset framework makes the
based on the availability. report simple and efficient and can provide data in a unified format, but
textual descriptions and photos can also be attached for details. The
2.3. Crowdsourcing data used in transportation study reported information will appear on the map as an icon, so that other
users are informed about the change on road and can get further in-
The data in most previous studies is from traffic accident reporting formation by clicking on the icon. After a certain period, the icon will
system in local traffic accident monitoring centers or emergency re- disappear from the map due to the expiration.
sponse agencies. Because the traffic conditions of different cities and In this dataset, traffic conditions are obtained from the feedback
regions have changed greatly over time, researchers cannot increase the provided by floating cars every 5 min to the granularity of road seg-
sample size of the research objects by simply mixing a large amount of ment. However, relative congestion level usually counts more than
historical data with diverse background. In response to this situation, absolute speed in reality when depicting the traffic status. In typical
CD provides a significant advantage in that a large amount of data can navigation systems, a 4-level traffic status defined by congestion delay
be collected in a short time without specific gathering environment index with color indicators is used. Congestion delay index I is given by
required. Rashidi et al. (2017) presented a bibliometric analysis with a Eq. (1):
focus on applications of social media data in modelling travel behavior,
vfree
including travel demand modelling, mobility behavior, individuals’ I=
v (1)
activity pattern, assessing public transport, traffic condition and in-
cidents. They pointed out that the low acquisition cost and increasing where v is the current average speed of the investigated road segment,
amount made these data sources appealing. Nonetheless, special cau- and vfree is the corresponding free flow speed. In this study, the
tion is required in using such data due to high extraction cost and boundaries for different levels are defined as,
sampling bias. Beheshti-Kashi et al. (2018) identified a list of textual
sources in transportation which divides highly valuable user opinions Level 1: when 1 ≤ I < 1.5, the segment is considered as uncongested,
into 3 categories, including social media based sources, traditional re- indicated by a green color (G);
port sources and intra-organizational sources. Regarding to applications Level 2: when 1.5 ≤ I < 2, the segment is considered as slow moving,
related with traffic accidents, Nguyen et al. (2016) carried out a de- indicated by a yellow color (Y);
tailed analysis of using Twitter to monitor traffic flow, which could be Level 3: when 2 ≤ I < 4, the segment is considered as congested, in-
an ideal method of traffic incident detection. 5000 filtered tweets are dicated by a red color (R);
labeled as either relevant or nonrelevant when training the machine Level 4: when I ≥ 4, the segment is considered as severely congested,
learning models. Their model can not only advance the incident de- indicated by a crimson color (C).
tecting time in comparison to Transport Management Centre (TMC) log
time, but also discover some incidents that are not reported to TMC. Gu Since different users may report the same accident to the platform
et al. (2016) primarily used natural language processing and several which is encouraged for credential purpose, repetitive records are
classifiers to extract and filter useful information from original text stored. After filtering the information of all accident reports in Beijing,
3
Y. Lin and R. Li Accident Analysis and Prevention 145 (2020) 105696
including both urban and suburban areas during 2017, we finally 4. Methodology
screen out a list of 13,338 unique accident reports. Matching the reports
with the traffic condition data by geocoding, we obtain a comprehen- 4.1. Hierarchical TAPI predicting scheme
sive description of an accident and the corresponding traffic conditions.
To further capture other potential factors that may affect TAPI, we Based on 4 accident types, we proposed a hierarchical TAPI pre-
gathered more data about weather and air quality during the selected dicting model by combining the prediction of the most congested
traffic accident period as well as the temporospatial properties of road condition and the duration of consequent congestion levels. At the very
segments. beginning of an accident, only basic information is known, a qualitative
prediction about the most congested level is performed in the order of
severity. When a level is predicted to happen, the duration prediction is
3.2. A comprehensive definition for TAPI and accident types activated for all levels below. As time goes by, more information about
the accident and its consequent traffic conditions will be known. If the
Conventionally regarded as accident duration, TAPI, in fact, has a most congested level in reality is consistent with former prediction, new
more sophisticated evolution. Not only the accident itself that matters, information such as when the former congestion level ends is added
the entire recovering process should also be taken into account. More into the model, more accurate prediction for the following TAPI can be
specifically, the congested level will show a unimodal trend, first ex- performed. Otherwise, the prediction of the most congested level is
periencing a growing stage after an accident occurrence and then fol- revised and the following prediction is performed consequently. The
lowed by a diminishing stage until traffic flow returns to normal. prediction process can be illustrated as Fig. 3.
Accident may be cleared in either stage but TAPI will last much longer Take the accident shown in Fig. 2d as an example, the hierarchical
and cause lasting impact on roads. Moreover, congestion of different scheme will ideally make predictions in the following manner: (1) given
levels needs to be considered separately and differently. Based on the the initial information of accident and associate environment factors,
subdivision of traffic status, TAPI can be depicted comprehensively by use a binary classification to predict whether Level 4 will occur; (2)
the following 2 sets of value: (1) the most congested level that traffic Level 4 is predicted to happen. Thus, this accident is considered to be a
condition will reach at the end of growing stage. (2) The duration from type 4 accident and duration prediction for Ttoi(i = 1, 2, 3) is activated.
the reported start time to the end of each congestion level within the Predicted value can be used as a reference for traffic guidance; (3)
diminishing stage. Combining these two factors, the accidents can be 65 min later, N4 = 1 is detected which indicates a right classification at
classified into 4 types according to the number of levels reached con- step 1. No additional adjustment is needed with a correct classification;
sequently, as shown in Fig. 1. Type 4 is the worst case which means that (4) 85 min later, T3 is detected. Add the known value Tto3 into model
after the accident happens, the traffic becomes increasingly congested and calibrate the predicted duration of lower levels T̂toi (i = 1, 2) ; (5)
until it reaches Level 4 and then congestion dissipates gradually. In 110 min later, Tto2 is detected. Further add the known value Tto2 into
contrast, Type 1 accident almost has no impact on the road conditions. model and calibrate the predicted duration T̂to1; (6) 120 min later, Tto1 is
Tstart stands for the start time of an accident. Ideally, congestion detected. Traffic flow returns to uncongested and the TAPI of this ac-
status may reach the peak after a period of fluctuation and slowly re- cident is fully depicted in the procedure.
turn to uncongested after the clearance of traffic accident. N4 = 1, if the
congestion reaches the level of severely congested, and N4 = 0, other-
wise. Similarly, Ni(i = 1, 2, 3) indicates whether the traffic ever reaches 4.2. Embedded algorithm
level i. During the diminishing stage, Ti(i = 1, 2, 3) is the first time back
to level i. Although road conditions may fluctuate back and forth be- Since the scheme does not depend on any prior assumption or
tween different levels during the clearance of a traffic accident, we use structure, it is totally data-driven and we can embed any algorithms for
the first occurrence of each congestion level in diminishing stage to prediction. For comparison and illustration purpose, we use three
represent the change of status in this study. It's because the first oc- common algorithms, random forest, support vector machine and neural
currence time shows the least time required for recovery while the network, in the following content.
following fluctuation has a higher probability to be caused by reasons RF is an ensemble model with each tree model catching part of the
other than accident. Therefore, we can define Ttoi = Ti − Tstart(i = 1, 2, nonlinearity between TAPI and a subset of factors. Since the inner re-
3) to measure the exact duration of recovery to a specific level. lationship among all variables and their different categories is difficult
Taking the actual data as an example, all 4 types of accident con- to interpret in real life, RF acts as a combination of multiple decision
gestion progress can be observed and examples are shown in Fig. 2. trees to find the most possible result without a bunch of assumptions. It
is also effective in preventing overfitting to training data.
SVM constructs a hyperplane that can be used for classification. If
linear inseparability occurs when dividing the space, segmentation
4
Y. Lin and R. Li Accident Analysis and Prevention 145 (2020) 105696
could be finished by mapping all points to a higher-dimensional plane. probability to have high congestion levels increases as the road class
By using a kernel function to map data from lower-dimensional space to becomes higher and severely congested after an accident occurs twice
higher-dimensional space, simplified calculation can be operated di- as likely on urban roads as rural roads. However, average duration is
rectly in the mapped space, thus making the application of the algo- not much different in both cases. (2) Workday has significantly larger
rithm feasible. In this study, a radial basis kernel function shown as Eq. probability but similar duration to have congestion than holiday. (3)
(2) is chosen: Peak hour increases the probability to have congestion while morning
2 peak hour has further impact on the duration. (4) Middle lane has a
K (x i, xj) = e−γ ∥ xi − xj ∥ (2) higher probability to cause congestion compared to side lanes. (5)
NN is composed of input layer, hidden layer and output layer and Foggy and sleet have positive impact on probability and duration re-
weights between each layers are calculated. Generally, the output spectively. (6) Congestion level before accident shows the largest var-
variable can be written as Eq. (3): iation among all factors.
5
Y. Lin and R. Li Accident Analysis and Prevention 145 (2020) 105696
0.73 among all levels while NN slightly outperforms the other two ML not last long.
models. However, the skewed fact should be noticed when predicting Considering the sequential prediction, we obtain more information
N4 and N2. According to the dataset, severely congested rarely happens. as the accident evolves which can also be added into the model.
If we adopt a naïve algorithm to keep predicting 0 for all N4, the model Specifically, if Tto3 is detected in reality, it can be considered as an
already achieves a high ACC. Similarly, always predicting 1 for N2 also independent variable when predicting Tto2 and Tto1 in Type 4 accidents.
achieves a high ACC. But this kind of algorithm is not applicable since it Likewise, actual Tto2 can be utilized to predict Tto1 in both Type 4 and
does not serve the classification purpose. Thus, on the other hand, more Type 3 accidents. The results of models with updates are as shown in
indicators such as PPV are needed as a supplement. PPV is over 0.81 Fig. 5.
among all models where SVM has slightly better performance. The high Comparing the results without updates in Fig. 4 and with updates in
PPV implies a high probability of congestion occurrence if the model Fig. 5, we can see that sequential prediction can effectively improve all
gives a positive prediction and thus is quite informative to traffic par- three models.
ticipants. Note that, as the congestion level increases, PPV also in- From the perspective of numerical criteria, mean absolute percen-
creases and even approaches 1 when predicting the severely congested tage error (MAPE) and root mean square error (RMSE) are used to
level. Overall, our model can have a good prediction with high ACC and measure the accuracy of these quantitative models as shown in Tables 3
PPV which makes a good start for TAPI prediction. and 4.
For the duration prediction of each level, we start with only relying MAPE shown as in Eq. (6) is chosen to assess the overall perfor-
on the initial information of accidents. The predicted results without mance of models,
updates are shown in Fig. 4.
m
All the models have a concentrated prediction, that is, more accu- 1 ˆti − ti
rate prediction can be achieved at moderate duration while obvious
MAPE =
m
∑ ti
× 100%
i=1 (6)
deviation is shown at larger and smaller values. As the degree of con-
gestion decreases, the predicted values show a positive bias. In addi- RMSE shown as in Eq. (7) is chosen to understand the absolute value of
tion, it is worth noting that NN model shows extremely high accuracy in deviation,
Tto3 prediction and a positive bias in other cases when congestion does
6
Y. Lin and R. Li Accident Analysis and Prevention 145 (2020) 105696
Table 1
Summary of occurrence probability and duration for each congestion level.
Category Variable Dummy variable Average probability (%) Average time (min)
Temporo-spatial feature Classification of roads a1 highway 18.59 60.22 80.67 55.60 71.60 80.30
a2 1st class road 12.26 51.42 79.25 58.27 74.86 77.20
a3 2ed class road 9.58 39.59 68.32 50.77 63.42 70.50
a4 3rd class road 10.31 40.36 62.78 35.00 62.33 68.50
a5 4th class road 6.91 31.48 58.93 49.58 64.97 69.50
a6 expressway 29.79 73.65 88.73 50.53 68.95 78.19
a7 arterial 25.71 64.88 82.42 49.23 65.92 75.46
a8 2ed trunk 21.34 57.10 76.57 51.05 68.08 75.46
a9 branch 17.11 53.22 74.71 54.87 68.91 76.66
a10 street 18.32 52.01 75.46 41.70 61.44 68.52
Workday or not b1 workday 23.92 61.71 80.32 50.64 67.98 76.08
b2 holiday 18.59 57.55 78.06 49.36 65.70 74.97
Peak hour or not c1 morning peak 25.18 64.60 81.89 55.36 73.07 80.14
c2 evening peak 26.82 65.98 83.42 48.03 67.44 76.49
c3 off peak 20.10 57.16 77.57 48.78 64.69 73.55
Lane location d1 left side 20.45 57.85 78.05 49.28 66.60 74.57
d2 middle 25.24 64.07 81.82 49.70 67.23 76.21
d3 right side 22.75 60.81 79.83 51.75 68.34 76.56
Weather condition Weather e1 sunny/cloudy 22.27 60.79 79.61 50.35 67.17 75.66
e2 light rain 23.39 60.49 80.57 50.79 67.72 75.94
e3 moderate rain 24.00 60.00 80.00 47.33 70.42 76.13
e4 storm 23.72 63.46 77.56 50.14 68.69 80.25
e5 foggy 28.24 70.59 87.06 47.92 69.67 76.89
e6 sleet 36.36 50.91 63.64 58.50 83.57 83.71
Pollution level f1 good 22.37 60.40 79.62 50.63 67.47 75.75
f2 lightly 24.06 62.05 80.50 49.29 67.21 76.09
f3 moderately 23.22 61.48 79.51 53.76 69.96 76.67
f4 severely 22.50 62.50 81.88 45.14 66.75 75.19
Accident information Accident type g1 breakdown 23.11 60.71 79.97 49.03 69.01 77.26
g2 scratch 24.49 62.30 82.03 48.43 65.79 75.32
g3 pileup 21.24 60.08 78.98 52.27 67.96 75.83
g4 other 22.19 55.01 75.39 53.00 67.93 74.19
Congestion level before accident h1 uncongested 10.83 38.57 63.07 58.80 68.11 70.19
h2 slow moving 14.04 68.21 99.71 55.15 63.17 70.61
h3 congested 35.90 99.71 99.89 46.73 65.96 85.78
h4 severely 99.03 99.46 99.68 45.39 77.77 87.64
Table 2
Results of predicting the most congested level.
Actual value RF SVM NN
7
Y. Lin and R. Li Accident Analysis and Prevention 145 (2020) 105696
8
Y. Lin and R. Li Accident Analysis and Prevention 145 (2020) 105696
Table 3 Table 6
Results of predicting congestion duration. Performance comparison with previous study.
RF SVM NN Literature MAPE range
MAPE (%) RMSE MAPE (%) RMSE MAPE (%) RMSE Wei and Lee (2007) 35–45%
(min) (min) (min) Pereira et al. (2013) 40–100%
Li et al. (2015) 45.4–185.7%
Type 4 Tto3 13.43 21.06 48.77 33.53 3.10 8.09 Ghosh et al. (2018) 20–100.9%
Tto2 32.78 34.06 33.04 34.74 17.40 10.86 Fu et al. (2019) 37.16–96.38%
Tto1 30.51 30.59 30.29 30.93 13.59 10.04 Our work 5.5–53.8%
Type 3 Tto2 39.82 25.26 39.27 25.56 42.86 14.85
Tto1 27.95 21.41 27.48 20.97 21.35 13.42
Type 2 Tto1 39.22 24.08 36.44 24.23 53.80 15.68
Table 7
Top 5 feature importance of different RF models.
9
Y. Lin and R. Li Accident Analysis and Prevention 145 (2020) 105696
work; this paper applies the machine learning method to predict the Acknowledgments
duration of different traffic condition states after traffic accidents and
considers the updating effect by adding newly acquired data to the The authors gratefully acknowledge assistance with accident data
prediction. On a real-time basis, it ultimately aims to depict the evo- and float car data from concerned institution. The research reported in
lution of surrounding traffic conditions after the occurrence of a traffic this paper is part of the Project supported by the National Natural
accident. On an offline basis, it can be used to assess the robustness of Science Foundation of China (71871123). The financial support is
transportation system when traffic accidents happen. Specifically, we highly appreciated.
first extract UGCD from navigation APPs provided by the users. After
filtering and calibrating UGCD into a set of traffic accidents, we can Appendix A. Supplementary data
consecutively generate the features for every accident. Based on the
geographical information of the accidents, we map each accident to the Supplementary data associated with this article can be found, in the
corresponding traffic flow data with the same spatiotemporal feature online version, at https://doi.org/10.1016/j.aap.2020.105696.
which can also be obtained from navigation apps. Additional related
information can also be extracted in order to explore other factors that References
might affect TAPI. Preprocessing steps such as screening of raw data,
selection of independent variables, definition of dependent variables, Agarwal, S., Kachroo, P., Regentova, E., 2016. A hybrid model using logistic regression
etc. are carried out to describe accident information and congestion. and wavelet transformation to detect traffic incidents. IATSS Res. 40 (1), 56–63.
Alkaabi, A.M.S., Dissanayake, D., Bird, R., 2011. Analyzing clearance time of urban traffic
Using the congestion delay index, which is commonly used in current accidents in Abu Dhabi, United Arab Emirates, with hazard-based duration modeling
traffic guidance systems, the TAPI is divided into four levels: severely method. Transp. Res. Rec. (2229), 46–54.
congested, congested, slow moving and uncongested. Based on the Al-Najada, H., Mahgoub, I., 2017. Real-time incident clearance time prediction using
traffic data from internet of mobility sensors. Proceedings of the 2017 IEEE 15th Intl.
occurrence of each traffic condition level, the accident can be divided Conf. on Dependable, Autonomic and Secure Computing, 15th Intl. Conf. on
into four types. Predicting the recovering duration of each level for Pervasive Intelligence and Computing, 3rd Intl. Conf. on Big Data Intelligence and
every type of accident can give an overall description of the persistence Computing and Cyber Science and Technology Congress (DASC/PiCom/DataCom/
CyberSciTech) 728–735.
of congestion. RF, SVM and NN algorithms are applied to model and
Amin-Naseri, M., Chakraborty, P., Sharma, A., Gilbert, S.B., Hong, M., 2018. Evaluating
predict the congestion state and duration. NN model has a better per- the reliability, coverage, and added value of crowdsourced traffic incident reports
formance in most cases, especially when considering the absolute dif- from Waze. Transp. Res. Rec. 2672 (43), 34–43.
Beheshti-Kashi, S., Buch, R., Lachaize, M., Kinra, A., 2018. Big textual data in transpor-
ference between predicting duration and actual duration Moreover, the
tation: an exploration of relevant text sources. Proceedings of the International
precision can become significantly improved over time with updated Conference on Dynamics in Logistics 395–399.
information involved regardless of the exact algorithm that is em- Chaniotakis, E., Antoniou, C., Pereira, F., 2016. Mapping social media for transportation
bedded. This result has great reference value for the prediction of real- studies. IEEE Intell. Syst. 31 (6), 64–70.
Chung, Y., 2010. Development of an accident duration prediction model on the Korean
time traffic conditions, for it can effectively guide the traffic partici- freeway systems. Accid. Anal. Prev. 42 (1), 282–289.
pants to avoid relevant congestion sections. Chung, Y., Yoon, B.J., 2012. Analytical method to estimate accident duration using ar-
There are several limitations in this study, which can be improved in chived speed profile and its statistical analysis. KSCE J. Civil Eng. 16 (6), 1064–1070.
Cohen, S., Nouveliere, C., 1997. Modelling incident duration on an urban expressway. In:
the future: (1) the model in this study only considers a small subset of Papageorgiou, M., Pouliezos, A. (Eds.), Transportation Systems 1997, Vols. 1–3.
factors that may impact TAPI and thus has a limited degree of accuracy. Pergamon Press Ltd, Oxford, pp. 297–301.
The accident information recorded by crowdsourcing data is not as Fu, K., Ji, T., Zhao, L., Lu, C.-T., 2019. Titan: a spatiotemporal feature learning framework
for traffic incident duration prediction. Proceedings of the 27th ACM SIGSPATIAL
comprehensive as data obtained by traditional methods. For example, International Conference on Advances in Geographic Information Systems 329–338.
the specific types of accidents, the number of lanes and vehicles in- Ghosh, B., Asif, M.T., Dauwels, J., Fastenrath, U., Guo, H., 2018. Dynamic prediction of
volved, which may affect the congestion time, are not normally re- the incident duration using adaptive feature set. IEEE Trans. Intell. Transp. Syst. 20
(11), 4019–4031.
corded in a crowdsourcing dataset. In the future, it is possible to con- Giuliano, G., 1989. Incident characteristics, frequency, and duration on a high volume
sider increasing the information of the network reporting system so that urban freeway. Transp. Res. Part A – Policy Pract. 23 (5), 387–396.
the state of the accident can be more accurately described. (2) Due to Golob, T.F., Recker, W.W., Leonard, J.D., 1987. An analysis of the severity and incident
duration of truck-involved freeway accidents. Accid. Anal. Prev. 19 (5), 375–395.
the large differences in traffic conditions at different times in different
Gu, Y.M., Qian, Z., Chen, F., 2016. From twitter to detector: real-time traffic incident
regions, models made by using a certain data sample are not applicable detection using social media data. Transp. Res. Part C Emerg. Technol. 67, 321–342.
to all road conditions. Portability of the enhanced model can be con- Hasan, S., Ukkusuri, S.V., 2014. Urban activity pattern classification using topic models
sidered in the future. From this study, we can see the feasibility of using from online geo-location data. Transp. Res. Part C Emerg. Technol. 44, 363–381.
Hojati, A.T., Ferreira, L., Washington, S., Charles, P., 2013. Hazard based models for
crowdsourcing data to predict congestion caused by traffic accidents. freeway traffic incident duration. Accid. Anal. Prev. 52, 171–181.
With further development and enrichment in the future, crowdsourcing Jones, B., Janssen, L., Mannering, F., 1991. Analysis of the frequency and duration of
data can provide more accurate support for post-accident congestion freeway accidents in Seattle. Accid. Anal. Prev. 23 (4), 239–255.
Khattak, A., Wang, X., Zhang, H., 2012. Incident management integration tool: dynami-
prediction by integrating data from diverse sources. cally predicting incident durations, secondary incident occurrence and incident de-
lays. IET Intell. Transp. Syst. 6 (2), 204–214.
Khattak, A.J., Schofer, J.L., Wang, M.H., 1995. A simple time-sequential procedure for
predicting freeway incident duration. IVHS J. 2 (2), 113–138.
Author statement Kim, W., Chang, G.L., 2012. Development of a hybrid prediction model for freeway in-
cident duration: a case study in Maryland. Int. J. Intell. Transp. Syst. Res. 10 (1),
Yunduan Lin: conceptualization, methodology, software, investiga- 22–33.
Kuang, L., Yan, H., Zhu, Y., Tu, S., Fan, X., 2019. Predicting duration of traffic accidents
tion, writing-original draft preparation, writing-reviewing and editing. based on cost-sensitive Bayesian network and weighted k-nearest neighbor. J. Intell.
Ruimin Li: funding acquisition, supervision, investigation, writing-re- Transp. Syst. 23 (2), 161–174.
viewing and editing, data curation, investigation, validation. Li, R.M., Pereira, F.C., Ben-Akiva, M.E., 2015. Competing risks mixture model for traffic
incident duration prediction. Accid. Anal. Prev. 75, 192–201.
Li, R.M., Pereira, F.C., Ben-Akiva, M.E., 2018. Overview of traffic incident duration
analysis and prediction. Eur. Transp. Res. Rev. 10 (2), 13.
Conflict of Interest Lin, L., Wang, Q., Sadek, A.W., 2016. A combined m5p tree and hazard-based duration
model for predicting urban freeway traffic accident durations. Accid. Anal. Prev. 91,
114–126.
The authors declare that they have no known competing financial Ma, X.L., Ding, C., Luan, S., Wang, Y., Wang, Y.P., 2017. Prioritizing influential factors for
interests or personal relationships that could have appeared to influ- freeway incident clearance time prediction using the gradient boosting decision trees
ence the work reported in this paper. method. IEEE Trans. Intell. Transp. Syst. 18 (9), 2303–2310.
Nam, D., Mannering, F., 2000. An exploratory hazard-based analysis of highway incident
10
Y. Lin and R. Li Accident Analysis and Prevention 145 (2020) 105696
duration. Transp. Res. Part A Policy Pract. 34 (2), 85–102. Skabardonis, A., Varaiya, P., Petty, K.F., Trb, 2003. Measuring Recurrent and
Nguyen, H., Liu, W., Rivera, P., Chen, F., 2016. Trafficwatch: real-time traffic incident Nonrecurrent Traffic Congestion. Freeways, High-Occupancy Vehicle Systems, and
detection and monitoring using social media. In: Bailey, J., Khan, L., Washio, T., Traffic Signal Systems 2003: Highway Operations, Capacity, and Traffic Control.
Dobbie, G., Huang, J.Z., Wang, R. (Eds.), Advances in Knowledge Discovery and Data Transportation Research Board Natl Research Council, Washington, pp. 118–124.
Mining, PAKDD 2016, Pt I. Springer-Verlag Berlin, Berlin, pp. 540–551. Wang, S., He, L., Stenneth, L., Philip, S.Y., Li, Z., Huang, Z., 2013. Estimating urban traffic
Pereira, F.C., Rodrigues, F., Ben-Akiva, M., 2013. Text analysis in incident duration congestions with multi-sourced data. Proceedings of the 2016 17th IEEE International
prediction. Transp. Res. Part C Emerg. Technol. 37, 177–192. Conference on Mobile Data Management (MDM) 82–91.
Perez, G.V.A., Lopez, J.C., Cabello, A.L.R., Grajales, E.B., Espinosa, A.P., Fabian, J.L.Q., Wei, C.H., Lee, Y., 2007. Sequential forecast of incident duration using artificial neural
2018. Road traffic accidents analysis in Mexico city through crowdsourcing data and network models. Accid. Anal. Prev. 39 (5), 944–954.
data mining techniques. Int. J. Comput. Inform. Eng. 12 (8), 604–608. Yang, F., Jin, P.J., Cheng, Y., Zhang, J., Ran, B., 2015. Origin-destination estimation for
Qi, Y., Teng, H.L., 2008. An information-based time sequential approach to online in- non-commuting trips using location-based social networking data. Int. J. Sustain.
cident duration prediction. J. Intell. Transp. Syst. 12 (1), 1–12. Transp. 9 (8), 551–564.
Rashidi, T.H., Abbasi, A., Maghrebi, M., Hasan, S., Waller, T.S., 2017. Exploring the ca- Yu, B., Wang, Y.T., Yao, J.B., Wang, J.Y., 2016. A comparison of the performance of ANN
pacity of social media data for modelling travel behaviour: Opportunities and chal- and SVM for the prediction of traffic accident duration. Neural Network World 26 (3),
lenges. Transp. Res. Part C Emerg. Technol. 75, 197–211. 271–287.
Shang, Q., Tan, D., Gao, S., Feng, L., 2019. A hybrid method for traffic incident duration Zhang, Z.H., He, Q., Gao, J., Ni, M., 2018. A deep learning approach for detecting traffic
prediction using boa-optimized random forest combined with neighborhood com- accidents from social media data. Transp. Res. Part C Emerg. Technol. 86, 580–596.
ponents analysis. J. Adv. Transp. 2019. Zou, Y., Henrickson, K., Lord, D., Wang, Y., Xu, K., 2016. Application of finite mixture
Shi, Q., Abdel-Aty, M., 2015. Big data applications in real-time traffic operation and models for analysing freeway incident clearance time. Transportmetrica A: Transp.
safety monitoring and improvement on urban expressways. Transp. Res. Part C: Sci. 12 (2), 99–115.
Emerg. Technol. 58, 380–394. Zou, Y., Ye, X., Henrickson, K., Tang, J., Wang, Y., 2018. Jointly analyzing freeway traffic
Shi, Q., Abdel-Aty, M., Lee, J., 2016. A Bayesian ridge regression analysis of congestion's incident clearance and response time using a copula-based approach. Transp. Res.
impact on urban expressway safety. Accid. Anal. Prev. 88, 124–137. Part C Emerg. Technol. 86, 171–182.
11