Nothing Special   »   [go: up one dir, main page]

Real-Time Traffic Accidents Post-Impact Prediction - Based On Crowdsourcing Data

Download as pdf or txt
Download as pdf or txt
You are on page 1of 11

Accident Analysis and Prevention 145 (2020) 105696

Contents lists available at ScienceDirect

Accident Analysis and Prevention


journal homepage: www.elsevier.com/locate/aap

Real-time traffic accidents post-impact prediction: Based on crowdsourcing T


data
Yunduan Lina,b, Ruimin Lia,*
a
Department of Civil Engineering, Tsinghua University, Beijing 100084, China
b
Department of Civil and Environment Engineering, University of California, Berkeley, CA 94720, United States

A R T I C LE I N FO A B S T R A C T

Keywords: Traffic accident management is a critical issue for advanced intelligent traffic management. The increasingly
Crowdsourcing data abundant crowdsourcing data and floating car data provide new support for improving traffic accident man-
Traffic accidents post-impact agement. This paper investigates the methods to predict the complicated behavior of traffic flow evolution after
Machine learning traffic accidents using crowdsourcing data. Based on the available data source, the traffic condition is divided
Sequential prediction
into four levels by congestion delay index: severely congested, congested, slow moving and uncongested. Four
types of accidents are consequently defined based on the occurrence of each level. A hierarchical scheme is
designed for identifying the most congested level and sequentially predicting duration of each level. The pro-
posed model is validated using traffic accident data in 2017 from an anonymous source in Beijing, China by
embedding three machine learning algorithms, random forest (RF), support vector machine (SVM) and neural
network (NN), in the scheme. The results show NN outperforms the other two models when the assessment is
conducted in absolute differences. Meanwhile, RF has a slightly better performance than SVM, especially when
predicting the short-period congestion of severely congested level at the first time. By continuously updating the
traffic condition information, significant improvement in accuracy can be acquired regardless of the exact model
used. This study shows that emerging crowdsourcing data can be used in a real-time analysis of traffic accidents
and the proposed model is effective to analyze such data.

1. Introduction was mainly single-sourcing from the Traffic Incident Management


System (Li et al., 2018), which is created and operated by government
According to various factors, such as improper driving behavior, departments or research institutions. Live traffic conditions can be ac-
bad weather, and so on, traffic accidents become somewhat inevitable. quired from various road sensors such as cameras, loop detectors and
They usually have a broad-scale impact on traffic conditions, especially GPS-based on-board units. However, spatiotemporal sparsity issue oc-
in peak hour. Typically, accidents may generate congestion to some curs in the above conventional approaches due to the limited number of
extent in part of the road network or even cause a chain breakdown in traffic detectors, especially when considering about the identification of
the entire system when the occurrence is overlapped with a road bot- accidents and recording surrounding traffic conditions simultaneously
tleneck. This type of congestion is characterized as nonrecurrent con- for real-time prediction. Due to the limited data availability, two major
gestion along with congestion caused by large events, work zones and deficiencies exist in the previous studies. One is that the model is
extreme weather. Nonrecurrent congestion, of which 72% is caused by usually constructed only based on a small region, such as a specific
accidents (Skabardonis et al., 2003), accounts for half to three-quarters highway (Zou et al., 2016; Al-Najada and Mahgoub, 2017) or urban
of total congestion (Giuliano, 1989). To cope with this mobility chal- freeway (Hojati et al., 2013). The other shortcoming is that the in-
lenge, accurately predicting Traffic Accidents Post-Impact (TAPI) is of formation of past accidents stored in the offline database is always
great significance for better guiding traffic participants involved and complete while only a small part can be acquired in time for a real-time
more efficiently operating integrated transportation systems. prediction. For example, whether there is fatality or injury in accident
One key aspect of traffic accident analysis and prediction is to ob- impacts accident duration. Usually, such information is collected
tain both accident information and surrounding traffic conditions si- afterwards which cannot be used in real-time analysis. Thus, more ef-
multaneously in real time. Accident information in previous research forts need to be paid to retrieve timely information and conduct online


Corresponding author.
E-mail address: lrmin@tsinghua.edu.cn (R. Li).

https://doi.org/10.1016/j.aap.2020.105696
Received 20 March 2020; Received in revised form 1 June 2020; Accepted 14 July 2020
0001-4575/ © 2020 Elsevier Ltd. All rights reserved.
Y. Lin and R. Li Accident Analysis and Prevention 145 (2020) 105696

prediction. and Yoon (2012), revealing that the distribution can be much better
In contrast, crowdsourcing data (CD) from mobile applications described by a log-normal distribution, while Hojati et al. (2013) and
(APPs) has become an emerging data source for transportation systems Alkaabi et al. (2011) found that a Weibull distribution fits best. How-
due to its abundancy. Crowdsourcing, which converts all participants to ever, the scalability of such analysis is greatly limited by the adopted
potential supervisors of the transportation system, can have informa- dataset. Researchers then turned to use statistical models to describe
tion from all the users spreading over the entire road network. Users on the relationship between accident duration and other related factors.
the platform can share information immediately once they observed Among all the proposed statistical models, regression model is the most
any changes on roads while others who may be involved in future can basic one, started by linear regression (Cohen and Nouveliere, 1997;
make decision beforehand based on the consistently updating traffic Khattak et al., 2012) which considered the duration as a linear com-
conditions. Obviously, the exploitation of CD comes at the cost of in- bination of different factors. Later, Wang et al., 2013; Agarwal et al.
fidelity and uncertainty when interpreting it. Supplemented with credit (2016) expanded regression models and combined the merits of several
systems, crowdsourcing is becoming a comprehensive but cheap way to different models. The other representative class of models is the sur-
collect traffic related data (Yang et al., 2015; Hasan and Ukkusuri, vival analysis/hazard-based model (Qi and Teng, 2008; Chung, 2010; Li
2014; Zhang et al., 2018). Rashidi et al. (2017) and Chaniotakis et al. et al., 2015): a parametric accelerated failure time (AFT) model that is
(2016) provided a more detailed discussion about how CD could be widely used on different traffic duration time phases to figure out the
utilized in studying transportation issues. impact factors; however, different results are achieved due to the dif-
Recent studies about the application of CD in transportation fields ferences in datasets and regions (Li et al., 2018). Zou et al. (2018) used
mainly lie in the usage of social media, such as Twitter and Facebook. copula approach to jointly analyze incident clearance and response
By extracting the traffic event from textual data, locating the event and time. The results showed that the proposed copula model can better
associating with auxiliary data, traffic conditions can be estimated more estimate conditional survival probability of clearance time than AFT
accurately (Wang et al., 2013). However, we note that User-Generated models.
Crowdsourcing Data (UGCD) which has a more direct connection with As the amount and variety of data generated in transportation sys-
traffic conditions remains unexplored. Several navigation systems such tems grow explosively, not only the impact of each factor is appealing
as Google maps, Waze, Inrix as well as Autonavi and Baidu maps in to researchers, but also the relationship among all variables as well as
China provide users with an interface for reporting various traffic in- the structure of the model itself remains uninvestigated. To harness the
cidents in real time, along with their navigation services. According to massive data with unknown pattern, machine learning algorithms
this feature, accident information as well as surrounding traffic condi- realize the data fusion and bridge the gap when corresponding the in-
tions can be obtained simultaneously in real time. It is applicable to puts to output without exogenous assumptions. Thus, lots of machine
develop a novel method to analyze TAPI using UGCD, especially, to learning algorithms have been implemented to simulate human
predict when the nearby road segment has resumed normal operation learning activities and solved a lot of problems with high accuracy. The
rather than just the clearance time of an accident. typical machine learning methods used in traffic accident duration
In this study, we novally introduce how to use UGCD in real-time prediction include the following: (1) tree models: tree models char-
TAPI prediction and propose a hierarchical scheme to perform se- acterize the nonlinear structure of model and output the average
quential prediction. Our work utilizes the power of open crowdsourcing duration of accidents with similar characteristics, which can give a
data in traffic accident analysis, thus the public is able to capture more good accuracy; however, outliers in the input dataset will largely in-
detailed perturbation of traffic conditions in a cost-efficient way fluence the results. Ma et al. (2017) proposed an efficient gradient-
without inquiring the government department. UGDC can also provide boosting decision tree model for prediction by using a threshold of
additional coverage to existing sources of the traffic management 15 min. In comparison to traditional models and other methods in-
system (Amin-Naseri et al., 2018), which facilitates more traffic parti- cluding RF, SVM and back-propagation-neural network, this model is
cipants. Moreover, this work explores the potential of using advanced superior in both long-lasting and short-period incident prediction. (2)
Artificial Intelligence model in comprehensively predicting accident Artificial neural network (ANN): The ANN approach is a data-driven,
impacts. The remainder of this paper is organized as follows: Section 2 self-adaptive and nonlinear methodology. Wei and Lee (2007) built a
is a brief literature review on traffic accident impact prediction and the data fusion model using ANN techniques with 1 hidden layer and ob-
emerging usage of crowdsourcing data in transportation field. Section 3 tained MAPE under 40% mostly. Yu et al. (2016) compared the per-
gives a detailed introduction and explanation of the data used. Section 4 formance of ANN and SVM and concluded that SVM model has a
presents the hierarchical model and embedded machine learning al- comprehensively better performance despite of some long duration
gorithms in this study, and Section 5 shows the numerical results with cases. (3) Hybrid models: deviation within these simple models acti-
analysis. Finally, Section 6 summarizes the major findings of this study. vates researchers to employ a hybrid model for a more exhaustive
prediction in recent years. Kim and Chang (2012) developed a hybrid
2. Literature review model, which combines a tree model, a logit model and a Bayesian
classifier together. Lin et al. (2016) proposed a combined M5P tree and
2.1. Traffic accident impact prediction model hazard-based duration model. This hybrid model achieves a lower
MAPE and identifies the significant variables much more easily than a
Traffic accident impact is usually measured by traffic accident single model. Shang et al. (2019) used Bayesian Optimization Algo-
duration, which is typically subdivided into 4 sections (Nam and rithm to optimize parameters of RF while the relevant features are
Mannering, 2000), including detecting/reporting time, preparing/dis- calculated by Neighborhood Components Analysis. Kuang et al. (2019)
patching time, travel time and clearance time. Studies that focus on modeled the relationship of different features by a cost-sensitive
traffic accident duration, to a large degree, only consider specific Bayesian network and classified accidents according to its severity. A
components within overall duration, especially the reported time dif- weighted K-nearest neighbor was then applied for duration prediction.
ference between occurrence and clearance. Fu et al. (2019) considered traffic incident duration as a multi-task
Since the life circle of accident is a good indicator of its impact, learning problem and proposed a spatiotemporal feature learning fra-
probabilistic distribution analysis was used to characterize the evolu- mework.
tion of accidents for decades solely based on accident information.
Jones et al. (1991); Qi and Teng (2008) and Chung (2010) used a log-
logistic distribution to fit the distribution of traffic accident duration.
However, there were also studies such as Golob et al. (1987) and Chung

2
Y. Lin and R. Li Accident Analysis and Prevention 145 (2020) 105696

2.2. Real-time measurements and time series analysis used in duration posted by users. Applications in different regions reveal great potential
prediction by covering existing reported incidents with just a small sample of
tweets. UGCD has also been explored in accident analysis in recent
As a traffic accident evolving, more information changes from un- years. Amin-Naseri et al. (2018) discovered that UGCD can provide
known to known and thus could be added into prior consideration; that additional coverage of accidents with low false alarm to conventional
is, an up-to-date replenishment of information in model can result in a traffic management system. Timely reporting has also been found
better prediction. To better encode the changing environment, either a compared to probe-based alternative. Perez et al. (2018) extracted ac-
real-time indicator or a time series model can be beneficial. Khattak cident reports from Waze. They identified the repetitive reports ac-
et al. (1995) split the whole process into 10 phases. Only basic factors cording to road safety theory and obtained the patterns using clustering
such as time, location and weather were obtained from reports in the algorithm.
first phase, which led to a blurry classification of incidents. Details Comprehensively, the internet, associated with a social platform,
about how one incident took place, how the operational conditions can provide first-hand data without a chain of trivial reporting pro-
were around that spot and other descriptions could be added into the cesses. Most of the previous work focuses on incident detection from
analysis in the following phases. Accuracy would increase after accu- social media; the subsequent impact, which extends to a sequential
mulation and correction of all data from different phases. Wei and Lee detection, is rarely mentioned, probably due to the lack of corre-
(2007) created two adaptive ANN-based models. The first one is used to sponding real-time traffic flow status data, or possibly for other reasons.
forecast the duration at the first time of detection or report, while the To the best of our knowledge, traffic duration prediction has not been
other one includes multiperiod updates after the incident notification. performed using UGCD up to now.
Pereira et al. (2013) developed a sequential model which can con-
sistently generate prediction updates whenever new text information is 3. Data preprocessing and problem definition
received while taking into account the elapsed time. Li et al. (2015)
proposed a time-dependent mixture model which performs better than 3.1. Data description
a model only with initial information. A reasonable structure in time
series combined with timely detection of new information can make The data used in this paper was collected from an anonymous na-
sequential prediction powerful. Shi and Abdel-Aty (2015) provided a vigation system. Compared to social media, such as Twitter, navigation
real-time congestion measurement based on Big Data which could de- system is highly related to the transportation system so that it provides
monstrate the temporo-spatial change of congestion patterns. Both di- a perfect interface for CD and traffic condition. In this study, real-time
rect and indirect congestion indicators were found to have significant accident information and surrounding traffic conditions can be ob-
impact on rear-end crashes. Since different real-time indicators might tained by accident report and congestion level estimation functions of
be correlated with each other, Shi et al. (2016) took multicollinearity navigation system respectively.
among independent variables into consideration and the performance The procedure of reporting accident to the navigation system is
of model was further improved by using Bayesian ridge regression to quite simple. Users can depict a traffic accident by selecting accident
deal with the issue. Ghosh et al. (2018) provided updated prediction type, lane location from a preset framework while the time and location
based on real-time streaming data by creating adaptive feature subsets can be automatically detected. This kind of preset framework makes the
based on the availability. report simple and efficient and can provide data in a unified format, but
textual descriptions and photos can also be attached for details. The
2.3. Crowdsourcing data used in transportation study reported information will appear on the map as an icon, so that other
users are informed about the change on road and can get further in-
The data in most previous studies is from traffic accident reporting formation by clicking on the icon. After a certain period, the icon will
system in local traffic accident monitoring centers or emergency re- disappear from the map due to the expiration.
sponse agencies. Because the traffic conditions of different cities and In this dataset, traffic conditions are obtained from the feedback
regions have changed greatly over time, researchers cannot increase the provided by floating cars every 5 min to the granularity of road seg-
sample size of the research objects by simply mixing a large amount of ment. However, relative congestion level usually counts more than
historical data with diverse background. In response to this situation, absolute speed in reality when depicting the traffic status. In typical
CD provides a significant advantage in that a large amount of data can navigation systems, a 4-level traffic status defined by congestion delay
be collected in a short time without specific gathering environment index with color indicators is used. Congestion delay index I is given by
required. Rashidi et al. (2017) presented a bibliometric analysis with a Eq. (1):
focus on applications of social media data in modelling travel behavior,
vfree
including travel demand modelling, mobility behavior, individuals’ I=
v (1)
activity pattern, assessing public transport, traffic condition and in-
cidents. They pointed out that the low acquisition cost and increasing where v is the current average speed of the investigated road segment,
amount made these data sources appealing. Nonetheless, special cau- and vfree is the corresponding free flow speed. In this study, the
tion is required in using such data due to high extraction cost and boundaries for different levels are defined as,
sampling bias. Beheshti-Kashi et al. (2018) identified a list of textual
sources in transportation which divides highly valuable user opinions Level 1: when 1 ≤ I < 1.5, the segment is considered as uncongested,
into 3 categories, including social media based sources, traditional re- indicated by a green color (G);
port sources and intra-organizational sources. Regarding to applications Level 2: when 1.5 ≤ I < 2, the segment is considered as slow moving,
related with traffic accidents, Nguyen et al. (2016) carried out a de- indicated by a yellow color (Y);
tailed analysis of using Twitter to monitor traffic flow, which could be Level 3: when 2 ≤ I < 4, the segment is considered as congested, in-
an ideal method of traffic incident detection. 5000 filtered tweets are dicated by a red color (R);
labeled as either relevant or nonrelevant when training the machine Level 4: when I ≥ 4, the segment is considered as severely congested,
learning models. Their model can not only advance the incident de- indicated by a crimson color (C).
tecting time in comparison to Transport Management Centre (TMC) log
time, but also discover some incidents that are not reported to TMC. Gu Since different users may report the same accident to the platform
et al. (2016) primarily used natural language processing and several which is encouraged for credential purpose, repetitive records are
classifiers to extract and filter useful information from original text stored. After filtering the information of all accident reports in Beijing,

3
Y. Lin and R. Li Accident Analysis and Prevention 145 (2020) 105696

including both urban and suburban areas during 2017, we finally 4. Methodology
screen out a list of 13,338 unique accident reports. Matching the reports
with the traffic condition data by geocoding, we obtain a comprehen- 4.1. Hierarchical TAPI predicting scheme
sive description of an accident and the corresponding traffic conditions.
To further capture other potential factors that may affect TAPI, we Based on 4 accident types, we proposed a hierarchical TAPI pre-
gathered more data about weather and air quality during the selected dicting model by combining the prediction of the most congested
traffic accident period as well as the temporospatial properties of road condition and the duration of consequent congestion levels. At the very
segments. beginning of an accident, only basic information is known, a qualitative
prediction about the most congested level is performed in the order of
severity. When a level is predicted to happen, the duration prediction is
3.2. A comprehensive definition for TAPI and accident types activated for all levels below. As time goes by, more information about
the accident and its consequent traffic conditions will be known. If the
Conventionally regarded as accident duration, TAPI, in fact, has a most congested level in reality is consistent with former prediction, new
more sophisticated evolution. Not only the accident itself that matters, information such as when the former congestion level ends is added
the entire recovering process should also be taken into account. More into the model, more accurate prediction for the following TAPI can be
specifically, the congested level will show a unimodal trend, first ex- performed. Otherwise, the prediction of the most congested level is
periencing a growing stage after an accident occurrence and then fol- revised and the following prediction is performed consequently. The
lowed by a diminishing stage until traffic flow returns to normal. prediction process can be illustrated as Fig. 3.
Accident may be cleared in either stage but TAPI will last much longer Take the accident shown in Fig. 2d as an example, the hierarchical
and cause lasting impact on roads. Moreover, congestion of different scheme will ideally make predictions in the following manner: (1) given
levels needs to be considered separately and differently. Based on the the initial information of accident and associate environment factors,
subdivision of traffic status, TAPI can be depicted comprehensively by use a binary classification to predict whether Level 4 will occur; (2)
the following 2 sets of value: (1) the most congested level that traffic Level 4 is predicted to happen. Thus, this accident is considered to be a
condition will reach at the end of growing stage. (2) The duration from type 4 accident and duration prediction for Ttoi(i = 1, 2, 3) is activated.
the reported start time to the end of each congestion level within the Predicted value can be used as a reference for traffic guidance; (3)
diminishing stage. Combining these two factors, the accidents can be 65 min later, N4 = 1 is detected which indicates a right classification at
classified into 4 types according to the number of levels reached con- step 1. No additional adjustment is needed with a correct classification;
sequently, as shown in Fig. 1. Type 4 is the worst case which means that (4) 85 min later, T3 is detected. Add the known value Tto3 into model
after the accident happens, the traffic becomes increasingly congested and calibrate the predicted duration of lower levels T̂toi (i = 1, 2) ; (5)
until it reaches Level 4 and then congestion dissipates gradually. In 110 min later, Tto2 is detected. Further add the known value Tto2 into
contrast, Type 1 accident almost has no impact on the road conditions. model and calibrate the predicted duration T̂to1; (6) 120 min later, Tto1 is
Tstart stands for the start time of an accident. Ideally, congestion detected. Traffic flow returns to uncongested and the TAPI of this ac-
status may reach the peak after a period of fluctuation and slowly re- cident is fully depicted in the procedure.
turn to uncongested after the clearance of traffic accident. N4 = 1, if the
congestion reaches the level of severely congested, and N4 = 0, other-
wise. Similarly, Ni(i = 1, 2, 3) indicates whether the traffic ever reaches 4.2. Embedded algorithm
level i. During the diminishing stage, Ti(i = 1, 2, 3) is the first time back
to level i. Although road conditions may fluctuate back and forth be- Since the scheme does not depend on any prior assumption or
tween different levels during the clearance of a traffic accident, we use structure, it is totally data-driven and we can embed any algorithms for
the first occurrence of each congestion level in diminishing stage to prediction. For comparison and illustration purpose, we use three
represent the change of status in this study. It's because the first oc- common algorithms, random forest, support vector machine and neural
currence time shows the least time required for recovery while the network, in the following content.
following fluctuation has a higher probability to be caused by reasons RF is an ensemble model with each tree model catching part of the
other than accident. Therefore, we can define Ttoi = Ti − Tstart(i = 1, 2, nonlinearity between TAPI and a subset of factors. Since the inner re-
3) to measure the exact duration of recovery to a specific level. lationship among all variables and their different categories is difficult
Taking the actual data as an example, all 4 types of accident con- to interpret in real life, RF acts as a combination of multiple decision
gestion progress can be observed and examples are shown in Fig. 2. trees to find the most possible result without a bunch of assumptions. It
is also effective in preventing overfitting to training data.
SVM constructs a hyperplane that can be used for classification. If
linear inseparability occurs when dividing the space, segmentation

Fig. 1. Traffic conditions after accidents.

4
Y. Lin and R. Li Accident Analysis and Prevention 145 (2020) 105696

Fig. 2. Illustration of different types accident.

could be finished by mapping all points to a higher-dimensional plane. probability to have high congestion levels increases as the road class
By using a kernel function to map data from lower-dimensional space to becomes higher and severely congested after an accident occurs twice
higher-dimensional space, simplified calculation can be operated di- as likely on urban roads as rural roads. However, average duration is
rectly in the mapped space, thus making the application of the algo- not much different in both cases. (2) Workday has significantly larger
rithm feasible. In this study, a radial basis kernel function shown as Eq. probability but similar duration to have congestion than holiday. (3)
(2) is chosen: Peak hour increases the probability to have congestion while morning
2 peak hour has further impact on the duration. (4) Middle lane has a
K (x i, xj) = e−γ ∥ xi − xj ∥ (2) higher probability to cause congestion compared to side lanes. (5)
NN is composed of input layer, hidden layer and output layer and Foggy and sleet have positive impact on probability and duration re-
weights between each layers are calculated. Generally, the output spectively. (6) Congestion level before accident shows the largest var-
variable can be written as Eq. (3): iation among all factors.

y = go (wTo gh (wTh x + bh) + bo) (3) 5.2. Model evaluation


where x is the input variables, wh and bh represent weights and bias
To evaluate our models, the whole dataset is split into 2 parts: 70%
between input layer and hidden layer, wo and bo represent weights and
as the training set and 30% as the test set. We use a binary classifier
bias between hidden layer and output layer. gh and go are activation
based on RF and SVM respectively, to predict which level is the most
functions which introduce nonlinearity into NN. More complexity can
congested level after a specific accident. To investigate the performance
be captured by increasing the size or the depth of the layer.
of these qualitative prediction models, two criteria: accuracy (ACC) and
precision (PPV) are computed. The results are shown in Table 2.
5. Numerical results ACC shown as in Eq. (4) is chosen to evaluate the overall predictive
power of our models,
5.1. Model inputs
TP + TN
ACC =
We extract 3 kinds of crucial factors that may potentially impact P+N (4)
TAPI, including temporospatial features, weather conditions and re- PPV shown as in Eq. (5) is chosen to evaluate the credibility of the
ported accident information. Furthermore, 8 independent variables predicted positive events which represent the occurrence of congestion,
could be split into 36 dummy variables, and the corresponding statistics
TP
of the happening probability and duration of each congestion level are PPV =
TP + FP (5)
shown as Table 1.
From Table 1, we can get some primary conclusions. (1) The On one hand, three models have similar performance with ACC over

5
Y. Lin and R. Li Accident Analysis and Prevention 145 (2020) 105696

Fig. 3. A schematic diagram for hierarchical TAPI predicting model.

0.73 among all levels while NN slightly outperforms the other two ML not last long.
models. However, the skewed fact should be noticed when predicting Considering the sequential prediction, we obtain more information
N4 and N2. According to the dataset, severely congested rarely happens. as the accident evolves which can also be added into the model.
If we adopt a naïve algorithm to keep predicting 0 for all N4, the model Specifically, if Tto3 is detected in reality, it can be considered as an
already achieves a high ACC. Similarly, always predicting 1 for N2 also independent variable when predicting Tto2 and Tto1 in Type 4 accidents.
achieves a high ACC. But this kind of algorithm is not applicable since it Likewise, actual Tto2 can be utilized to predict Tto1 in both Type 4 and
does not serve the classification purpose. Thus, on the other hand, more Type 3 accidents. The results of models with updates are as shown in
indicators such as PPV are needed as a supplement. PPV is over 0.81 Fig. 5.
among all models where SVM has slightly better performance. The high Comparing the results without updates in Fig. 4 and with updates in
PPV implies a high probability of congestion occurrence if the model Fig. 5, we can see that sequential prediction can effectively improve all
gives a positive prediction and thus is quite informative to traffic par- three models.
ticipants. Note that, as the congestion level increases, PPV also in- From the perspective of numerical criteria, mean absolute percen-
creases and even approaches 1 when predicting the severely congested tage error (MAPE) and root mean square error (RMSE) are used to
level. Overall, our model can have a good prediction with high ACC and measure the accuracy of these quantitative models as shown in Tables 3
PPV which makes a good start for TAPI prediction. and 4.
For the duration prediction of each level, we start with only relying MAPE shown as in Eq. (6) is chosen to assess the overall perfor-
on the initial information of accidents. The predicted results without mance of models,
updates are shown in Fig. 4.
m
All the models have a concentrated prediction, that is, more accu- 1 ˆti − ti
rate prediction can be achieved at moderate duration while obvious
MAPE =
m
∑ ti
× 100%
i=1 (6)
deviation is shown at larger and smaller values. As the degree of con-
gestion decreases, the predicted values show a positive bias. In addi- RMSE shown as in Eq. (7) is chosen to understand the absolute value of
tion, it is worth noting that NN model shows extremely high accuracy in deviation,
Tto3 prediction and a positive bias in other cases when congestion does

6
Y. Lin and R. Li Accident Analysis and Prevention 145 (2020) 105696

Table 1
Summary of occurrence probability and duration for each congestion level.
Category Variable Dummy variable Average probability (%) Average time (min)

N4 N3 N2 Tto3 Tto2 Tto1

Temporo-spatial feature Classification of roads a1 highway 18.59 60.22 80.67 55.60 71.60 80.30
a2 1st class road 12.26 51.42 79.25 58.27 74.86 77.20
a3 2ed class road 9.58 39.59 68.32 50.77 63.42 70.50
a4 3rd class road 10.31 40.36 62.78 35.00 62.33 68.50
a5 4th class road 6.91 31.48 58.93 49.58 64.97 69.50
a6 expressway 29.79 73.65 88.73 50.53 68.95 78.19
a7 arterial 25.71 64.88 82.42 49.23 65.92 75.46
a8 2ed trunk 21.34 57.10 76.57 51.05 68.08 75.46
a9 branch 17.11 53.22 74.71 54.87 68.91 76.66
a10 street 18.32 52.01 75.46 41.70 61.44 68.52
Workday or not b1 workday 23.92 61.71 80.32 50.64 67.98 76.08
b2 holiday 18.59 57.55 78.06 49.36 65.70 74.97
Peak hour or not c1 morning peak 25.18 64.60 81.89 55.36 73.07 80.14
c2 evening peak 26.82 65.98 83.42 48.03 67.44 76.49
c3 off peak 20.10 57.16 77.57 48.78 64.69 73.55
Lane location d1 left side 20.45 57.85 78.05 49.28 66.60 74.57
d2 middle 25.24 64.07 81.82 49.70 67.23 76.21
d3 right side 22.75 60.81 79.83 51.75 68.34 76.56
Weather condition Weather e1 sunny/cloudy 22.27 60.79 79.61 50.35 67.17 75.66
e2 light rain 23.39 60.49 80.57 50.79 67.72 75.94
e3 moderate rain 24.00 60.00 80.00 47.33 70.42 76.13
e4 storm 23.72 63.46 77.56 50.14 68.69 80.25
e5 foggy 28.24 70.59 87.06 47.92 69.67 76.89
e6 sleet 36.36 50.91 63.64 58.50 83.57 83.71
Pollution level f1 good 22.37 60.40 79.62 50.63 67.47 75.75
f2 lightly 24.06 62.05 80.50 49.29 67.21 76.09
f3 moderately 23.22 61.48 79.51 53.76 69.96 76.67
f4 severely 22.50 62.50 81.88 45.14 66.75 75.19
Accident information Accident type g1 breakdown 23.11 60.71 79.97 49.03 69.01 77.26
g2 scratch 24.49 62.30 82.03 48.43 65.79 75.32
g3 pileup 21.24 60.08 78.98 52.27 67.96 75.83
g4 other 22.19 55.01 75.39 53.00 67.93 74.19
Congestion level before accident h1 uncongested 10.83 38.57 63.07 58.80 68.11 70.19
h2 slow moving 14.04 68.21 99.71 55.15 63.17 70.61
h3 congested 35.90 99.71 99.89 46.73 65.96 85.78
h4 severely 99.03 99.46 99.68 45.39 77.77 87.64

m To further show the prediction performance on specific accident


1
RMSE =
m
∑ (ˆti − ti)2 rather on the average level, we define the following assessment criteria
i=1 (7)
on absolute prediction difference in Table 5, and corresponding pro-
portion is shown in Fig. 6.
Same as the illustration in Figs. 4 and 5, RF and SVM models
Despite the results of average criteria is not good enough, we find
without updates has a MAPE over 27% besides the prediction of Tto3 by
about 50% accidents can be highly accurate predicted even only with
RF model while NN models reaches below 20% in some cases. The
the initial information. More interestingly, NN shows great advantage
accuracy is limited by two main reasons: (1) the granularity of detected
over other two models when considering about the absolute difference.
traffic condition is 5 min which has restrictions on the original data
Meanwhile, Fig. 6a shows that inaccurate predicting also counts a lot
accuracy and continuity. (2) MAPE is also influenced by the scale of the
which further implies the initial information about accident is not
duration itself since it mainly describes the relative error to its absolute
adequate to characterize the entire TAPI. Improvement from Fig. 6a and
value. Small denominator will magnify errors between the actual value
b comes from the increasing number of highly accurate predicting.
and predicted value. The prediction of Tto3 by RF model reduces large
Thus, it is reasonable to assume if we can get more detailed description
MAPE from (2) by given good estimation on small duration. By adding
of the accident and its surrounding traffic condition, the model can fi-
the actual hitting time of previous congestion level can make the model
nally give an ideal prediction.
reaches MAPE lower than 10%. In the version with updates, RF has
Since no previous study has used the same kind of data as our work,
better predicting power than SVM and NN.

Table 2
Results of predicting the most congested level.
Actual value RF SVM NN

0 1 ACC PPV 0 1 ACC PPV 0 1 ACC PPV

N4 0 3091 6 0.8391 0.9560 3095 2 0.8386 0.9924 3077 24 0.8423 0.9245


1 638 267 644 261 607 294
N3 0 1226 335 0.7311 0.8354 1283 278 0.7309 0.8552 1128 409 0.7406 0.8178
1 741 1700 799 1642 629 1836
N2 0 71 735 0.8051 0.8109 91 715 0.8048 0.8140 89 702 0.8096 0.8178
1 45 3151 66 3130 60 3151

7
Y. Lin and R. Li Accident Analysis and Prevention 145 (2020) 105696

Fig. 4. Results of two models without updates.

Fig. 5. Results of two models with updates.

8
Y. Lin and R. Li Accident Analysis and Prevention 145 (2020) 105696

Table 3 Table 6
Results of predicting congestion duration. Performance comparison with previous study.
RF SVM NN Literature MAPE range

MAPE (%) RMSE MAPE (%) RMSE MAPE (%) RMSE Wei and Lee (2007) 35–45%
(min) (min) (min) Pereira et al. (2013) 40–100%
Li et al. (2015) 45.4–185.7%
Type 4 Tto3 13.43 21.06 48.77 33.53 3.10 8.09 Ghosh et al. (2018) 20–100.9%
Tto2 32.78 34.06 33.04 34.74 17.40 10.86 Fu et al. (2019) 37.16–96.38%
Tto1 30.51 30.59 30.29 30.93 13.59 10.04 Our work 5.5–53.8%
Type 3 Tto2 39.82 25.26 39.27 25.56 42.86 14.85
Tto1 27.95 21.41 27.48 20.97 21.35 13.42
Type 2 Tto1 39.22 24.08 36.44 24.23 53.80 15.68
Table 7
Top 5 feature importance of different RF models.

Table 4 Random forest Feature 1 Feature 2 Feature 3 Feature 4 Feature 5


Results of predicting congestion duration with updating information. model

RF SVM NN Type 4 – N4 h4 0.6733 h3 0.1500 c3 0.0235 b2 0.0182 d2 0.0147


Type 4 – N3 h3 0.5163 h4 0.2253 c3 0.1049 b2 0.0311 d2 0.0147
MAPE (%) RMSE MAPE (%) RMSE MAPE (%) RMSE Type 4 – N2 h2 0.3777 h3 0.2823 h4 0.1764 a6 0.0433 a5 0.0237
(min) (min) (min) Type 4 – Tto3 h4 0.1269 c3 0.1059 d3 0.1035 g2 0.0926 a7 0.0815
Type 4 - Tto2 a7 0.1014 h3 0.0971 g2 0.0967 c3 0.0846 g3 0.0775
Type 4 Tto2 7.62 9.55 11.60 13.26 9.33 7.62 Type 4 – Tto2 – Tto3 0.2906 a7 0.0730 a6 0.0643 c3 0.0582 c2 0.0497
Tto1 3.05 5.50 8.74 12.79 5.59 6.05 with updates
Type 3 Tto1 5.43 8.51 6.42 10.62 8.70 7.39 Type 4 – Tto1 h4 0.2492 a8 0.1593 a7 0.1391 c2 0.0913 h3 0.0646
Type 4 – Tto1 – Tto2 0.8939 Tto3 0.0902 c3 0.0135 a7 0.0024
with updates
Type 3 – Tto2 c3 0.1263 g3 0.0906 d2 0.0869 g2 0.0866 h3 0.0778
Table 5
Type 3 – Tto1 h3 0.1181 h2 0.1047 g3 0.0968 c3 0.0951 b2 0.0767
Predicting assessment criteria.
Type 3 – Tto1 – Tto2 0.3613 d3 0.0779 a7 0.0615 c3 0.0524 g3 0.0497
Absolute difference (min) Assessment with updates
Type 2 – Tto1 d3 0.1034 c3 0.1012 g2 0.0865 d2 0.0779 g3 0.0759
≤10 Highly accurate predicting
10–20 Good predicting
20–30 Reasonable predicting dominates the prediction. It is not surprising since the most congested
> 30 Inaccurate predicting
level that the accident may reach is based on the former traffic condi-
tion. As a supplement, c3 whether it's peak hour and a6 whether the
accident occurs on expressway also affect. h congestion level before
we compare our model performance with some of the work which
accident is still important in duration prediction without updates, but
considered dynamic prediction in Table 6. The lower bound of MAPE
other factors also have something to do with the prediction. It can be
has been greatly improved by our work which shows the great potential
clearly seen in Table 7, a classification of roads, c peak hour or not, d
predictive power of CD in accident duration.
lane location and g accident type have prominent contributions in the
model while weather condition almost has no impact. When it comes to
5.3. Influencing factor analysis the sequential prediction, newly added variables will dominate the
prediction instead.
Since different independent variables have different effects on the
prediction, it is meaningful to identify the factors that have great im- 6. Conclusion
pact for further research. Take RF model in the previous section for
analysis, top 5 features and their corresponding importance scores are Emerging crowdsourcing data provides a new source for the analysis
shown in Table 7. of traffic accidents, and data fusion with real-time traffic condition
For qualitative models, we find h congestion level before accident information derived from floating car data provides a basis for this

Fig. 6. Absolute difference assessment

9
Y. Lin and R. Li Accident Analysis and Prevention 145 (2020) 105696

work; this paper applies the machine learning method to predict the Acknowledgments
duration of different traffic condition states after traffic accidents and
considers the updating effect by adding newly acquired data to the The authors gratefully acknowledge assistance with accident data
prediction. On a real-time basis, it ultimately aims to depict the evo- and float car data from concerned institution. The research reported in
lution of surrounding traffic conditions after the occurrence of a traffic this paper is part of the Project supported by the National Natural
accident. On an offline basis, it can be used to assess the robustness of Science Foundation of China (71871123). The financial support is
transportation system when traffic accidents happen. Specifically, we highly appreciated.
first extract UGCD from navigation APPs provided by the users. After
filtering and calibrating UGCD into a set of traffic accidents, we can Appendix A. Supplementary data
consecutively generate the features for every accident. Based on the
geographical information of the accidents, we map each accident to the Supplementary data associated with this article can be found, in the
corresponding traffic flow data with the same spatiotemporal feature online version, at https://doi.org/10.1016/j.aap.2020.105696.
which can also be obtained from navigation apps. Additional related
information can also be extracted in order to explore other factors that References
might affect TAPI. Preprocessing steps such as screening of raw data,
selection of independent variables, definition of dependent variables, Agarwal, S., Kachroo, P., Regentova, E., 2016. A hybrid model using logistic regression
etc. are carried out to describe accident information and congestion. and wavelet transformation to detect traffic incidents. IATSS Res. 40 (1), 56–63.
Alkaabi, A.M.S., Dissanayake, D., Bird, R., 2011. Analyzing clearance time of urban traffic
Using the congestion delay index, which is commonly used in current accidents in Abu Dhabi, United Arab Emirates, with hazard-based duration modeling
traffic guidance systems, the TAPI is divided into four levels: severely method. Transp. Res. Rec. (2229), 46–54.
congested, congested, slow moving and uncongested. Based on the Al-Najada, H., Mahgoub, I., 2017. Real-time incident clearance time prediction using
traffic data from internet of mobility sensors. Proceedings of the 2017 IEEE 15th Intl.
occurrence of each traffic condition level, the accident can be divided Conf. on Dependable, Autonomic and Secure Computing, 15th Intl. Conf. on
into four types. Predicting the recovering duration of each level for Pervasive Intelligence and Computing, 3rd Intl. Conf. on Big Data Intelligence and
every type of accident can give an overall description of the persistence Computing and Cyber Science and Technology Congress (DASC/PiCom/DataCom/
CyberSciTech) 728–735.
of congestion. RF, SVM and NN algorithms are applied to model and
Amin-Naseri, M., Chakraborty, P., Sharma, A., Gilbert, S.B., Hong, M., 2018. Evaluating
predict the congestion state and duration. NN model has a better per- the reliability, coverage, and added value of crowdsourced traffic incident reports
formance in most cases, especially when considering the absolute dif- from Waze. Transp. Res. Rec. 2672 (43), 34–43.
Beheshti-Kashi, S., Buch, R., Lachaize, M., Kinra, A., 2018. Big textual data in transpor-
ference between predicting duration and actual duration Moreover, the
tation: an exploration of relevant text sources. Proceedings of the International
precision can become significantly improved over time with updated Conference on Dynamics in Logistics 395–399.
information involved regardless of the exact algorithm that is em- Chaniotakis, E., Antoniou, C., Pereira, F., 2016. Mapping social media for transportation
bedded. This result has great reference value for the prediction of real- studies. IEEE Intell. Syst. 31 (6), 64–70.
Chung, Y., 2010. Development of an accident duration prediction model on the Korean
time traffic conditions, for it can effectively guide the traffic partici- freeway systems. Accid. Anal. Prev. 42 (1), 282–289.
pants to avoid relevant congestion sections. Chung, Y., Yoon, B.J., 2012. Analytical method to estimate accident duration using ar-
There are several limitations in this study, which can be improved in chived speed profile and its statistical analysis. KSCE J. Civil Eng. 16 (6), 1064–1070.
Cohen, S., Nouveliere, C., 1997. Modelling incident duration on an urban expressway. In:
the future: (1) the model in this study only considers a small subset of Papageorgiou, M., Pouliezos, A. (Eds.), Transportation Systems 1997, Vols. 1–3.
factors that may impact TAPI and thus has a limited degree of accuracy. Pergamon Press Ltd, Oxford, pp. 297–301.
The accident information recorded by crowdsourcing data is not as Fu, K., Ji, T., Zhao, L., Lu, C.-T., 2019. Titan: a spatiotemporal feature learning framework
for traffic incident duration prediction. Proceedings of the 27th ACM SIGSPATIAL
comprehensive as data obtained by traditional methods. For example, International Conference on Advances in Geographic Information Systems 329–338.
the specific types of accidents, the number of lanes and vehicles in- Ghosh, B., Asif, M.T., Dauwels, J., Fastenrath, U., Guo, H., 2018. Dynamic prediction of
volved, which may affect the congestion time, are not normally re- the incident duration using adaptive feature set. IEEE Trans. Intell. Transp. Syst. 20
(11), 4019–4031.
corded in a crowdsourcing dataset. In the future, it is possible to con- Giuliano, G., 1989. Incident characteristics, frequency, and duration on a high volume
sider increasing the information of the network reporting system so that urban freeway. Transp. Res. Part A – Policy Pract. 23 (5), 387–396.
the state of the accident can be more accurately described. (2) Due to Golob, T.F., Recker, W.W., Leonard, J.D., 1987. An analysis of the severity and incident
duration of truck-involved freeway accidents. Accid. Anal. Prev. 19 (5), 375–395.
the large differences in traffic conditions at different times in different
Gu, Y.M., Qian, Z., Chen, F., 2016. From twitter to detector: real-time traffic incident
regions, models made by using a certain data sample are not applicable detection using social media data. Transp. Res. Part C Emerg. Technol. 67, 321–342.
to all road conditions. Portability of the enhanced model can be con- Hasan, S., Ukkusuri, S.V., 2014. Urban activity pattern classification using topic models
sidered in the future. From this study, we can see the feasibility of using from online geo-location data. Transp. Res. Part C Emerg. Technol. 44, 363–381.
Hojati, A.T., Ferreira, L., Washington, S., Charles, P., 2013. Hazard based models for
crowdsourcing data to predict congestion caused by traffic accidents. freeway traffic incident duration. Accid. Anal. Prev. 52, 171–181.
With further development and enrichment in the future, crowdsourcing Jones, B., Janssen, L., Mannering, F., 1991. Analysis of the frequency and duration of
data can provide more accurate support for post-accident congestion freeway accidents in Seattle. Accid. Anal. Prev. 23 (4), 239–255.
Khattak, A., Wang, X., Zhang, H., 2012. Incident management integration tool: dynami-
prediction by integrating data from diverse sources. cally predicting incident durations, secondary incident occurrence and incident de-
lays. IET Intell. Transp. Syst. 6 (2), 204–214.
Khattak, A.J., Schofer, J.L., Wang, M.H., 1995. A simple time-sequential procedure for
predicting freeway incident duration. IVHS J. 2 (2), 113–138.
Author statement Kim, W., Chang, G.L., 2012. Development of a hybrid prediction model for freeway in-
cident duration: a case study in Maryland. Int. J. Intell. Transp. Syst. Res. 10 (1),
Yunduan Lin: conceptualization, methodology, software, investiga- 22–33.
Kuang, L., Yan, H., Zhu, Y., Tu, S., Fan, X., 2019. Predicting duration of traffic accidents
tion, writing-original draft preparation, writing-reviewing and editing. based on cost-sensitive Bayesian network and weighted k-nearest neighbor. J. Intell.
Ruimin Li: funding acquisition, supervision, investigation, writing-re- Transp. Syst. 23 (2), 161–174.
viewing and editing, data curation, investigation, validation. Li, R.M., Pereira, F.C., Ben-Akiva, M.E., 2015. Competing risks mixture model for traffic
incident duration prediction. Accid. Anal. Prev. 75, 192–201.
Li, R.M., Pereira, F.C., Ben-Akiva, M.E., 2018. Overview of traffic incident duration
analysis and prediction. Eur. Transp. Res. Rev. 10 (2), 13.
Conflict of Interest Lin, L., Wang, Q., Sadek, A.W., 2016. A combined m5p tree and hazard-based duration
model for predicting urban freeway traffic accident durations. Accid. Anal. Prev. 91,
114–126.
The authors declare that they have no known competing financial Ma, X.L., Ding, C., Luan, S., Wang, Y., Wang, Y.P., 2017. Prioritizing influential factors for
interests or personal relationships that could have appeared to influ- freeway incident clearance time prediction using the gradient boosting decision trees
ence the work reported in this paper. method. IEEE Trans. Intell. Transp. Syst. 18 (9), 2303–2310.
Nam, D., Mannering, F., 2000. An exploratory hazard-based analysis of highway incident

10
Y. Lin and R. Li Accident Analysis and Prevention 145 (2020) 105696

duration. Transp. Res. Part A Policy Pract. 34 (2), 85–102. Skabardonis, A., Varaiya, P., Petty, K.F., Trb, 2003. Measuring Recurrent and
Nguyen, H., Liu, W., Rivera, P., Chen, F., 2016. Trafficwatch: real-time traffic incident Nonrecurrent Traffic Congestion. Freeways, High-Occupancy Vehicle Systems, and
detection and monitoring using social media. In: Bailey, J., Khan, L., Washio, T., Traffic Signal Systems 2003: Highway Operations, Capacity, and Traffic Control.
Dobbie, G., Huang, J.Z., Wang, R. (Eds.), Advances in Knowledge Discovery and Data Transportation Research Board Natl Research Council, Washington, pp. 118–124.
Mining, PAKDD 2016, Pt I. Springer-Verlag Berlin, Berlin, pp. 540–551. Wang, S., He, L., Stenneth, L., Philip, S.Y., Li, Z., Huang, Z., 2013. Estimating urban traffic
Pereira, F.C., Rodrigues, F., Ben-Akiva, M., 2013. Text analysis in incident duration congestions with multi-sourced data. Proceedings of the 2016 17th IEEE International
prediction. Transp. Res. Part C Emerg. Technol. 37, 177–192. Conference on Mobile Data Management (MDM) 82–91.
Perez, G.V.A., Lopez, J.C., Cabello, A.L.R., Grajales, E.B., Espinosa, A.P., Fabian, J.L.Q., Wei, C.H., Lee, Y., 2007. Sequential forecast of incident duration using artificial neural
2018. Road traffic accidents analysis in Mexico city through crowdsourcing data and network models. Accid. Anal. Prev. 39 (5), 944–954.
data mining techniques. Int. J. Comput. Inform. Eng. 12 (8), 604–608. Yang, F., Jin, P.J., Cheng, Y., Zhang, J., Ran, B., 2015. Origin-destination estimation for
Qi, Y., Teng, H.L., 2008. An information-based time sequential approach to online in- non-commuting trips using location-based social networking data. Int. J. Sustain.
cident duration prediction. J. Intell. Transp. Syst. 12 (1), 1–12. Transp. 9 (8), 551–564.
Rashidi, T.H., Abbasi, A., Maghrebi, M., Hasan, S., Waller, T.S., 2017. Exploring the ca- Yu, B., Wang, Y.T., Yao, J.B., Wang, J.Y., 2016. A comparison of the performance of ANN
pacity of social media data for modelling travel behaviour: Opportunities and chal- and SVM for the prediction of traffic accident duration. Neural Network World 26 (3),
lenges. Transp. Res. Part C Emerg. Technol. 75, 197–211. 271–287.
Shang, Q., Tan, D., Gao, S., Feng, L., 2019. A hybrid method for traffic incident duration Zhang, Z.H., He, Q., Gao, J., Ni, M., 2018. A deep learning approach for detecting traffic
prediction using boa-optimized random forest combined with neighborhood com- accidents from social media data. Transp. Res. Part C Emerg. Technol. 86, 580–596.
ponents analysis. J. Adv. Transp. 2019. Zou, Y., Henrickson, K., Lord, D., Wang, Y., Xu, K., 2016. Application of finite mixture
Shi, Q., Abdel-Aty, M., 2015. Big data applications in real-time traffic operation and models for analysing freeway incident clearance time. Transportmetrica A: Transp.
safety monitoring and improvement on urban expressways. Transp. Res. Part C: Sci. 12 (2), 99–115.
Emerg. Technol. 58, 380–394. Zou, Y., Ye, X., Henrickson, K., Tang, J., Wang, Y., 2018. Jointly analyzing freeway traffic
Shi, Q., Abdel-Aty, M., Lee, J., 2016. A Bayesian ridge regression analysis of congestion's incident clearance and response time using a copula-based approach. Transp. Res.
impact on urban expressway safety. Accid. Anal. Prev. 88, 124–137. Part C Emerg. Technol. 86, 171–182.

11

You might also like