Open AccessArticle

Analyzing Autonomous Vehicle Collision Types to Support Sustainable Transportation Systems: A Machine Learning and Association Rules Approach

Ehsan Kohanpour

Seyed Rasoul Davoodi

^1,*

and

Khaled Shaaban

Department of Civil Engineering, Golestan University, Gorgan 49361-79142, Iran

Department of Mechanical and Civil Engineering, Utah Valley University Orem, Orem, UT 84058, USA

Author to whom correspondence should be addressed.

Sustainability 2024, 16(22), 9893; https://doi.org/10.3390/su16229893

Submission received: 3 September 2024 / Revised: 7 November 2024 / Accepted: 8 November 2024 / Published: 13 November 2024

(This article belongs to the Special Issue Innovations and Policies Shaping Sustainable Transportation Engineering)

Download

Browse Figures

Figure 1
Conceptual framework. Process of crash data extraction to modeling. "> Figure 2
The heat map of AV crashes in the test areas. "> Figure 3
The sample OL-316 form for the AV collision report provided by the CA DMV is presented. (a) First page of form OL-316; (b) Second page of form OL-316; (c) Third page of form OL-316. "> Figure 4
Word cloud of points of interest with the highest number of crashes. "> Figure 5
Descriptive statistics of CA DMV data as of 31 December 2023. "> Figure 6
Descriptive statistics of CA DMV data. (a) means Types of ADS disengagement; (b) means Type of intersection at the collision site; (c) means Intersection with traffic signals; (d) means Types of AV collisions; (e) means AV driving mode; (f) means Collision severity. "> Figure 7
Decision tree for classification and regression for the variable of collision type. "> Figure 8
Association rules bubble chart. "> Figure 9
Variable importance for collision type using XGB, CART, and RF algorithms. "> Figure 10
Feature importance with SHAP. (a) Impact on model output; (b) Average impact on model output. ">

Versions Notes

Abstract

The increasing presence of autonomous vehicles (AVs) in transportation, driven by advances in AI and robotics, requires a strong focus on safety in mixed-traffic environments to promote sustainable transportation systems. This study analyzes AV crashes in California using advanced machine learning to identify patterns among various crash factors. The main objective is to explore AV crash mechanisms by extracting association rules and developing a decision tree model to understand interactions between pre-crash conditions, driving states, crash types, severity, locations, and other variables. A multi-faceted approach, including statistical analysis, data mining, and machine learning, was used to model crash types. The SMOTE method addressed data imbalance, with models like CART, Apriori, RF, XGB, SHAP, and Pearson’s test applied for analysis. Findings reveal that rear-end crashes are the most common, making up over 50% of incidents. Side crashes at night are also frequent, while angular and head-on crashes tend to be more severe. The study identifies high-risk locations, such as complex unsignalized intersections, and highlights the need for improved AV sensor technology, AV–infrastructure coordination, and driver training. Technological advancements like V2V and V2I communication are suggested to significantly reduce the number and severity of specific types of crashes, thereby enhancing the overall safety and sustainability of transportation systems.

Keywords:

autonomous vehicles; sustainable transportation; transportation safety; collision types

1. Introduction

Transportation automation is increasingly recognized as a crucial solution to enhancing traffic safety, mitigating environmental impacts, alleviating congestion, and improving mobility. Particularly, when roads are fully equipped with highly automated vehicles (SAE level 4 and above) [1], a significant reduction in crash frequency is expected due to the elimination of human-related avoidable errors [2]. Additionally, it is anticipated that the traffic flow with autonomous vehicles will become more stable [1]. As autonomous vehicles will be capable of transporting individuals with minimal intervention or without the need for a driver, it is expected that they will provide more mobility options for individuals with physical disabilities and elderly individuals [3,4]. As part of the European Union’s initiatives, the world’s first technical regulation was enacted in 2022, allowing member countries to approve the registration and sale of a limited number of highly automated vehicles at SAE level 4 [5]. According to the law, car manufacturers must equip vehicles with advanced emergency braking systems capable of detecting other vehicles. This requirement is expected to be extended to include cyclists and pedestrians starting in 2024 (as of 15 October 2024, no information has been found regarding the implementation or non-implementation of this regulation). Since safety is a primary concern in the deployment of AVs in complex traffic environments, acquiring knowledge and learning from reported AV crashes is essential.

Since 2014, under the condition of reporting crashes via the standard form OL316, the presence of a safety driver was required by the California Department of Motor Vehicles (CA DMV) through the Autonomous Vehicle Testing (AVT) program. With the introduction of autonomous vehicles onto roads and the commencement of extensive performance testing, in order to encourage innovation and gain a better understanding of the capabilities and limitations of autonomous vehicles, since 2018, a license for testing AVs without a human driver present on the driver’s seat has been issued. The tested AVs in California are mostly passenger vehicles primarily operating under conditional automated driving, i.e., SAE level 3 and recently, level 4 [6]. Under this program, all crashes (any collision resulting in financial damage, bodily injury, or death within 10 days of the incident) and the disengagement of automated driving systems (ADSs) and vehicle miles traveled (VMT) must be reported to the DMV. Until February 2024, there have been 676 crashes involving the testing of autonomous vehicles on public roads in California [7].

One of the critical challenges in transportation automation is the interaction between autonomous vehicles (AVs) and human-driven vehicles in mixed traffic. This challenge causes uncertainty regarding traffic safety until a 100% penetration rate of AVs is achieved in the market [8]. A previous study indicated that the deployment of AVs may slightly increase crashes in roundabouts [9], and another study showed that the low penetration rate of AVs with adaptive cruise control (ACC) technologies may slightly jeopardize traffic stability at intersections [10].

With the increase in casualties from autonomous vehicle crashes, AVs are at risk of losing their deployment potential in mixed traffic [11]. Considering that the impact of autonomous vehicles on the type and severity of crashes is ambiguous [10] and that the factors affecting the type and severity of AV crashes, especially at night and in adverse weather conditions, are not clearly defined [12,13], most existing autonomous vehicle control systems lack an effective mechanism to cope with severe weather conditions [14,15]. Previous studies on autonomous vehicles have mainly focused on routing [16], technology acceptance and adoption [17], the impact of AVs on pedestrian safety [18], and traffic impact [19], but limited studies have been conducted on the crash mechanisms of autonomous vehicles [12,20], the land use relationship with AV crashes [12], crashes involving driverless vehicles, and collisions between two AVs. The lack of crash data is one of the main issues [11].

To fill this gap, this study employs a multi-pronged methodological approach. First, the researchers transformed and structured AV crash report forms into a format that can be processed and analyzed. This involved extracting key data elements such as crash location, vehicle characteristics, environmental conditions, and collision types from the standardized forms.

Next, the study conducted extensive data mining, statistical analysis, and machine learning to uncover the key interrelationships between various factors and the different types of crashes involving autonomous vehicles. Specifically, the researchers utilized association rule mining (Apriori algorithm), decision tree analysis (CART), and cross-tabulation methods to identify significant patterns and relationships in the data.

By evaluating the current status of AVs in terms of safety and identifying the factors necessary to ensure acceptable levels of safe performance, especially in mixed-traffic environments, this study aims to contribute to the growing body of knowledge on autonomous vehicle crash dynamics. Moreover, identifying the existing trends and patterns in injury severity can significantly inform the implementation of safety measures and traffic control strategies to enhance the safety of mixed-traffic environments with both human-driven vehicles and AVs.

2. Literature Review

The research in the AV field is rapidly growing, as this technology has successfully become a part of public transportation and significantly contributes to the shaping of driving regulations and the enhancement of road safety. Studies on safety reveal that human error is responsible for more than 90% of transportation crashes [21]. The annual cost of injury and fatality crashes in the United States alone amounts to approximately $900 billion. Both the government and private industries are dedicated to implementing measures aimed at significantly reducing casualties, serious injuries, and damages caused by vehicles and road systems in the United States. Many believe that the elimination of drivers through vehicle automation is a viable approach to substantially decreasing casualties and injuries [22,23] and can propel society toward realizing the vision of zero crashes. This study investigates crashes involving self-driving cars operating under conditional and high automated modes (SAE levels 2 to 4). In this section, a concise explanation of the intelligence levels of AVs is provided based on SAE definitions.

Automation levels include six levels, from level 0 meaning “no automation” to level 5, meaning “full automation” [24,25], with intermediate stages such as “driver assistance (level 1)”, “partial automation (level 2)”, “conditional automation (level 3)”, and “high automation (level 4)”. The study focuses on AV crashes involving vehicles operating at levels of “partial automation (level 2)” to “high automation (level 4)” according to SAE International (2018), where the autonomous vehicle can handle some driving tasks but the driver needs to be prepared to intervene when needed, and in high automation (level 4), an AV performs driving functions without a driver present in the seat and does so remotely through an operator [6].

The prediction and estimation of traffic crashes and the identification of factors influencing such events have been a prominent research topic for decades. The traditional approach to modeling non-AV crashes involves the development of statistical models to identify the primary causes of crashes [26]. In addition to statistical modeling approaches [23,27,28], researchers have also utilized various machine learning (ML) methods and neural networks for analyzing traffic crashes [2,20,29]. Furthermore, some studies have employed non-parametric methods such as decision trees (DTs) or association rules to identify influential factors in crashes. For instance, Zhang and Zhu utilized association rule analysis to identify patterns of various types of crashes, including side collisions, angle collisions, and rear-end collisions [30]. Ashraf et al. in 2021 used decision tree classification and regression (CART) models and association rules to create rules for different types of AV crashes [2], while Wang et al. in 2016 used association rules to extract patterns of ordinary car crashes [31]. De et al. in 2013 used tree-based methods to analyze crash severity and types of ordinary vehicle crashes [32].

Several studies have analyzed various types of AV-related collisions using crash data from California. AVs are being tested on various road types, including freeways/highways, arterials, collectors, and local roads. AV crashes have occurred across all functional road classes. Arterial roads have the highest (60%) crash rates [33], and intersections are hotspots for AV crashes [2,34,35]. It has been revealed that rear-end collisions are the most common (60%) type of AV crashes [36,37,38]. Favaro et al. in 2017 examined reports of 26 crashes that occurred in California between 2014 and 2017. The study reported that the majority of AV-involved crashes (62%) were rear-end collisions, where an AV was struck by another vehicle. Approximately 60% of these cases involved AVs experiencing low-speed impacts (below 16 km per hour) [39]. Additionally, Wang and Li in 2019 analyzed 133 AV crash records from California and demonstrated that the driving mode of AVs significantly influences the type of collision. Among the collision types, rear-end collisions accounted for the highest proportion (69%), followed by side collisions (22%), angle collisions (9%), and off-road collisions (8%) [40]. Boggs et al. in 2020 evaluated influential factors in AV crashes using text analysis and hierarchical Bayesian logistic models on AV crash data collected between 2014 and 2018. The study indicated that driving an AV in conventional mode in mixed-use areas increases the likelihood of rear-end collisions, while the same type of crash is less likely near public/private schools. In another study [10], Ashraf et al. utilized decision tree models and Apriori to extract association rules for various types of collisions and found that 62% of all crashes were rear-end collisions, which primarily occurred at intersections. The generated rules suggested that rear-end collisions were more prevalent when the autonomous vehicle (AV) was operating in autonomous mode. Furthermore, most rear-end collisions happened when the AV was stationary at intersections, and non-AVs were either moving directly or decelerating behind the AV at intersections [2]. Lee et al. in 2023 used statistical models and found that AV performance in dealing with vehicles behind them to prevent rear-end collisions, especially in autonomous driving mode, needs improvement. When comparing manual disengagement and conventional mode, there is no significant difference in the likelihood of rear-end collisions. AVs should have better interactions with vehicles changing lanes around them when they are moving directly. Collisions involving a combination of left turns by an AV and direct movement of the second vehicle have a positive relationship with injury crashes [8]. Wu et al. in 2023 identified rear-end collisions and unexpected behaviors of other road users as the most significant risk factors in operational ADS mode using a joint modeling approach based on Bayesian networks (BNs) to describe factors related to three safety performance indices of Avs, namely (1) AV fault, (2) collision types, and (3) consequences of AV crashes [41].

Research utilizing California DMV autonomous vehicle (AV) crash data has also focused on comparing the safety performance between AVs and conventional vehicles. A study by Alambeigi et al. in 2020 employed a thematic modeling analysis of 167 AV crashes, which identified five subjects or types of crashes, which were transfer crashes initiated by the driver, side collisions while turning, and rear-end collisions while the vehicle was stopped at an intersection, in a traffic lane, or facing oncoming traffic [35]. Previous research on AV crashes using CA DMV data has mostly focused on exploratory data analysis to find relationships between various contributing factors with statistical models and descriptive statistics. To more accurately identify potential risks for full deployment programs in mixed traffic, previous research has utilized a wide range of statistical models in AV crash analysis, such as descriptive statistics and Bayesian inference [36, 39], data mining methods such as machine learning models [20,42], decision trees and regression (CART), and deep neural networks capable of detecting patterns within observations. Table 1 presents some of the most important relevant studies, along with the methods and data used.

The existing research on autonomous vehicle (AV) collisions has predominantly relied on relatively small sample sizes and methodologies that may yield misleading results, such as descriptive statistics and basic correlation analyses [2,8]. These limitations can obscure the complex interrelationships among various factors influencing AV crashes, leading to conclusions that may not accurately reflect real-world dynamics. In contrast, the current study addresses this gap by utilizing a significantly larger dataset, encompassing 606 AV crashes from California and employing a comprehensive methodological framework that integrates statistical methods, data mining, and advanced machine learning techniques. Furthermore, to obtain more accurate conclusions, research in this area should be conducted at regular intervals.

This multi-faceted approach not only enhances the reliability of the findings but also minimizes the potential for researcher bias in the interpretation of results. By systematically analyzing the data through models such as CART, Apriori, and XGBoost v2.1, this study provides a more nuanced understanding of the mechanisms behind AV collisions, thereby contributing valuable insights to the field. Ultimately, the methodology introduced in this research serves as a robust model for future studies, paving the way for more accurate assessments of AV safety and performance in mixed-traffic environments.

3. Materials and Methods

3.1. Conceptual Framework

This research involves two primary objectives, namely (1) transforming and structuring AV crash report forms into a format that can be processed and (2) conducting data mining for association rules, statistical analysis, and machine learning, as depicted in Figure 1.

As depicted in the conceptual framework of Figure 1, the models utilized in this study encompass machine learning models, data mining models, and statistical models. Each AV crash results from a combination of various influencing factors, and each crash can be represented as a logical combination of factors in the form of conditional statements, known as crash rules. This study employs non-parametric models such as decision trees and association rule mining with Apriori to extract crash rules [2,12], given the relatively small sample size of the AV crash database. The results obtained from non-parametric models offer a clearer and more direct correlation, facilitating the analysis of self-driving car crash mechanisms, the identification of variable patterns and trends, and the development of suitable mitigation measures to enhance AV safety performance during the critical evaluation and deployment initiation phase, contributing to technological advancement [2]. Imbalanced data are addressed using SMOTE (version 0.9.1), and machine learning models including random forests, XGBoost (version 1.7.0), and decision trees are employed to assess variable importance in modeling. Moreover, a variable explanation method called Shapley (SHAP) (version 0.41.0) is utilized to evaluate and interpret the importance of machine learning models. In all algorithms, the training data size constitutes 70% of the dataset, while the test data size comprises 30%. The models and results are obtained using the mlxtend (version 0.19.0), Scikit-learn (version 1.0.2), pandas (version 1.3.5), numpy (version 1.21.6), and apriori libraries, and the programming language used is Python 3.9. Further explanation of the theoretical foundations of these models and evaluation criteria is provided in this section.

3.2. Data Collection

3.2.1. Study Area

This research utilizes AV crash reports sourced from the California Department of Motor Vehicles (CA DMV) [7]. The crashes took place in 28 cities within California but based on the data distribution, three main areas for investigation, San Francisco, Mountain View, and Los Angeles can be identified, as illustrated in Figure 2. The majority of AV technology companies in California operate in areas such as San Francisco, Mountain View, Santa Clara, San Jose, and Los Angeles [43].

3.2.2. Population and Sample

Given the information presented thus far, the study population comprises all self-driving cars registered in California. The study sample encompasses all data pertaining to self-driving car crashes from September 2014 to September 2023. In the context of self-driving car crash data, we treat the population and sample sizes as equivalent, signifying that all accessible data in the population are utilized for analysis. This approach is suitable when precise data analysis is necessary.

3.2.3. Data Sources

Even though several states in the United States are conducting experiments with autonomous vehicles on public roads [40], the most trustworthy and comprehensive source of data on AV crashes is the OL316 report forms provided by the California Department of Motor Vehicles, which are publicly accessible [7]. These report forms contain information regarding the type of collision, crash location, injury severity, and scene details such as weather, lighting conditions, road surface conditions, pre-crash conditions, and other relevant factors. The crash reports can be obtained upon request. The complete dataset, designed to serve as a research resource for future scholars, is available for download in xlsx and csv formats via our GitHub repository at https://github.com/kohanpour1/AVs (accessed on 1 November 2024).

Crash Data

The current analysis includes 606 AV crash data reports from September 2014 to September 2023 [7], which were extracted and converted into a processable format. The frequency of crash variables is presented in the Results section.

Dependent Variable: Collision Types

In this study, there are six types of collisions considered crashes. Previous studies have used four types of collisions for analysis [2,8]. The types of collisions studied in the literature include rear-end collisions with AVs, sideswipe collisions, head-on collisions with AVs, and broadside collisions. In our data, due to the increase in the number of crashes involving hitting an object, this type of collision has also been added. Additionally, collisions involving humans and rollovers are classified as “Other” as classes of the dependent variable [7].

In this study, PDF forms are first converted into a processable format, and then crash location information, points of interest, and other variables are extracted based on the conceptual model in Figure 1. The sections a, b, and c in Figure 3 include the pages of the AV inspection reports from the California Department of Motor Vehicles (CA DMV), which provide essential data regarding the performance and safety of autonomous vehicles. These reports serve as a critical source of information for analyzing trends and patterns related to AV incidents.

2.: Extraction of Land Use Data and Secondary Variables

Following, land usage information, represented by points of interest (POIs) such as restaurants, clinics, banks, etc., was obtained from OpenStreetMap using the request library in the Python programming language. Subsequently, the nearest point of interest to the crash coordinates was extracted and included in the data as the POI variable. Alongside POIs, additional variables such as season, days of the week, morning and evening rush hours, nighttime (9 p.m. to 5 a.m.), weekends, and public holidays were incorporated into the data to better detect underlying variations. The points of interest were categorized into four groups, which were public use, transportation use, administrative use, and residential use [12].

Figure 4 illustrates the frequency of points of interest in the POI data. Parking, parking spaces, restaurants, places of worship, and schools are locations with high frequencies in the data.

3.2.4. Data Preprocessing

In order to obtain accurate and reliable results in any research, it is essential to start by cleaning the data, adjusting it to the specific requirements of the study, and performing necessary checks. This section focuses on data preprocessing, which involves identifying and addressing missing data, outliers, improperly scaled data, and examining variables for correlation and independence.

In the initial step, following the conversion of PDF reports to CSV format, reports with missing data and those that did not identify the key variables of this study were excluded, resulting in the removal of two reports. The remaining reports were supplemented based on the provided descriptions and an analysis of the accident locations, which were retrievable using the street view features of OpenStreetMap. Overall, due to the manual conversion process, the data were clean and did not require further processing.

The dataset should be prepared in a manner suitable for each algorithm utilized in the study, as certain algorithms require binary data, while others can handle categorical data. To assess correlation, the variance inflation factor (VIF) is calculated and executed for the independent variables [12]. Subsequently, the data are transformed into binary format to be compatible with machine learning algorithms. Various methods such as oversampling, undersampling, combined methods, and artificial data generation can be employed to balance the data. To maintain the proportion of samples for the different classes of the dependent variable, the SMOTE (version 0.9.1) method is utilized to increase the sample size of the minority class by generating synthetic data. These methods are implemented using Python version 3.9 and libraries such as statsmodels, matplotlib, imbalanced-learn, and scipy.stats.Analysis Methods.

3.2.5. Theoretical Foundations of Classification and Regression Trees (CARTs)

The classification and regression tree (CART) model is an important non-parametric model in data mining and machine learning. The CART model is a popular decision tree method that does not require prior probabilistic knowledge and can be used for classification and regression modeling [47]. The main difference between decision tree construction methods lies in the splitting criteria during tree formation [32]. The CART model, developed by Breiman et al., is the most widely used approach for analyzing traffic crash data [48]. In this model, tree nodes (parent) are divided into sub-nodes (child) based on the threshold value of a variable. The CART model performs this by searching for the best homogeneity for the sub-nodes using the Gini index criterion [2]. The Gini index measures the impurity level in each node until its purity (homogeneity) does not increase with further division. The Gini index

G i n i (C_{j})

for variable

C_{j}

(e.g., operational states of AVs: C) can be defined as Equation (1).

G i n i (C_{j}) = 1 - \sum_{j = 1}^{k} p_{C_{j}}^{2}

(1)

The CART model presents a simple and hierarchical graphical structure, which is used to extract rules and identify the effects of different explanatory variables on the dependent variable (crash severity) [2,47]. Each node represents a variable, and each branch represents one of its conditions. The terminal node, known as the tree leaf, represents the expected value of the target variable. Each tree leaf, with all previous nodes connected to it, represents a rule in the form of “if A then B”, where A is the antecedent or independent variable and B is the consequent, representing the classes of the dependent variable. These rules and their associated probabilities can indicate trends and patterns leading to crashes, including property loss or bodily harm.

To filter out important rules, three parameters are used, namely support (S), population (Po), and probability (P) [2,47]. Due to data imbalance, the class weight parameter (w) can be used, with a typical value of 30% for binary decision tree crash severity. The minimum value of this parameter depends on the nature of the dataset [2,32].

To prevent overfitting, parameters such as the maximum number of leaves, minimum sample size for splitting, and maximum tree depth are used as pruning and stopping criteria.

3.2.6. Random Forest

The random forest (RF) model is a machine learning model used for classification and regression. In this model, several decision trees are randomly constructed, and their outputs are combined to be considered as the final output of the model [49]. Random forests have hyperparameters similar to decision trees and bagging classifiers, and its evaluation is also conducted using the Gini criterion [29]. In this study, the RF model is used to assess the importance of variables. In this approach, the importance of each variable is determined based on the amount of error reduction in the decision tree [12].

3.2.7. XGBoost

The XGB model is an ensemble model, meaning that this model has been developed by combining several other models using the boosting method to improve the accuracy and performance of the algorithm. XGB is used as one of the strongest and most effective machine learning models. This model is used for classification and regression problems and can create high-precision and overfitting-resistant predictive models [50]. Chen and Guestrin made some improvements based on gradient boosting and introduced XGBoost in 2016. XGB is used in the classification and prediction of dependent variables, such as crash severity, to assess the importance of variables [12]. The important parameters of this algorithm include the number of trees used (n_estimators) in the ensemble (number of iterations), the maximum depth of the trees (max_depth), the learning rate (learning_rate) that affects the training process, and the slope of the Gaussian function (gamma). XGB adds optimization to suppress model complexity and prevent overfitting. This cost function consists of two parts; the first part, which is related to the cost function, is usually an error function, such as a squared error or logistic error, and the second part is related to the penalty function. The penalty function is used to control model complexity and prevent overfitting.

The new cost function for

{L_{k}}^{'}

for the

k

-th iteration of the XGB algorithm can be expressed as Equation (2).

{L_{k}}^{″} = \sum_{i = 1}^{m} l (y (i), \hat{y} (i)_{k - 1} + f_{k} (x (i))) + Ω (f_{k}) + \sum_{j = 1}^{T_{k - 1}} Ω (f_{j}^{(k - 1)})

(2)

Here, we have three main parts; the first part

\sum_{i = 1}^{n} l (y (i), \hat{y} (i)_{k - 1} + f_{k} (x (i))) + Ω (f_{k})

represents the total cost function for each sample in the dataset. This part includes the difference between the actual label

y (i)

and the sum of the previous prediction

\hat{y} (i)_{k - 1}

and the new prediction

f_{k} (x (i))

Ω (f_{k})

represents the penalty function for the new tree

f_{k}

. This expression is used to control the complexity of the new tree and prevent overfitting. The last part

\sum_{j = 1}^{T_{k - 1}} Ω (f_{j}^{(k - 1)})

represents the total penalty function for all previous trees up to step

k - 1

. This expression is used to maintain the overall penalty across all trees in the model. The goal of this updated cost function is to guide the XGB algorithm in constructing a new tree

f_{k}

that minimizes the overall cost function and also considers the prediction errors and complexity of the new tree.

Other relationships in the XGB model include relationships related to optimizing the cost function, weighting the samples, and optimizing the structure of decision trees. These relationships improve the performance and accuracy of the algorithm in predicting data [49].

3.2.8. Association Rule Mining

Association rule mining, or frequent itemset mining, is a popular non-parametric data mining technique that can be used in analyzing causality in traffic safety research [51]. The main advantage of this approach is that the dependent variable for rule generation does not need to follow a specific distribution. The generated rules can be used to develop countermeasures to break the relationship between factors affecting crashes, thereby reducing crash risk [31]. The discovery of association rules is typically carried out in two stages. In the first stage, factors that are optimal and frequent (e.g., AV movements, locations, AV operational states, etc.) and have a support value greater than the minimum threshold are selected using the Apriori model. The variables are divided into antecedents and consequents, where crash severity is considered as the consequence, and other factors are adjusted as antecedents, expressed as conditional rules called crash rules. In the second stage, the rules generated in the first stage are organized based on their lift values using the Apriori model and then filtered based on the minimum confidence and support [2,12,52].

3.2.9. Apriori Algorithm

The Apriori algorithm was discovered by Agrawal and his colleagues at the IBM research center and can be used to generate all frequent itemsets or association rules [12]. In this study, the Apriori algorithm interprets the interrelationships between factors and examines the mechanism of AV-induced crashes for the dependent variables of crash severity. Association rules are extracted from the crash database obtained from CA DMV reports using Python with the mlxtend and apriori libraries. The working principles of this algorithm are discussed below. Let I = {g₁, g₂, g₃, …, g_n} be a set of variables such as driving condition, lighting condition, etc., and let D = {c₁, c₂, c₃, …, c_n} be a set of crash events. Each crash event in D occurs due to a unique combination of variables in

I

, which we call its crash rule. Therefore, each consequence in D consists of a subset of antecedents (items) in I. A rule is defined as an implication of the form

X \Rightarrow Y

(X then Y), where X, Y ⊆ I, and

X \cap Y = \emptyset

. The first part of this rule, X, is the antecedent(s), while the second part, Y, is the consequence.

In this study, the dependent variable, crash severity (bodily injury and property damage), is considered as the consequence or outcome, and various factors affecting crashes such as crash location, driving condition, etc., are considered as antecedents. To separate important rules from the entire set of rules generated by the Apriori algorithm, various criteria can be used. In the Apriori algorithm, three essential indices are necessary for discovering association rules. The most recognized criteria for determining the minimum threshold are lift (L), support (S), and confidence (C) [53,54]. The support of a rule is the percentage of the total dataset covered by the rule and is expressed as Equation (3).

s u p p o r t (X \Rightarrow Y) = P (X Y) = \frac{The number of data containing (X a n d Y)}{(t o t a l d a t a) N}

(3)

The confidence of an association rule is the conditional probability of the occurrence of the consequence, given that the antecedent(s) have occurred, and is expressed as Equation (4).

c o n f i d e n c e (X \Rightarrow Y) = P (Y | X) = \frac{P (X Y)}{P (X)} o r \frac{s u p p o r t (X \Rightarrow Y)}{s u p p o r t (X)}

(4)

Additionally, the lift value is used to measure the mutual dependence between the antecedents and the consequence of a rule. Lift measures the ratio of the confidence of a rule to the expected confidence of a rule, given the occurrence of the antecedent(s), and is expressed as Equation (5).

l i f t (X \Rightarrow Y) = \frac{P (Y | X)}{P (Y)} = \frac{confidence (X \Rightarrow Y)}{P (Y)} = \frac{P (Y | X)}{P (Y | X)} \cdot \frac{P (X Y)}{P (X)}

(5)

Equation (4) shows that when the value of L is greater than one, the occurrence of X increases the likelihood of Y occurring. Otherwise, the rule is invalid [31]. Lift is more important than support and confidence in determining the strength of an association rule [2]. In this study, a minimum lift value of 1 is considered, and the criteria of support and confidence must be obtained through trial and error [2]. In summary, the algorithm for extracting rules is according to Figure 1. The random forest model has similar hyperparameters to the decision tree and bagging classifier, and its evaluation is also conducted using the Gini criterion [29]. In this study, the RF model is used to assess the importance of variables. In this method, the importance of each variable is based on the amount of error reduction in the decision tree.

3.2.10. Statistical Algorithms

In this study, the statistical algorithm of the Pearson test and contingency table will also be employed, which will be examined further below.

Contingency Table

A contingency table is a two-dimensional table in which we simultaneously examine two or more variables. This table is used to investigate the relationship between a dependent variable and other variables. The dependent variable is placed in the columns, and the independent variables are placed in the rows. Each cell in the table shows the percentage of observations in which the classes of each variable were involved. This table is executed using the pandas library in Python. Information extraction from a contingency table is performed using the formula of the chi-square test statistic and calculating the p-value and degrees of freedom. This method is also known as Pearson’s chi-square test [55]. The formula for calculating the (

χ^{2}

) value of the chi-square test statistic is shown in Equation (6).

[χ^{2} = \sum \frac{(O - E)^{2}}{E}]

(6)

where (O) is the observed frequency and (E) is the expected frequency under the null hypothesis. The degrees of freedom (df) are calculated by the following formula:

d f = (r - 1) \times (c - 1)

(7)

In the above formula,

r

is the number of rows and

c

is the number of columns. In this study, a contingency table is calculated for each dependent variable of crash severity and other variables.

3.2.11. Default Values for Model Parameters

Table 2 shows the parameters used for running each of the machine learning models in the Python programming language.

3.2.12. Evaluation Metrics

This section examines the evaluation metrics utilized in this study.

Confusion Matrix

The confusion matrix, or error matrix, is used as an evaluation tool in classification problems to measure the accuracy of different models. This matrix mainly helps us compare the classification result with the actual measured value to understand how well our model has succeeded in detecting severity classes of crashes. This matrix consists of four main values, namely TP (true positive), TN (true negative), FN (false negative), and FP (false positive). Using these values, various evaluation metrics such as accuracy, sensitivity or recall, precision, and the G-mean can be calculated.

Accuracy: the ratio of the number of correctly classified samples (

T P + T N

) to the total number of samples (

T P + T N + F P + F N

A c c u r a c y = \frac{T P + T N}{T P + T N + F N + F P}

(8)

Precision: the ratio of the number of positive samples correctly identified (

T P

) to the total number of samples that the model identified as positive (

T P + F P

P r e c i s i o n = \frac{T P}{T P + F P}

(9)

G-Mean: This is a logical measure for evaluating unbalanced data by balancing the classification accuracy of minority and majority cases [50].

To calculate the

G - m e a n

, we first need to calculate the recall and specificity.

G - M e a n = \sqrt{S e n s i t i v i t y \times S p e c i f i c i t y} = \sqrt{\frac{T P}{T P + F N} \times \frac{T N}{T N + F P}}

(10)

Additional Explanation of Shapley (SHAP)

In data mining and data analysis, interpreting the results of classification models to justify model predictions and understand their causes is very important. In fact, interpretability is as important as accuracy and precision. In this study, we use the SHAP method to interpret our machine learning models. SHAP is a valid method for interpreting predictive models based on cooperative game theory. It was introduced by Simon and his colleagues in 2017 [29]. The purpose of this part of the research is to investigate in detail the effect of input variables on the predictions of a machine learning model for the dependent variable of crash severity. By conducting this study, it is possible to determine which variables have the greatest impact on the model’s predictions and how these variables contribute to the generation of output for crash severity models and collision types. These results can help inform decisions and enhance the interpretability of machine learning models [12]. Equation (11) calculates the SHAP values.

ϕ_{i} (f) = \frac{1}{K} \sum_{k = 1}^{K} [f (z_{k}) - f (z_{k \ i})]

(11)

Here,

ϕ_{i} (f)

is the SHAP value for variable

i

and

K

is the number of samples in the dataset;

f (z_{k})

is the model prediction for sample

K

; and

f (z_{k \ i})

is the model prediction for sample

K

without the presence of variable

i

4. Results

In this section, the results of the descriptive statistics, classification and regression tree models, association rules, and the Poisson test are presented.

4.1. Descriptive Statistics of Crash Data

A bar chart in Figure 5 shows the frequency distribution of crashes in different driving modes (including an autonomous driving system (ADS), manual driving, and driverless) between September 2014 and September 2023. The highest number of crashes with active ADSs occurred in 2022, and from 2022 onward, it has the highest crash rate. From 2019 to 2021, due to the restrictions imposed due to COVID-19, the testing of autonomous vehicles was also limited [56], which is evident in the chart. Crashes involving driverless vehicles have been reported since 2022. An important point in the crash statistics of 2021 and 2022 is the equal number of crashes in manual driving mode, while the number of crashes with active ADSs showed considerable growth in 2022. This could indicate users’ confidence in the ADS system. Possibly due to the improvement of AV systems, users prefer to hand over control to the autonomous system [43].

The analysis of the descriptive statistics of sections a, b, and c in the pie charts in Figure 6 shows that 66.2% of crashes occur at intersections, with 53% of them happening at signalized intersections. 82% of crashes occur without disengagement, and 14.5% involve disengagement of the ADS before the crash occurs.

According to sections d, e, and f, AVs have recorded 50.8% rear-end collisions, 23.8% side collisions, 7.92% head-on collisions, and 7.43% angular collisions, while non-AVs have mostly collided from the front and sides. A total of 51% of crashes involve active ADSs, and about 4% of crashes involve driverless vehicles. A total of 84.3% of AV crashes result in property damage, while 15.7% involve bodily injury.

4.2. CART Model Results

The CART model was developed using a decision tree model, with 70% of the random data used for training and 30% for model validation. The accuracy of the CART model was 79%, with a precision of 68% and recall of 0.79. Despite using a greater number of variables for model training, the observed accuracy of our model was higher than previous studies that focused on modeling the type of collision. Ashraf et al. (2021) utilized the CART model for modeling collision types and extracting rules, achieving a model accuracy of 65% [2]. Wang and Li (2019) achieved an accuracy of 60% in modeling collision types using the CART model [48]. The developed regression classification tree for examining collision types related to AV crashes is shown in Figure 7. The root node of the CART model was “Non-AV”, which was then divided into two branch nodes (node 1 and node 2) based on the data. Each subsequent node was further split into additional branch nodes based on factors such as AV and non-AV movement, driving conditions, crash locations, and other variables. In total, the CART model resulted in 26 nodes. The terminal (leaf) nodes represent the probability of each crash type, which is a combination of the factors indicated in the preceding branch nodes. Using the Gini criterion, the division continues until the CART recognizes that further tree growth will not lead to greater benefits or when predefined stopping criteria halt the division. Ultimately, each class of the dependent variable is placed in the terminal (leaf) nodes. In the Python programming language, a depth criterion of six and a leaf count of 17 were used to prevent overfitting of the model. Additionally, class weight parameters were set to balance the data and address any imbalances. Rules can be expressed as a conditional expression of statements at different depths in the CART. In total, 20 rules (in terms of collision probability) were generated from the CART using functions in Python, as listed in Table 3. Each rule estimated the probability of collision types for a combination of influential factors in crashes.

4.2.1. Rear-End Collision

The rules 4, 5, 6, 7, and 8 (five rules out of the top 20 rules) in Table 3 indicate a higher probability of rear-end collisions. Three of these four rules (rules 4, 5, and 7) had intersections as the location of crashes, meaning that rear-end collisions are more likely to occur at intersections. According to the factors before the collision, as per rule 8 of the CART, when an AV stopped at an intersection and a non-AV was changing lanes behind the AVs with peripheral parking spaces, the probability of a rear-end collision was 62%. When in an intersection without traffic lights, in manual driving mode, and the non-AV had movements other than straight driving, the probability of a rear-end AV collision increased to 45% according to rule 7 of the CART. According to rule 6 of the CART, when the AV was stopped in ADS mode, the probability of collision increased to 56%. According to the extracted rules, AVs have poorer performance at intersections due to their different driving behaviors compared to human-driven vehicles, leading to an increased probability of collision. According to rule 5 of the CART, when an AV stopped at an intersection with traffic lights in manual driving mode and non-AVs were directly behind the AVs, the probability of a rear-end collision was 50%. A comparison of rules 5 and 7 of the CART indicates that if there are traffic lights at the intersection, the probability of a rear-end AV collision increases. The highest probability of a rear-end collision occurred when the AV was stopped at an intersection and non-AVs were directly behind the AVs. According to rule 4 of the CART, in this scenario, there was a 75% probability of a rear-end AV collision.

The factors before the collision in the CART rules indicate that rear-end collisions are more likely to occur when AVs are stopped at an intersection with active ADSs and non-AVs are moving directly. The presence of traffic lights and the operational state of ADSs increase the probability of rear-end collisions. These findings are consistent with previous studies [2,8].

Most rear-end collisions occur due to unsafe movements of human-driven vehicles or failure to maintain a safe following distance, as they do not stop before colliding with the back of AVs because AVs strictly adhere to traffic rules and maintain a specific following distance with the front vehicle [57]. Non-AV drivers do not expect AVs to stop at that distance, so they are not prepared for sudden braking by AVs, resulting in collisions [2]. By examining crash reports, it is found that some collisions with active ADSs occur due to delayed movements of AVs after the traffic light turns green. In this scenario, non-AVs attempt to pass the AV by changing lanes or colliding with the AV from behind, resulting in low-impact collisions with the vehicle and its occupants.

4.2.2. Sideswipe Collisions

The second most common type of crash related to AVs is sideswipe collisions. Rules 9, 10, 11, and 12 of the CART indicate a higher probability of sideswipe collisions. According to the factors before the collision, rule 9 of the CART shows that when the location of the collision was on a road and the AV was stopped in manual mode while the non-AV was performing unsafe maneuvers such as unsafe circular movements, overtaking, or lane changes, the probability of a sideswipe collision was 60%. Rule 12 of the CART shows that if instead of the mentioned maneuvers, the non-AV had direct movement in the adjacent lane, the probability of a sideswipe collision decreased to 46%. According to rule 10 of the CART, during peak traffic hours when the non-AV was performing unsafe maneuvers such as unsafe circular movements, merging, overtaking, or lane changes and the AV was stopped at an intersection in ADS mode, the probability of a sideswipe collision was 45%. According to rule 11 of the CART, when the AV was moving straight at an intersection with an active ADS and the non-AV was parked beside the street or performing overtaking maneuvers, the probability of a sideswipe collision was 57%.

Two rules were related to roads or direct paths, and two rules were related to intersections as the location of crashes, indicating that in both locations, sideswipe collisions are more likely to occur near or at intersections. The factors before the collision in rules 9 and 11 of the CART show that when the location of the collision was on a road (not at an intersection) and the AV, instead of stopping in manual mode, moved forward at the intersection with an active ADS, the probability of sideswipe collisions increased from 57% (rule 11 of the CART) to 60% (rule 9 of the CART). These two rules indicated that when the AV is stopped in manual operation on a road, the probability of a sideswipe collision is higher than when it moves directly at an intersection with an active ADS. The results are consistent with previous studies [2,8,35]. According to crash narratives, 32 out of 144 sideswipe collisions occurred in darkness, with the highest rate of sideswipe collisions occurring in darkness related to 20 cases of AVs moving straight and 18 cases of active ADSs. Seven out of eighteen collisions at night with active ADSs were related to driverless vehicles. The highest rate of collisions involving driverless vehicles, 38% of the reported 21 collisions, was sideswipe collisions. A total of 81% of driverless vehicle collisions occurred at night, indicating that high automated level 4 cruise vehicles do not perform well at night (from 21:00 to 5:00) and perform best in daylight.

4.2.3. Head-On Collisions

Another type of crash related to AVs is a head-on collision, which involves any collision that damages the front part of the AV, regardless of whether another vehicle collided with the rear or front of the AV. According to rule 17 of the CART, when the AV was maneuvering to park, entering traffic, or overtaking and the non-AV had movements other than direct movement and the POI included transportation uses, there was a 52% probability of a head-on collision. Further examination reveals that this type of collision occurs more frequently on a direct road path. On the other hand, rule 18 shows that there was a 78% probability of a head-on collision when in clear weather and adjacent to transportation uses, and the non-AV was directly moving forward while the AV was in manual operation. Further investigation reveals that this type of collision mostly occurred at intersections and when the AV was maneuvering, especially turning left or right. A comparison of rules 18 and 14 of the CART indicates that in public use with active ADSs (rule 14), the probability of a head-on collision is lower than in transportation use with manual driving (rule 18). According to rules 13 and 19, when the AV was in a parking space and the non-AV was exiting the parking or performing other movements, there was an 88% probability of a head-on collision. Most head-on collisions occurred on roads. Crashes involving intersections occurred at intersections without traffic lights and in manual driving mode, and non-AVs were mostly responsible for the collisions, as they did not observe the right of way or engaged in aggressive maneuvers, reversing, or driving in the wrong lane. In 5.61% of collisions, the AV was also at fault and collided with a parked vehicle. Ashraf et al. found that most head-on collisions occurred on roads while the AV was directly moving on a two-way path in manual driving mode, but findings indicate that AVs were mostly stopped or maneuvering to park, entering traffic, or overtaking, and non-AVs were changing lanes or driving in the wrong path [2].

4.2.4. Broadside Collisions

Another common type of crash related to AVs is a broadside (angular) collision, which occurs more frequently when the AV is at an intersection with traffic lights and is turning left/right or continuing straight. According to rule 1 of the CART, when the AV was in motion at an intersection and the non-AV was directly continuing its movement while the disengagement of the ADS did not occur, there was a 74% probability of an angular collision. According to rules 2 and 3 of the CART, when the AV was performing movements other than stopping at an intersection with traffic lights and the non-AV was directly moving while peripheral parking spaces were available, there was a 79% probability (rule 2 of the CART), and when peripheral parking spaces were not available, there was a 71% probability (rule 3 of the CART) of an angular collision. According to crash reports, most angular collisions occurred at T and Y intersections, mixed intersections, on roads, and the rest at intersections with traffic lights. Angular collisions occurred during manual driving and ADSs, and there were no determining factors for this type of collision. Further examination reveals that in most of these crashes, the AV was directly moving or maneuvering left/right, and the non-AV initiated the collision by violating the red light or failing to observe the right of way, maneuvering, or moving directly. These findings are consistent with previous studies [2,8,35].

4.2.5. Hit-Object Collisions

Another type of crash related to AVs is collisions with objects, where no secondary vehicle is involved and the AV collides with a fixed object such as a street-side table or a garbage bin. According to rules 20 and 15 of the CART, when there is no non-AV and the weather is clear while the ADS is active, there is a 90% probability of a collision (rule 20) and when the vehicle is in manual operation, there is a 100% probability of a collision with an object (rule 15). These two rules indicate that when the ADS is active, the probability of a collision with an object decreases by 10% and the ADS performs better than humans in preventing collisions with objects. According to crash reports, most collisions in manual operation occurred during maneuvering, reversing, in daylight on a direct path and at intersections, and the POI included transportation locations. A total of 23% of the total 35 collisions involving objects were related to an active ADS, one of which was related to a driverless cruise vehicle.

In general, the rules indicate that the most critical collision scenario is when the AV is in front of non-AVs, which needs to be addressed. For angular collisions, intersections with traffic lights, and head-on collisions, parking spaces and parking maneuvers were critical. AVs cannot avoid side collisions well, and this applies to highly automated driverless AVs as well. Locations with transportation uses and intersections were identified as critical locations for AVs. Intersections provided a challenging environment for AVs in mixed-traffic conditions (i.e., a combination of AVs and non-AVs). Overcautious AVs, for example, prolonged startup delays at intersections, during maneuvering, or forward movements, which led non-AV drivers (with shorter startup delays) [2] to collide with AVs, resulting in rear-end or angular collisions.

4.3. Evaluation of the CART Model

The results of the error matrix show an accuracy of 79% for 70% of the training data and a precision of 68% for 30% of the test dataset. Other evaluation parameters are shown in Table 4. The model’s error is calculated to be 19%. Given the high number of variables and classes of the dependent variable, the error value is acceptable. The Gini coefficient value indicates that this tree is still expandable; however, to prevent overfitting the model and consequently reducing the model’s accuracy, pruning of the tree was performed using stop criteria.

4.4. Model of Association Rules Results: Apriori

To extract association rules, the Apriori, Eclat, and FP-Growth models are commonly used. In this study, we utilized the Apriori model due to its previous use in modeling traffic safety issues [2,12] through the mlxtend library in the Python programming language. In this section, the results of extracting association rules using the Apriori model to discover patterns and community rules from AV crash data are presented. These results include important community rules and combinations of variables used in the analysis of autonomous vehicle crash mechanisms. These rules include frequent variables as antecedents and class variables as consequences.

To generate association rules using the Apriori model, 14 crash features (AV accident location, road type, driving mode, ADS disengagement, AV movements during accidents, non-AV movements during accidents, lighting conditions, weather conditions, POI, peripheral parking space, AV damage severity, peak traffic hour, accident severity, and collision types) were used in binary form (41 binary variables). Based on the literature review [2,12] and error testing, a minimum support value of 0.03 and a confidence value of 0.4 were selected. Initially, optimal combinations were extracted using the Apriori model, and then frequent patterns were filtered using community rule thresholds. The lift value for all rules generated in the first step was greater than one, indicating a positive association between antecedents and consequences in all rules. In the next stage, collision types were separated as consequences, and factors influencing different collision types were filtered. The extracted collision rules (antecedents) from the community rule approach using different collision types as consequences are presented in Table 5.

4.4.1. Community Rules for AV Rear-End Collisions

Most rules produced by the Apriori algorithm are supported by CART model rules and provide further insights into AV crashes. The top four rules 1, 2, 3, and 4 shown in Table 5 indicate a combination of factors related to rear-end AV collisions, similar to the findings of the CART model.

These four rules suggest that rear-end collisions with AVs introduce the autonomous driving mode as one of the influencing factors. This finding indicates that AVs pose a higher risk of rear-end crashes in the ADS driving mode compared to the manual driving mode in mixed-traffic conditions. Additionally, intersections were identified as locations where most rear-end AV crashes occur. Rear-end collisions with ADS disengagement before the collision and at intersections without traffic lights were associated with reduced support for the consequence of rear-end collision types, suggesting that if the driver takes control of the vehicle before the collision at an intersection without traffic lights, the likelihood of a rear-end collision decreases. The phenomenon of higher rear-end collision rates in autonomous mode could be attributed to potential issues with the failure of AV technologies in performing appropriate driving maneuvers. Additionally, AV drivers may recognize the perceived safety issue of AVs and attempt to take control of the vehicle, leading to these types of collisions [2]. Most rules indicated that the AV was stationary, and the non-AV was moving directly or performing other maneuvers, but rule 4 indicated that the AV was also moving without ADS disengagement at intersections without traffic lights. This rule had less support compared to other rules, but its lift value indicates frequent occurrences of these variables together. Rear-end collision rules did not include land use and suggested that this type of collision could occur in any location.

4.4.2. Community Rules for Sideswipe Collisions

Rules 5, 6, 7, and 8 (Table 5) were related to sideswipe AV crashes. These four rules indicate similar findings as those revealed by rules 9, 10, 11, and 12 from the CART model (Table 3). The rules indicate that in manual and autonomous driving modes, these crashes occurred at intersections and straight paths, but in most cases involving intersections, the intersections lacked traffic lights and ADS disengagement. In most rules, the AV was directly moving, but in rule 8, at an intersection with traffic lights, the non-AV had a direct forward movement and the AV was in manual driving mode, performing maneuvers such as reversing, overtaking, merging into traffic, or changing lanes, with less support compared to other rules. Additionally, according to rule 7, sideswipe collisions were more common in POIs including public facilities.

4.4.3. Community Rules for Head-On Collisions

Rules 9, 10, 11, and 12 (Table 5) were related to head-on collisions with AVs. These four rules indicate similar findings as those revealed by rules 13, 14, 17, and 18 from the CART model (Table 3). These rules also indicate that most collisions occurred on roads and straight paths while in manual and autonomous driving modes, and in most cases, the AV was stationary at the moment of collision. Additionally, according to rule 11, head-on collisions were more common in POIs including public locations. Non-AV movements at the moment of collision included parking maneuvers, reversing, overtaking, lane changes, or entering traffic. According to rule 12, collisions were less supported when the AV was moving directly on a street without disengagement, indicating a lower level of support compared to when the AV was stationary. The scenario where the AV was stationary in manual driving mode and the non-AV was performing circular maneuvers, reversing, driving in the wrong direction, or merging into traffic, as well as overtaking on two-way roads, had the highest confidence and support.

4.4.4. Community Rules for Broadside Collisions with AVs

Rules 13, 14, 15, and 16 (Table 5) were related to angular collisions with AVs. These four rules indicate similar findings as those revealed by rules 1, 2, and 3 from the CART model (Table 3). These rules indicate that most crashes occur at intersections with traffic lights in both manual and autonomous driving modes. AV and non-AV movements were mostly direct (rules 13, 14, and 15). Rule 13 suggests the most probable scenario for angular collisions, where both vehicles continue straight at an intersection with traffic lights without ADS disengagement, indicating the highest support and confidence. According to rule 14, maintaining the conditions of rule 13, if the AV performs circular maneuvers such as turning left/right, the confidence in this scenario is also high, and based on crash reports, angular collisions during left turns are more frequent. Rules 14, 15, and 16 indicate that angular collisions occurred more frequently in clear weather and POIs including public locations.

4.4.5. Community Rules for AV Collisions with Objects

Rules 17, 18, 19, and 20 (Table 5) were related to collisions of AVs with objects. These four rules indicate similar findings as those revealed by rules 1, 2, and 13 from the CART model (Table 3). These rules are more comprehensive compared to the CART rules and indicate that most crashes occurred at intersections without traffic lights, where the AV was directly moving in manual driving mode or performing circular maneuvers such as parking, reversing, entering or exiting parking, changing lanes, or driving in the wrong direction. Most collisions occurred without disengagement and in proximity to transportation facilities.

4.5. Evaluation of the Apriori Model for Collision Types

In addition to the evaluation metrics mentioned in Table 5 for each scenario, the number of extracted rules for different levels of support and confidence can be observed in Figure 8. This chart provides insights into the number of filtered rules obtained with increasing levels of metrics, which may vary based on the required precision and model conditions [52].

Comparison of Apriori Model Results with CART Model

The results of both decision tree algorithms and community rules support each other and yield similar findings. The most significant difference between these two models is that due to the absence of restrictions in the CART method, community rules can use a wider range of variables for rule extraction. However, it is only suitable for examining frequent scenarios. Conversely, CART rules, in addition to frequent factors, can be used to examine observations with fewer repetitions. Overall, the CART model and community rules complement each other, as the results indicate that each model provides new information not mentioned in the other. In conclusion, these two models complement each other, and using the results of these models can help develop mutual actions to reduce AV-related crashes in the future and improve mixed-traffic safety.

4.6. Statistical Model Results

4.6.1. Cross-Tabulation Results (Pearson’s Test)

In this section, the results of the cross-tabulation table are presented. Before conducting statistical analyses, variables with a VIF parameter greater than 10 were removed due to their strong correlation with other variables. Then, the independence of variables was examined and independent variables were selected for statistical analysis. This table was executed using the pandas library in Python. The applications of this method include examining relationships between variables, assessing data distribution, and evaluating differences and patterns between variables, with assessment using the chi-square (Pearson’s) test and the

H_{0}

statistical hypothesis. The results from this table can be utilized for a better understanding of patterns and results from other models in the results analysis and discussion chapter. The results of the Pearson’s test, descriptive statistics summary, and the frequency of variables are presented in Table 6.

4.6.2. Analysis of Cross-Tabulation Results for Collision Type Variables

The following are results of a cross-tabulation table (Table 6) created to examine the mutual relationships and perform Pearson’s test. This section includes the results, the statistical hypothesis test, the test statistics, and the analysis of the extracted results for different types of collisions.

Null Hypothesis Testing

The null hypothesis (

H_{0}

) positing no significant relationship between independent variables and collision types was rejected. A chi-squared test with a p-value < 0.05 indicated a significant correlation between variables and collision types. In terms of collision type distribution, the data show that around half of the crashes involve rear-end collisions with AVs, while side collisions, head-on collisions, and collisions with objects account for smaller but still substantial shares. Interestingly, Apple vehicles seem to perform better than others in preventing rear-end collisions with AVs, and this type of collision has trended downward from 2014 to 2023.

Looking closer at rear-end collisions with AVs, these are most common on freeways, boulevards, and regular roads, especially at intersections. They make up a whopping 80% of crashes at Y-shaped intersections and 60% at complex intersections. Rear-end crashes are more likely to occur when the AV is in automated driving mode rather than manual control, but driverless AVs have the best record in avoiding these types of collisions.

Side collisions, on the other hand, tend to happen more at night, often with the AV hitting a parked or stopped non-AV. Lighting conditions play a big role, with 50% of side collisions occurring in total darkness versus only 24% when streetlights are present.

Angular and head-on collisions involving AVs result in more severe vehicle damage and injuries, with 15% of these crashes causing physical harm. They are more common with two-wheeled vehicles and other AVs as the secondary vehicle. Intersections, especially complex ones without traffic lights, see a higher share of these collisions compared to straight road segments.

Finally, collisions with objects peaked at 15% of all crashes in 2023, with Woven, Drive.AI, and Apple reporting the highest rates. These types of crashes are more prevalent on highways, freeways, and in parking areas and are slightly less likely to occur on straight road sections versus intersections.

Overall, the cross-tabulation analysis reveals the intricate relationships between factors like driving mode, road infrastructure, lighting conditions, and vehicle types in influencing the frequency and severity of different collision scenarios involving autonomous vehicles.

4.7. Feature Importance

In this study, the importance of variables for balanced collision data was evaluated and interpreted using machine learning models, including decision trees, random forests, boosted gradients, and a variable importance interpretation model using the SHAP variable explainer and a three-layer neural network. All modeling and results were created and extracted using the Python programming language. Assessing the importance of variables helps us understand which variables are more important for modeling and which variables are less important. With these results, we can have a better analysis and interpretation and use them to evaluate the results of the CART and Apriori models.

4.8. Feature Importance with Machine Learning Models

A comprehensive chart of importance was created using decision trees, random forests, and boosted gradients. The assessment of variable importance using machine learning models, as shown in Figure 9, indicated that the absence of non-AVs, non-AV direct movement, intersections with traffic lights, stopped AVs, intersections and streets as collision locations, ADS non-disengagement, and manual driving are important for all three models. These results made it clear that the presence of traffic lights, manual driving compared to ADSs, non-disengagement of ADSs, and the absence of marginal parking space relative to its presence are more important for the models. Intersections have greater importance than streets, and other non-AV movements (including reverse gear, overtaking, lane changes, entering traffic, driving in the wrong direction, etc.) are more important for the occurrence of various collisions compared to direct and circular AV movements. The results of all models for examining the mechanism of various collisions are consistent. It was also established that the balanced data results using SMOTE are consistent with the balanced data results with class weight parameters and p = 0.3 parameters in the CART and Apriori models.

Feature Importance for Collision Type with SHAP

In examining the models, the interpretability of the model results is also of great importance in addition to accuracy and precision. The SHAP algorithm was developed based on the XGB model. This algorithm can display the importance of variables and the importance of each sample on a graph. SHAP is one of the best methods for analyzing important variables in machine learning models, and it provides a better understanding of the importance of variables due to the creation of graphical charts. To understand the importance of variables in predicting collision types, the summary SHAP plot was examined. The SHAP values indicate the unique contribution of each variable to each prediction. Variables are ranked based on their global importance, and each point on the plot represents a SHAP value for the samples of each variable in the test data. This plot has four characteristics, including (1) color indicating the value of that variable from low (blue) to high (red), (2) horizontal position indicating the small or large effect of the variable on the prediction, (3) vertical position indicating the importance of a specific variable, and (4) density displaying the distribution of the variable in the dataset. In section a in Figure 10, since the variables are binary, for each variable, the blue color indicates the number of samples with less importance and the red color indicates the number of samples with greater importance. For each variable, the right side of the origin line shows 1 (presence), and the left side shows 0 (absence of conditions of that variable in the data).

According to the SHAP model, based on sections a and b in Figure 10, the absence of non-AV and other AV movements (including parking maneuvers, reverse gear, overtaking, lane changes, entering or exiting traffic, etc.) were identified as the most important factors for predicting collision types. Other non-AV movements, manual driving, cloudy and rainy weather, non-disengagement of ADSs, non-peak traffic hours, and parking location as the crash site are variables that have greater importance. For predicting collision type classes, the XGB model prioritized the absence of non-AV and other AV movements, manual driving, cloudy and rainy weather, non-disengagement of ADSs, non-peak traffic hours, and parking location as the crash site. It is possible that these priorities may differ for different algorithms.

4.9. Evaluation Metrics for Collision Models

Machine learning models were developed to select a model with the highest accuracy, precision, and minimal error for the creation of the SHAP explanatory model. These metrics and charts were created in Python 3.9. A total of 70% of the data was used for training and 30% for evaluating the models. According to Table 7, among the machine learning models, XGB had the highest accuracy and precision at 89%, the lowest error rate of 0.1, and the lowest TN value or mispredicted true negatives. Therefore, due to its high accuracy and resistance to overfitting, XGB was used to develop the SHAP model.

5. Discussion

In this section, the results are analyzed and interpreted, with each of the research hypotheses examined and verified using the obtained results and compared with previous studies. Additionally, the relationship between the results and the overall research objectives, the application of the research findings, and proposed solutions are discussed. At the end of this chapter, the limitations, suggestions, and general conclusions of this research are presented.

5.1. Examination of the Mechanism of AV Crashes

To investigate the mechanisms of crashes involving an AV and another party, three approaches, namely statistical analysis, data mining, and machine learning, were employed to model the dependent variable, the type of collision. Due to the imbalance in the data, balancing was performed using the SMOTE method. Models such as the CART, Apriori, RF, XGB, SHAP, and Pearson’s test were utilized for modeling and a detailed examination of the importance of variables for the created models. The results of this research have several practical implications, identifying gaps and providing suggestions that can be used for the development of joint efforts in the advancement of AV technology among multiple stakeholders, such as AV developers, traffic safety authorities, researchers, and law enforcement agencies. The findings from this study can serve as a foundation for collaborative initiatives aimed at enhancing the safety and reliability of autonomous vehicles, ultimately leading to a safer transportation ecosystem.

The crash reports of AVs have been manually converted into a machine-readable format, tripling the number of reports compared to previous studies and providing a reliable source for future researchers. The outcomes of the research are directly applicable to various aspects of AV development and regulation. For instance, the identified gaps in AV technology can guide developers in prioritizing areas for improvement, such as enhancing sensor capabilities or refining decision-making algorithms to better handle complex traffic situations. Traffic safety authorities can use the suggestions to inform policy decisions and implement measures that promote safer interactions between AVs and other road users. Researchers can build upon the findings to explore new methodologies and technologies that address the identified weaknesses in current AV systems. Law enforcement agencies can benefit from the insights to develop training programs that prepare officers to handle AV-related incidents and enforce regulations effectively.

Moreover, the research provides a comprehensive understanding of the factors contributing to AV crashes, which can be instrumental in designing more robust testing protocols and safety standards. By leveraging the results, stakeholders can work together to create a more cohesive and proactive approach to AV safety, ensuring that the integration of autonomous vehicles into our transportation system is achieved in a manner that maximizes benefits while minimizing risks.

5.2. Analysis of the Results of Collision Models

The results show that 50.8% of crashes involve a rear-end collision with the AV, 23.8% involve a sideswipe collision, 7.5% involve an angular collision, 7.9% involve a head-on collision, 5.7% involve a collision with objects, 0.5% involve a collision with pedestrians, and 3.8% involve other types of collisions. Subsequently, the modeling results of the dependent variable types of collisions using the CART, Apriori, Pearson’s test, and a detailed examination of variable importance using SHAP are analyzed and interpreted.

5.2.1. Rear-End Collision with the AV

The most common type of collision is a rear-end with the AV, consistent with previous findings [8, 39]. These occur when the stopped AV in ADS mode is hit by a non-AV in motion [2]. Rear-end crashes are a key challenge, as AVs strictly follow rules, while non-AV drivers lack AV awareness [8].

Driverless AVs have lower rear-end rates than in ADS mode, indicating their greater success in preventing these. The improved sensor integration of cameras, radar, LIDAR, and night vision can help ADSs respond reliably in adverse conditions [14]. Traction control can also prevent wheel slippage on wet roads, increasing safety [37].

Rear-end collisions are more likely at signalized intersections, aligning with prior research [58]. Better education for non-AV drivers on AV capabilities and infrastructure solutions like dedicated lanes or AV-non-AV disengagement at intersections could mitigate these [8].

Unfavorable weather like fog, clouds, and rain also contributes to increased rear-end crashes, consistent with [37,46,59]. Addressing sensor disruptions from moisture or poor visibility through design improvements can help AVs maintain reliable perception in adverse conditions.

Rear-end collisions with the AV happen more often during peak evening hours, on holidays, and at locations with points of interest. Anticipating and managing traffic flow in these high-risk scenarios could prevent incidents. While minor rear-end crashes still comprise 51% of all crashes [36], the continued development to enhance AV safety systems and integration with human drivers is crucial.

5.2.2. Sideswipe Collision

Sideswipe collisions are the second most common involving AVs. They are most likely when the stopped manual AV on a street or straight path is struck by a non-AV performing maneuvers like entering/exiting traffic, unsafe turns, or merging—a 60% probability scenario [49]. Even when the non-AV is in direct motion, the chance remains high at 46%, highlighting the importance of their actions [8]. Common AV maneuvers in these crashes include direct movement, stopping, and circular movements.

In ADS mode, collisions are more likely if the non-AV is parked or overtaking at an intersection. AVs can also sideswipe parked vehicles in both manual and ADS modes. AV-to-AV sideswipes account for the highest percentage (43%) of the seven such incidents [35].

5.2.3. Broadside Collision

Broadside (angular) collisions are a significant concern for AVs, particularly at signalized intersections. They are more likely when the manual AV and non-AV are directly moving through the intersection, even with ADS disengagement [2]. The CART model indicates broadside crashes are more probable when the AV is moving through an intersection while the non-AV continues a different approach, without the ADS engaged [2].

In total, 81% of angular collisions occurred at intersections, mostly with traffic lights. These severe crashes appear to stem from non-compliance by non-AV drivers with red lights or right-of-way rules [2,45]. Broadside incidents involving AVs had a higher share of two-wheelers and other self-driving cars [8].

Consistent with prior findings, the data confirm a strong relationship between crash type and severity—rear-end and angular collisions result in greater bodily injury, while sideswipe, object, and crashes tend to cause more financial damage [2,12].

5.2.4. Head-On Collision

Head-on crashes with AVs are more prevalent on streets, direct paths, and in clear weather when the AV is in manual mode without disengagement [8,10]. A total of 40% of these occurred when the stopped AV was hit, and they are also common in parking lots. The data show the most frequent AV maneuvers before head-on crashes include overtaking, entering traffic, parking, left turns, stopping, and lane changes. In contrast, the non-AVs were often engaged in lane changes, driving in the opposite lane or wrong direction, stopping, entering traffic, and making left turns [8,10]. Head-on collisions are more likely on highways and in parking lots but less common on direct paths and more probable at intersections—especially T-junctions without signals. Disengagement between the AV and other vehicles decreases the likelihood of head-on crashes. Environmental factors also play a role, as these collisions occur more frequently in rain, cloudy conditions, at sunrise/sunset, and night with adequate lighting. Locations with points of interest also see a higher incidence of head-on crashes with AVs. Fortunately, these crashes tend to result in minor or no damage, though they still account for 9% of financial costs and 4% of injuries [2].

5.2.5. Collision with Object

Collisions with objects are less studied but show an increased frequency. They are more likely on highways, freeways, roads, and in parking spaces rather than on direct paths, and they are more prevalent at intersections—particularly T-junctions without signals [10]. Compared to other types, object crashes account for 10% in manual mode and 5% in ADS mode. Notably, the probability decreases when the AV’s ADS is active, contradicting some prior findings that the ADS can increase object collision risks [10]. The most common AV maneuvers before object collisions include driving in the wrong direction, U-turns, parking, lane changes, turns, and direct movement. This suggests opportunities to enhance AV perception, prediction, and decision-making to better navigate complex environments and avoid these crashes.

The study’s hypotheses regarding the impact of environmental factors on AV collisions have been confirmed. The results show that weather conditions, ambient light, road types, and other environmental variables significantly influence the occurrence of different collision types. For instance, the likelihood of collisions with objects is higher on highways, while sideswipe crashes are more common on streets and highways, and head-on collisions are more prevalent in parking areas [12,20,60]. Poor visibility conditions like fog and snow have been linked to a higher probability of severe crashes [3,14], potentially due to sensor limitations in such environments. To mitigate these issues, enhancing camera, sensor, and LiDAR capabilities has been recommended [61,62].

The hypothesis regarding the impact of intersection types on collision patterns is also supported. Rear-end and angular crashes are more likely at intersections with traffic lights, while frontal, sideswipe, and AV front-end collisions are more common at priority intersections without signals [2,8,57]. This may stem from the complex mixed-traffic environments at intersections. Infrastructure improvements and V2X communication can help address these challenges [61,63].

Regarding the influence of autonomous driving modes, the data confirm that AVs perform better than manual driving in most collision types, except for rear-end incidents, which accounted for 65% of crashes in autonomous mode [11,41]. This could be due to drivers’ diverted attention and overreliance on automation. However, the hypothesis that disengagement of the automated driving system (ADS) before a collision can reduce severity is rejected, as disengagement often leads to more severe outcomes in common crash types [44,64].

The study also found a correlation between land use characteristics and collision patterns, with the highest occurrence near attractions like restaurants, bars, and bike parking [60,65]. This aligns with prior research on the impact of peripheral parking and limited maneuvering space [23,66], though the specific findings differ from some previous studies [33].

Overall, this comprehensive exploratory analysis provides valuable insights to help developers, safety agencies, and researchers improve mixed-traffic safety and develop enhanced AV models and scenarios.

5.3. Key Interrelationships

The interrelationships identified in this study shed light on the complex dynamics of crashes involving AVs. The findings have revealed significant interrelationships between environmental factors, driving modes, collision types, disengagement events, and land use, providing valuable insights for various stakeholders involved in traffic safety and autonomous vehicle development. These interrelationships underscore the need for tailored solutions and interventions to address the multifaceted challenges associated with the integration of AVs into mixed-traffic environments.

Highlighting the significant interrelationships between pre-collision conditions, collision types, collision severity, and disengagement distance provides us with a better understanding of the outcomes of various models.

The following are some of the key interrelationships presented:

Intersection locations are more conducive to rear-end collisions and angular collisions with AVs, while direct paths are more likely to experience other types of collisions.
ADS disengagement has been able to reduce rear-end collisions and collisions with objects, but it has increased the occurrence of other collision types.
Unlit roads at night are prone to sideswipe and rear-end collisions, but angular collisions do not occur there. In illuminated areas at night and during sunrise/sunset, all types of collisions except rear-end collisions are more likely to occur.
The most common types of collisions, namely sideswipe and rear-end collisions, are more prevalent on dark unlit roads without the presence of traffic lights and during sunrise/sunset, where visibility is limited. This may be due to the poor performance of AV cameras and sensors in darkness and glaring light during sunrise and sunset. Improving these conditions can undoubtedly contribute significantly to reducing crashes. Limited visibility conditions such as fog and snow have a greater impact on collisions with objects and collisions with the rear of the AV, while sideswipe collisions are common in clear weather.
Collisions with the rear of the AV have the highest frequency, but angular collisions have the highest likelihood of resulting in bodily injury. Most angular collisions have resulted in bodily injury to vulnerable road users such as motorcyclists, cyclists, and scooter users.

In conclusion, the comprehensive analysis conducted in this study has uncovered intricate interrelationships and critical insights into the factors influencing crashes involving AVs. By examining the impact of environmental conditions, driving modes, collision types, disengagement events, and land use, this study has provided valuable knowledge that can inform the development of targeted interventions and strategies to enhance traffic safety in the context of autonomous vehicles. The findings and insights presented in this study provide a valuable foundation for further research, policies, and technological advancements to address the risks and challenges of widespread AV adoption, including through the extraction of processable AV crash data and the demonstration of AV strengths and weaknesses in various crash scenarios.

5.4. Limitations and Recommendations

5.4.1. Limitations

While this study has generated valuable insights into the safety of autonomous vehicles, there are some important limitations to acknowledge.

First and foremost, the sample size and range of variables included in the analysis could be expanded in future research to increase the reliability and robustness of the findings. Collecting additional data on driver characteristics, pre-crash traffic conditions, vehicle speeds, and sensor performance would provide a more comprehensive understanding of the underlying mechanisms behind AV-involved collisions. Additionally, this study relied solely on crash reports from California. To gain a more holistic view, it would be beneficial to gather data from other regions and countries, as driving behaviors and traffic regulations can vary significantly across different contexts. Another limitation relates to the data collection process. The researchers had to manually extract and process the crash reports, which is a time-consuming task. It would be more efficient if manufacturers submitted the required data in a standardized machine-readable format directly to the relevant authorities. While the analytical approach used in this study is well established, the researchers acknowledge that as the sample size grows, the models can be further refined and extended to explore new areas, such as collisions between fully autonomous vehicles or the differences between driverless vehicles and conventional cars.

Finally, the researchers note that some of the relationships identified in this study contradict previous findings, underscoring the need for continued investigation and validation as the body of research on AV safety evolves. Despite these limitations, the insights gained from this study offer a valuable foundation for future research and the development of strategies to enhance the safety of autonomous vehicles as they become more prevalent on our roads.

5.4.2. Recommendations for Future Research

The key findings from this study provide a foundation for future research to improve AV safety in mixed-traffic environments. The recommended research directions include the following:

3.

Enhancing sensor capabilities and algorithms:

Investigate AV sensor performance in challenging conditions;
Develop advanced sensor fusion and data processing algorithms;
Explore emerging sensor technologies to improve perception.

4.

Improving AV–infrastructure coordination:

Research V2X communication for real-time information sharing;
Investigate dedicated lanes, intersection control, and other infrastructure solutions;
Analyze the impact of infrastructure design on AV safety.

5.

Addressing human factors and behavioral interactions:

Conduct studies on non-AV driver awareness and acceptance of AVs;
Explore methods for effective driver education and training programs;
Investigate the impact of different AV control modes on driver behavior.

6.

Validating and testing AV safety systems:

Develop advanced simulation environments and testing scenarios;
Establish standardized metrics and testing protocols;
Leverage real-world crash data to refine safety algorithms.

7.

Adopting a holistic approach to traffic safety:

Explore the integration of AVs with other transportation technologies;
Investigate the impact of societal and policy factors on AV safety;
Conduct multidisciplinary research to address the challenges.

Additional recommendations:

Study the impact of increasing AV intelligence on safety;
Investigate the need for new standards and laws for intelligent transportation;
Examine the impact of V2V interactions on crash mechanisms;
Assess the safety implications of self-driving two-wheelers in mixed traffic.

By addressing these research directions, the scientific community can contribute to the development of safer and more reliable autonomous vehicles that can be successfully integrated into mixed-traffic environments, enhancing overall road safety.

6. Conclusions

The development of autonomous vehicles (AVs) in future transportation is not only inevitable but also critical for enhancing road safety, particularly in mixed-traffic environments where conventional vehicles coexist. This study has established a comprehensive dataset of 606 AV crashes in California from September 2014 to September 2023, utilizing crash narrative PDFs provided by the California Department of Motor Vehicles (CA DMV). By employing advanced methodologies, including the Apriori model for association rule extraction and classification regression tree (CART) modeling, this research has uncovered significant patterns and interrelationships among various factors influencing AV crashes, such as pre-collision conditions, driving modes, collision types, and severity.

Key findings indicate that rear-end collisions are the most prevalent, highlighting the necessity for AVs, especially in autonomous mode, to maintain safe distances from other vehicles. This high incidence may stem from non-AVs’ inability to recognize AV driving patterns. Immediate recommendations include the manual operation of AVs in hazardous conditions and the implementation of warning systems for non-AV drivers regarding the operational state of AVs. Furthermore, enhancing driver training for conventional vehicle operators to better understand AV behaviors is essential.

The study also identified sideswipe collisions as a significant concern for driverless vehicles, necessitating improvements in vehicle-to-vehicle (V2V) communication and perception systems. The observed AV–AV crashes underscore the need for a common communication protocol to prevent misunderstandings that could lead to accidents. Additionally, the integration of advanced sensor technologies and artificial intelligence algorithms is crucial for improving AV interaction with both human-driven and other autonomous vehicles.

To mitigate various types of collisions, this study proposes several actionable solutions as follows:

Rear-end collisions: implement adaptive cruise control and advanced emergency braking systems, develop V2V communication capabilities, and conduct public education campaigns to raise awareness among non-AV drivers.

Sideswipe collisions: enhance V2V communication, improve sensor suites for better detection in low visibility, and consider dedicated lanes to separate AVs from human drivers.

Broadside collisions: improve sensor algorithms for detecting traffic violations, develop V2X communication systems, and implement dedicated lanes at high-risk intersections.

Head-on collisions: enhance detection systems for wrong-way maneuvers, improve AV communication at unsignalized intersections, and conduct targeted driver education campaigns.

Object collisions: develop robust path planning algorithms, enhance communication with infrastructure, and conduct extensive testing to identify vulnerabilities.

Addressing these technical, infrastructural, and human factors is vital for making AVs more resilient against prevalent collision types in mixed-traffic environments. As AV technology continues to evolve, a collaborative effort among researchers, manufacturers, and policymakers is essential to ensure a safe and effective integration of AVs into transportation systems. This study lays a foundational understanding that can guide future research and the development of strategies aimed at enhancing AV safety and reliability, ultimately contributing to a safer transportation ecosystem.

Author Contributions

Conceptualization, E.K. and S.R.D.; methodology, E.K.; software, E.K.; validation, E.K., S.R.D. and K.S.; formal analysis, E.K.; investigation, E.K.; resources, E.K.; data curation, E.K.; writing—original draft preparation, E.K.; writing—review and editing, E.K.; visualization, E.K.; supervision, K.S.; project administration, K.S. and S.R.D. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data extracted for this study are available to assist future researchers and promote research in the field of AVs through our GitHub link (kohanpour1/AVs: AV CRASH (github.com/). Additionally, raw crash reports in PDF format can be obtained via https://www.dmv.ca.gov/.

Conflicts of Interest

The authors declare that no conflicts of interest are associated with this work.

Abbreviations

AVs	Autonomous Vehicles
ADS	Automated Driving Systems
AVT	Autonomous Vehicle Testing
CA DMV	California Department of Motor Vehicles
CART	Classification and Regression Trees
DL	Deep Learning
ML	Machine Learning
RF	Random Forest
SAE	Society of Automotive Engineers
SHAP	SHapley Additive exPlanations
SMOTE	Synthetic Minority Over-Sampling Technique
V2I	Vehicle-to-Infrastructure
V2V	Vehicle-to-Vehicle
V2X	Vehicle-to-Everything
VMT	Vehicle Miles Traveled
XGB	eXtreme Gradient Boosting

References

NHTSA. Federal Automated Vehicles Policy: Accelerating the Next Revolution in Roadway Safety; U.S. Department of Transportation: Washington, DC, USA, 2016; p. 116. [Google Scholar]
Ashraf, M.T.; Dey, K.; Mishra, S.; Rahman, M.T. Extracting Rules from Autonomous-Vehicle-Involved Crashes by Applying Decision Tree and Association Rule Methods. Transp. Res. Rec. J. Transp. Res. Board. 2021, 2675, 522–533. [Google Scholar] [CrossRef]
Sivakanthan, S.; Cooper, R.; Lopes, C.; Kulich, H.; Deepak, N.; Lee, C.D.; Wang, H.; Candiotti, J.L.; Dicianno, B.E.; Koontz, A.; et al. Accessible Autonomous Transportation and Services: A Focus Group Study. Disabil. Rehabil. Assist. Technol. 2023, 19, 1992–1999. [Google Scholar] [CrossRef] [PubMed]
Lutin, J.M. Not If, but When: Autonomous Driving and the Future of Transit. J. Public Trans. 2018, 21, 92–103. [Google Scholar] [CrossRef]
Pokorny, P.; Høye, A.K. Descriptive Analysis of Reports on Autonomous Vehicle Collisions in California: January 2021–June 2022. Traffic Saf. Res. 2022, 2, 1–8. [Google Scholar] [CrossRef]
SAE International. Surface Vehicle. SAE Int. 2018, 4970, 1–5. [Google Scholar]
California DMV. Autonomous Vehicle Collision Reports. Available online: https://www.dmv.ca.gov/portal/vehicle-industry-services/autonomous-vehicles/autonomous-vehicle-collision-reports/ (accessed on 22 January 2024).
Lee, S.; Arvin, R.; Khattak, A.J. Advancing Investigation of Automated Vehicle Crashes Using Text Analytics of Crash Narratives and Bayesian Analysis. Accid. Anal. Prev. 2023, 181, 106932. [Google Scholar] [CrossRef]
Tibljaš, A.D.; Giuffrè, T.; Surdonja, S.; Trubia, S. Introduction of Autonomous Vehicles: Roundabouts Design and Safety Performance Evaluation. Sustainability 2018, 10, 1060. [Google Scholar] [CrossRef]
Boggs, A.M.; Wali, B.; Khattak, A.J. Exploratory Analysis of Automated Vehicle Crashes in California: A Text Analytics & Hierarchical Bayesian Heterogeneity-Based Approach. Accid. Anal. Prev. 2020, 135, 105354. [Google Scholar] [CrossRef]
Song, Y.; Chitturi, M.V.; Noyce, D.A. Automated Vehicle Crash Sequences: Patterns and Potential Uses in Safety Testing. Accid. Anal. Prev. 2021, 153, 106017. [Google Scholar] [CrossRef]
Chen, H.; Chen, H.; Zhou, R.; Liu, Z.; Sun, X. Exploring the Mechanism of Crashes with Autonomous Vehicles Using Machine Learning. Math. Probl. Eng. 2021, 2021, 1–10. [Google Scholar] [CrossRef]
Liu, Q.; Wang, X.; Wu, X.; Glaser, Y.; He, L. Crash Comparison of Autonomous and Conventional Vehicles Using Pre-Crash Scenario Typology. Accid. Anal. Prev. 2021, 159, 106281. [Google Scholar] [CrossRef] [PubMed]
Saez-Perez, J.; Wang, Q.; Alcaraz-Calero, J.M.; Garcia-Rodriguez, J. Design, Implementation, and Empirical Validation of a Framework for Remote Car Driving Using a Commercial Mobile Network. Sensors 2023, 23, 1671. [Google Scholar] [CrossRef] [PubMed]
Dai, S. Prioritize Winter Crash Severity Influencing Factors in US Midwestern for Autonomous Vehicle. 2020. Available online: http://digital.library.wisc.edu/1793/79895 (accessed on 22 January 2024).
Madadi, B.; Van Nes, R.; Snelder, M.; Van Arem, B. Optimizing Road Networks for Automated Vehicles with Dedicated Links, Dedicated Lanes, Andmixed-Trafficsubnetworks. J. Adv. Transp. 2021, 2021, 1–17. [Google Scholar] [CrossRef]
Xiao, J.; Goulias, K.G. How Public Interest and Concerns about Autonomous Vehicles Change over Time: A Study of Repeated Cross-Sectional Travel Survey Data of the Puget Sound Region in the Northwest United States. Transp. Res. Part C Emerg. Technol. 2021, 133, 103446. [Google Scholar] [CrossRef]
Rahman, M.T.; Dey, K.; Dimitra Pyrialakou, V.; Das, S. Factors Influencing Safety Perceptions of Sharing Roadways with Autonomous Vehicles Among Vulnerable Roadway Users. J. Saf. Res. 2023, 85, 266–277. [Google Scholar] [CrossRef]
Jin, W.; Islam, M.; Chowdhury, M. Risk-Based Merging Decisions for Autonomous Vehicles. J. Saf. Res. 2022, 83, 45–56. [Google Scholar] [CrossRef]
Wang, S.; Li, Z. Exploring the Mechanism of Crashes with Automated Vehicles Using Statistical Modeling Approaches. PLoS ONE 2019, 14, e0214550. [Google Scholar] [CrossRef]
Ren, W.; Yu, B.; Chen, Y.; Gao, K. Divergent Effects of Factors on Crash Severity under Autonomous and Conventional Driving Modes Using a Hierarchical Bayesian Approach. Int. J. Envion. Res. Public Health 2022, 19, 11358. [Google Scholar] [CrossRef]
Hemenway, D.; Lee, L.K. Lesson from the Continuing 21st Century Motor Vehicle Success. Inj. Prev. 2022, 28, 480–482. [Google Scholar] [CrossRef]
Xu, C.; Ding, Z.; Wang, C.; Li, Z. Statistical Analysis of the Patterns and Characteristics of Connected and Autonomous Vehicle Involved Crashes. J. Saf. Res. 2019, 71, 41–47. [Google Scholar] [CrossRef]
Mallory, A.; Ramachandra, R.; Valek, A.; Suntay, B.; Stammen, J. Pedestrian Injuries in the United States: Shifting Injury Patterns with the Introduction of Pedestrian Protection into the Passenger Vehicle Fleet. Traffic Inj. Prev. 2024, 25, 463–471. [Google Scholar] [CrossRef] [PubMed]
Soori, M.; Arezoo, B.; Dastres, R. Artificial Intelligence, Machine Learning and Deep Learning in Advanced Robotics, a Review. Cogn. Robot. 2023, 3, 54–70. [Google Scholar] [CrossRef]
Khattak, A.J.; Wali, B. Analysis of Volatility in Driving Regimes Extracted from Basic Safety Messages Transmitted Between Connected Vehicles. Transp. Res. Part. C Emerg. Technol. 2017, 84, 48–73. [Google Scholar] [CrossRef]
Favarò, F.; Eurich, S.; Nader, N. Autonomous Vehicles’ Disengagements: Trends, Triggers, and Regulatory Limitations. Accid. Anal. Prev. 2018, 110, 136–148. [Google Scholar] [CrossRef]
Yoon, Y.; Kim, T.; Lee, H.; Park, J. Road-Aware Trajectory Prediction for Autonomous Driving on Highways. Sensors 2020, 20, 4703. [Google Scholar] [CrossRef]
Hasan, A.S.; Jalayer, M.; Das, S.; Asif Bin Kabir, M. Application of Machine Learning Models and SHAP to Examine Crashes Involving Young Drivers in New Jersey. Int. J. Transp. Sci. Technol. 2023, 14, 156–170. [Google Scholar] [CrossRef]
Zhang, Y.; Yang, X.J.; Zhou, F. Disengagement Cause-and-Effect Relationships Extraction Using an NLP Pipeline. IEEE Trans. Intell. Transp. Syst. 2022, 23, 21430–21439. [Google Scholar] [CrossRef]
Weng, J.; Zhu, J.-Z.; Yan, X.; Liu, Z. Investigation of Work Zone Crash Casualty Patterns Using Association Rules. Accid. Anal. Prev. 2016, 92, 43–52. [Google Scholar] [CrossRef]
De Oña, J.; López, G.; Abellán, J. Extracting Decision Rules from Police Accident Reports Through Decision Trees. Accid. Anal. Prev. 2013, 50, 1151–1160. [Google Scholar] [CrossRef]
Boggs, A.M.; Arvin, R.; Khattak, A.J. Exploring the Who, What, When, Where, and Why of Automated Vehicle Disengagements. Accid. Anal. Prev. 2020, 136, 105406. [Google Scholar] [CrossRef]
Banerjee, S.S.; Jha, S.; Cyriac, J.; Kalbarczyk, Z.T.; Iyer, R.K. Hands off the Wheel in Autonomous Vehicles?: A Systems Perspective on over a Million Miles of Field Data. In Proceedings of the 48th Annual IEEE/IFIP International Conference on Dependable Systems and Networks, DSN 2018, Luxembourg, 25–28 June 2018; pp. 586–597. [Google Scholar] [CrossRef]
Alambeigi, H.; McDonald, A.D.; Tankasala, S.R. Crash Themes in Automated Vehicles: A Topic Modeling Analysis of the California Department of Motor Vehicles Automated Vehicle Crash Database. arXiv 2020, arXiv:2001.11087. [Google Scholar] [CrossRef]
Das, S.; Dutta, A.; Tsapakis, I. Automated Vehicle Collisions in California: Applying Bayesian Latent Class Model. IATSS Res. 2020, 44, 300–308. [Google Scholar] [CrossRef]
Ding, S.; Abdel-Aty, M.; Wang, D.; Barbour, N.; Wang, Z.; Zheng, O. Exploratory Analysis of Injury Severity Under Different Levels of Driving Automation (SAE Level 2–5) Using Multi-Source Data. Accid. Anal. Prev. 2024, 206, 107692. [Google Scholar] [CrossRef]
Kutela, B.; Avelar, R.E.; Bansal, P. Modeling Automated Vehicle Crashes with a Focus on Vehicle At-Fault, Collision Type, and Injury Outcome. J. Transp. Eng. A Syst. 2022, 148. [Google Scholar] [CrossRef]
Favaro, F.M.; Nader, N.; Eurich, S.O.; Tripp, M.; Varadaraju, N. Examining Accident Reports Involving Autonomous Vehicles in California. PLoS ONE 2017, 12, e0184952. [Google Scholar] [CrossRef]
Sinha, A.; Vu, V.; Chand, S.; Wijayaratna, K.; Dixit, V. A Crash Injury Model Involving Autonomous Vehicle: Investigating of Crash and Disengagement Reports. Sustainability 2021, 13, 7938. [Google Scholar] [CrossRef]
Wu, K.W.; Wu, W.F.; Liao, C.C.; Lin, W.A. Risk Assessment and Enhancement Suggestions for Automated Driving Systems through Examining Testing Collision and Disengagement Reports. J. Adv. Transp. 2023, 1-18, 1–18. [Google Scholar] [CrossRef]
Sinha, A.; Chand, S.; Wijayaratna, K.P.; Virdi, N.; Dixit, V. Comprehensive Safety Assessment in Mixed Fleets with Connected and Automated Vehicles: A Crash Severity and Rate Evaluation of Conventional Vehicles. Accid. Anal. Prev. 2020, 142, 105567. [Google Scholar] [CrossRef]
Dixit, V.V.; Chand, S.; Nair, D.J. Autonomous Vehicles: Disengagements, Accidents and Reaction Times. PLoS ONE 2016, 11, e0168054. [Google Scholar] [CrossRef]
Petrovic, D.; Mijailović, R.; Pešić, D. Traffic Accidents with Autonomous Vehicles: Type of Collisions, Manoeuvres and Errors of Conventional Vehicles’ Drivers. Transp. Res. Procedia 2020, 45, 161–168. [Google Scholar] [CrossRef]
Kutela, B.; Das, S.; Dadashova, B. Mining Patterns of Autonomous Vehicle Crashes Involving Vulnerable Road Users to Understand the Associated Factors. Accid. Anal. Prev. 2022, 165, 106473. [Google Scholar] [CrossRef] [PubMed]
Novat, N.; Kidando, E.; Kutela, B.; Kitali, A.E. A Comparative Study of Collision Types between Automated and Conventional Vehicles Using Bayesian Probabilistic Inferences. J. Saf. Res. 2023, 84, 251–260. [Google Scholar] [CrossRef] [PubMed]
Zhu, S.; Meng, Q. What Can We Learn from Autonomous Vehicle Collision Data on Crash Severity? A Cost-Sensitive CART Approach. Accid. Anal. Prev. 2022, 174, 106769. [Google Scholar] [CrossRef] [PubMed]
Wang, S.; Li, Z. Exploring Causes and Effects of Automated Vehicle Disengagement Using Statistical Modeling and Classification Tree Based on Field Test Data. Accid. Anal. Prev. 2019, 129, 44–54. [Google Scholar] [CrossRef] [PubMed]
Bentéjac, C.; Csörgő, A.; Martínez-Muñoz, G. A Comparative Analysis of Gradient Boosting Algorithms. Artif. Intell. Rev. 2021, 54, 1937–1967. [Google Scholar] [CrossRef]
Li, Y.; Fan, W.; Song, L.; Liu, S. Combining Emerging Hotspots Analysis with XGBoost for Modeling Pedestrian Injuries in Pedestrian-Vehicle Crashes: A Case Study of North Carolina. J. Transp. Saf. Secur. 2023, 15, 1203–1225. [Google Scholar] [CrossRef]
Othman, K. Public Acceptance and Perception of Autonomous Vehicles: A Comprehensive Review. AI Ethics 2021, 1, 355–387. [Google Scholar] [CrossRef]
Hong, J.; Tamakloe, R.; Park, D. Application of Association Rules Mining Algorithm for Hazardous Materials Transportation Crashes on Expressway. Accid. Anal. Prev. 2020, 142, 105497. [Google Scholar] [CrossRef]
Wu, K.F.; Wang, L. Exploring the Combined Effects of Driving Situations on Freeway Rear-End Crash Risk Using Naturalistic Driving Study Data. Accid. Anal. Prev. 2021, 150, 105866. [Google Scholar] [CrossRef]
Qu, Y.; Li, Z.; Liu, Q.; Pan, M.; Zhang, Z. Crash/Near-Crash Analysis of Naturalistic Driving Data Using Association Rule Mining. J. Adv. Transp. 2022, 1–19. [Google Scholar] [CrossRef]
Reiser, M.; Cagnone, S.; Zhu, J. An Extended GFfit Statistic Defined on Orthogonal Components of Pearson’s Chi-Square. Psychometrika 2023, 88, 208–240. [Google Scholar] [CrossRef] [PubMed]
Chen, A.; Tan, Y. Pandemic Effects to Autonomous Vehicles Test Operations in California. PLoS ONE 2022, 17, e0264484. [Google Scholar] [CrossRef]
Goodall, N.J. Comparison of Automated Vehicle Struck-from-Behind Crash Rates with National Rates Using Naturalistic Data. Accid. Anal. Prev. 2021, 154, 106056. [Google Scholar] [CrossRef] [PubMed]
Xu, Z.; Jiang, Z.; Wang, G.; Wang, R.; Li, T.; Liu, J.; Zhang, Y.; Liu, P. When the Automated Driving System Fails: Dynamics of Public Responses to Automated Vehicles. Transp. Res. Part. C Emerg. Technol. 2021, 129, 103271. [Google Scholar] [CrossRef]
Chen, H.; Chen, H.; Liu, Z.; Sun, X.; Zhou, R. Analysis of Factors Affecting the Severity of Automated Vehicle Crashes Using XGBoost Model Combining POI Data. J. Adv. Transp. 2020, 1–13. [Google Scholar] [CrossRef]
Hu, L.; Song, Y.; Wang, F.; Lin, M. Exploring the Differences in Rider Injury Severity in Vehicle-Two-Wheelers Accidents with Dissimilar Fault Parties. Traffic Inj. Prev. 2024, 25, 78–84. [Google Scholar] [CrossRef]
Yu, X.; Marinov, M. A Study on Recent Developments and Issues with Obstacle Detection Systems for Automated Vehicles. Sustainability 2020, 12, 3281. [Google Scholar] [CrossRef]
Wang, K.; Li, G.; Chen, J.; Long, Y.; Chen, T.; Chen, L.; Xia, Q. The Adaptability and Challenges of Autonomous Vehicles to Pedestrians in Urban China. Accid. Anal. Prev. 2020, 145, 105692. [Google Scholar] [CrossRef]
Ahangar, M.N.; Ahmed, Q.Z.; Khan, F.A.; Hafeez, M. A Survey of Autonomous Vehicles: Enabling Communication Technologies and Challenges. Sensors 2021, 21, 706. [Google Scholar] [CrossRef]
Pendleton, S.D.; Andersen, H.; Du, X.; Shen, X.; Meghjani, M.; Eng, Y.H.; Rus, D.; Ang, M.H. Perception, Planning, Control, and Coordination for Autonomous Vehicles. Machines 2017, 5, 6. [Google Scholar] [CrossRef]
Shaaban, K.; Ghanim, M.S. Modeling of Severity in Red-Light-Running Crashes Using Deep Learning Recognition. In Proceedings of the 2023 Intermountain Engineering, Technology and Computing (IETC), Provo, UT, USA, 12–13 May 2023; pp. 181–186. [Google Scholar]
Ghanim, M.S.; Shaaban, K. Investigating the Impact of Autonomous Vehicles on Roundabout Performance Using Microsimulation. In Proceedings of the 2024 Intermountain Engineering, Technology and Computing (IETC), Logan, UT, USA, 13–14 May 2024; pp. 341–346. [Google Scholar]

Figure 1. Conceptual framework. Process of crash data extraction to modeling.

Figure 2. The heat map of AV crashes in the test areas.

Figure 3. The sample OL-316 form for the AV collision report provided by the CA DMV is presented. (a) First page of form OL-316; (b) Second page of form OL-316; (c) Third page of form OL-316.

Figure 4. Word cloud of points of interest with the highest number of crashes.

Figure 5. Descriptive statistics of CA DMV data as of 31 December 2023.

Figure 6. Descriptive statistics of CA DMV data. (a) means Types of ADS disengagement; (b) means Type of intersection at the collision site; (c) means Intersection with traffic signals; (d) means Types of AV collisions; (e) means AV driving mode; (f) means Collision severity.

Figure 7. Decision tree for classification and regression for the variable of collision type.

Figure 8. Association rules bubble chart.

Figure 9. Variable importance for collision type using XGB, CART, and RF algorithms.

Figure 10. Feature importance with SHAP. (a) Impact on model output; (b) Average impact on model output.

Table 1. Research background summary.

Reference	Data and Dependent Variables	Research Method
[43]	12 crash reports from 9/2014 to 11/2015; correlation analysis of disengagement with various collisions	Descriptive statistics
[42]	26 crash reports from 9/2014 to 3/2017; analysis of crashes and collision types	Descriptive statistics
[20]	107 crash reports; examination of crash severity, collision type	Ordinal logistic regression and classification tree regression
[44]	53 crash reports from 2015 to 2017; examination of collision types	Statistical methods (Khi-2)
[2]	198 crash reports from 2016 to 2020; examination of collision types	Decision tree and association rules
[38]	333 data extracted by humans; examination of AV culpability, collision type, and injury outcome	Bayesian network model
[45]	252 crash reports (35 vulnerable user crashes) from 2017 to 2020; vulnerable user collisions with AV	Simple Bayesian models, random forests, neural networks, and support vector machines
[46]	127 self-driving car crashes and 865 regular car crashes	Bayesian network model (BN)
[8]	260 reports from January 2019 to December 2021; examination of pre-crash conditions, AV driving modes, crash types, and crash outcomes	Path analysis and repeated and Bayesian methods

Table 2. Parameters of machine learning models.

Algorithm	Parameter Values	Algorithm	Parameter Values
Apriori	min_Lift = 1, min_support = 0.03, min_confidence = 0.4	XGBOOST	Gamma = 0.005, max_depth = 3, learning_rate = 0.1, n_estimators = 100
Association Rule	min_Lift = 1, min_support = 0.03, min_confidence = 0.4	SHAP	model_output = ‘margin’
CART	criterion = ‘gini’, max_depth = 6, min_samples_split = 5, min_samples_leaf = 2, max_Leaf = 17, p = class_weight = balanced	SMOTE	sampling_strategy = ‘auto’, k_neighbors = 3
LR	n_estimators = 100, criterion = ‘gini’, max_depth = 6, min_samples_split = 5, min_samples_leaf = 2	Data	Train = 0.7, Test = 0.3

Table 3. Classification and regression tree results for collision type.

		Crash Type Probability
Rule No.	Leaf Label	Head-On	Sideswipe	Rear-End	Broadside	Hit Object
1	(M_V2-P S > 0.5) and (M_AV-STOPPED ≤ 0.5) and (Location_Intersection > 0.5) and (Disengagement(no) > 0.5)	0	0.129	0.062	0.741	0
2	(M_V2-P S > 0.5) and (M_AV-Other ≤ 0.5) and (M_AV-STOPPED ≤ 0.5) and (Location_Intersection > 0.5) and (Parking_provision > 0.5)	0	0.214	0	0.785	0
3	(M_V2-P S > 0.5) and (M_AV-Other ≤ 0.5) and (M_AV-STOPPED ≤ 0.5) and (signal(no) ≤ 0.5) and (Parking_provision(no) ≤ 0.5)	0	0.288	0	0.711	0
4	(M_V2-Nan ≤ 0.5) and (M_V2-P S > 0.5) and (Location_Intersection > 0.5) and (M_AV-STOPPED > 0.5)	0	0.145	0.745	0.076	0
5	(M_V2-Nan ≤ 0.5) and (M_AV-STOPPED > 0.5) and (M_V2-P S > 0.5) and (Location_Intersection > 0.5) and (Mode(Conventional) < 0.5) and (signal(no) ≤ 0.5)	0.17	0.203	0.495	0.081	0
6	(M_V2-Nan ≤ 0.5) and (M_AV-P S > 0.5) and (Location_Intersection ≤ 0.5) and (Mode(Conventional) ≤ 0.5) and (signal(no) > 0.5)	0	0.262	0.555	0.12	0
7	(M_V2-Nan ≤ 0.5) and (M_AV-Other ≤ 0.5) and (M_V2-P S ≤ 0.5) and (Location_Intersection > 0.5) and (Mode(Conventional) > 0.5) and (signal(no) > 0.5)	0	0.263	0.454	0.12	0
8	(M_V2-Nan ≤ 0.5) and (Location_Intersection ≤ 0.5) and (M_AV-STOPPED > 0.5) and (M_V2-Other > 0.5) and (Parking_provision(no) > 0.5)	0	0.245	0.615	0.075	0
9	(M_V2-Other > 0.5) and (M_AV-STOPPED > 0.5) and (Location_Intersection ≤ 0.5) and (Mode(ADS) ≤ 0.5)	0	0.596	0.155	0.248	0
10	(M_V2-Nan ≤ 0.5) and (M_AV-Other ≤ 0.5) and (M_V2-Other < 0.5) and (Location_Intersection > 0.5) and (Mode(Conventional) ≤ 0.5) and (Traffic_Peak > 0.5)	0.293	0.449	0	0.257	0
11	(M_V2-Other > 0.5) and (M_AV- P S > 0.5) and (Location_Intersection < 0.5) and (Mode(ADS) < 0.5)	0	0.567	0.148	0.283	0
12	(M_V2-Nan ≤ 0.5) and (M_V2-P S < 0.5) and (Location_Intersection ≤ 0.5) and (M_AV-STOPPED > 0.5) and (Mode(ADS) ≥ 0.5)	0	0.46	0.368	0.171	0
13	(M_V2-Nan ≤ 0.5) and (M_AV-Other > 0.5) and (Location_Parking > 0.5)	0.877	0.019	0.053	0.05	0
14	(M_V2-Nan > 0.5) and (Mode(ADS) > 0.5) and (Weather_Clear > 0.5) and (POI_C > 0.5)	0.564	0	0	0	0.435
15	(M_V2-Nan > 0.5) and (Mode(ADS) ≤ 0.5) and (Weather_Clear ≤ 0.5)	0	0	0	0	1
16	(M_V2-P S > 0.5) and (M_AV-STOPPED > 0.5) and (Location_Intersection ≤ 0.5) and (Mode(ADS) ≤ 0.5)	0	0	0	1	0
17	(M_V2-Nan ≤ 0.5) and (M_AV-Other > 0.5) and (Location_Parking lot ≤ 0.5) and (Parking_provision(no) ≤ 0.5) and (M_V2-P S ≤ 0.5) and (POI_T > 0.5)	0.517	0.049	0.142	0.09	0.199
18	M_V2-P S > 0.5 and (Mode(ADS) ≤ 0.5) and (Weather_Cloudy_Rainy ≤ 0.5) and (POI_C ≤ 0.5)	0.781	0.149	0	0.068	0
19	(M_V2-Nan ≤ 0.5) and (M_AV-Other > 0.5) and (Location_Parking lot > 0.5)	0.876	0.019	0.053	0.051	0
20	(M_V2-Nan > 0.5) and (Mode(ADS) > 0.5) and (POI_C ≤ 0.5)	0.105	0	0	0	0.895

M_AV = MOVMENT_AV; M_V2 = MOVEMENT_Non_AV; POI_T = POI_Transporttion Faselity; POI_C = POI_Commerical. The remaining percentage is for each rule related to other types of collisions.

Table 4. Evaluation metrics for the CART model.

Criterion	Present
accuracy	0.79
precision	0.68
recall	0.79
f1-score	0.77
support	555.00
G-mean	0.39
MSE	0.19

Table 5. Association rules for collision type.

Rule No.	Antecedents	Consequents	Support	Confidence	Lift
1	M_AV-Stopped And Mode(ADS) And Location_Intersection And Disengagement(no)	REAR END	0.21	0.68	1.33
2	Mode(ADS) And M_AV-Stopped And M_V2-P S And Location_Intersection	REAR END	0.26	0.78	1.54
3	Disengagement(no) And Mode(ADS) And signal And M_AV-Stopped	REAR END	0.11	0.8	1.58
4	M_V2-P S And Disengagement And Location_Intersection And M_AV-P S And signal(no)	REAR END	0.1	0.76	1.5
5	M_AV-P S And Disengagement(no) And Weather_Clear And signal(no)	SIDE SWIPE	0.09	0.42	1.82
6	POI_C And Location_Intersection And Disengagement(no) And Mode(Conventional)	SIDE SWIPE	0.07	0.41	1.76
7	M_AV-P S And M_V2-Other And Location_Street And Mode(ADS)	SIDE SWIPE	0.05	0.47	2
8	signal And POI_C And M_AV-Other And M_V2-PS And Mode(Conventional)	SIDE SWIPE	0.04	0.44	1.48
9	M_V2-Other And Weather_Clear And Location_Street And Disengagement(no)	HEAD-ON	0.04	0.41	1.06
10	M_AV-Stopped And M_V2-Other And Disengagement(no) And Mode(Conventional)	HEAD-ON	0.04	0.94	1.16
11	Day And Mode(Conventional) And signal(no) And POI_C And M_AV-Stopped	HEAD-ON	0.03	0.43	1.6
12	M_AV-P S And Disengagement(no) And M_V2-Other And Location_Street	HEAD-ON	0.03	0.91	1.16
13	Signal And Disengagement(no) And M_V2-P S And M_AV-P S	BROADSIDE	0.04	0.44	1.85
14	Location_Intersection And Weather_Clear And Mode(Conventional) And M_V2-P S	BROADSIDE	0.03	0.42	1.31
15	Mode(ADS) And signal And POI_C And M_V2-P S	BROADSIDE	0.03	0.42	1.68
16	Location_Intersection And M_AV- Other And Weather_Clear And signal	BROADSIDE	0.03	0.41	1.83
17	And M_V2-Nan And Mode(Conventional) And M_AV-P S	HIT OBJECT	0.03	0.9	15.74
18	Location_Intersection And Disengagement(no) And signal(no) And M_AV-TURN	HIT OBJECT	0.04	0.74	12.85
19	Mode(Conventional) And Weather_Clear And POI_T And M_AV-Other	HIT OBJECT	0.03	0.91	15.74
20	Disengagement(no) And Mode(Conventional) And signal(no) And POI_T	HIT OBJECT	0.03	0.46	2.74

M_AV = MOVMENT_AV; M_V2 = MOVEMENT_Non_AV; POI_T = POI_Transportation Faselity; POI_C = POI_Commerical.

Table 6. Cross-tabulation results and null hypothesis testing of collision type with other variables.

Pearson + Frequency + Cross_Table (Table Summaries)		Pearson’s Test Statistics				Frequency		Class Percentages for Collision_Type_AV
Feature		χ²	p	df	H₀	N *	Percent Age	Broadside	Head-On	Hit Object	Rear-End	Sideswipe
Company		142	0.04	114	Rejecting
	Apple					17	2.81	0.0	17.7	29.4	35.3	11.8
	Cruise					221	36.47	10.0	8.6	4.1	46.6	25.8
	Lyft					8	1.32	0.0	25.0	0.0	50.0	25.0
	Mercedes-Benz					6	0.99	16.7	16.7	0.0	66.7	0.0
	Pony.AI					9	1.49	0.0	22.2	22.2	55.6	0.0
	Waymo					238	39.27	7.6	4.6	6.3	52.9	23.1
	Weride					5	0.83	0.0	0.0	0.0	80.0	20.0
	Zoox					81	13.37	6.2	12.4	1.2	53.1	24.7
Year		12.6	0.18	9.0	Accepting
	(September)2014					1	0.17	0.0	0.0	0.0	0.0	100.0
	2015					9	1.49	0.0	0.0	0.0	77.8	22.2
	2016					15	2.48	6.7	13.3	6.7	53.3	20.0
	2017					30	4.95	0.0	3.3	3.3	66.7	26.7
	2018					75	12.38	5.3	2.7	6.7	54.7	24.0
	2019					105	17.33	9.5	6.7	0.0	66.7	14.3
	2020					44	7.26	4.6	4.6	9.1	52.3	22.7
	2021					117	19.30	12.8	10.3	7.7	38.5	26.5
	2022					151	24.92	6.6	8.0	4.0	49.7	25.8
	(September) 2023					59	9.74	6.8	18.6	15.3	32.2	23.7
Vehicle2_type		59.1	0.29	54.0	Accepting
	AV					7	1.16	14.3	14.3	0.0	28.6	42.9
	Human					3	0.50	0.0	0.0	0.0	0.0	0.0
	mid_size cars					260	42.90	5.8	7.3	0.8	56.5	24.6
	Object					40	6.60	0.0	5.0	77.5	0.0	5.0
	Sub_compact					177	29.21	7.3	9.0	0.0	63.3	18.6
	Trucks/Buses					64	10.56	4.7	9.4	1.6	48.4	32.8
	two_wheeler					55	9.08	25.5	9.1	1.8	29.1	32.7
Location		146	0.00	54.0	Rejecting
	Avenue					102	16.83	8.75	8.75	6.25	51.25	21.25
	Boulevard					31	5.12	8.57	5.71	2.86	65.71	14.29
	Freeway					16	2.64	0.00	0.00	22.22	77.78	0.00
	Highway					4	0.66	0.00	25.00	12.50	37.50	25.00
	Road					29	4.79	0.00	9.38	12.50	65.63	12.50
	Street					399	65.84	8.22	7.75	4.69	48.59	25.82
	Parking lot					39	6.44	6.25	12.50	12.50	37.50	18.75
Intersection		1.8	0.18	1.0	Accepting
	No					205	33.83	3.4	13.2	8.3	39.5	26.8
	Yes					401	66.17	9.7	5.5	4.5	56.6	21.5
Intersection_Geometry		12.0	0.02	4.0	Rejecting
	Complex_Intersection					25	4.13	16.0	0.0	4.0	60.0	20.0
	Intersection					316	52.15	9.5	7.3	3.5	53.2	24.1
	Straight					144	23.76	2.8	16.0	10.4	31.3	28.5
	T_Intersection					77	12.71	9.1	3.9	7.8	58.4	16.9
	Y_Intersection					44	7.26	2.3	0.0	4.6	79.6	13.6
Signal		5.0	0.03	1.0	Rejecting
	Signal					284	46.86	9.9	5.6	2.1	57.4	22.2
	Yield					322	53.14	5.6	10.3	9.0	45.0	24.2
Parking_provision		0.0	1.00	1.0	Accepting
	Non_Parking provision					144	23.76	5.6	5.6	6.3	64.6	15.3
	Parking provision					462	76.24	8.2	8.9	5.6	46.5	25.8
Disengagement		2.7	0.26	2.0	Accepting
	Disengagement					88	14.52	13.6	11.4	5.7	38.6	28.4
	Driverless					21	3.46	9.5	23.8	4.8	23.8	38.1
	Non_Disengagement					497	82.01	6.4	6.8	5.8	54.1	21.7
Mode		2.7	0.26	2.0	Accepting
	AV Mode					306	50.50	6.5	4.5	2.3	65.4	18.1
	Driverless					21	3.47	5.6	27.8	5.6	22.2	38.9
	Manual Mode					279	46.03	9.0	10.8	9.7	36.6	28.0
AV_Status		0.3	0.61	1.0	Accepting
	Moving					346	57.10	9.8	8.1	10.1	39.9	28.0
	Stopped					260	42.90	4.6	8.1	0.0	65.4	16.9
Non_AV_Status		30.4	0.03	18.0	Rejecting
	Moving					514	84.81	8.8	7.2	0.4	57.6	23.2
	Nan					43	7.09	0.0	4.7	72.1	0.0	4.7
	Stopped					49	8.08	2.0	20.4	4.1	24.5	40.8
Weather		30.1	0.03	17.0	Rejecting
	Clear					538	88.78	7.8	7.6	4.5	50.7	24.5
	Cloudy					41	6.77	2.4	9.8	12.2	56.1	17.1
	Fog/Visibility					22	3.63	0.0	0.0	20.0	60.0	20.0
	Raining					5	0.83	13.6	18.2	22.7	40.9	4.6
Lighting		17.6	0.48	18.0	Accepting
	Dark—no streetlights					2	0.33	0.0	0.0	0.0	50.0	50.0
	Dark—streetlights					153	25.25	10.5	11.8	7.8	42.5	23.5
	Daylight					434	71.62	6.9	6.7	5.1	53.2	23.3
	Dusk–Dawn					17	2.81	0.0	11.8	5.9	64.7	17.7
Movement_AV		460	0.00	78.0	Accepting
	Parked					11	1.82	0.0	0.0	0.0	0.0	0.0
	Parking maneuver					11	1.82	0.0	18.2	36.4	27.3	9.1
	Changing lanes					16	2.64	0.0	6.3	18.8	43.8	18.8
	Entering traffic					3	0.50	0.0	33.3	0.0	0.0	66.7
	Making left turn					33	5.45	21.2	18.2	6.1	27.3	27.3
	Making right turn					38	6.27	5.3	0.0	13.2	47.4	31.6
	Making U-turn					1	0.17	0.0	0.0	100.0	0.0	0.0
	Merging					3	0.50	0.0	0.0	0.0	66.7	33.3
	Passing other vehicle					3	0.50	0.0	33.3	0.0	33.3	33.3
	Proceeding straight					182	30.03	12.1	7.1	8.8	36.3	31.9
	Slowing/stopping					54	8.91	7.4	5.6	0.0	68.5	14.8
	Stopped					234	38.61	3.9	8.6	0.0	68.8	17.5
	Traveling wrong way					1	0.17	0.0	0.0	100.0	0.0	0.0
Movement_Non_AV		791	0.00	102	Rejecting
	Parked					34	5.61	0.0	11.8	5.9	29.4	44.1
	Parking maneuver					11	1.82	9.1	0.0	0.0	27.3	9.1
	Changing lanes					49	8.09	12.0	44.0	0.0	20.0	16.0
	Entering traffic					10	1.65	30.0	20.0	0.0	30.0	20.0
	Making left turn					26	4.29	15.4	7.7	0.0	30.8	46.2
	Making right turn					35	5.78	0.0	5.7	0.0	65.7	25.7
	Merging					7	1.16	0.0	0.0	0.0	57.1	42.9
	Nan					43	7.10	0.0	4.7	72.1	0.0	4.7
	Other unsafe turning					14	2.31	28.6	7.1	0.0	7.1	57.1
	Passing other vehicle					33	5.45	0.0	3.0	0.0	33.3	63.6
	Proceeding straight					262	43.23	10.3	3.1	0.4	75.2	9.5
	Ran off road					1	0.17	0.0	0.0	0.0	0.0	100.0
	Slowing/stopping					15	2.48	6.7	6.7	0.0	86.7	0.0
	Stopped					17	2.81	0.0	29.4	0.0	23.5	35.3
	Traveling wrong way					9	1.49	6.7	33.3	0.0	13.3	40.0
	Xing into opposing lane					9	1.49	0.00	33.33	0.00	22.22	44.44
Collision type_Non_AV		8.3	0.12	5.0	Accepting
	Broadside					17	1.98	50.0	33.3	0.0	0.0	8.3
	Head-on					351	57.92	10.5	3.1	0.0	81.5	3.7
	Hit object					43	7.10	0.0	4.7	72.1	0.0	4.7
	Rear-end					39	6.44	0.0	12.8	7.7	25.6	41.0
	Sideswipe					39	6.44	2.6	69.2	0.0	15.4	2.6
	Other					122	20.13	1.6	0.0	0.8	4.9	88.5
Injury_Level		606	5.05	3.0	Accepting
	Death					0	0.00	100.0	0.0	0.0	0.0	0.0
	Minor Damage					87	14.36	13.8	4.6	0.0	66.7	14.9
	No Damage					511	84.32	6.3	8.8	6.7	48.3	24.9
	Severe Damage					8	1.32	14.3	0.0	14.3	42.9	14.3
POI_tags		19.6	0.35	18.0	Accepting
	atm					10	1.65	0.0	0.0	10.0	60.0	30.0
	bank					15	2.48	6.7	0.0	6.7	53.3	33.3
	bar					23	3.80	4.4	8.7	0.0	47.8	39.1
	bicycle_parking					105	17.33	6.7	5.7	5.7	61.9	16.2
	bicycle_rental					18	2.97	11.1	5.6	0.0	44.4	33.3
	cafe					63	10.40	9.5	9.5	7.9	41.3	20.6
	car_sharing					18	2.97	5.6	16.7	5.6	50.0	22.2
	car_wash					7	1.16	28.6	14.3	0.0	28.6	28.6
	clinic					7	1.16	5.6	16.7	5.6	50.0	22.2
	doctors					7	1.16	28.6	14.3	0.0	28.6	28.6
	drinking_water					13	2.15	5.6	16.7	5.6	50.0	22.2
	fast_food					25	4.13	5.6	16.7	5.6	50.0	22.2
	fountain					5	0.83	28.6	14.3	0.0	28.6	28.6
	fuel					20	3.30	5.6	16.7	5.6	50.0	22.2
	parking					5	0.83	28.6	14.3	0.0	28.6	28.6
	parking_entrance					29	4.79	3.5	3.5	6.9	65.5	13.8
	parking_space					5	0.83	20.0	0.0	20.0	60.0	0.0
	place_of_worship					20	3.30	10.0	10.0	0.0	40.0	30.0
	post_office					6	0.99	3.5	3.5	6.9	65.5	13.8
	pub					16	2.64	20.0	0.0	20.0	60.0	0.0
	public_bookcase					8	1.32	0.0	0.0	0.0	80.0	0.0
	restaurant					115	18.98	10.0	10.0	0.0	40.0	30.0
	school					7	1.16	3.5	3.5	6.9	65.5	13.8
	taxi					6	0.99	0.0	16.7	16.7	16.7	33.3
	toilets					8	1.32	0.0	0.0	12.5	50.0	25.0
	vending_machine					15	2.48	0.0	16.7	16.7	16.7	33.3
Category_POI		3.5	0.39	4.0	Accepting
	Commercial buildings					340	56.11	9.4	8.8	4.4	47.7	25.0
	Office building					47	7.76	4.3	4.3	6.4	51.1	34.0
	Residential buildings					5	0.83	0.0	20.0	0.0	60.0	20.0
	Transportation facilities					214	35.31	5.6	7.5	7.9	55.6	18.2
Lighting condition		15.6	0.02	6.0	Rejecting
	Day					490	80.86	6.9	7.1	5.7	53.3	22.2
	Night					116	19.14	10.3	12.1	6.0	40.5	27.6
Traffic Category		9.6	0.65	12.0	Accepting
	Evening peak traffic					88	14.52	9.1	6.8	3.4	54.6	22.7
	Morning peak traffic					50	8.25	0.0	12.0	6.0	50.0	26.0
	Other hours					468	77.23	8.1	7.9	6.2	50.2	23.1
Weekend		12.7	0.05	6.0	Rejecting
	Weekday					467	77.06	6.0	7.7	6.6	52.7	22.3
	Weekend					139	22.94	13.0	9.4	2.9	44.6	26.6
Holiday		11.2	0.08	6.0	Accepting
	Holiday					13	2.15	7.7	30.8	7.7	23.1	23.1
	Non_Holiday					593	97.85	7.6	7.6	5.7	51.4	23.3
Severity
	Property damage					511	84.32	14.7	4.2	1.1	64.2	14.7
	Bodily injury					95	15.68	6.3	8.8	6.7	48.3	24.9

* N = count.

Table 7. Evaluation metrics for collision type models with balanced dataset by SMOTE.

Model	Overall Accuracy	TN	FN	FP	TP	F-Score	Precision	MSE	Recall	G-Mean
XGBoost	0.89	16	17	16	129	0.89	0.89	0.1	0.89	0.89
SVM	0.86	22	18	22	128	0.87	0.87	0.13	0.87	0.86
CART	0.85	26	20	26	126	0.85	0.84	0.14	0.85	0.85
R F	0.8	45	15	45	131	0.8	0.81	0.19	0.81	0.81
MLP	0.88	20	17	20	20	0.88	0.88	0.12	0.87	0.87
NB	0.73	63	21	63	125	0.72	0.74	0.27	0.73	0.74

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Kohanpour, E.; Davoodi, S.R.; Shaaban, K. Analyzing Autonomous Vehicle Collision Types to Support Sustainable Transportation Systems: A Machine Learning and Association Rules Approach. Sustainability 2024, 16, 9893. https://doi.org/10.3390/su16229893

AMA Style

Kohanpour E, Davoodi SR, Shaaban K. Analyzing Autonomous Vehicle Collision Types to Support Sustainable Transportation Systems: A Machine Learning and Association Rules Approach. Sustainability. 2024; 16(22):9893. https://doi.org/10.3390/su16229893

Chicago/Turabian Style

Kohanpour, Ehsan, Seyed Rasoul Davoodi, and Khaled Shaaban. 2024. "Analyzing Autonomous Vehicle Collision Types to Support Sustainable Transportation Systems: A Machine Learning and Association Rules Approach" Sustainability 16, no. 22: 9893. https://doi.org/10.3390/su16229893

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu