Example On Flight Delay Data

Hindawi
Journal of Advanced Transportation

Volume 2021, Article ID 4292778, 10 pages
https://doi.org/10.1155/2021/4292778
Research Article
Flight Delay Classification Prediction Based on
Stacking Algorithm
Jia Yi,1 Honghai Zhang ,1 Hao Liu,2 Gang Zhong,1 and Guiyi Li1
1
College of Civil Aviation, Nanjing University of Aeronautics&Astronautics, Nanjing 211106, China
2
College of Science, Nanjing University of Aeronautics&Astronautics, Nanjing 211106, China
Correspondence should be addressed to Honghai Zhang; zhh0913@163.com
Received 2 June 2021; Revised 19 July 2021; Accepted 11 August 2021; Published 18 August 2021
Academic Editor: Chi-Hua Chen
Copyright © 2021 Jia Yi et al. This is an open access article distributed under the Creative Commons Attribution License, which
permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
With the development of civil aviation, the number of flights keeps increasing and the flight delay has become a serious issue and
even tends to normality. This paper aims to prove that Stacking algorithm has advantages in airport flight delay prediction,
especially for the algorithm selection problem of machine learning technology. In this research, the principle of the Stacking
classification algorithm is introduced, the SMOTE algorithm is selected to process imbalanced datasets, and the Boruta algorithm
is utilized for feature selection. There are five supervised machine learning algorithms in the first-level learner of Stacking
including KNN, Random Forest, Logistic Regression, Decision Tree, and Gaussian Naive Bayes. The second-level learner is
Logistic Regression. To verify the effectiveness of the proposed method, comparative experiments are carried out based on Boston
Logan International Airport flight datasets from January to December 2019. Multiple indexes are used to comprehensively
evaluate the prediction results, such as Accuracy, Precision, Recall, F1 Score, ROC curve, and AUC Score. The results show that the
Stacking algorithm not only could improve the prediction accuracy but also maintains great stability.
1. Introduction probability, statistics, and computer science [3]. Machine

learning can break the limitations of mathematical formulas
Airports are significant nodes of air transportation. The and improve the accuracy of flight delay prediction. In
number of airport flight delays has been on increase in recent general, machine learning technology can be roughly di-
years. Delayed flights are defined by the Federal Aviation vided into supervised learning, unsupervised learning, deep
Administration when they arrive or depart more than learning, reinforcement learning, and ensemble learning.
15 minutes later than scheduled. In 2019, the arrival delay Each of these learning methods has its characteristics. We
rate is 19.2% and the departure delay rate is 18.18% in the should select the appropriate methods and algorithms to
United States [1]. Flight delays can cause many negative carry on research. Poorly performing algorithms not only
effects, such as passengers’ inconvenience, increased airport cannot gain accurate results but also wastes computing
pressure, and airline losses [2]. Effective flight delay pre- power. Therefore, algorithm selection is an important
diction could provide support for flight plan and emergency process in machine learning technology. This paper aims to
plan formulation, reduce the economic loss, and alleviate the provide an applicable flight delay classification prediction
negative impact. The Bureau of Transportation Statistics has method, especially for solving algorithm selection problems.
recorded the nationwide flight operation data in the United Many scholars have studied flight delay issue based on
States which provides valuable and reliable datasets for study different machine learning methods. Esmaeilzadeh and
flight delay issues. Meanwhile, with the development of Mokhtarimousavi used a support vector machine to mine
artificial intelligence, machine learning technology has been the nonlinear relationship between flight delay and various
widely used in airport flight delay prediction. Machine features. Given the black-box nature of machine learning,
learning technology involves multiple disciplines, such as the sensitivity analysis of corresponding variables and
2 Journal of Advanced Transportation
independent variables was conducted, and weather factors, efficiency by calculating feature importance. Onan and
airport scene operation, demand, and other factors were Korukoglu presented a feature selection model based on the
comprehensively considered. This research provided a new ensemble method. The experiment result shows that the
idea for studying the flight delay causes [3]. Kalyani et al. proposed method not only effectively processed the complex
proposed a flight arrival delay prediction classification features but also improved the classification accuracy [14]. In
model based on XGBoost and a flight arrival delay prediction addition, considering weather information could effectively
regression model based on linear regression. As one of the improve the prediction accuracy [15], but the exact weather
most widely used algorithms in the machine learning field, information might not be available until few hours before
linear regression has the advantages of simple principle and the flight. Therefore, we are not considering bringing in
easy application, and XGBoost is an ensemble learning al- weather features in this research temporarily. The rest of this
gorithm based on the Decision Tree, which can find the paper is organized as follows. Section 2 elaborates the re-
optimal result by constantly adjusting the hyperparameters search methods and principles used in this study including
[4]. Zhang and Ma established a flight delay prediction the Stacking classification algorithm, the SMOTE algorithm,
model based on the Catboost algorithm, and the prediction the Boruta algorithm, and several indexes. Section 3 de-
accuracy reached 0.77. The SHAP value was used to analyze scribes the data sources and the data preprocessing method.
the features’ contribution degree [5]. Khaksar and Shei- Section 4 discusses comparative experiments and compre-
kholeslami developed a hybrid method combining the J48 hensively evaluates the prediction results through Accuracy,
Decision Tree with K-means to train flight datasets from the Precision, Recall, F1 Score, ROC curve, and AUC Score. In
United States and Iran, respectively, and compared them Section 5, the conclusions and expectations of this research
with four algorithms and obtained the optimal results with are discussed.
the hybrid method [6].
When utilizing machine learning techniques, most 2. Methodologies
scholars will use multiple machine learning algorithms to
train the same datasets and come up with the optimal al- 2.1. Stacking Classification Methods. Stacking methods are
gorithm and the optimal predict result through the evalu- derived from the idea of ensemble learning based on
ation indexes comparison [7, 8]. Moreover, with the learners’ combinations [16]. Stacking learner usually con-
development of machine learning technology, the variety of tains two levels, the first-level learner consists of multiple
algorithms is increasing and most scholars tend to use at basics learners selected for training the same datasets, and
least three algorithms in one research. Henriques and Fei- the predicted outputs will become a new dataset to be carried
teira presented a classification model based on Hartsfield- into the second-level learner [17]. To avoid overfitting, cross-
Jackson International Airport which utilized Decision Tree, validation can be used when the first-level learner is the
Random Forest, and Multilayer Perceptron. The Multilayer training model, and we select the k-fold cross-validation
Perceptron provided the highest accuracy [9]. Choi et al. method in this paper [18]. The main process of Stacking
attempted two supervised learning algorithms, Decision methods is shown in Figure 1.
Tree and KNN, and two ensemble learning algorithms, The initial datasets have been divided into training
Random Forest, and Adaboost, and the results showed that dataset Dta and testing dataset Dts, and then the training
ensemble algorithm classifier was greater than single algo- dataset Dta has been divided into k subdatasets, Dta1,
rithm classifier [10]. Stefanovič et al. took Lithuania Airport Dta2,. . ., Dtak. In the k-fold cross-validation method, i
flight delays datasets as the research object and selected models will be trained for k times, each subdataset becomes a
seven machine learning algorithms including probabilistic test dataset in turn, and other subdatasets are training
neural network, multilayer perceptron neural network, datasets to participate in training. In each model, k pre-
Gradient-Boosted Tree, Decision Tree, and the Gradient- diction results are combined to form a new training sub-
Boosted Tree obtained the optimal results [11]. The above dataset Tir(r � 1,2, . . ., k) and Tir (r � 1,2, . . ., k) have formed
research studies are inspirational, and most of them through a new training datasets Nta and brought into the second-
the model comparison obtain one optimal model while the level learner.
other models were eliminated which create a waste of When K-fold cross-validation is carried out in the first-
computing power. In addition, flight datasets are enormous level learner, every time Model i trains the training dataset
and versatile, and the stability of algorithm is significant for Dta, testing datasets Dts will be predicted as well. Therefore,
real world applications. However, most studies did not pay k prediction results Rik which are predicted by the same
attention to the algorithm stability, especially some novel testing dataset Dts will be obtained. When solving the re-
algorithms. In thie study, we build a flight delay prediction gression problem, the averaging method is usually adopted
classification model based on Stacking and design the ex- to process the k prediction results. In the classification
periments to verify the stability of Stacking. problem, the processing of the prediction results is shown in
The flight delay prediction methods based on machine Figure 2.
learning technology become mature gradually. However, In machine learning, the binary classification will output
one core process that is often neglected in previous studies is the probability value of positive and negative at first. The
feature selection [12]. Features selection is an essential step category corresponding to a higher probability value is the
in machine learning [13]. The main purpose of feature se- category of the data sample, and the sum of the probability
lection is to remove redundant features and improve model value is 1. In Stacking classification, model i predicts that the
Journal of Advanced Transportation 3
Level 1 Level 2
Model i
Model 1 Model 2 Model ith
Data Subset 1 Data Subset 1 Data Subset 1 Testing Result Testing Result Testing Result
Dta1 Dta1 ... Dta1 T1k T2k ... Tik
Training Model
Data Subset 2 Data Subset 2 Data Subset 2 Testing Result Testing Result Testing Result
New Training Data Set

Dta2 Dta2 ... Dta2 T1(k-1) T2(k-1) ... Ti(k-1)
Training Data Set
Dta
Nta
... ... ... ... ... ... ... ...
Data Subset (k-1) Data Subset (k-1) Data Subset (k-1) Testing Result Testing Result Testing Result
Dta (k-1) Dta (k-1) ... Dta (k-1) T12 T22 ... Ti2
Data Subset k Data Subset k Data Subset k Testing Result Testing Result Testing Result
Dtak Testing Model ... T11 T21 ... Ti1
Dtak Dtak
New Training Data Set

Predicting Model
Testing Data Set
Model 1 Model 2 Model ith

Predict Result 1 Predict Result 2 Predict Result k Predict Result Predict Result Predict Result
Dts
Nts
Ri1 Ri2 ... Rik ...
R1 R2 Ri
Figure 1: Stacking methods framework.
Model i class samples when training imbalanced datasets. Although

Predict Result 1 Predict Result 2 Predict Result k
such a classifier can achieve a certain accuracy, it does not
Ri1 Ri2
...
Rik have applicability [19]. The flight delay datasets in this
paper are typical imbalanced datasets, and the data volume
n1 p1 n2 p2 ... nk pk
of on-time flights is nearly four times that of delayed flights
n p
(3.78 : 1).
Oversampling and undersampling are the commonly
Model i Predict Result used techniques to deal with imbalanced datasets [20]. The
Ri main idea of these two technologies is to reconstruct the
Figure 2: Stacking classification method of the second-level learner sample size. Undersampling has achieved balance by re-
framework. ducing most samples, while Oversampling has achieved
balance by increasing the minority of samples.
In this paper, SMOTE (synthetic minority oversampling
probability of the data sample belonging to positive p is technique) algorithm is selected to process the imbalanced
P(p) � (p1 + p2 + · · · + pk)/k and the probability of datasets [21]. The SMOTE algorithm is an oversampling
the data sample belonging to negative is technology based on the KNN algorithm. It improves the
P(n) � (n1 + n2 + · · · + nk)/k. Thus, the prediction result of simple random oversampling algorithm of randomly
Model i on testing dataset Dts, Ri (i � 1,2, . . ., i), forms a new copying a few samples to increase the sample size, which can
testing dataset Nts into the second-level learner. The second- avoid overfitting and effectively improve the generalization
level learner could choose a relatively simple algorithm and ability of the model. The main process of the SMOTE al-
then trains the model with the new training dataset Nta and gorithm is as follows:
test with new testing datasets Nts. (1) The Euclidean distance is calculated from each mi-
nority sample x to the other minority sample
2.2. Imbalanced Datasets Processing. Imbalanced datasets (2) The sampling rate is set according to the difference
are one of the common problems in machine learning between the minority sample size and the majority
classification. This is mainly reflected in the fact that the sample size and randomly determines k nearest
number of samples belonging to a certain category in the neighbors of sample x of a minority class
datasets is far greater than that of other categories. To (3) Between a few samples x and xi , according to the
improve the accuracy, most classification algorithms tend sampling rate set in Step (2), a new sample xn can be
to identify the minority class data samples as the majority calculated according to the following formula:
􏼌􏼌 􏼌􏼌
xn � x + ran d(0, 1) × 􏼌􏼌x − xi 􏼌􏼌. (1) Confusion Matrix
Predicted
Positive Negative
Positive TP FN
True
Negative FP TN
2.3. Features Selection. Feature selection is one of the core
contents of machine learning, which aims to eliminate re- Figure 3: Confusion matrix.
dundant features, improve model accuracy, and reduce
operation time. The commonly used feature selection
is, the number of positive samples that are wrongly predicted
methods include Filter, Wrapper, and Embedded [22]. The
to be negative.
Boruta algorithm is utilized in this research to select features.
Boruta is an encapsulated feature selection algorithm based Accuracy is the ratio of correctly predicted samples to
on Random Forest. The importance of each feature to the the total amount of samples, and its calculation formula is as
dependent variable is calculated to determine whether to be follows:
retained. The main process of the Boruta algorithm is as TP + TN
follows: Accuracy � · 100%. (3)
TP + FP + FN + TN
(1) Establish shadow feature: the original features are Accuracy is one of the most used evaluation indexes in
randomly sorted to form a shadow feature matrix, classification. Since the flight delay data sample is the
and the new feature matrix is obtained by splicing the imbalanced dataset, that is, the sample size of on-time
shadow feature matrix with the original feature flights is much larger than delayed flights. To improve
matrix. accuracy, the model tends to identify the minority samples
(2) The new feature matrix is brought in a Random as the majority, and the model can obtain higher accuracy,
Forest classifier for training, and output the im- but the prediction of delayed samples is almost ineffective.
portances of features v. Therefore, the predicted results also need to be evaluated by
(3) The Z score of the original feature and shadow Precision, Recall, and F1 Score in the classification
feature is calculated, and the calculation formula is as problem.
follows: Precision indicates the percentage of correct predictions
in the sample with a positive predicted value. The calculation
Av formula is as follows:
zscore � , (2)
Sv
TP
Precision � · 100%. (4)
where Av represents the average value of feature (TP + FP)
importance and Sv represents the standard deviation
Recall indicates the percentage of the correct prediction
of feature importance.
in the sample with a positive true value. The calculation
(4) The maximum zscore is searched in the shadow formula is as follows:
feature, denoted as Zmax .
TP
(5) If the original feature zscore is greater than Zmax , the Recall � · 100%. (5)
(TP + FN)
feature is recorded as “important.” On the contrary,
if the original feature zscore is less than Zmax , the According to the calculation formula of Precision and
feature will be marked as “unimportant” and be Recall, it can be found that when the Precision increases, the
deleted. Recall will decrease, and when the Recall increases, the
(6) Steps (1) to (5) are repeated until all features have Precision will decrease. In this paper, the Precision focuses
been marked. on how many delayed flights were successfully predicted in
the total sample, while the Recall focuses on how many
delayed flights were successfully predicted in all delayed
2.4. Evaluation Indexes. In this paper, Accuracy, Precision, flights. Moreover, the F1 Score, as the harmonic average of
Recall, and F1 Score are calculated by output confusion Precision and Recall, could consider both. The calculation
matrix to evaluate the prediction results. The confusion formula is as follows:
matrix is shown in Figure 3 [23]. 2 · Precisiong · Recall
TP is True Positive, indicating that both the true value F1 score � . (6)
and the predicted value are positive, that is, the number of Precision + Recall
positive samples predicted correctly. FP is False Positive,
indicating that the true value is negative, but the predicted 3. Data Acquisition and Preprocessing
value is positive, that is, the number of negative samples is
wrongly predicted to be positive. TN is True Negative, in- 3.1. Data Sources. In this research, we collect flight data from
dicating that both the true value and the predicted value are January to December 2019 at Logan International Airport in
negative, that is, the number of negative samples that are Boston, Massachusetts, the United States. The total number
correctly predicted. FN is False Negative, indicating that the of departure flight datasets is 149,576, and the total number
true value is positive, but the predicted value is negative, that of arrival flight datasets is 149338. The Logan Airport is one
20000
17500
15000
12500
Departure
13324 13605
12465 12795 13049 12619 12801 12453 12699 12373
10000
10954 10439
7500
5000
2500 3090 3131 3498
2487 2468 2468 2912 2700 2874 2525 2016
1772
0
1 2 3 4 5 6 7 8 9 10 11 12
MONTH
Delay
On Time
20000
17500
15000
12500 13605
Arrival
13337
10000 12469 12808 13047 12621 12823 12461 12702 12377
10976 10447
7500
5000
2500 3405 3285 3711 3427 3563 3731
2776 2677 2553 2893 2425
1830
0
1 2 3 4 5 6 7 8 9 10 11 12
MONTH
Delay
On Time
Figure 4: Monthly number of delayed flights and on-time flights.
Table 1: The input features and descriptions.

of the busiest airports in the eastern United States, with 31,
941 flights delayed in the departure dataset and 35,941 flights Features Format Description
delayed in the arrival dataset. The departure delay rate is Quarter int64 Quarter (1–4)
21.35%, and the arrival delay rate is 24.07%. The monthly Month int64 Month (1–12)
distribution of flight delays in 2019 is shown in Figure 4. Day_of_month int64 Day of month (1–31)
Both datasets include 9 features, and the input features Day_of_week int64 Day of week (1–7)
and descriptions are shown in Table 1. CRS departure time (local time:
CRS_dep_time int64
hhmm)
CRS_arr_time int64 CRS arrival time (local time: hhmm)
CRS_elapsed_time int64 CRS elapsed time, in minutes
3.2. Uniformization Processing. To avoid the impact of di-
Distance int64 Miles
mensionless differences among features in the dataset, the Diverted int64 diverted � 1, not diverted � 0
data are normalized in this paper. The aim is to adjust the
mean of the data to 1 and the variance to 0. The calculation
formula is as follows: shown in Figure 5. In the departure dataset, all features are
X − Xmean marked as important. The CRS_DEP_TIME is the most
x′ � , (7) important feature in the departure dataset. In the arrival
Xmax − Xmin
dataset, 8 features are estimated as important features, and
where Xmean is the mean value, Xmax is the maximum value, Diverted has been rejected. The departure dataset features
and Xmin is the minimum value. importance is shown in Table 2, and the arrival dataset
features importance is shown in Table 3.
4. Experiment and Analysis To explore the influence of features’ importance on the
prediction results, the following experiment has proceeded.
4.1. Features Selection Results. In this research, the Boruta At first, only input the most important features for training
algorithm is utilized to select features for the departure delay and then add one feature at a time according to the im-
dataset and arrival dataset, respectively, and the results are portance value until all the features are input. According to
60 60
50 50
40 40
Importance
Importance
30 30
20 20
10 10
0 0
-10 -10
shadowMin
shadowMean
DIVERTED
DISTANCE
CRS_ELAPSED_TIME
DAY_OF_WEEK
CRS_ARR_TIME
DAY_OF_MONTH
MONTH
CRS_DEP_TIME
shadowMin
shadowMean
DIVERTED
DISTANCE
CRS_ARR_TIME
CRS_ELAPSED_TIME
DAY_OF_WEEK
MONTH
DAY_OF_MONTH
CRS_DEP_TIME
shadowMax
shadowMax
QUARTER
QUARTER
Departure Arrival
(a) (b)
Figure 5: Features selection results: (a) departure; (b) arrival.
Table 2: Departure dataset features importance.

meanImp medianImp minImp maxImp normHits Decision
Quarter 18.7022 18.38765 15.96021 24.59658 1 Confirmed
Month 39.15992 39.53052 35.61664 42.88205 1 Confirmed
Day_of_month 37.13472 36.92707 34.23672 40.63482 1 Confirmed
Day_of_week 34.41572 34.80809 29.75517 38.51781 1 Confirmed
CRS_dep_time 47.17696 46.49994 39.36456 57.88962 1 Confirmed
CRS_arr_time 35.90235 34.88594 32.14023 40.42103 1 Confirmed
CRS_elapsed_time 30.42818 31.15777 24.04919 35.14513 1 Confirmed
Diverted 8.957969 9.035397 7.912347 9.665486 1 Confirmed
Distance 28.66349 29.39897 22.9603 31.81813 1 Confirmed
Table 3: Arrival dataset features importance.

meanImp medianImp minImp maxImp normHits Decision
Quarter 21.59 21.61953 19.64103 24.85327 1 Confirmed
Month 48.19586 47.78515 43.47335 52.73112 1 Confirmed
Day_of_month 48.35522 48.61903 41.88918 55.16763 1 Confirmed
Day_of_week 46.23737 46.70293 43.08921 49.43334 1 Confirmed
CRS_dep_time 48.68234 49.67468 43.40899 52.34451 1 Confirmed
CRS_arr_time 42.41878 42.13526 40.19638 46.36798 1 Confirmed
CRS_elapsed_time 42.78774 43.67602 36.73169 47.83175 1 Confirmed
Distance 0 0 0 0 0 Rejected
Diverted 32.28774 33.28389 26.69778 36.87651 1 Confirmed
the feature selection results, the Diverted is removed in F1 Score exceed 0.8. When the sixth important feature is
arrival prediction model training. In this experiment, the given as input, the indexes show slight decrease, but the
first-level learner contains five algorithms: Decision Tree, overall trend is stable without significant increase or
KNN, Logistic Regression, Gaussian Naive Bayes, and decrease. In other words, the last four features contributed
Random Forest. The second-level learner is Logistic Re- limited to the prediction model, which was consistent
gression. The experiment results are shown in Figure 6. with Boruta feature selection results. In the arrival dataset,
In the departure dataset, when the fifth important when the fourth important feature is given as input, the
feature is given as input, Accuracy, Precision, Recall, and evaluation indexes have no significant change. In the
0.90
0.85
0.80
Departure
0.75
0.70
0.65
0.60
1 2 3 4 5 6 7 8 9
0.90
0.85
0.80
0.75
Arrival
0.70
0.65
0.60
0.55
0.50
1 2 3 4 5 6 7 8
Accuracy Recall
Precision F1_Score
Figure 6: The prediction results with different features.
arrival dataset, when the fourth feature is given as input, algorithms. However, if we remove Random Forest from the
the evaluation indexes exceed 0.8 and tend to be stable. It Stacking algorithm, will the performance of Stacking de-
is worth mentioning that with the increase in features, crease? In other words, if we remove the weak performance
Recall changes from the highest to the lowest among the algorithm Gaussian Naive Bayes, will the performance of
four indexes, while Precision changes from the lowest to Stacking increase? In section 4.3, we experiment to explore
the highest. the impact of strong and weak algorithms on the perfor-
mance of Stacking.
4.2. Comparison between Algorithms. There is no “multi-

purpose algorithm” or “the greatest algorithm” in machine 4.3. First-Level Learners Analyses. In the single algorithm
learning. It is necessary to attempt multiple algorithms. In comparison, we find that the Random Forest has great per-
this research, six algorithms are selected including KNN, formance, and Gaussian Naive Bayes and Logistic Regression
Random Forest, Logistic Regression, Decision Tree, perform poorly. In this section, one algorithm is removed, in
Gaussian Naive Bayes, and Stacking to train the same turn, to figure out how strong or weak algorithms affect
dataset, respectively. The experiment results are shown in Stacking prediction results. The results are shown in Tables 4
Figure 7. In addition to Stacking, Random Forest also and 5. Overall, there is no significant difference between the
showed a great prediction result which four evaluation in- six groups with different first-level learners. Both in the de-
dexes all exceed 0.8. The difference among four indexes of parture dataset and arrival dataset, the four evaluation indexes
KNN is lager than other algorithms but also has reached 0.7. are similar among the six scenarios, only the Recall and F1
Meanwhile, Gaussian Naive Bayes and Logistic Regression Score of the third scenario decrease below 0.8. The overall
have relatively poor performance, and four indexes are accuracy is shown in Figure 9. The prediction accuracy is
around 0.6. around 0.8 which is close to the result of Stacking. It can be
The ROC (receiver operating characteristic) curve could concluded that the Stacking algorithm not only could ensure
measure algorithm generalization ability. The AUC (area the prediction accuracy but also maintains great stability.
under curve) is the area under the ROC curve [24]. The Random Forest has a strong performance, but when we
closer the AUC is to 1, the better the algorithm will be. We remove Random Forest form the first-level learner, the
output the ROC for each algorithm and calculate the AUC model still acquires great predict results. As we mentioned
Score, and the results are shown in Figure 8. Stacking reaches before, there is no “multipurpose algorithm” or “the
0.823 in the departure dataset and 0.821 in the arrival greatest algorithm” in machine learning. Therefore, the
dataset. The result of Random Forest is similar to that of Stacking algorithm could be a great solution to deal with
Stacking. With this result, we consider that Random Forest algorithm selection, especially the enormous and complex
contributes more to Stacking compared with other datasets like flight datasets.
1.0
0.822
0.829
0.812
0.821
0.815
0.814
0.819
0.816
0.9
0.788
0.770
0.775
0.779
0.761
0.753
0.741
0.8
0.707
Departure
0.678
0.662
0.657
0.656
0.651
0.650
0.645
0.638
0.7
0.6
0.5
0.4
1.0
0.832
0.822
0.818
0.9
0.815
0.818
0.810
0.814
0.805
0.788
0.782
0.774
0.770
0.754
0.761
0.8 0.742
0.706
Arrival
0.640
0.628
0.626
0.625
0.629
0.627
0.632
0.625
0.7
0.6
0.5
0.4
Stacking KNN RF GNB LR DT
Accuracy Recall
Precision F1_Score
Figure 7: The prediction results of different algorithms.
Departure Arrival
1.0 1.0
0.8 0.8
True Positive Rate
True Positive Rate
0.6 0.6
0.4 0.4
0.2 0.2
0.0 0.0
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
False Positive Rate False Positive Rate
Stacking AUC = 0.823 LR AUC = 0.649 Stacking AUC = 0.821 LR AUC = 0.625
KNN AUC = 0.752 DT AUC = 0.774 KNN AUC = 0.753 DT AUC = 0.771
GNB AUC = 0.644 RF AUC = 0.816 GNB AUC = 0.627 RF AUC = 0.815
(a) (b)
Figure 8: Receiver operating characteristic curve: (a) departure; (b) arrival.
Table 4: The departure prediction results of different first-level learners.

Departure First-level learner Accuracy Precision Recall F1 Score
1 GNB, RF, KNN, LR, DT 0.822 0.830 0.812 0.821
2 RF, KNN, LR, DT 0.821 0.8277 0.812 0.820
3 GNB, KNN, LR, DT 0.800 0.805 0.784 0.794
4 GNB, RF, LR, DT 0.819 0.823 0.812 0.817
5 GNB, RF, KNN, DT 0.822 0.828 0.811 0.819
6 GNB, RF, KNN, LR 0.82 0.827 0.811 0.819
Table 5: The arrival departure prediction results of different first-level learners.

Arrival First-level learner Accuracy Precision Recall F1 Score
1 GNB, RF, KNN, LR, DT 0.822 0.832 0.808 0.82
2 RF, KNN, LR, DT 0.82 0.827 0.806 0.816
3 GNB, KNN, LR, DT 0.80 0.811 0.78 0.793
4 GNB, RF, LR, DT 0.818 0.824 0.804 0.814
5 GNB, RF, KNN, DT 0.82 0.828 0.805 0.816
6 GNB, RF, KNN, LR 0.818 0.825 0.804 0.814
1.0 (3) The main aim of this study is to explore the stability
of the Stacking algorithm. Stacking is a combination
0.822
0.822
0.821
0.820
0.819
0.818
0.822
0.820
0.820
0.818
0.800
0.800
0.9
of different algorithms with different performances.
0.8
Accuracy
In section 4.3, we design an experiment to verify how

0.7
strong or weak learners affect the Stacking perfor-
0.6 mance. The experiment result shows that whether
0.5 strong learners or weak learners are removed, the
0.4 overall accuracy of the Stacking has no obvious
1 2 3 4 5 6 difference. Therefore, we believe that Stacking pro-
Departure vides a reliable solution for algorithm selection in
Arrival machine learning applications, especially the enor-
mous and complex datasets like flight datasets.
Figure 9: The accuracy of different first-level learners.
In future research, other machine learning technologies
can be utilized to study flight delay prediction. Moreover, it
can also pay close attention to weather influence on a flight
5. Conclusion delay. In this research, we does not add exact weather-
related features in the prediction model but that does not
In this research, we propose a flight delay prediction clas- mean weather influence is unimportant. On the contrary,
sification method based on the Stacking algorithm. The we believe that studying the influence of weather on flight
SMOTE algorithm is introduced to process imbalanced delays is a significant and complex issue. We will focus
datasets used, and the Boruta algorithm is utilized to select more on establishing reasonable features to measure the
input features. The Logan International Airport flight data in impact of weather on flight delays, especially for high-
2019 are collected to carry out comparative experiments, and impact weather, and use machine learning correlation
the Accuracy, Precision, Recall, and F1 Score are above 0.8. analysis technology to explore the relatedness between
The main contributions are as follows: weather and flight delay.
(1) The Boruta algorithm is used to select features.
Features selection is an essential process when uti- Data Availability
lizing machine learning technology. According to
section 4.1, the comparison experimental results are The flight dataset used in this paper is from the Bureau of
consistent with the Boruta algorithm feature selec- Transportation Statistics website (https://www.transtats.bts.
tion results, which verify the effectiveness of the gov/homepage.asp).
Boruta algorithm. 9 feature importances are ob-
tained based on the Random Forest classifier, and the
experiments are designed to input different features
Conflicts of Interest
into the model in the order of their importance value. The authors declare that they have no conflicts of interest.
In the departure dataset, all features have been
confirmed while Diverted has been rejected in the
arrival dataset. Acknowledgments
(2) A flight delay prediction classification method based This work was supported by the National Key R&D Program
on Stacking is proposed in this study. The first-level of China (No. 2018YFE0208700) and the National Natural
learner includes KNN, Random Forest, Logistic Science Foundation of China (No. 52002177).
Regression, Decision Tree, and Gaussian Naive
Bayes, and the second-level learner utilizes Logistic
Regression. To distinguish the contribution of five References
first-level learners, the same dataset that has been [1] Bureau of Transportation Statistics, “Bureau of Trans-
trained based on these five first-level learners sep- portation Statistics,”.
arately. The result shows that Random Forest has the [2] M. Ball, C. Barnhart, M. Dresner et al., “Total delay impact
best performance which is similar to Stacking. study,” 2010.
[3] E. Esmaeilzadeh and S. Mokhtarimousavi, “Machine learning [20] A. Behzad Mirzaei, A. B. Bahareh Nikpour, and
approach for flight departure delay prediction and analysis,” A. Nezamabadi Pour, “A clustering and density-based hybrid
Transportation Research Record: Journal of the Transportation approach for imbalanced data classification,” Expert Systems
Research Board, vol. 2674, no. 8, pp. 145–159, 2020. with Applications, vol. 164, 2020.
[4] N. L. Kalyani, G. Jeshmitha, U. Bindu Sri Sai, M. Samanvitha, [21] N. V. Chawla, K. W. Bowyer, L. O. Hall, and
J. Mahesh, and B. V. Kiranmayee, “Machine learning model - W. P. Kegelmeyer, “SMOTE: synthetic minority over-sam-
based prediction of flight delay,” in Proceedings of the 2020 pling technique,” Journal of Artificial Intelligence Research,
Fourth International Conference on I-SMAC (IoT in Social, vol. 16, no. 1, pp. 321–357, 2002.
Mobile, Analytics and Cloud) (I-SMAC), Palladam, India, [22] Q. Al-Tashi, H. Md Rais, S. Mirjalili, and H. Alhussian, A
November 2020. Review of Grey Wolf Optimizer-Based Feature Selection
[5] B. Zhang and D. Ma, “Flight delay prediciton at an airport Methods for Classification, UTP Universiti Teknologi PET-
using maching learning,” in Proceedings of the 2020 5th In- RONAS, Seri Iskandar, Malaysia, 2020.
ternational Conference on Electromechanical Control Tech- [23] S. A. Alvarez, An Exact Analytical Relation Among Recall,
nology and Transportation, Nanchang, China, May 2020. Precision, and Classification Accuracy in Information Re-
[6] H. Khaksar and A. Sheikholeslami, “Airline delay prediction trieval, Boston College, Newton, MA, USA, 2002.
by machine learning algorithms,” Scientia Iranica, vol. 1, p. 12, [24] T. Fawcett, “An introduction to ROC analysis,” Pattern
Recognition Letters, vol. 27, no. 8, pp. 861–874, 2005.
2017.
[7] G. Rebala, A. Ravi, and S. Churiwala, An Introduction to
Machine Learning, Springer International Publishing, New
York, NY, USA, 2019.
[8] A. Onan, “On the performance of ensemble learning for
automated diagnosis of breast cancer,” Advances in Intelligent
Systems and Computing, vol. 347, pp. 119–129, 2015.
[9] R. Henriques and I. Feiteira, “Predictive modelling: flight
delays and associated factors, hartsfield-jackson atlanta in-
ternational airport,” Procedia Computer Science, vol. 138,
pp. 638–645, 2018.
[10] S. Choi, Y. J. Kim, B. Simon, and D. Mavris, “Prediction of
weather-induced airline delays based on machine learning
algorithms,” in Proceedings of the 2016 IEEE/AIAA 35th
Digital Avionics Systems Conference (DASC), Sacramento, CA,
USA, December 2016.
[11] P. Stefanovič, R. Štrimaitis, and O. Kurasova, “Prediction of
flight time deviation for Lithuanian airports using supervised
machine learning model,” Computational Intelligence and
Neuroscience, vol. 2020, Article ID 8878681, 10 pages, 2020.
[12] G. Gui, F. Liu, J. Sun, J. Yang, Z. Zhou, and D. Zhao, “Flight
delay prediction based on aviation big data and machine
learning,” IEEE Transactions on Vehicular Technology, vol. 69,
no. 1, pp. 140–150, 2020.
[13] N. Chakrabarty, “A data mining approach to flight arrival
delay prediction for american airlines,” 2019, https://arxiv.
org/abs/1903.06740.
[14] A. Onan and S. Korukoglu, “A feature selection model based
on genetic rank aggregation for text sentiment classification,”
Journal of Information Science, vol. 99, pp. 1103–1107, 2015.
[15] Y. J. Kim, C. Sun, S. Briceno, and D. Mavris, “A deep learning
approach to flight delay prediction,” in Proceedings of the
Digital Avionics Systems Conference 2016, Sacramento, CA,
USA, December 2016.
[16] Z. H. Zhou, Ensemble Methods: Foundations and Algorithms,
Taylor & Francis, Oxfordshire, UK, 2012.
[17] G. Zhong, T. Yin, L. Li, J. Zhang, H. Zhang, and B. Ran, “IEEE
intelligent transportation systems magazine,” IEEE Intelligent
Transportation Systems Magazine, vol. 99, 2020.
[18] J. D. Rodriguez, A. Perez, and J. A. Lozano, “Sensitivity
analysis of K-fold cross validation in prediction error esti-
mation,” IEEE Transactions on Pattern Analysis and Machine
Intelligence, vol. 32, no. 3, pp. 569–575, 2010.
[19] H. Patel, D. S. Rajput, G. T. Reddy, C. Iwendi, and O. Jo, “A
review on classification of imbalanced data for wireless sensor
networks,” International Journal of Distributed Sensor Net-
works, vol. 16, no. 4, Article ID 812444408, 2020.

Example On Flight Delay Data

Uploaded by

Copyright:

Available Formats

Example On Flight Delay Data

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Example On Flight Delay Data

Uploaded by

Copyright:

Available Formats

Hindawi

Journal of Advanced Transportation

Correspondence should be addressed to Honghai Zhang; zhh0913@163.com

Academic Editor: Chi-Hua Chen

1. Introduction probability, statistics, and computer science [3]. Machine

New Training Data Set

New Training Data Set

Model 1 Model 2 Model ith

Figure 1: Stacking methods framework.

Model i class samples when training imbalanced datasets. Although

Table 1: The input features and descriptions.

Figure 5: Features selection results: (a) departure; (b) arrival.

Table 2: Departure dataset features importance.

Table 3: Arrival dataset features importance.

4.2. Comparison between Algorithms. There is no “multi-

True Positive Rate

Figure 8: Receiver operating characteristic curve: (a) departure; (b) arrival.

Table 4: The departure prediction results of diﬀerent ﬁrst-level learners.

Table 5: The arrival departure prediction results of diﬀerent ﬁrst-level learners.

In section 4.3, we design an experiment to verify how

You might also like