Example On Flight Delay Data
Example On Flight Delay Data
Example On Flight Delay Data
Research Article
Flight Delay Classification Prediction Based on
Stacking Algorithm
Jia Yi,1 Honghai Zhang ,1 Hao Liu,2 Gang Zhong,1 and Guiyi Li1
1
College of Civil Aviation, Nanjing University of Aeronautics&Astronautics, Nanjing 211106, China
2
College of Science, Nanjing University of Aeronautics&Astronautics, Nanjing 211106, China
Received 2 June 2021; Revised 19 July 2021; Accepted 11 August 2021; Published 18 August 2021
Copyright © 2021 Jia Yi et al. This is an open access article distributed under the Creative Commons Attribution License, which
permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
With the development of civil aviation, the number of flights keeps increasing and the flight delay has become a serious issue and
even tends to normality. This paper aims to prove that Stacking algorithm has advantages in airport flight delay prediction,
especially for the algorithm selection problem of machine learning technology. In this research, the principle of the Stacking
classification algorithm is introduced, the SMOTE algorithm is selected to process imbalanced datasets, and the Boruta algorithm
is utilized for feature selection. There are five supervised machine learning algorithms in the first-level learner of Stacking
including KNN, Random Forest, Logistic Regression, Decision Tree, and Gaussian Naive Bayes. The second-level learner is
Logistic Regression. To verify the effectiveness of the proposed method, comparative experiments are carried out based on Boston
Logan International Airport flight datasets from January to December 2019. Multiple indexes are used to comprehensively
evaluate the prediction results, such as Accuracy, Precision, Recall, F1 Score, ROC curve, and AUC Score. The results show that the
Stacking algorithm not only could improve the prediction accuracy but also maintains great stability.
independent variables was conducted, and weather factors, efficiency by calculating feature importance. Onan and
airport scene operation, demand, and other factors were Korukoglu presented a feature selection model based on the
comprehensively considered. This research provided a new ensemble method. The experiment result shows that the
idea for studying the flight delay causes [3]. Kalyani et al. proposed method not only effectively processed the complex
proposed a flight arrival delay prediction classification features but also improved the classification accuracy [14]. In
model based on XGBoost and a flight arrival delay prediction addition, considering weather information could effectively
regression model based on linear regression. As one of the improve the prediction accuracy [15], but the exact weather
most widely used algorithms in the machine learning field, information might not be available until few hours before
linear regression has the advantages of simple principle and the flight. Therefore, we are not considering bringing in
easy application, and XGBoost is an ensemble learning al- weather features in this research temporarily. The rest of this
gorithm based on the Decision Tree, which can find the paper is organized as follows. Section 2 elaborates the re-
optimal result by constantly adjusting the hyperparameters search methods and principles used in this study including
[4]. Zhang and Ma established a flight delay prediction the Stacking classification algorithm, the SMOTE algorithm,
model based on the Catboost algorithm, and the prediction the Boruta algorithm, and several indexes. Section 3 de-
accuracy reached 0.77. The SHAP value was used to analyze scribes the data sources and the data preprocessing method.
the features’ contribution degree [5]. Khaksar and Shei- Section 4 discusses comparative experiments and compre-
kholeslami developed a hybrid method combining the J48 hensively evaluates the prediction results through Accuracy,
Decision Tree with K-means to train flight datasets from the Precision, Recall, F1 Score, ROC curve, and AUC Score. In
United States and Iran, respectively, and compared them Section 5, the conclusions and expectations of this research
with four algorithms and obtained the optimal results with are discussed.
the hybrid method [6].
When utilizing machine learning techniques, most 2. Methodologies
scholars will use multiple machine learning algorithms to
train the same datasets and come up with the optimal al- 2.1. Stacking Classification Methods. Stacking methods are
gorithm and the optimal predict result through the evalu- derived from the idea of ensemble learning based on
ation indexes comparison [7, 8]. Moreover, with the learners’ combinations [16]. Stacking learner usually con-
development of machine learning technology, the variety of tains two levels, the first-level learner consists of multiple
algorithms is increasing and most scholars tend to use at basics learners selected for training the same datasets, and
least three algorithms in one research. Henriques and Fei- the predicted outputs will become a new dataset to be carried
teira presented a classification model based on Hartsfield- into the second-level learner [17]. To avoid overfitting, cross-
Jackson International Airport which utilized Decision Tree, validation can be used when the first-level learner is the
Random Forest, and Multilayer Perceptron. The Multilayer training model, and we select the k-fold cross-validation
Perceptron provided the highest accuracy [9]. Choi et al. method in this paper [18]. The main process of Stacking
attempted two supervised learning algorithms, Decision methods is shown in Figure 1.
Tree and KNN, and two ensemble learning algorithms, The initial datasets have been divided into training
Random Forest, and Adaboost, and the results showed that dataset Dta and testing dataset Dts, and then the training
ensemble algorithm classifier was greater than single algo- dataset Dta has been divided into k subdatasets, Dta1,
rithm classifier [10]. Stefanovič et al. took Lithuania Airport Dta2,. . ., Dtak. In the k-fold cross-validation method, i
flight delays datasets as the research object and selected models will be trained for k times, each subdataset becomes a
seven machine learning algorithms including probabilistic test dataset in turn, and other subdatasets are training
neural network, multilayer perceptron neural network, datasets to participate in training. In each model, k pre-
Gradient-Boosted Tree, Decision Tree, and the Gradient- diction results are combined to form a new training sub-
Boosted Tree obtained the optimal results [11]. The above dataset Tir(r � 1,2, . . ., k) and Tir (r � 1,2, . . ., k) have formed
research studies are inspirational, and most of them through a new training datasets Nta and brought into the second-
the model comparison obtain one optimal model while the level learner.
other models were eliminated which create a waste of When K-fold cross-validation is carried out in the first-
computing power. In addition, flight datasets are enormous level learner, every time Model i trains the training dataset
and versatile, and the stability of algorithm is significant for Dta, testing datasets Dts will be predicted as well. Therefore,
real world applications. However, most studies did not pay k prediction results Rik which are predicted by the same
attention to the algorithm stability, especially some novel testing dataset Dts will be obtained. When solving the re-
algorithms. In thie study, we build a flight delay prediction gression problem, the averaging method is usually adopted
classification model based on Stacking and design the ex- to process the k prediction results. In the classification
periments to verify the stability of Stacking. problem, the processing of the prediction results is shown in
The flight delay prediction methods based on machine Figure 2.
learning technology become mature gradually. However, In machine learning, the binary classification will output
one core process that is often neglected in previous studies is the probability value of positive and negative at first. The
feature selection [12]. Features selection is an essential step category corresponding to a higher probability value is the
in machine learning [13]. The main purpose of feature se- category of the data sample, and the sum of the probability
lection is to remove redundant features and improve model value is 1. In Stacking classification, model i predicts that the
Journal of Advanced Transportation 3
Level 1 Level 2
Model i
Model 1 Model 2 Model ith
Data Subset 1 Data Subset 1 Data Subset 1 Testing Result Testing Result Testing Result
Dta1 Dta1 ... Dta1 T1k T2k ... Tik
Training Model
Data Subset 2 Data Subset 2 Data Subset 2 Testing Result Testing Result Testing Result
Nta
... ... ... ... ... ... ... ...
Data Subset (k-1) Data Subset (k-1) Data Subset (k-1) Testing Result Testing Result Testing Result
Dta (k-1) Dta (k-1) ... Dta (k-1) T12 T22 ... Ti2
Data Subset k Data Subset k Data Subset k Testing Result Testing Result Testing Result
Dtak Testing Model ... T11 T21 ... Ti1
Dtak Dtak
Nts
Ri1 Ri2 ... Rik ...
R1 R2 Ri
xn � x + ran d(0, 1) × x − xi . (1) Confusion Matrix
Predicted
Positive Negative
Positive TP FN
True
Negative FP TN
2.3. Features Selection. Feature selection is one of the core
contents of machine learning, which aims to eliminate re- Figure 3: Confusion matrix.
dundant features, improve model accuracy, and reduce
operation time. The commonly used feature selection
is, the number of positive samples that are wrongly predicted
methods include Filter, Wrapper, and Embedded [22]. The
to be negative.
Boruta algorithm is utilized in this research to select features.
Boruta is an encapsulated feature selection algorithm based Accuracy is the ratio of correctly predicted samples to
on Random Forest. The importance of each feature to the the total amount of samples, and its calculation formula is as
dependent variable is calculated to determine whether to be follows:
retained. The main process of the Boruta algorithm is as TP + TN
follows: Accuracy � · 100%. (3)
TP + FP + FN + TN
(1) Establish shadow feature: the original features are Accuracy is one of the most used evaluation indexes in
randomly sorted to form a shadow feature matrix, classification. Since the flight delay data sample is the
and the new feature matrix is obtained by splicing the imbalanced dataset, that is, the sample size of on-time
shadow feature matrix with the original feature flights is much larger than delayed flights. To improve
matrix. accuracy, the model tends to identify the minority samples
(2) The new feature matrix is brought in a Random as the majority, and the model can obtain higher accuracy,
Forest classifier for training, and output the im- but the prediction of delayed samples is almost ineffective.
portances of features v. Therefore, the predicted results also need to be evaluated by
(3) The Z score of the original feature and shadow Precision, Recall, and F1 Score in the classification
feature is calculated, and the calculation formula is as problem.
follows: Precision indicates the percentage of correct predictions
in the sample with a positive predicted value. The calculation
Av formula is as follows:
zscore � , (2)
Sv
TP
Precision � · 100%. (4)
where Av represents the average value of feature (TP + FP)
importance and Sv represents the standard deviation
Recall indicates the percentage of the correct prediction
of feature importance.
in the sample with a positive true value. The calculation
(4) The maximum zscore is searched in the shadow formula is as follows:
feature, denoted as Zmax .
TP
(5) If the original feature zscore is greater than Zmax , the Recall � · 100%. (5)
(TP + FN)
feature is recorded as “important.” On the contrary,
if the original feature zscore is less than Zmax , the According to the calculation formula of Precision and
feature will be marked as “unimportant” and be Recall, it can be found that when the Precision increases, the
deleted. Recall will decrease, and when the Recall increases, the
(6) Steps (1) to (5) are repeated until all features have Precision will decrease. In this paper, the Precision focuses
been marked. on how many delayed flights were successfully predicted in
the total sample, while the Recall focuses on how many
delayed flights were successfully predicted in all delayed
2.4. Evaluation Indexes. In this paper, Accuracy, Precision, flights. Moreover, the F1 Score, as the harmonic average of
Recall, and F1 Score are calculated by output confusion Precision and Recall, could consider both. The calculation
matrix to evaluate the prediction results. The confusion formula is as follows:
matrix is shown in Figure 3 [23]. 2 · Precisiong · Recall
TP is True Positive, indicating that both the true value F1 score � . (6)
and the predicted value are positive, that is, the number of Precision + Recall
positive samples predicted correctly. FP is False Positive,
indicating that the true value is negative, but the predicted 3. Data Acquisition and Preprocessing
value is positive, that is, the number of negative samples is
wrongly predicted to be positive. TN is True Negative, in- 3.1. Data Sources. In this research, we collect flight data from
dicating that both the true value and the predicted value are January to December 2019 at Logan International Airport in
negative, that is, the number of negative samples that are Boston, Massachusetts, the United States. The total number
correctly predicted. FN is False Negative, indicating that the of departure flight datasets is 149,576, and the total number
true value is positive, but the predicted value is negative, that of arrival flight datasets is 149338. The Logan Airport is one
Journal of Advanced Transportation 5
20000
17500
15000
12500
Departure
13324 13605
12465 12795 13049 12619 12801 12453 12699 12373
10000
10954 10439
7500
5000
2500 3090 3131 3498
2487 2468 2468 2912 2700 2874 2525 2016
1772
0
1 2 3 4 5 6 7 8 9 10 11 12
MONTH
Delay
On Time
20000
17500
15000
12500 13605
Arrival
13337
10000 12469 12808 13047 12621 12823 12461 12702 12377
10976 10447
7500
5000
2500 3405 3285 3711 3427 3563 3731
2776 2677 2553 2893 2425
1830
0
1 2 3 4 5 6 7 8 9 10 11 12
MONTH
Delay
On Time
Figure 4: Monthly number of delayed flights and on-time flights.
60 60
50 50
40 40
Importance
Importance
30 30
20 20
10 10
0 0
-10 -10
shadowMin
shadowMean
DIVERTED
DISTANCE
CRS_ELAPSED_TIME
DAY_OF_WEEK
CRS_ARR_TIME
DAY_OF_MONTH
MONTH
CRS_DEP_TIME
shadowMin
shadowMean
DIVERTED
DISTANCE
CRS_ARR_TIME
CRS_ELAPSED_TIME
DAY_OF_WEEK
MONTH
DAY_OF_MONTH
CRS_DEP_TIME
shadowMax
shadowMax
QUARTER
QUARTER
Departure Arrival
(a) (b)
the feature selection results, the Diverted is removed in F1 Score exceed 0.8. When the sixth important feature is
arrival prediction model training. In this experiment, the given as input, the indexes show slight decrease, but the
first-level learner contains five algorithms: Decision Tree, overall trend is stable without significant increase or
KNN, Logistic Regression, Gaussian Naive Bayes, and decrease. In other words, the last four features contributed
Random Forest. The second-level learner is Logistic Re- limited to the prediction model, which was consistent
gression. The experiment results are shown in Figure 6. with Boruta feature selection results. In the arrival dataset,
In the departure dataset, when the fifth important when the fourth important feature is given as input, the
feature is given as input, Accuracy, Precision, Recall, and evaluation indexes have no significant change. In the
Journal of Advanced Transportation 7
0.90
0.85
0.80
Departure
0.75
0.70
0.65
0.60
1 2 3 4 5 6 7 8 9
0.90
0.85
0.80
0.75
Arrival
0.70
0.65
0.60
0.55
0.50
1 2 3 4 5 6 7 8
Accuracy Recall
Precision F1_Score
Figure 6: The prediction results with different features.
arrival dataset, when the fourth feature is given as input, algorithms. However, if we remove Random Forest from the
the evaluation indexes exceed 0.8 and tend to be stable. It Stacking algorithm, will the performance of Stacking de-
is worth mentioning that with the increase in features, crease? In other words, if we remove the weak performance
Recall changes from the highest to the lowest among the algorithm Gaussian Naive Bayes, will the performance of
four indexes, while Precision changes from the lowest to Stacking increase? In section 4.3, we experiment to explore
the highest. the impact of strong and weak algorithms on the perfor-
mance of Stacking.
1.0
0.822
0.829
0.812
0.821
0.815
0.814
0.819
0.816
0.9
0.788
0.770
0.775
0.779
0.761
0.753
0.741
0.8
0.707
Departure
0.678
0.662
0.657
0.656
0.651
0.650
0.645
0.638
0.7
0.6
0.5
0.4
1.0
0.832
0.822
0.818
0.9
0.815
0.818
0.810
0.814
0.805
0.788
0.782
0.774
0.770
0.754
0.761
0.8 0.742
0.706
Arrival
0.640
0.628
0.626
0.625
0.629
0.627
0.632
0.625
0.7
0.6
0.5
0.4
Stacking KNN RF GNB LR DT
Accuracy Recall
Precision F1_Score
Figure 7: The prediction results of different algorithms.
Departure Arrival
1.0 1.0
0.8 0.8
True Positive Rate
0.6 0.6
0.4 0.4
0.2 0.2
0.0 0.0
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
False Positive Rate False Positive Rate
Stacking AUC = 0.823 LR AUC = 0.649 Stacking AUC = 0.821 LR AUC = 0.625
KNN AUC = 0.752 DT AUC = 0.774 KNN AUC = 0.753 DT AUC = 0.771
GNB AUC = 0.644 RF AUC = 0.816 GNB AUC = 0.627 RF AUC = 0.815
(a) (b)
1.0 (3) The main aim of this study is to explore the stability
of the Stacking algorithm. Stacking is a combination
0.822
0.822
0.821
0.820
0.819
0.818
0.822
0.820
0.820
0.818
0.800
0.800
0.9
of different algorithms with different performances.
0.8
Accuracy
[3] E. Esmaeilzadeh and S. Mokhtarimousavi, “Machine learning [20] A. Behzad Mirzaei, A. B. Bahareh Nikpour, and
approach for flight departure delay prediction and analysis,” A. Nezamabadi Pour, “A clustering and density-based hybrid
Transportation Research Record: Journal of the Transportation approach for imbalanced data classification,” Expert Systems
Research Board, vol. 2674, no. 8, pp. 145–159, 2020. with Applications, vol. 164, 2020.
[4] N. L. Kalyani, G. Jeshmitha, U. Bindu Sri Sai, M. Samanvitha, [21] N. V. Chawla, K. W. Bowyer, L. O. Hall, and
J. Mahesh, and B. V. Kiranmayee, “Machine learning model - W. P. Kegelmeyer, “SMOTE: synthetic minority over-sam-
based prediction of flight delay,” in Proceedings of the 2020 pling technique,” Journal of Artificial Intelligence Research,
Fourth International Conference on I-SMAC (IoT in Social, vol. 16, no. 1, pp. 321–357, 2002.
Mobile, Analytics and Cloud) (I-SMAC), Palladam, India, [22] Q. Al-Tashi, H. Md Rais, S. Mirjalili, and H. Alhussian, A
November 2020. Review of Grey Wolf Optimizer-Based Feature Selection
[5] B. Zhang and D. Ma, “Flight delay prediciton at an airport Methods for Classification, UTP Universiti Teknologi PET-
using maching learning,” in Proceedings of the 2020 5th In- RONAS, Seri Iskandar, Malaysia, 2020.
ternational Conference on Electromechanical Control Tech- [23] S. A. Alvarez, An Exact Analytical Relation Among Recall,
nology and Transportation, Nanchang, China, May 2020. Precision, and Classification Accuracy in Information Re-
[6] H. Khaksar and A. Sheikholeslami, “Airline delay prediction trieval, Boston College, Newton, MA, USA, 2002.
by machine learning algorithms,” Scientia Iranica, vol. 1, p. 12, [24] T. Fawcett, “An introduction to ROC analysis,” Pattern
Recognition Letters, vol. 27, no. 8, pp. 861–874, 2005.
2017.
[7] G. Rebala, A. Ravi, and S. Churiwala, An Introduction to
Machine Learning, Springer International Publishing, New
York, NY, USA, 2019.
[8] A. Onan, “On the performance of ensemble learning for
automated diagnosis of breast cancer,” Advances in Intelligent
Systems and Computing, vol. 347, pp. 119–129, 2015.
[9] R. Henriques and I. Feiteira, “Predictive modelling: flight
delays and associated factors, hartsfield-jackson atlanta in-
ternational airport,” Procedia Computer Science, vol. 138,
pp. 638–645, 2018.
[10] S. Choi, Y. J. Kim, B. Simon, and D. Mavris, “Prediction of
weather-induced airline delays based on machine learning
algorithms,” in Proceedings of the 2016 IEEE/AIAA 35th
Digital Avionics Systems Conference (DASC), Sacramento, CA,
USA, December 2016.
[11] P. Stefanovič, R. Štrimaitis, and O. Kurasova, “Prediction of
flight time deviation for Lithuanian airports using supervised
machine learning model,” Computational Intelligence and
Neuroscience, vol. 2020, Article ID 8878681, 10 pages, 2020.
[12] G. Gui, F. Liu, J. Sun, J. Yang, Z. Zhou, and D. Zhao, “Flight
delay prediction based on aviation big data and machine
learning,” IEEE Transactions on Vehicular Technology, vol. 69,
no. 1, pp. 140–150, 2020.
[13] N. Chakrabarty, “A data mining approach to flight arrival
delay prediction for american airlines,” 2019, https://arxiv.
org/abs/1903.06740.
[14] A. Onan and S. Korukoglu, “A feature selection model based
on genetic rank aggregation for text sentiment classification,”
Journal of Information Science, vol. 99, pp. 1103–1107, 2015.
[15] Y. J. Kim, C. Sun, S. Briceno, and D. Mavris, “A deep learning
approach to flight delay prediction,” in Proceedings of the
Digital Avionics Systems Conference 2016, Sacramento, CA,
USA, December 2016.
[16] Z. H. Zhou, Ensemble Methods: Foundations and Algorithms,
Taylor & Francis, Oxfordshire, UK, 2012.
[17] G. Zhong, T. Yin, L. Li, J. Zhang, H. Zhang, and B. Ran, “IEEE
intelligent transportation systems magazine,” IEEE Intelligent
Transportation Systems Magazine, vol. 99, 2020.
[18] J. D. Rodriguez, A. Perez, and J. A. Lozano, “Sensitivity
analysis of K-fold cross validation in prediction error esti-
mation,” IEEE Transactions on Pattern Analysis and Machine
Intelligence, vol. 32, no. 3, pp. 569–575, 2010.
[19] H. Patel, D. S. Rajput, G. T. Reddy, C. Iwendi, and O. Jo, “A
review on classification of imbalanced data for wireless sensor
networks,” International Journal of Distributed Sensor Net-
works, vol. 16, no. 4, Article ID 812444408, 2020.