1 Introduction

The exponential growth of personal vehicles (cars and two-wheelers), combined with increase in trips and trip lengths results in acute traffic congestion in most of the metropolitan cities around the world. In recent years, the focus of congestion reduction have shifted from infrastructure- and capital-intensive transportation strategies to more balanced and sustainable transportation solutions like Intelligent Transportation Systems (ITS). Traffic forecasting, the process of predicting future traffic conditions in short-term or near-term future, based on current and the past traffic observations is an important component of any of the Intelligent Transportation Systems (ITS) applications. Short-term traffic flow forecasting, which involves the prediction of traffic volume in the next time interval usually in the range of five minutes to 1 h, is one of the important research problem in the field of ITS addressed by many researchers in the last two decades. Traffic flow or the number of vehicles crossing a particular point per unit time period is a point process or in other words, it is a type of random process which consists of a set of isolated points collected over time [1]. For modelling of such point processes, data driven approaches based on statistical techniques are usually employed to identify the stochasticity in the observed data [2]. In general, the statistical techniques used for the problem of traffic flow prediction can be classified as non parametric or parametric statistical techniques [3]. The nonparametric techniques include nonparametric regression [4] and neural network [516]. The parametric techniques include linear and nonlinear regression, historical average algorithms [6], smoothing techniques [6, 11, 17], and autoregressive linear processes [3, 7, 11, 1726]. It is reported that the time series analysis based techniques like the autoregressive integrated moving average (ARIMA) is one of the most precise methods for the prediction of traffic flow when compared to other available techniques as mentioned above [27]. The time series models try to identify the pattern in the past data by decomposing the long term trends and seasonal patterns and extrapolate that pattern into the future. Since the traffic flow pattern exhibits a strong seasonal pattern due to peak and off-peak traffic conditions which is repeating more or less on the same time every day, it is said that, seasonal ARIMA (SARIMA) models are particularly relevant to model traffic flow behavior [3, 23, 24, 26, 27]. In many studies, the SARIMA model is found to perform better than the models based on random walk, linear regression, support vector regression (SVR), historical average, and simple ARIMA [23, 24, 26, 28]. Smith et al. [29] reported that the best-performing k-NN forecast models (non-parametric) did not reach the predictive performance of SARIMA (parametric).

Reported studies on the use of SARIMA models for flow prediction mainly suffers from a drawback that, they used a huge historical database for model development. For example, Smith et al. [29] used previous 45 days of 15 min. flow observations for the next day traffic flow forecasting. More than 2 months of traffic volume observations was used by Williams and Hoel [23] and around 60, 000 flow observations aggregated for each 3 min. intervals spanned over a period of 106 days was used by Stathopoulos and Karlaftis [30]. Ghosh et al. [24] used 20 days of 15 min. flow data with a total of 1920 observations. Mai et al. [27] used 15 min aggregated traffic volume observations over a period of 26 days for fitting the SARIMA based traffic flow prediction model. Dong et al. [25] used 2 months of flow observations aggregated to 5 min. intervals as input to ARIMA model for predicting the flow for the test day of interest. Lippi et al. [26] used 4 months of flow data from loop detectors placed around nine districts of California for model development using SARIMA. Tan et al. [11] used a time series of traffic flow collected over several years for model development using ARIMA. The use of such a huge database for model building may restrict its application in places where the data availability could be an issue. Sometimes, the storage and maintenance of the historical databases could be a difficult task. Thus, it will be ideal if a SARIMA model can be developed for predicting flow, which need only limited input data for model development. The present study is an attempt in this direction, in which only previous 3 days flow observations aggregated to the required time interval has been used in the prediction scheme developed using SARIMA for predicting the next day (24 h. ahead forecast) flow values with a desired accuracy. The use of previous 3 days flow data as input can capture the peak and off-peak traffic conditions which is repeating more or less on the same time every day. Short term prediction of traffic flow during morning and evening peak periods was also attempted using both historic (previous 3 days flow data) and real time data on the day of interest.

The following section gives the details of the selected study stretch, data collection and extraction techniques for prediction scheme development and corroboration. Section 3 explains the step by step procedure of the development of proposed scheme for traffic flow prediction using SARIMA using only previous 3 days flow data as input. The corroboration of the prediction scheme using the actual data from the field is explained in section 4. Short term prediction of traffic flow using both historic and real time data is presented in section 5 followed by concluding remarks in section 6.

2 Data collection and extraction

The study stretch considered for the present study was on Rajiv Gandhi road in Chennai, India. The selected road is one of the busy arterial roads in Chennai and is also known as IT corridor or Old Mahabalipuram road. More than 30,000 vehicles use this road daily. It is a 6 lane roadway, with 3 lanes in each direction. For the present study only one direction of traffic was considered. The automated traffic sensor namely the Collect-R camera [31] permanently fixed at one of the location of the selected study stretch was utilized to obtain the required data on vehicular flow for the model development and corroboration of the prediction scheme. Flow data from three consecutive days (September 20, 21 and 22, 2012) was collected from the Collect-R camera and used for the model development. The flow data corresponding to September 23, 2012 was used for model validation. The raw data from the automated traffic sensor contained each one minute class-wise traffic flow for the entire 24 h from 12 midnight to 12 midnight. As the prediction scheme is based on time series analysis, which basically requires a series of discrete observations collected over time, the input could be either class-wise traffic flow or total vehicular flow aggregated into any desired uniform time interval. For the present study, the total number of vehicles aggregated into ten minute time intervals were considered as input. However, the proposed prediction scheme could be extended to any desired time interval with the input of class-wise vehicular flow also. Hence, the data extraction involved the summing up of class-wise traffic flow in each one minute interval and then aggregating into 10 min intervals. The observed flow in each 10 min interval was then converted to vehicles per hour. Thus for each day, 144 flow values were available (24 h × 6 data points/hr) as input to the prediction scheme. The same process was repeated for all the 4 days (three consecutive days for model development and next consecutive day for model validation) to get the total number of vehicles in each 10 min. intervals.

3 Development of prediction scheme using SARIMA

The development of proposed scheme for traffic flow prediction using SARIMA involved four steps of model identification, model estimation, diagnostic checking and forecasting/validation of the developed model. The first three steps are explained in this section. The last step of model validation is explained in section 4.

3.1 Model identification

The first step in model identification is to plot the time series data and examine for the features such as trend and seasonality. If there is an upward or downward linear trend and no obvious seasonality, a first order difference is needed to make the series stationary. If there is a curved trend, a logarithmic transformation may be required before differencing. If there is a seasonality and no trend, differencing at lag specified by the seasonal period ‘S’ is required. For instance, a 12th order difference (x t  − x t − 12) is required for monthly data with seasonality. If the series contains trend as well as seasonality, both non-seasonal and seasonal differencing needs to be applied as two successive operations in either order. If there is neither obvious trend nor seasonality, such series can be modelled by AR, MA or ARMA models. It is not advisable to go beyond two differencing as over-differencing can cause unnecessary levels of dependency in the time series data. The time series plot of the observed 10 min. flow in veh/hr of three consecutive days is shown in Fig. 1. It can be seen that, there is a clear seasonal pattern in the observed traffic flow with seasonality of 24 h. This shows that, the time series data could be modelled using SARIMA. It can be seen from Fig. 1 that the morning and evening peak hours were clearly repetitive and showed similar variation across the days. Inspection of the plots also suggests that, there is no increasing or decreasing long-term trend in the data.

Fig. 1
figure 1

Time series data of observed traffic flow in three consecutive days

The next step is to do necessary differencing to make the input time series a stationary one. As there is no trend in the data and only seasonal effect is visible, one time differencing at the lag specified by the seasonal period corresponding to 24 h is sufficient. The 24 h flow data aggregated to 10 min. intervals gave 144 data points per day (24 h × 6 data points/hr). So, the seasonal period ‘S’ is 144. Hence, the differencing at lag 144 (x t  − x t − 144) was adopted. For the differenced series, the ACF and PACF were plotted and are shown in Fig. 2. It can be seen from Fig. 2 that, there is a gradual tapering of ACF towards zero, which clearly suggests a possible AR process for the non-seasonal part. The order of the AR model could be found in PACF. There are three significant non-zero autocorrelations at early lags in PACF and this indicates a possibility of 3rd order AR model for the non-seasonal part. However, there is a sharp cut-off after lag 2 in PACF and this suggests a possibility of AR(2) process. On the other hand, the PACF at lag 1 is comparatively higher than that of lag 2 and 3, showing the possibility of AR(1) for the non-seasonal part. The ACF and PACF at seasonal lag of 144 in Fig. 2 indicated a possible MA(1) process for the seasonal model as there is a significant spike in ACF at lag 144. Hence, the possible combination of models that can be tried include ARIMA (3,0,0) × (0,1,1) 144, ARIMA (2,0,0) × (0,1,1) 144, ARIMA (1,0,0) × (0,1,1) 144. Once the possible models and their corresponding orders were found, the next step of model estimation was performed as explained below.

Fig. 2
figure 2

ACF and PACF plot for the previous 3 days flow data after seasonal differencing

3.2 Model estimation and diagnostic checking

The model estimation involves the estimation of model parameters, i.e., φ ′ s, θ ′ s, Φ ′ s, and ϑ ′ s. In the present study, one of the most widely used estimation method, namely the ‘maximum likelihood’ method was adopted using R software. The estimation procedures are not covered in this paper and the details of it can be obtained from Brockwell and Davis [32]. The generally accepted principle is that the model with the fewest parameters that can adequately describe the process has to be selected [33]. If two different models are fitting a series equally well, the model with less number of parameters should be preferred because estimation of parameters will be more precise for models with fewer parameters. From the selected feasible models, most suitable one is selected based on the goodness-of-fit. The present study uses Akaike’s Information Criteria (AIC) given by Eq. (1) to select the best model. The model with lowest AIC will be the best one.

$$ AIC= \ln {\sigma^2}_k+\frac{n+2k}{n}, $$
(1)

where, σ 2 k is the estimate of variance, n is the number of samples and k is the number of parameters. The results of model estimation are shown in Table 1. The usual procedure is to choose a model that has low AIC. Since the ARIMA (2,0,0) × (0,1,1)144 model showed a AIC of 4218.34 which is less when compared to that of other two models, the model ARIMA (2,0,0) × (0,1,1)144 was finally selected and corroboration of the chosen model is detailed in the following section.

Table 1 Parameters of the SARIMA model

4 Corroboration of the prediction scheme

The developed model using time series flow data of previous 3 days was validated using the actual/observed flow data of September 23, 2012. The validation step involved the prediction for September 23, 2012 using the previous 3 days of flow data as input (September 20, 21 and 22, 2012) and comparing the predicted flows with the observed or actual flow values. The plot of the predicted values of each 10 min. flow (in veh/hr) against the measured/observed values during September 23, 2012 is shown in Fig. 3. The resulting MAPE between observed and predicted flow was found to be 9.22. According to Lewis’ scale of interpretation of estimation accuracy [34], any forecast with a MAPE value of less than 10 % can be considered highly accurate, 11–20 % is good, 21–50 % is reasonable and 51 % or more is inaccurate. In most of the studies on flow prediction [23, 24, 29, 30, 35], a MAPE in the range of 10–20 % was reported. Since traffic flow observations vary from a few hundred vehicles per hour in off peak to several thousand vehicles during peak periods, MAPE in the range of 10–20 % is generally acceptable. Based on this, it can be seen that the results are highly accurate with MAPE less than 10 % and within acceptable limits. It can be seen from Fig. 3 that, the predicted flow values closely matches with the observed flows during both peak and off peak hours, thus indicating the better performance of the model, developed using only limited input data. Based on the MAPE results and plot, it can be concluded that previous 3 days flow as input is adequate for predicting the next 24 h ahead flow with an accuracy which is acceptable in most of the ITS applications. For weekends (Saturday and Sunday), the use of previous days as input may not be suitable as the traffic on roads during weekends would be quite less when compared to the normal week days. In such cases, the previous weeks same day flow data could be considered to capture the traffic flow pattern. A comparison of the proposed method with historic average method (predicted flow in the given time interval is the average of the preceding 3 days flow in the same time interval) and naive method (flow today equals flow yesterday in the same time interval) was attempted and the results are shown in Fig. 4. The corresponding MAPE were 10.53 and 10.42 for historic average method and naive method respectively, which is higher than the MAPE of 9.22 by the proposed SARIMA model.

Fig. 3
figure 3

Comparison of observed and predicted flow for September 23, 2012

Fig. 4
figure 4

Comparison of observed and predicted flow for September 23, 2012 using historic average, naive and SARIMA methods

Instead of 3 days, 4, 5, 6, 7, 8 and 9 previous days were considered as input in the proposed SARIMA model to check whether there is an improvement in prediction results. For this, 10 days of traffic data from May 26 to June 06, 2014 (excluding Saturday and Sunday) was taken from the automated traffic sensor installed at Perungudi which is located at a distance of about 3 km from the study location. The total vehicular flow in each 10 min. interval from 12 midnight to 12 midnight was extracted for all the 10 days considered. In Scenario-1, the flow data from three previous days (June 03, 04 and 05) was considered as input to predict the traffic flow on June 06, 2014, the next consecutive day. In a similar way, the details of various scenarios analyzed are given below in Table 2.

Table 2 Details of the number of days considered in the SARIMA model for various scenarios

The input time series of nine previous days for scenario 7 is shown in Fig. 5. The MAPE between observed and predicted flows on June 06, 2014 for various scenarios analyzed is shown in Fig. 6. It can be seen that, initially the MAPE increases but soon after, there is a sudden drop in scenario 3 when five previous days were considered as input. This shows that prediction results improve when previous week - same day (in our case, it is Friday) was also included in the input time series. The reduction in MAPE was not much significant beyond scenario 3. The plot of observed and predicted flows on June 06, 2014 for scenario 3 is shown in Fig. 7. It can be seen that predicted flows closely follow the observed flow values.

Fig. 5
figure 5

Input time series consisting of nine previous days flow for scenario 7

Fig. 6
figure 6

MAPE between observed and predicted flows on June 06, 2014 for various scenarios analyzed

Fig. 7
figure 7

Plot of observed versus predicted flows on June 06, 2014 using previous 5 days of flows as input

5 Real time short term traffic prediction

Short term traffic prediction (maximum of 1 h ahead), taking into account real time data, was also attempted. In this case, the historic data (five previous days flow data as input) was used along with the real-time data until the time of prediction on June 06, 2014. Both morning and evening peak period of June 06, 2014 were considered to check the model accuracy. For morning peak, the real time traffic flow data until 8.30 am was taken into account to provide prediction for the next 1 h duration (8.30–9.30 am). Similarly for evening peak period, the flow data until 5 pm was taken into account to predict the traffic flow from 5 to 6 pm. The plots of the predicted values against the measured/observed values are shown in Figs. 8 and 9 respectively. The MAPE between observed and predicted flow for morning and evening peak was found to be 4.37 and 3.83 respectively. The results showed the better performance of the developed model in short term prediction of traffic flow when real time data also was taken into account.

Fig. 8
figure 8

Comparison of observed and predicted flow during morning peak hour

Fig. 9
figure 9

Comparison of observed and predicted flow during evening peak hour

6 Concluding remarks

Timely and accurate prediction of traffic flow is essential for proactive traffic management and control in Advanced Traffic Management Systems (ATMS) and real-time route guidance in Advanced Traveler Information Systems (ATIS). Among the techniques available for traffic flow prediction, time series analysis using ARIMA models is one of the most precise methods and SARIMA in particular is relevant to model traffic flow behavior. However, the main drawback of data driven approaches is the requirement of huge historical database for model development. For example, the reported studies on the use of SARIMA for flow prediction used flow data in the order of several months for development of the prediction scheme. Use of such huge database may restrict its application in places where the data availability could be an issue. Also, the storage and maintenance of the historical databases sometimes becomes a difficult task. It may also involve more computational time and resources for running the SARIMA model when the input time series is large. The present study tries to overcome the above issues by proposing a prediction scheme using SARIMA model for short term prediction of traffic flow using only limited input data. In the prediction scheme, only previous 3 days of flow observations was considered as input for predicting the next day (24 h. ahead forecast) flow values. Short term prediction of traffic flow during morning and evening peak periods was also attempted using both historic and real time data. The results were promising and the prediction scheme proposed in this study for traffic flow prediction could be considered in situations where database is a major constraint during model development using ARIMA.