1. Introduction
Strawberries are widely cultivated around the world for their rich vitamin content, antioxidant properties, and health benefits [
1]. They are also highly valued by consumers for their attractive red color, distinct aroma, and sweet taste [
2]. Strawberries contain high levels of vitamin C, minerals, flavonoids, and phytochemicals, and are recognized as a functional food with numerous health-enhancing properties [
3]. Additionally, the phytochemical profile of strawberries includes phenolic compounds and vitamins, contributing to their antioxidant capacity and health benefits, which has led to a significant increase in consumer demand over the past 20 years [
4]. Research has shown that strawberries, rich in antioxidants and bioactive compounds, can significantly lower fasting blood glucose levels in individuals diagnosed with type 2 diabetes [
5]. For these reasons, strawberries are one of the most extensively studied fruits from agronomic, genetic, and nutritional perspectives [
6].
Monitoring and managing CO
2 concentration are therefore vital for ensuring optimal conditions for strawberry growth and productivity in greenhouse environments [
7]. The intervals for collecting CO
2 concentration data vary significantly from study to study, ranging from weekly, hourly, to even minute intervals [
8]. Sokolov, S.V. [
9] have emphasized the importance of long-term measurement of CO
2 concentrations in greenhouses, highlighting the critical role of controlling these levels. This body of research underlines the importance of accurately collecting and analyzing environmental data, such as CO
2 concentration. The diversity of sampling intervals and methodologies emphasizes the complexity and variability in atmospheric CO
2 concentration monitoring. Accurate and timely environmental data are essential for smart farm managers to make informed decisions regarding crop management [
10].
Despite extensive research on CO
2, studies aimed at determining the optimal intervals for collecting CO
2 concentration data remain scarce. Despite technological advancements, the data collection intervals in existing smart farm systems still require significant improvement. If data are collected too infrequently, critical environmental changes may not be detected in time, preventing timely responses. Conversely, collecting data too frequently can lead to an accumulation of unnecessary data, consuming substantial resources in storage and processing and leading to inefficient use of resources. Therefore, setting appropriate data collection intervals is crucial for efficient resource use and accurate detection of environmental changes. However, the current data collection cycles in existing smart farm systems often have various issues. Existing smart farm systems face challenges with data collection cycles ranging from infrequent to excessively frequent data collection, leading to resource wastage or overlooking significant environmental changes. Small- to medium-sized farms often employ customized data collection methods, focusing on key observations rather than adopting formal systems, which hinders the adoption of Farm Management Information Systems (FMISs) [
11]. The decision on data collection intervals plays a crucial role in optimizing data collection strategies. For example, the time of day when temperatures change the most typically occurs during the transition between day and night [
12,
13]. Furthermore, elucidating the relationship between temperature and CO
2 levels is imperative for optimal farm management. Fluctuations in temperature can markedly affect atmospheric CO
2 concentrations, which, in turn, influence plant growth and soil health [
14]. Elevated temperatures during daylight hours can enhance the rate of CO
2 uptake by plants via photosynthesis. In contrast, reduced temperatures at night decelerate this process [
15]. By synchronizing data collection with these pivotal environmental interactions, farmers can more precisely evaluate the effects of climatic conditions on crop productivity and make more informed decisions regarding resource allocation. This necessitates studies to decide the intervals at which to collect data efficiently. Therefore, predicting CO
2 concentrations at different collection intervals is crucial for guaranteeing the optimal growth of plants.
Additionally, in this study, the ARIMA model and PFM were utilized to predict the concentration of CO
2 within the greenhouse, and the predictive performance of each model was derived to determine the optimal data collection interval. Time series model analyses such as the ARIMA model and PFM were used to find the optimal data collection intervals [
16]. Considering their excellent ability to analyze time series data, these models have become increasingly indispensable in the agricultural sector over the past 20 years. They are primarily used to meet the critical need for precise forecasting of crop yields, market prices, and environmental conditions, which are essential for effective farm management and planning [
17]. The PFM was proposed by Desai and Shingala [
18] as a forecasting model for wheat yield predictions, achieving high accuracy through the use of the FB PFM algorithm. These models predict CO
2 concentration data to identify patterns, trends, and seasonality. Previous studies [
19] have evaluated the performance of ARIMA and PFM in comparison with other time series forecasting models, demonstrating that ARIMA can achieve higher accuracy in short-term predictions for agricultural data when compared to deep learning models such as LSTM. Similarly, M’barek et al. [
20] compared PFM with LSTM-based models and found that PFM performs better on shorter time series data. The results of such predictions provide crucial criteria for determining when to schedule subsequent data collections. For instance, during periods with strong patterns or seasonality, data collection can be increased to capture fluctuations more accurately. Conversely, when trends are stable, the collection intervals can be extended to use resources more efficiently. Furthermore, the ARIMA model or PFM can be applied to the data to accurately predict future CO
2 concentrations in the greenhouse, providing essential information for greenhouse management. Moreover, the results of time series analysis can play a significant role in establishing such strategies. It is ideal to perform time series analysis on collected data to continuously adjust and optimize the sampling strategy based on the analysis results and engage in an iterative process.
According to the existing literature, there was a knowledge gap in utilizing ML models to predict CO
2 concentrations at various collection intervals [
21,
22,
23]. Consequently, the current study aims to construct an optimal model that is suitable for short-term predictions and has low complexity. This research uses the ARIMA model and PFM to forecast CO
2 concentrations. Statistical time-series forecasting models increase efficiency in data processing and variable selection through the use of appropriate input parameters, which is crucial for enhancing prediction accuracy. Moreover, selecting suitable parameters maximizes the model performance and ensures an efficient learning process. The principal aims of this research are as follows:
- (1)
By building time series forecasting models for CO2 concentration predictions, optimal hyperparameters with reduced complexity are selected to accommodate various collection intervals (1-min, 5-min, 10-min, 30-min, and 60-min intervals);
- (2)
Comparing the accuracy of all models using 5 different datasets: (1) dataset collected at 1-min intervals, (2) dataset collected at 5-min intervals, (3) dataset collected at 10-min intervals, (4) dataset collected at 30-min intervals, and (5) dataset collected at 60-min intervals;
- (3)
Comparing the performance of two-time series forecasting models for predicting CO2 concentrations.
3. Results
3.1. Microclimate of Experimental Greenhouse
The variations in temperature, relative humidity, and CO2 concentration inside the greenhouse were analyzed during the experimental period. The ranges of temperature, relative humidity, and CO2 were 4.29–35.50 °C, 7.6–98.1%, and 356.92–596 ppm, respectively, the in experimental period. It was observed that the relation between temperature, CO2 concentration, and humidity displayed distinct patterns. Specifically, during the experiment period, a negative correlation was observed between temperature and humidity (r = −0.550, p < 0.01), indicating that as temperature increased, humidity tended to decrease. Additionally, a weak negative correlation was observed between temperature and CO2 concentration (r = −0.169, p < 0.01), suggesting a slight decrease in CO2 concentration with rising temperature. However, the humidity made a weak positive relation with CO2 concentration (r = 0.006, p = 0.049).
3.2. The Results of the ARIMA Model
In this study, the concentration of CO
2 was predicted using the ARIMA model. While analyzing time series data with an ARIMA model, it is essential to verify whether the data exhibit stationarity [
43]. The stationarity of the time series data was verified through the Augmented Dickey-Fuller (ADF) test. The ADF test statistic for the CO
2 concentration data was −9.225884, which is below the critical values at significance levels of 1%, 5%, and 10% (−3.958454, −3.410526, and −3.127071, respectively). Additionally, the
p-value was extremely low at 1.126898 × 10
−13, providing strong evidence to reject the null hypothesis of it being non-stationary. Therefore, the data meet the conditions of being stationary, indicating that no further differencing is necessary. This suggests that the data are suitable for applying the ARIMA model in time series analysis.
Figure 4 illustrates that the CO
2 concentration dataset is stationary. To sum up, the data are in a stationary state and are ready for processing with the ARIMA model.
To obtain reliable results, it is crucial to determine the appropriate parameters for the ARIMA model. The parameters (p, d, q) are defined as follows:
In this study, the autocorrelation structure of five datasets with varying collection intervals was analyzed using Autocorrelation Function (ACF) and Partial Autocorrelation Function (PACF) plots. These analyses were essential for identifying the autocorrelation patterns of each dataset and developing suitable predictive models. The correlation coefficient is displayed on the
x-axis, while the number of lags is plotted on the
y-axis [
35]. The ACF plot for the CO
2_1 dataset shows a positive correlation beyond the first lag, suggesting the presence of a Moving Average (MA) component. The MA(1) model, which is the first order of the moving average model, explains how the error term from a previous point in time predicts the current value in time series data. This model is represented by Equation (8) [
44]. It is used to analyze the impact of random shocks on future values in time series data. The slow decline in the correlation coefficients suggests that the MA model is necessary, particularly an MA(1) model, since the coefficients remain relatively high after the first lag. Additionally, the dataset exhibits a strong autoregressive effect in the first two lags, which requires an AR(2) model. The AR(2) model, or the second order autoregressive model, illustrates the relationship of the current value with the values of the two preceding points in time series data. This model is represented by Equation (9) [
44]. It is useful for analyzing patterns and trends in time series data and for predicting future values. Additionally, the PACF plot shows relatively high partial autocorrelation coefficients at the first two lags, followed by a sharp decline to near-zero values, suggesting that the order of the AR model should be two. The sharp decrease after the first two lags indicates that additional AR effects are not significant. Thus, the MA(1) model can effectively capture the high correlation coefficient at the first lag. Therefore, an ARIMA (2,0,1) model was selected for the CO
2_1 dataset.
In the datasets with CO
2_5, CO
2_10, CO
2_30, and CO
2_60, the Autocorrelation Function (ACF) plots exhibit a high positive correlation coefficients at the first lag, which gradually decrease. This trend suggests the possibility of presence Moving Average (MA) components, particularly indicated by the sustained positive correlations at the first two lags, which makes the MA(2) model suitable for explaining the data’s correlation pattern. The persistence of positive values beyond the first two lags provides a basis for setting the moving average order at q = 2. Additionally, the PACF plots show a sharp truncation of the correlation coefficients after the initial two lags, with subsequent lags converging significantly toward zero. This indicates that an AR(2) model can adequately explain the autoregressive structure of the data. The significant partial autocorrelation at the first two lags strongly supports the necessity of two AR terms to model the autoregressive characteristics of the time series data, particularly as the high values at the first and second lags justify selecting an autoregressive order p = 2. The PACF plots generally show a sharp decline in correlation coefficients after the first two lags, approaching zero, indicating that the order of the AR component is two, and further AR effects are not significant beyond these lags. This means that two AR terms can sufficiently model the autoregressive characteristics of the time series data. Based on the common patterns observed in the ACF and PACF, an ARIMA (2,0,2) model was chosen for the datasets with CO
2_5, CO
2_10, CO
2_30, and CO
2_60. This model captures both the autoregressive and moving average properties of each dataset, effectively reflecting the effects up to the second lag. The AR order of two terms explains the autoregressive dynamics found in the initial two lags, while the MA order of two terms is necessary to model the moving average effects observed at the initial lags. The ACF and PACF plots are depicted in
Figure 5.
In this study, we considered both the model fit and complexity using the AIC and Bayesian Information Criterion (BIC). Thus, these two indices were crucial in selecting the optimal ARIMA models, particularly advantageous for interpretation and prediction.
For the CO
2_1 dataset, the ARIMA (2,0,1) model displayed the lowest AIC and BIC values, calculated at −123.45 and −118.90, respectively. Likewise, for the CO
2_5, CO
2_10, CO
2_30, and CO
2_60 datasets, the ARIMA (2,0,2) model showed the lowest AIC and BIC values, both recorded at −123.45 and −118.90, respectively. The minimum values indicate that the model that provides the best balance, between fitting the data and maintaining model simplicity compared to other parameter combinations. The (p, d, q) values of the ARIMA model were determined following a thorough trial-and-error process, identifying the model parameters that maximize efficiency and effectiveness (
Figure 6).
3.3. Dataset Performance
In this section, the results of the CO
2_1, CO
2_5, CO
2_10, CO
2_30, and CO
2_60 datasets are compared to conduct the performance analysis. The highest performing datasets testing data were obtained from the ARIMA model in CO
2_1 prediction (MAE = 2.832, RMSE = 7.359, R
2 = 0.928). For CO
2 concentration prediction, all the models performed better when using the CO
2_1 dataset compared to the other datasets. The lowest performing datasets testing data were obtained from the PFM in CO
2_60 prediction (MAE = 19.158, RMSE = 25.04, R
2 = 0.753). The performance outcomes of all models for predicting CO
2 concentrations are presented in
Table 2. Overall, the results suggest that the ARIMA model outperformed the PFM across the five datasets.
In the CO2_1 dataset, the ARIMA model exhibits a 2.78% higher RMSE in the test data compared to the training data, with a 10.80% increase in MAE and a 3.83% decrease in R2. This indicates that the model demonstrates slightly higher errors while maintaining consistent predictive capabilities in the test data, and generally, the ARIMA model has shown high performance with this dataset, though a slight performance degradation is observed in the test data. In the PFM, the RMSE in the test data is 4.54% higher than in the training data, the MAE has increased by 5.93%, and the R2 has decreased by 3.06%. This shows a drop in prediction accuracy in the test data.
In the CO2_5 dataset, the ARIMA model shows a 5.98% higher RMSE in the test data compared to the training data, an 8.45% increase in MAE, and a 10.69% decrease in R2. This indicates a decline in model consistency, particularly evident in the significant drop in R2, highlighting more pronounced performance degradation in the test data compared to the training data. In the PFM, the RMSE in the test data is 0.26% higher than in the training data, with MAE increasing by 5.40% and R2 decreasing by 4.66%. The prediction accuracy for the test data was not consistently maintained. In the CO2_10 dataset, the ARIMA model shows a 5.82% higher RMSE in the test data, a 60.25% increase in MAE, and a 12.24% decrease in R2. This indicates a significant degradation in model performance. As the dataset intervals increase, the decline in performance becomes more apparent, with the increase in MAE in the test data being particularly noteworthy. The PFM exhibits a 3.92% higher RMSE in the test data compared to the training data, a 4.84% increase in MAE, and a 3.03% decrease in R2. This suggests that the model struggles to maintain consistent performance compared to the training data.
In the CO2_30 dataset, the ARIMA model exhibits a 30.44% higher RMSE in the test data compared to the training data, a 6.01% increase in MAE, and a 7.07% decrease in R2. This highlights a pronounced performance degradation as the data intervals lengthen, indicating a decline in the model’s generalization ability, particularly evident in the significant difference in RMSE. The PFM also shows deterioration in consistent predictive capabilities, with a 7.35% higher RMSE in the test data, a 5.17% increase in MAE, and a 3.89% decrease in R2, further confirming the challenges in maintaining model performance with longer data intervals.
In the CO2_60 dataset, the ARIMA model shows a significant degradation in performance at the longest data interval, with RMSE in the test data being 21.41% higher than in the training data, MAE increasing by 17.60%, and R2 decreasing by 5.22%. This indicates the largest decline in generalization ability at the longest data interval. The PFM also demonstrates reduced generalization capability in this dataset, with a 7.93% higher RMSE in the test data, a 3.84% increase in MAE, and a 1.34% decrease in R2, confirming that the model struggles to maintain performance relative to the training data in the face of extended data intervals.
3.4. Model Performance
Overall, the ARIMA model demonstrates satisfactory prediction results during both training and testing periods. The performance of the two models for CO
2_1, CO
2_5, CO
2_10, CO
2_30, and CO
2_60 is shown in
Figure 7 and
Figure 8.
For the CO2_1 prediction training time, the ARIMA model results are comparatively higher (MAE = 2.556, RMSE = 7.160, and R2 = 0.965) followed by the PFM (MAE = 17.601, RMSE = 21.417, and R2 = 0.981). Since validation assessment is crucial, as previously mentioned, this study considered the testing outcomes as an indicator of the model’s performance. Consequently, during the testing phase, the ARIMA model outperformed other models (MAE = 2.832, RMSE = 7.359, and R2 = 0.928). The PFM struggled to produce a more competitive outcome than the ARIMA model for the CO2_1 predictions as reflected in the results (MAE = 558% high, RMSE = 204% high, and R2 = 2.477% low).
By CO
2_1 prediction results, the CO
2_5 exhibited a similar performance pattern. The ARIMA model was outperformed during the training and testing time compared to PFM. The top performance was achieved from the ARIMA model for the CO
2_5 predictions during the training time (MAE = 3.798, RMSE = 8.691, and R
2 = 0.945) and testing time as well (MAE = 4.119, RMSE = 9.21, and R
2 = 0.844). According to the testing results, PFM was attained second (358.5% higher MAE, 144.5% higher RMSE, and 9.116% lesser R
2) (refer to
Table 2). As noted earlier, the ARIMA model demonstrated better performance in the D3 predictions. The results of CO
2_10 prediction training and testing time were comprehensively explained in
Table 2. According to this, the ARIMA model performance surpassed the PFM performance during the training time. The difference observed when compared to the ARIMA model performance was PFM (158% higher MAE, 126.5% higher RMSE, and 8.080% lesser R
2).
When evaluating the outcomes of CO2_30 predictions, the ARIMA model results of training time (MAE = 6.957, RMSE = 12.735, and R2 = 0.877) and testing time (MAE = 7.375, RMSE = 16.614, and R2 = 0.815) were superior, whereas PFM carried out the second. During the testing time, PFM (158.960% higher MAE, 49.540% higher RMSE, and 7.840% lesser R2) was placed in the second position.
Similar to the other findings, the prediction results from the CO
2_60 dataset also demonstrate that the models behave in a similar manner. The ARIMA model executes better results than PFM (88.870% higher MAE, 44.690% higher RMSE, and 7.580% lesser R
2) during the training time. Furthermore, the comparison of the evaluation metrics’ results between the training and testing periods is presented in
Figure 9.
3.4.1. Performance of the ARIMA Model
The performance of the ARIMA model was evaluated using the RMSE, MAE, and R2 metrics. When comparing the performance between CO2_1 and CO2_60 datasets, the CO2_1 dataset exhibited the least error and maximum performance, showing a 62.20% lower RMSE, 75.33% lower MAE, and a 21.78% higher R2 compared to the CO2_60 dataset. When comparing the performance between CO2_5 and CO2_60 datasets, the ARIMA training results for RMSE were 20.10% lower, MAE was 31.25% lower, and R2 was 9.95% higher. Additionally, when comparing the performance between the CO2_10 and CO2_60 datasets, the ARIMA training for RMSE was 29.03% lower, MAE was 61.35% lower, and R2 was 13.59% higher. Finally, when comparing the performance between the CO2_30 and CO2_60 datasets, the ARIMA training results showed a 55.71% reduction in RMSE, a 61.6% reduction in MAE, and a 13.87% increase in R2. In summary, data collection at a 1-min interval demonstrated the highest model performance with an R2 of 0.928, RMSE of 7.359, and MAE of 2.832. The performance of the model in predicting CO2 concentration decreased as the data collection interval increased, thus showing the lowest performance at the CO2_60 dataset (R2 = 0.762, RMSE = 19.469, MAE = 11.48).
The comparison results between the actual values and predicted values are presented in
Figure 7. According to the ARIMA prediction outcomes, the datasets with intervals of CO
2_1, CO
2_5, CO
2_10, CO
2_30, and CO
2_60 followed a similar performance pattern. The training and testing prediction performance demonstrated that the CO
2_1 dataset achieved the best performance in predicting CO
2 concentrations compared to other datasets.
3.4.2. Performance of PFM
The results of the PFM are presented in
Table 2. It was observed that the CO
2_1 dataset provided the best performance compared to other datasets (RMSE = 22.388, MAE = 18.645, R
2 = 0.951). Conversely, the lowest performance was observed in the CO
2_60 dataset (RMSE = 25.04, MAE = 19.158, R
2 = 0.753). While comparing the performance between these two datasets, the CO
2_1 dataset exhibited the least error and maximum performance, showing a 10.59% lower RMSE, 2.68% lower MAE, and a 26.29% higher R
2 compared to the PFM testing results for the 60-min dataset. When comparing the performance between the CO
2_5 and CO
2_60 datasets, the PFM testing results for RMSE were 0.42% lower, MAE was 1.42% lower, and R
2 was 3.26% higher. Additionally, when comparing the performance between CO
2_10 and CO
2_60 datasets, the PFM testing for RMSE was 4.70% lower, MAE was 1.66% lower, and R
2 was 7.70% higher. Finally, when comparing the performance between the CO
2_30 and CO
2_60 datasets, the PFM testing results were 0.77% lower in RMSE, 0.56% lower in MAE, and 16.73% higher in R
2. Consequently, data collection at a 1-min interval demonstrated the highest model performance compared to other datasets. Similar to the ARIMA model predictive performance results, the performance of the PFM in predicting CO
2 concentration decreased as the data collection interval increased, thus showing the lowest performance at the 60-min interval.
The comparison results between the actual values and predicted values are presented in
Figure 8. According to the PFM prediction outcomes, the datasets with intervals of CO
2_1, CO
2_5, CO
2_10, CO
2_30, and CO
2_60 followed a similar performance pattern. The training and testing prediction performance demonstrated that the 1 CO
2_1 dataset achieved the best performance in predicting CO
2 concentrations.
3.5. Model’s Performance Comparison and the Proposed Model
In the dataset with CO2_1, the ARIMA model presents a 67.13% lower RMSE and an 84.81% lower MAE compared to the PFM, indicating greater accuracy in terms of error metrics. Conversely, the R2, which indicates the proportion of variance the model explains, is 2.48% higher in the PFM, suggesting that it may slightly better reflect the variability in the data. In the dataset with CO2_60, the ARIMA model continues to outperform with a 22.25% lower RMSE and a 40.08% lower MAE, while also achieving a 1.20% higher R2. This consistency suggests that the ARIMA model is generally superior to the PFM, making it a preferable choice for predicting CO2 concentrations overall.
4. Discussion
4.1. Comparative Analysis of Models in CO2 Concentration Prediction
CO
2 and plant growth are closely linked due to the precise influence of CO
2 on photosynthesis, nutrient uptake, biomass, and chloroplast diversity [
45,
46,
47]. However, key factors influencing the final quality and quantity of production, including seed germination, growth of roots and shoots, stem length, flower growth, and leaf development, primarily depend on CO
2 concentrations [
48,
49]. Plant growth is significantly affected by CO
2 concentrations, but direct measurement of CO
2 can be time-consuming, costly, and labor-intensive. Consequently, this study evaluated the accuracy of CO
2 concentration predictions within a greenhouse using the ARIMA model and the PFM developed by Facebook, across various data collection intervals. The research analyzed the performance of the prediction models using five different datasets. The results indicate that the ‘CO
2_1’ dataset was more effective in accurately modeling CO
2 concentrations than the ‘CO
2_5’, ‘CO
2_10’, ‘CO
2_30’, and ‘CO
2_60’ datasets. This comparison of the two models’ predictive performance with the five datasets helped identify an appropriate method for predicting greenhouse CO
2 concentrations.
Another previous study [
50] conducted research to predict greenhouse environmental variables over a period of 45 days. This research utilized the ARIMA model to predict temperature and humidity, achieving a minimum error rate of 0.4% and a forecasting accuracy of 95%. In the current study, the same ARIMA model was applied to predict CO
2 concentrations in a greenhouse over a period of 121 days, with the results achieving a predictive accuracy of 92.8% on test data. These results are consistent with the previous findings and further reaffirm the reliability of the ARIMA model. PFM effectively processes daily, weekly, and annual seasonal data and accurately identifies complex patterns of CO
2 concentration that reflect changes in both internal and external greenhouse conditions [
51].
This study reveals that PFM is somewhat limited compared to the ARIMA model. The ARIMA model excels at capturing rapid changes in CO2 concentrations in data collected at 1-min intervals, whereas PFM effectively identifies major trends in CO2 concentrations from data gathered every 30 min. While ARIMA requires careful tuning of its p, d, and q parameters to prevent overfitting, PFM offers a more flexible approach with less intensive parameter adjustments. Such optimization strikes an efficient balance between computational load and predictive accuracy, which is essential for real-time greenhouse management systems. Considering the overall predictive performance and minimal error metrics, the research concludes that the ARIMA model is more suitable for predicting CO2 concentrations.
4.2. Model Accomplishment
The performance of the ARIMA model is negatively affected as the data collection interval increases. Firstly, expanding the data collection interval from 1 min to 60 min leads to significant information loss. Data exhibiting high volatility, such as carbon dioxide concentrations, can change rapidly over time. Shorter intervals are more effective at capturing these fine variations. Secondly, as intervals widen, the responsiveness and temporal resolution of the model decrease, which challenges the model’s ability to capture recent changes and natural patterns, including periodicities. This directly undermines the accuracy of predictions. Since ARIMA models rely heavily on the autocorrelation of time series data, a reduced temporal resolution significantly hampers the model’s ability to learn data autocorrelation and periodicity. Furthermore, a decrease in the number of data points diminishes the amount of information available for the model to learn, particularly exacerbating performance degradation in sparse data conditions. Therefore, for effective prediction, it is ideal to collect data at as short an interval as possible, though this comes with increased costs and effort in data processing and storage. Considering these factors, the performance of the ARIMA model is superior when data collection intervals are shorter and deteriorate as intervals lengthen.
In contrast, the performance decline of the PFM is also notable as the interval widens. Specialized in analyzing various temporal elements such as trends, seasonality, and holiday effects in time series data, the PFM is particularly affected by changes in the data collection interval [
37]. Firstly, increasing the interval leads to a loss of detailed data and a decrease in the accuracy of estimating seasonality and trends [
38]. Data collected at 1-min intervals can capture subtle changes in CO
2 concentrations, aiding the model in learning more accurate trends and patterns. In contrast, data at 60-min intervals may miss important fluctuations or trends. Secondly, a wider interval can lead to less accurate identification of fine-grained seasonal patterns, such as hourly or daily variations, which diminishes the model’s predictive power in environments where short-term changes are critical [
40]. Thirdly, a wider gap between data points lowers the resolution of the time series, thereby increasing the variability in statistical estimates. Generally, predictive models tend to perform better when trained on a larger number of data points [
36]. Therefore, as the interval widens and the number of available data points decreases, it negatively impacts model performance. Finally, the resolution of the data also affects the model’s propensity for overfitting and its ability to generalize predictions. High-resolution data can increase the risk of overfitting, but this risk can be managed through proper data handling and model tuning. Conversely, low-resolution data may lead to underfitting, degrading the model’s generalization capabilities in making predictions.
Consequently, while the ARIMA model enables more precise predictions with high-resolution data collected at short intervals, suggesting its suitability for rapidly changing environmental conditions, the PFM, while useful for analyzing seasonal variability and trends in time series data, sees its ability to capture fine changes diminish as the data collection interval widens. Considering the impact of data collection intervals, the choice of model and data collection strategy should be carefully determined based on the research objectives and available resources. For instance, the ARIMA model may be more appropriate in situations where real-time detection of environmental changes is required, whereas the PFM might be advantageous for long-term seasonal variability. Such decisions will vary depending on costs, data storage and processing capabilities, and the required accuracy of predictions.
4.3. Influence of Input Variables and Models on CO2 Concentration Prediction
In this study, the data collection interval had a substantial impact on the prediction accuracy of the ARIMA model and PFM for estimating CO
2 concentration. Considering the various input combinations, the dataset with CO
2_1 produced the most accurate results across all models. This suggests that shorter data collection intervals offer better accuracy compared to datasets gathered over longer intervals. Generally, the R
2 values increased as the data collection intervals decreased, while RMSE and MAE values showed a downward trend. As more input parameters were incorporated, there was a noticeable improvement in model accuracy. However, incorporating multiple input variables raised computational costs and added complexity to the model, potentially limiting its practical application [
52]. As shown in numerous studies, the amount and relevance of input parameters have a substantial impact on prediction accuracy [
53].
Comparing the two-time series forecasting models, the ARIMA model demonstrated relatively higher stability and less sensitivity to input data variations in prediction accuracy [
54]. Considering the increased rates of RMSE and MAE between the training and testing phases, the PFM also exhibited high stability, but ARIMA generally outperformed in overall performance. This stability may be attributed to ARIMA’s ability to effectively capture complex relationships and patterns in time series data and generalize well with unseen data [
55]. The stability of the two models varied with different combinations of input data, with ARIMA showing less sensitivity to these variations. Primandari et al. [
56] utilized the PFM to predict CO
2 concentrations. The PFM is noted for its high predictive accuracy and low error values, effectively managing the seasonality and change points in CO
2 levels, which showed a continuing upward trend without any reduction in recent levels. However, the performance and stability of a model can differ based on the specific datasets used and the challenges faced, making it essential to evaluate model performance across diverse datasets [
57]. This remains a valuable consideration when selecting time series forecasting models.
The adaptability of ARIMA to various data intervals and its robust performance suggest its suitability for environments where rapid real-time data processing is crucial. For example, in urban air quality monitoring or industrial environmental management systems, the ARIMA model’s quick response to environmental changes is highly beneficial. On the other hand, the PFM’s capability to analyze long-term seasonal variations makes it more appropriate for applications like agricultural planning where long-term trends are more relevant.
The impact of different data collection intervals on data file size was analyzed as input variables for a CO2 concentration prediction model. The collection interval affects not only the performance of the prediction model but also data management, storage costs, and processing times. The data file from CO2_1 is the largest, at 59.82 MB, reflecting the highest frequency of data collection. This high granularity captures hourly variations in detail, enabling more accurate predictions but also requiring significant resources for data processing and storage. The data file at CO2_5 is significantly reduced in size to 11.96 MB, approximately an 80% decrease compared to the CO2_1 data file. This reduction results in some loss of detailed information, but it eases data processing and management. The size of the data file collected at CO2_10 is 5.98 MB, which is a 50% reduction from the CO2_5 data file. This further decreases the need for storage space and speeds up processing, but it also increases the potential for loss in prediction accuracy. The data file collected, CO2_30, is much smaller at 1.99 MB, significantly reducing data frequency and risking missing important environmental changes. However, this interval significantly cuts costs related to data management and processing. The data file collected at CO2_60 is the smallest at 0.997 MB, offering minimal data and a high likelihood of missing critical time periods. However, they provide benefits in terms of minimal storage space usage and reduced processing time. The variation in data collection intervals and file sizes clearly illustrates the trade-offs between data quality and quantity and processing costs. High-resolution data enable more accurate analyses but justify increased costs and resource usage, as larger file sizes demand more time and money for data processing and storage. Thus, assessing the model’s effectiveness and the suitability of data intervals, considering budget and infrastructure constraints, is crucial in choosing the most efficient data collection strategy. Especially for large facilities or those with limited budgets, selecting an appropriate model and data collection frequency plays a vital role in maximizing resource optimization and efficiency.
This study confirmed that both the ARIMA model and PFM exhibit higher prediction accuracy at shorter data collection intervals. This suggests their utility in applications where real-time or high-frequency data monitoring is crucial. For instance, in urban air quality monitoring or industrial environmental management systems, rapid responses through real-time data are essential when using short data collection intervals. However, the complexity of the models and computational costs tend to increase as the data collection intervals decrease. This can be particularly challenging when dealing with large-scale data, necessitating a cost-effectiveness analysis.
Overall, this research proposes a methodology to determine the optimal data collection frequency for regulating optimal CO2 concentrations in greenhouse crops and enhancing the efficiency of smart farm operations. This suggests an appropriate collection frequency for efficiently utilizing vast amounts of data in the agricultural sector. These findings underscore the importance of high-frequency data collection in accurately monitoring and controlling CO2 concentration within greenhouses.
5. Conclusions
In this study, we utilized two-time series models to predict CO2 concentrations in strawberry greenhouses. The primary objective was to evaluate the optimal data collection intervals needed to achieve high-accuracy predictions of CO2 concentrations within the greenhouse environment. The results of the study demonstrated that the ARIMA model outperforms the Prophet model (PFM) in predicting CO2 concentrations across all data collection intervals. Moreover, among the five datasets (CO2_1, CO2_5, CO2_10, CO2_30, CO2_60), the ARIMA model and PFM demonstrated the best performance on the CO2_1 dataset. Overall, the performance of the ARIMA model and PFM improved as the data collection interval shortened. Specifically, the ARIMA model with the CO2_1 dataset showed that R2 increased by 21.78%, and RMSE and MAE reduced by 62.20% and 75.33%, respectively, compared to the CO2_60 dataset. Additionally, the PFM with the CO2_1 dataset showed that R2 increased by 26.29% and RMSE and MAE reduced by 10.59% and 2.68%, respectively, compared to the CO2_60 dataset. This research clearly highlighted the effectiveness of time series models, especially the ARIMA model, in forecasting CO2 concentrations in a greenhouse. The results offer valuable insights into CO2 concentration patterns, supporting data-driven decision-making in plant production and environmental management through real-time CO2 monitoring. However, modeling CO2 concentrations has some limitations because it depends on other variables such as ventilation conditions, temperature, humidity, and seasonal variations. Therefore, future studies may focus on developing predictive models that consider ventilation conditions, temperature, humidity, and seasonal changes to predict CO2 concentrations in plant production.