1. Introduction
Energy plays a pivotal role in the economic development of nations. In this context, photovoltaic (PV) energy has emerged as a crucial component, contributing significantly to global energy dynamics. PV energy, derived from solar radiation through solar cells, has gained immense global significance as a renewable and sustainable source of power. The technology makes use of the sunlight and converts it into electricity, offering a clean and environmentally friendly alternative to traditional energy sources [
1]. With its potential to reduce dependence on fossil fuels, mitigate environmental impacts, and foster energy security, PV energy has become a key player in global efforts toward sustainable energy production and addressing climate change. The widespread adoption of PV technology worldwide reflects its importance in diversifying energy portfolios and contributing to a more sustainable future.
In the Spanish context, PV energy holds particular relevance as the nation strives to diversify its energy mix and transition toward cleaner and more sustainable sources [
2,
3]. Spain has an ideal environment for utilizing solar energy through PV technology, thanks to its abundant sunlight. This adoption of PV energy supports Spain’s goals of meeting renewable energy targets and addressing climate change issues. By taking advantage of its solar resources, Spain reduces its reliance on conventional energy sources and contributes to a greener and more secure energy infrastructure. The integration of PV energy in Spain’s energy landscape highlights its strategic importance in the nation’s pursuit of a sustainable and environmentally conscious future.
Understanding and forecasting PV electricity demand in Spain is increasingly critical as the nation actively embraces renewable energy to meet its growing power needs [
4]. With a rising focus on sustainability and a commitment to reducing carbon emissions, Spain’s energy landscape is experiencing a transformative shift toward renewable sources, particularly solar energy. Thus, the accurate forecasting of PV electricity demand is vital for efficient energy planning, ensuring a satisfactory integration of solar power into the grid and optimizing the resource allocation [
5]. As Spain expands its PV infrastructure, the ability to anticipate and meet the demand for solar electricity becomes paramount for maintaining grid stability, minimizing wastage, and achieving energy efficiency goals. Moreover, precise forecasting supports strategic decision-making, helping policymakers and energy authorities adapt to the dynamic nature of renewable energy sources and contribute to Spain’s broader objectives.
The accurate prediction and management of the PV energy demand depends on deploying precise models that play an essential role in the efficiency of energy systems [
6]. These models are important tools to anticipate variations in PV energy demand, ensure optimal grid integration, and facilitate effective energy management strategies. Precise forecasting helps align energy production with demand, minimizing the risk of grid underutilization or overloading [
7]. Additionally, it enables proactive planning for energy storage and distribution, contributing to enhanced grid stability and resilience. In the context of PV energy, where generation is subject to weather conditions, the reliability of models becomes critical in examining the variability inherent in solar power production. Therefore, using accurate models is fundamental to successfully predicting and managing PV energy demand.
As a consequence, the importance of Big Data techniques in enhancing the accuracy of energy demand predictions cannot be ignored. In the PV energy field, where vast and diverse datasets are commonly processed, Big Data techniques offer a powerful solution to extract meaningful information. The ability to process large volumes of data in real time allows for a more comprehensive understanding of dynamic factors influencing energy demand, including weather patterns, economic indicators, and consumer behavior [
1,
7,
8,
9]. Big Data approaches facilitate the identification of complex relationships and trends that may go unnoticed with traditional methods. This enhanced understanding, in turn, leads to more accurate and responsive energy demand predictions.
Big Data approaches offer significant advantages in terms of scalability and efficiency when it comes to handling large datasets. These techniques are highly proficient in managing vast amounts of information and provide a flexible infrastructure that can easily accommodate the increasing datasets that are characteristic in energy demand prediction. This scalability allows for the analytical capabilities of the system to adapt flawlessly as datasets expand, without affecting performance.
Moreover, the efficiency of Big Data approaches results from their distributed processing abilities that enable parallel computation across multiple nodes. This parallelization accelerates data processing times [
10,
11,
12]. The efficiency gains are especially critical in the context of energy demand forecasting, where real-time or near-real-time analysis is essential for effective decision-making.
Our study aims to propose and implement predictive models to improve the accuracy of PV energy demand predictions in Spain, considering the dynamic nature of renewable energy sources and the critical need for reliable forecasts in energy planning. To achieve this, our proposed methodology employs advanced predictive modeling techniques, leveraging the power of Big Data. We focus on the Linear Regression (LR), Random Forest (RF), Light Gradient Boosting Machine (LGBM), and Recurrent Neural Network (RNN) models, each designed to handle large datasets. By utilizing these models and the capabilities of Big Data, we seek to provide a robust framework for forecasting PV energy demand, contributing to more informed decision-making in the field of sustainable energy management in Spain.
2. Methodology
This section introduces the proposed methodology followed for predicting PV energy demand in Spain. The following subsections detail the dataset used and the four models applied in this study.
2.1. Dataset
The dataset utilized in this study was sourced from the Spanish Electricity System (SES). The SES plays a pivotal role in electric storage and the management of high-tension energy transportation. Its responsibility includes real-time adjustments to energy production to ensure a balance between scheduled production and demand.
The data are publicly accessible on the SES website [
13], providing users with the ability to retrieve various types of electric energy information. Employing scraping techniques, we collected the data. The resulting dataset comprises columns for the hour, date, and solar PV energy spanning from 2017 to 2022. The data granularity is set at 10 min intervals. The dataset has a period of 5 years and a couple of months, encompassing data up to 1 May 2022. In total, we collected 280.368 samples.
In the data preprocessing stage, we addressed issues such as repeated values and missing data by cleaning the dataset. Additionally, we normalized the data to ensure consistency. To prepare the PV energy demand data [
9], we employed a sliding window approach, a method that involves analyzing the data in sequential segments. This technique allows for a systematic and continuous analysis of the dataset. Once the data were appropriately prepared, we proceeded to set up four models for learning and predicting the PV energy demand, as detailed in the subsequent sections.
2.2. Linear Regression
The initial model implemented was Linear Regression (LR), utilizing the Scikit-Learn implementation. LR served as the baseline model for the comparative analysis with the subsequent models.
LR is a statistical technique used for modeling the relationship between a dependent variable and one or more independent variables. In this context, LR aims to establish a linear relationship between the features of the dataset and the solar PV energy demand [
14]. LR is a simple and widely used model that aims to establish a linear relationship between variables. Despite its simplicity, LR has demonstrated effectiveness in various scenarios, often yielding good results, though there was room for improvement. This basic model serves as a valuable benchmark, offering insights into the complexity of the problem at hand and providing an initial assessment of whether linear solutions are sufficient to address the intricacies of the dataset.
2.3. Random Forest
Random Forest (RF) is a versatile and widely employed ensemble learning technique. In our study, we utilized the RF version from MLlib, specifically designed for Big Data applications.
RF is more complex than LR, employing an ensemble of decision trees to enhance predictive accuracy. RF introduces parameters such as the number of estimators (trees) and depth, providing flexibility to adapt to various data complexities [
8]. This complexity allows for RF to capture nonlinear relationships in the data, making it a suitable choice for scenarios where linear solutions may fall short. The decision to incorporate RF into our analysis derives from its capacity to handle hidden relationships within the data, making it an appropriate choice for predicting the PV energy demand in our study.
2.4. Light Gradient Boosting Machine
Light Gradient Boosting Machine (LGBM) is a high-performance gradient-boosting framework designed for efficiency and scalability [
15]. In our study, we employed the LGBM version from SynapseML, developed for Big Data applications, as well as MLlib. SynapseML is a machine learning library that provides optimized algorithms for distributed computing environments, enhancing the efficiency of model training and prediction.
Compared to RF, LGBM introduces parameters such as the number of estimators (trees) and leaves. These parameters contribute to its adaptability and effectiveness in handling large datasets, making LGBM a favorable choice for scenarios where computational efficiency is crucial. The efficiency of LGBM in handling large datasets justifies its use in predicting PV energy demand while ensuring computational speed and scalability.
2.5. Recurrent Neural Network
Lastly, a Recurrent Neural Network (RNN) is a type of neural network designed to process sequential data by maintaining a memory of previous inputs [
16]. In our study, we utilized the Big Data-adapted version of an RNN from Keras. Keras is an open-source deep-learning library that facilitates the construction of neural networks in a user-friendly and modular manner.
In comparison to RF and LGBM, RNN introduces parameters such as the number of layers and neurons. These parameters enable RNN to capture temporal dependencies within sequential data, making it suitable for time-series prediction tasks.
We incorporated the RNN into our analysis because of its ability to model sequential dependencies in data, making it well suited for our problem as it addresses data over time.
3. Experiments
We carried out a series of experiments to evaluate the feasibility of Big Data-based solutions. To establish a baseline model, we selected LR for comparison with other models, even though its performance is not expected to be very robust. However, LR’s simplicity provides insights into the behavior of other models.
LR, which has only one parameter for training, namely, the number of iterations, was used as our foundation model. We tested RF from the MLlib library with different numbers of trees, ranging from 25 to 100, and different max depths of the trees, ranging from 4 to 8. For LGBM, which was implemented using the SynapseML library, we conducted experiments with different numbers of trees, ranging from 25 to 100, and numbers of leaves, ranging from 20 to 40. We subjected the RNN to experimentation involving 1 to 3 layers and 10 to 50 neurons. We designed these comprehensive experiments to explore the performance and behavior of each model in terms of both the accuracy and time cost.
4. Results
Table 1 presents the results of the four distinct models applied in our study: LR using MLlib, RF using MLlib, LGBM using SynapseML, and using Keras. For each model, the table displays the information in five rows. The first column denotes the model’s name, the second column signifies the window size utilized in the model, and the third and fourth columns represent the specific parameters relevant to each model, as detailed in the preceding section. The fifth and sixth columns provide the errors in terms of the RMSE and MAE, respectively. Finally, the time required to train each model is presented in the last column.
We can now evaluate the models’ performance, taking into account both the predictive accuracy and computational cost. The top-performing models, derived from Linear Regression, exhibited closely comparable results. LR was tested with different numbers of iterations, as it does not involve additional parameters. The most favorable results, though not good enough, were observed with a window size of 288 and 150 iterations. These outcomes closely resembled those obtained with a smaller window size and a reduced number of iterations, showing only a marginal improvement of approximately half a unit.
Note that we tested other linear regressors on Scikit-Learn with alternative gradients and settings in order to confirm whether there was an execution mistake or error. However, the consistency across these models points toward the conclusion that the LR model may not be optimally suited for this problem. In this study, the LR model was employed as a baseline reference in our analysis for the rest of the models.
RF achieved ten times better accuracy than LR. The time cost of RF was slightly higher than LR, as anticipated. The optimal error for RF was achieved with 75 trees, a max depth of eight, and a window size of 288. While not as accurate as with 288 predictors, RF with 144 predictors demonstrated the fastest performance.
LGBM demonstrated stable behavior, with consistently similar results across various configurations. The optimal performance was achieved with a window size of 144, 100 estimators, and 20 leaves, resulting in an RMSE of 323.295 and an MAE of 109.411. Interestingly, this configuration also generated the fastest model.
Finally, we can analyze the RNN. It is noteworthy that the RNN exhibited comparatively higher errors, with the performance varying considerably across different configurations. Discrepancies of up to 200 units in the RMSE and MAE suggest potential overfitting. Additionally, there is an almost double time difference between some configurations. It may be caused by the complexity of the settings, involving more neurons and layers. The optimal results were achieved with a window size of 36, two layers, and 10 neurons per layer. However, it did not prove to be the fastest configuration.
Table 2 collects the optimal errors attained by each model. The ranking order is as follows: LGBM, RF, the RNN, and LR. LGBM achieved the top positions for both the RMSE and MAE, with values of 323.295 and 324.635, respectively. RF, while securing the second position, exhibited an approximately 24% higher error compared to LGBM. Furthermore, the RNN demonstrated a performance approximately 22% worse than RF. It is impressive to see a consistent improvement trend between the models until reaching LR, which displayed an astonishing 760% higher error than the third-worst model, the RNN.
Table 3 below presents a comparison of the time required to train the best of each model. As expected, the simplicity of LR results in the shortest training time. LGBM took a bit longer, but it achieved the highest predictive accuracy among all the models. RF and the RNN are slower than LR and LGBM. However, LGBM is distinguished for its ability to balance high accuracy with a relatively swift training time, making it the best choice overall.
After evaluating the performance of our photovoltaic demand prediction models, we now turn to some visualizations to gain a more intuitive understanding of their behavior.
Figure 1 presents the actual demand alongside the predictions from each model: LGBM, RF, and the RNN. Bear in mind that, to improve clarity and better focus, we excluded the LR model visualization as its values would make the overall trend harder to interpret. Having said that, we can first examine the long-term trend and observe seasonal variations. Apparently, all four models seem to capture the overall trend of the data, though none perfectly match the actual demand curve. LGBM appears to track the actual demand a bit closer and the RNN deviates the most from the actual values.
Figure 2 zooms in on a specific cycle of the PV demand in order to highlight the cyclical nature of this series. As mentioned before, the actual demand follows a cyclical pattern with a clear peak. While the models align well with the actual demand during the flatter, i.e., lower, portion of the cycle, their performance weakens as the demand approaches the peak. This is evident by the increasing spread in the prediction compared to the tighter grouping at the bottom. As the demand increases, the models struggle to maintain the same level of accuracy.
Finally,
Figure 3 splits the previous cycle into two parts: the flatter
Figure 3a and the peak
Figure 3b. The first figure has a period with the lowest range of demand values compared to the entire series. While the overall trend appears flat, a closer look reveals some fluctuations. Here, the RNN exhibits the most variation in its predictions compared to the other models, which seem to better match the flatter pattern of the actual demand. If we focus on the second figure, the RNN consistently overestimates the demand. In contrast, the other models underestimate it. LGBM stands out as the closest predictor in this timeframe, with its line tracking the actual demand more closely than its counterparts.
5. Conclusions
In conclusion, our study has effectively met its objectives by successfully acquiring photovoltaic solar energy production data from the Spanish Electric Network. Through the implementation of various predictive models using Big Data-oriented techniques, we achieved generally acceptable outcomes. This integration of Big Data techniques played a crucial role in enhancing both the time efficiency and accuracy of our predictions.
Among the models employed, LGBM emerged as the best performer, demonstrating superior accuracy and efficiency. This highlights its effectiveness in handling the complexities of our dataset. On the other hand, the LR model faced challenges in delivering accurate results, and its time cost was notably prolonged. This emphasizes the limitations of LR in capturing the patterns present in the data. Our findings stress the importance of leveraging advanced predictive modeling techniques, such as LGBM, for accurate and timely predictions in the realm of photovoltaic energy demand forecasting in Spain.
By incorporating Big Data-oriented techniques, our study not only contributes a valuable understanding of forecasted trends but also sets a precedent for utilizing innovative approaches in energy demand prediction.