1. Introduction
As increases in computational power and processing capability continue, the introduction of new technologies into the most diverse industry segments is expanding, viticulture and the whole wine industry being no exception. With producers and companies demanding cheaper and faster ways to produce higher-quality wine from their vineyards, together with the massive increases in information available to the market, the search for a competitive advantage is increasing. One such advantage can be found in the development of new methodologies to reduce the cost of gathering information about grapes in an environmentally-friendly and timely manner, allowing winemakers to obtain more frequent insights into their wine grapes, to enable harvesting them at the optimal point of maturity and selecting them according to certain quality features.
Laboratory-based hyperspectral imaging collects information on how objects reflect and absorb light as a function of wavelength, providing both spatial and spectral information [
1,
2]. The main advantage of this kind of image-based system in comparison with traditional chemical analysis is that it is non-destructive and reduces the overall cost of the acquisition of quality information. However, it involves complex data that requires powerful analysis tools to extract the necessary information from the underlying patterns in the spectra. In recent years, the use of hyperspectral techniques combined with proper data analysis tools for extracting information has become an important technique to measure different oenological parameters that are highly involved in the evaluation of grapes’ ripeness stage. The most commonly used equipment in the literature operates simultaneously in the visible and near-infrared range, between 400–1000 nm, although the near-infrared wavelength above 1000 nm may contain important absorption bands. A possible reason for the extensive use of that range, in addition to the good results obtained, may be related to the fact that equipment operating at larger wavelengths tends to be significantly more expensive.
In this process of analysis and evaluation, the sugar content, anthocyanin concentration and pH index are highly researched parameters because they are correlated with the flavour, colour and are good indicators of the grapes’ ripeness. At this stage of the maturation process the level of anthocyanin and sugars increases, while the acidity diminishes sharply, leaving a signature in the spectral activity of the grapes. Nevertheless, due to several factors such as different climate changes, soil quality, sun exposition, water assessment, altitude and harvest time, a large variability may be present in the vineyards and consequently in the quality of the grapes and in the profile of the different oenological parameters. In recent years, the use of hyperspectral image-based methods for the prediction of such parameters has been proposed, using (a) transmittance mode [
3,
4], (b) interactance mode [
5,
6,
7], and (c) reflectance mode. Since for the same illumination scenario the intensity of light reflected from the grape is stronger, which facilitates measurements, we chose reflectance mode spectroscopy for our methodology, making it more relevant to analysing previous results in the reflectance mode. The works on the reflectance mode can be further divided into (1) the reflectance mode for a small number of berries in each sample [
8,
9,
10,
11,
12,
13,
14,
15,
16], using partial least squares (PLS), neural networks (NN) and least-squares support vector machines (LSSVM); and (2) reflectance mode for a large number of berries in each sample [
17,
18,
19,
20,
21,
22,
23,
24,
25], using PLS or modified partial least squares (MPLS). Using a small number of berries per sample comprises a more difficult problem, since the samples’ variability is higher than when using samples with a larger number of berries, where the reference measurements result from the mixture of the individual berries.
The use and effectiveness of support vector machines combined with hyperspectral imaging ([
26]) has already been tested and widely employed in classification problems [
27,
28,
29], but approaches using regression are still uncommon. Other works are available that measure different chemical compounds (for example, phenolic compounds, solid sugar compounds or aroma compounds) [
30,
31,
32,
33,
34].
Table 1 provides the detailed results published in the literature for the determination of sugar content, anthocyanin concentration and pH index, with the hyperspectral imaging performed in the reflectance mode.
The main purpose of this work is to present a different approach capable of dealing with the hyperspectral data of wine grape berries from several different years and to evaluate the ability of the model to assess varieties of wine grapes that were not used in the model’s training. Building robust models that present a good generalization ability with new vintages and/or varieties of wine grapes is becoming a topic of major importance, since it avoids creating a new model for each new application. The existing scientific literature is sparse apart from the works published by the same authors [
14,
15], and the few exceptions [
19,
23] that tested new vintages for homogenised samples or for samples containing a large number of berries. Regarding the authors’ previous works, they provided a less in-depth study of the model’s generalization capacity since in this work we use a greater number of samples in the training phase, with more harvest years and also a different capture location to obtain a greater description of data variability.
In the present work, we introduce new models combining hyperspectral imaging and support vector regression ([
35]) with a Gaussian kernel, implemented with MATLAB software using the parallel computing toolbox with a Compute Unified Device Architecture (CUDA)-enabled NVIDIA Graphics Processing Unit (GPU), to predict oenological parameters in grapes, namely anthocyanin concentration, pH index and sugar content. The study is focused on the generalization ability of the method and on the use of a small number of whole berries per sample, which can be applicable to the selection of the best berries for precision viticulture and for mapping areas in order to improve ripening. The obtained results reveal better performance when compared with previously implemented solutions found in the literature and, unlike the majority of previous works, it includes different varieties and harvest years of wine grape berries. Thus, it is a more detailed study with a capacity for generalization.
Section 2 provides a brief description of the varieties of wine grape berries used, an overview of the hyperspectral imaging setup for image-acquisition, details the procedures followed for the reflectance measurements and the techniques applied to build the prediction model, with emphasis on the principal component analysis, support vector regression and cross-validation algorithms, respectively.
Section 3 and
Section 4 provide a detailed analysis of the results obtained for the different oenological parameters in the study, including a comparison with current state-of-the-art approaches.
Section 5 provides general conclusions about the work conducted, alongside future directions for improvement.
2. Materials and Methods
2.1. Grape Sampling
The subject of our study were grapes bunches of the Touriga Franca (TF) variety, considered one of the most important varieties for the production of port wine in the Douro region, due to its resiliency to plant diseases, fruity flavour and intense colour, harvested in the years 2012, 2013 and 2014, from the vineyards of Quinta do Bonfim in Pinhão, Portugal, which is a property of Symington Family Estates, In the year 2015, the samples were collected from the vineyards in Universidade de Trás-os-Montes e Alto Douro, Vila Real, Portugal. In order to achieve the best possible model, it is important to test grapes from the beginning of veraison and maturity, and from areas within the same vineyard under different conditions (sun exposure, water availability, soil quality, among others). A total of 240 samples were collected in the year of 2012 (24 per day), 84 on the year of 2013 (12 per day), 120 in the year of 2014 (12 per day) and 108 in the year of 2015 (12 per day). The samples were collected from three different regions in the vineyard from vine trees with small, medium and large vigour. Laboratory-based, line-scan, hyperspectral image acquisition was performed on fresh grape berries, as described in
Section 2.2. Each sample evaluated by hyperspectral imaging was composed of six grape berries randomly collected in a single bunch, removed from bunches with their pedicel still attached. All the samples were then kept frozen, at −18 ºC. The chemical analysis was carried out with the six grape berries being defrosted at room temperature and then crushed in a buffer solution of tartaric acid (pH.3.2) and ethanol (95%) by macerating, and the resulting mixture was kept overnight at 25 °C [
36]. The samples were centrifuged (SIGMA centrifuge 3K18, 20 min, 4 °C, spin at 7155 g) and a clear extract was collected and mixed with acidified ethanol (0.1% Hydrochloric Acid). The total anthocyanin concentration was determined photometrically by the SO
2 bleaching method [
37]. A Ultraviolet-Visible Spectroscopy (UV/VIS) spectrophotometer (Shimadzu) and 1 cm path length disposable cells were used for spectral measurements at 520 nm and the pigment content, expressed in mg·L
−1, was calculated from a calibration curve of malvidin-glucoside. All determinations were performed in duplicate and the juice released was analysed for pH and brix contents according to validated standard methods [
38].
In order to test the generalization capacity of the model, 84 and 60 samples (12 per day) were collected in the year 2013 for the Tinta Barroca (TB) and Touriga Nacional (TN) varieties, which are also important varieties for the production of port wine in the Douro region. The sample collection, number of berries per sample, hyperspectral image acquisition and chemical analyses were performed as mentioned previously.
2.2. Experimental Setup for Hyperspectral Imaging
We chose the reflectance mode over the transmittance mode for hyperspectral imaging since, for the same illumination scenario, the intensity of light reflected from the grape is stronger, which facilitates measurements.
The experimental setup assembled for the images collected used a hyperspectral camera, composed of a JAI Pulnix (JAI, Yokohama, Japan) black and white camera and a Specim Imspector V10E spectrograph (Specim, Oulu, Finland); lighting, using a lamp holder with 300 × 300 × 175 mms (length × width × height) that held four 20 W, 12 V halogen lamps and two 40 W, 220 V blue reflector lamps (Spotline, Philips, Eindhoven, The Netherlands). Both types of lamps were powered by continuous current power supplies to avoid light flickering and the reflector lamps were powered at only 110 V to reduce lighting and prevent camera saturation. The resulting hyperspectral images correspond to a single line over the sample and had 1040 wavelengths (ranging between 380 to 1028 nm, with approximately 0.6 nm width in each channel) × 1392 pixels. The 1392 pixels stand for the spatial dimension over the samples with approximately 110 mm of width. The distance between the camera and the sample base was 420 mm, and the camera was controlled with Coyote software from JAI inside a dark room, at room temperature.
Figure 1 illustrates the experimental setup assembled for the hyperspectral image acquisition. After the image acquisition, it is possible to identify the grape berries using image segmentation methods.
Figure 2 shows a hyperspectral image taken by the aforementioned setup, for samples of the TF 2012 variety.
Observing
Figure 2, it is possible to see patterns of light absorption, specifically in three main regions, these are where three wine grape berries were placed simultaneously for imaging. A threshold-based segmentation method was then applied to each hyperspectral image to obtain an individual image for each wine grape berry, thus leading to the reflectance measurement step.
2.3. Reflectance Measurements
Reflectance is the quotient between the intensity of the light reflected by an object and the light that illuminates that object, being a function of the wavelength of the light. We chose reflectance as input because the patterns of reflectance and absorption across wavelengths can uniquely identify chemical compounds and because, in contrast to the transmittance and interactance mode, it is possible to perform the imaging without requiring contact between the spectrometer/camera and the sample. For some positions
, and at wavelengths
, the reflectance
can be expressed as:
where
is the intensity of light reflected from the grape;
is the intensity of light reflected from the Spectralon (which is a total reflectance target); and
is the dark current signal, which is electronic noise. The dark current signal is measured with the hyperspectral camera lens covered and must be subtracted from the grape and the Spectralon signal because it is independent of the object being imaged and would otherwise distort the calculated reflectance values.
To achieve noise reduction, an accumulation of 32 hyperspectral images on each grape berry were derived for
,
and
records. All reflectance measurements for the six grape berries were carried out along the berry “equator”, and for three different berry rotations. In order to create a single reflectance spectrum, all grape point reflectance values were averaged over the spatial dimension and rotations. The resulting spectrum was then normalized (by subtracting the minimum values from each spectrum and dividing by the difference between the maximum and minimum values), in order to eliminate fluctuations in the measured light intensities due to the grape berry size and curvature.
Figure 3 presents an example of the reflectance measurements, in the case for the samples of the TF 2012 variety.
2.4. Principal Component Analysis
Due to data complexity, with the dimensionality of the resulting matrix equal to the number of spectral channels measured by the hyperspectral camera (namely 1040), there are difficulties in processing such a large, multivariate dataset. In order to obtain the maximum performance of the machine learning algorithm employed, one must significantly reduce the size of its inputs with some kind of data compression method. Consequently, a principal component analysis (PCA) was performed on the data matrix, extracting the dominant patterns present in the spectra in terms of a complementary set of scores and loadings plots, providing the means to significantly reduce the size of a dataset without losing variability in the data. The PCA was implemented by eigenvalue decomposition of the data covariance matrix and applied to each dataset composed of a different variety and vintage of wine grape berries, together with the mean-centering and auto-scaling methods to normalize the spectra, which subtracts the mean from the original dataset and then divides by its standard deviation.
Figure 4 shows the loadings plot for the six first Principal Components (PC) extracted from the TF 2012 variety data matrix.
Observing
Figure 4 it is noticeable that there are a large number of peaks throughout the spectrum, with emphasis between 400–700 nm, that are usually related to chemical compounds such as chlorophyll, carotenoids and anthocyanin, while the peeks between 700–800 nm are commonly associated with sugars [
38].
A basic assumption in the use of the PCA is that the score and loading vectors corresponding to the largest eigenvalues contain the most useful information relating to the specific problem, and that the remaining ones mainly comprise noise [
39]. In general cases, the number of principal components used is chosen by the number of factors with eigenvalues over one, resulting in a percentage of variance explained (that is, the variability accounted for in the data) in a cumulative sum usually between 95–99%. However, in this case the aforementioned assumption might not be true due to the highly complex chemical interactions present in the samples that have an impact on the reflectance measurements. There is no clear answer as to how many factors should be retained for analysis (only general rules of thumb, such as a scree plot analysis) and, since we are using the PCA as a pre-processing step for a supervised learning task, in this paper we decided to treat the number of PCs as another parameter to be optimized during the cross-validation task, in which we used different try-outs to choose the ideal number of PCs. Thus, every model was tested using between one and 50 principal components, saving the best result. We chose 50 PCs as the upper-limit since the results never improved above that number of PCs and the computational cost starts to increase with a higher number of PCs retained for analysis.
2.5. Support Vector Regression
The advantages of using a support vector machine methodology are (both for classification or regression): it has a regularization parameter, making it easier to avoid overfitting; it maps the input vectors to a high-dimensional feature space, so that it is possible to build expert knowledge about a problem via engineering the kernel function; and, most importantly, the algorithm is defined as a convex optimization problem, for which there is no local minima (unlike neural networks or partial least squares regression), using a subset of training points in the decision function, called support vectors, making the computation cost significantly smaller. In order to introduce our choice of the type of regression and kernel employed in the support vector regression framework, below we provide a brief description of the Support Vector (SV) algorithm. A complete mathematical formulation can be found in [
40,
41].
Given a set of training data , where denotes the space of the input patterns, the objective is to find a function that has a maximum deviation from the real measured targets for all the training data and, simultaneously, is as flat as possible (that is, to be as close to a straight line as possible).
In the case of a linear function, we can express the convex optimization problem as the minimization of the Euclidean norm; however, we must also guarantee that infeasible constraints are dealt with by introducing slack variables
, reaching the formulation stated in [
35]:
where the constant
determines the trade off between the flatness of
and the amount up to which deviations larger than
are tolerated. An alternative to Vapnik’s ε-SV regression was introduced by [
42], named
-SV regression, where
is not defined
a priori but is itself a variable, where its value is traded off against model complexity and slack variables by means of a constant
.
In order to extend the SV machine to nonlinear functions, one must start by writing the optimization problem in its dual formulation via a Lagrange function from both the objective function and the corresponding constraints by introducing a dual set of variables. Satisfying all the constraints and optimizing the equation, we reach the so-called SV expansion, stated below:
where
is completely described as a linear combination of the training patterns, and examples with
are the support vectors.
The chemical processes we are modelling are nonlinear, so the final step is to adapt the SV algorithm to deal with such processes. A computationally cheap way is to map the input vectors into a high-dimensional feature space through some nonlinear mapping, and then to solve the optimization problem in that feature space. With the use of a suitable, nonlinear function
(called kernel), we obtain the nonlinear regression functions of the form:
Two major types of kernel can be defined: local kernels (based on distance), which state that only the data that are near each other have an influence on the kernel values; and global kernels (based on the dot product), which state that samples that are far away from each other still have an influence on the kernel value [
43].
In the present work, a model with Vapnik’s
-SV algorithm and a Gaussian radial basis local kernel (
) was chosen because it has obtained the least root mean square error for all the models tested (linear, sigmoid, polynomial and Gaussian kernels with Vapnik’s ε–SV regression and Chalimourda’s υ–SV regression, as seen in
Table 2 and
Table 3). The interval of values for the parameter
of the algorithm and the parameter
in the kernel (the hyper parameters) was determined via a genetic algorithm, resulting in
and
, and a random search algorithm was implemented to search these intervals for the first combination of these values lower than an error threshold (mean squared error lower than 0.1) with
in the cross-validation procedure, with 10 attempts performed in each different configuration to ensure unbiased predictions.
2.6. N-Fold Cross-Validation with Test Set
To avoid overfitting, a cross-validation approach was employed in the model presented, allowing for the separation of the observations into a training set (to estimate the calibration) and a validation set (to validate the calibration and to correct parameters). In more advanced experiments on the model’s capacity, an independent external test set was used (to evaluate the model performance on an independent set of samples). For further testing the model’s generalization capacity, some of the independent test sets will be composed of new samples of different varieties of wine grape berries not yet seen by the model on the training and validation sets. The cross-validation approach chosen was the n-fold cross-validation method, where the data is split into folds: , used for training, and one for validation, namely the procedure repeated for every fold with a different validation set at each time. The advantage of this kind of approach is that it guarantees that each sample in the dataset is used by the model at least once, contrary to methods that draw data randomly with replacement. The final validation results presented are the average of the results for all the validation sets in each experiment. The training, validation and independent test sets are chosen at random for each run, and the fine-tuning of the model’s hyper parameters only occurs in the cross-validation stage to guarantee an unbiased model.
In order to proceed to a comparison between the present results and state-of-the-art publications, both the root mean square error of the cross-validation and test sets were used (RMSE), alongside the squared correlation coefficient (R
2)—these values are defined as follows:
where
is the reference value,
is the model estimate,
is the covariance between
and
and
,
are the respective standard deviations.
Each wine grape variety data had its own training and validation sets (
Section 3.1), except for the 2015 vintage that was only used in more advanced experiments. The finetuning of the model’s hyperparameters occurred for all varieties. An independent external test set was used in the study of the TF 2012 variety and in the study of a mixture of samples from different vintages of the TF variety (
Section 3.2), since these are the only cases that, in our opinion, presented a sufficient number of samples to use a test set and obtain significant results. To further test the model’s generalization capacity, an independent external test set composed of different varieties of wine grapes was also studied for evaluation of the model’s performance with previously unused varieties in the test set (
Section 3.3).
Table 4 provides detailed information about the experiences conducted and described in each section.
A descriptive statistical analysis was performed to study the datasets used (see
Appendix A,
Appendix B and
Appendix C). ANOVA (one-way analysis of variance) tests were also employed to verify any significant differences between the means of the different sets (see
Appendix D,
Appendix E and
Appendix F), and it was found that there are significant differences in the means between the datasets used for this work.
3. Results
3.1. Model Training and Validation
The validation set results obtained by the models (one for each vintage) for the prediction of the anthocyanin concentration, pH index and sugar content in the different vintages are presented in
Table 5. For the 2014 vintage of the TF variety there are no laboratory results available for the anthocyanin concentration, preventing the development of a model for that particular vintage.
For each dataset the training and validation were conducted with 10-fold cross-validation for the TF 2012 samples, and with five-fold cross-validation for the remaining vintages (because these have a lower number of samples and reducing the number of folds helps avoiding overfitting in these scenarios).
3.2. Model Behaviour Using Test Sets
The study of the model’s behavior using test sets was performed with the TF 2012, TF 2013, TF 2014 and TF 2015 datasets (except for the prediction of the anthocyanin concentration, where the 2014 vintage results were not available), with 10% of the samples used as an independent test set: the first, with the 2012 vintage (test set: 30 samples); the second, composed of 2012 and 2013 vintages samples (test set: 36 samples); the third, with samples from the 2012, 2013 and 2014 vintages (test set: 50 samples for sugar content and pH index); the fourth, composed of 2012, 2013, 2014 and 2015 vintages (63 samples for sugar content and pH index, 48 samples for anthocyanin concentration).
For each data set the training and validation were conducted with the remaining samples, with 10-fold cross-validation and with the fine-tuning of the hyper parameters in the Support Vector Regression (SVR) models occurring only for the training and validation sets.
Table 6 shows the results obtained for the prediction of the anthocyanin concentration, pH index and sugar content on the external test sets while referencing the best result found in the literature for the same experiment, for a direct comparison. For more information regarding these test sets, see
Appendix G,
Appendix H and
Appendix I.
The overall results for the different test sets and for all the oenological parameters under study represent an improvement in the R
2 and RMSE values in comparison to the ones obtained in the validation sets. Since this is somewhat unusual in machine learning problems,
Table 7 shows a set of experiments that were employed to rule out a possible inherent bias of the SVR algorithm to perform well on the randomly chosen data for the external test sets (that is, a natural predisposition of the algorithm to obtain better results for a certain set of samples).
For each experiment, the samples were independent and chosen at random, which guarantees that the model’s response is unbiased. A total of 10% of the samples were left out to comprise the independent test set, while the remaining samples were used to train and calibrate the model using 10-fold cross-validation.
A possible explanation for the difference in performance in the independent testing phase is that through the pipeline of our model there are various steps where the concern with overfitting [
44,
45,
46] is present. The application of methods to counter this effect, and the sum of all these applications, might then lead the model away from overfitting and lead it into an underfitting situation. This is not a traditional underfitting scenario, where the model cannot capture the relationships in the data during the training stage, but a scenario where the inductive bias of the algorithm is predisposed to lead to increased generalization capacity. Another possibility is that because the optimization method for the hyper-parameters in the SVM tries to optimize the results for the respective test sets on the many different random cross-validation runs, it sacrifices training performance for enhanced generalization capacity [
47]. Either way further study on this topic should be conducted.
The enhanced generalization capacity of the model is noticeable by the differences in the results between the validation and test sets present in
Table 7: a model with strong inductive biases is likely to benefit when these biases are well suited to the data, which seems to be the case for this work.
3.3. Model Generalization: Different Varieties and Vintages
The last test aimed to further analyse the model’s generalization capacity and involved the use of a mixture of samples from different vintages and varieties of wine grape berries; this is a very important configuration, because if the prediction algorithm cannot handle the variations in the grapes’ oenological parameters that are known to occur between years and varieties, the application of the models becomes more complex since it will require a new model to be used for every different year or for every different variety.
The test of the model’s generalization capacity was performed with two datasets: the first using the samples from all the vintages of the TF variety employed in the training and validation phases, with 10-fold cross validation, and 25% of the samples of the TB variety (23 samples) comprising the independent test set; the second also using all the vintages of the TF variety used in the training and validation stages using 10-fold cross-validation, alongside 25% of the samples of the TN variety (17 samples) used for the external test set.
Figure 5 shows the results obtained for the determination of the anthocyanin concentration in both cases. The number of principal components used were 47 and three, respectively, for each of the datasets. For additional information regarding the samples used, please check
Appendix J.
Figure 6 illustrates the outcome for the prediction of the pH index in both cases. The number of principal components used were 11 and 50, respectively, for each of the experiments. More information on the samples used can be seen in
Appendix K.
Figure 7 presents the results for the estimation of the sugar content in both cases. The number of principal components used were 45 and 17, respectively, for each of the datasets. Further information regarding the samples is presented in
Appendix L.
5. Conclusions
A hyperspectral imaging technique was combined with a machine learning algorithm (support vector regression) to compose a framework capable of estimating oenological parameters with different varieties and vintages of wine grape berries. The present paper presents a fast, inexpensive and non-destructive type of analysis that provides an alternative to traditional methods when studying wine grape berries during ripening.
The results obtained are competitive in comparison to current state-of-the-art publications in the prediction of sugar content, anthocyanin concentration and pH index, maintaining high performance across different varieties and vintages of wine grapes. This represents a step forward in terms of the study of the generalization capacity, which is very important in order to achieve a model capable of predicting values for a wide variety of wine grapes without the need to capture more and more samples throughout the years to tune the predictor. Moreover, the hyperspectral imaging was conducted in the reflectance mode and with a small number of whole berries, which is a setup rarely found in literature and is relevant when mapping areas in order to improve ripening, selecting the best berries inside each bunch for the production of high quality wines.
Further work should include the study of different pre-processing, dimensionality reduction and feature selection methods, which may represent an improvement in the models’ capacity to capture different patterns in the spectra, especially for the estimation of the pH index, which obtained slightly inferior results. Also, datasets composed of different varieties for the training and test phases should be investigated to promote the development of the generalization capacity while investigating the models’ behaviour when tested without supplementary training.