1. Introduction
The soil sphere is an important component of the earth’s ecosystem and a major site for human subsistence and social production activities. The soil carbon pool is the largest terrestrial organic carbon pool, and it plays an important role in the global carbon cycle and maintaining the ecological balance in the context of climate warming and land use change [
1]. Soil surveying and mapping are prerequisites for better utilization and management of limited soil resources, and they are an important foundation for other soil science research. Traditional soil mapping is carried out by experts through field surveys to understand soil–landscape relationships and manual mapping based on topographic maps and aerial or satellite imagery [
2]. The traditional method is not only time-consuming and laborious but also relies heavily on expert knowledge and is difficult to apply to soil investigations on a large scale or outside the study area. With the development and application of soil genesis theory and the law of geographical similarity, digital soil mapping (DSM) has emerged [
3,
4,
5]. DSM predicts soil properties or types by mining the relationship between soil and environmental covariates through machine learning (ML) and other methods [
6]. It can save a lot of labor and time costs and is free from the limitation of expert knowledge limited to a certain area.
Soil genesis theory is the theoretical basis for DSM, which states that soils are formed through certain physical, chemical, and biological processes under the action of five soil-forming factors: climate, topography, parent material, biology, and time [
7]. Soil is an object with specific properties formed under specific environmental conditions, and the spatial distribution of soil has a synergistic relationship with the spatial distribution of environmental factors [
8]. Soil and environmental factors are geographically similar, and the more similar the combination of environmental factors is, the more similar the corresponding soil is. The environmental factors and soil properties are mostly continuous and gradual, and the closer the distance is, the closer the soil properties are, that is, the first law of geography [
9]. This geographic similarity is another theoretical basis for DSM. Based on the above theories, McBratney [
8] proposed a soil spatial prediction function with spatially autocorrelated errors (SSPFe), which is particularly relevant for those places where soil resource information is limited.
Kriging interpolation is an early method for DSM [
10]; the geostatistical approach treats soil properties and environmental covariates as regionalized variables and expresses the spatial correlation of variables as differences in distance and direction. However, the geostatistical method is based on spatial point data for interpolation and cannot fully utilize the surface data of environmental covariates (e.g., remotely sensed images, DEM, and their derived variables). ML and data mining methods, on the other hand, can solve this problem by bringing the spatial surface raster data of environmental variables into the fitted model for soil prediction mapping. In recent years, ML methods have been used to conduct many DSM studies, such as multiple linear regression (MLR), random forest (RF), regression trees (RT), artificial neural network (ANN), and Cubist [
11,
12,
13,
14,
15,
16,
17,
18]. However, most of these methods use a single model or an ensemble model of the same structure, which is limited by the single-model structure and has a low prediction accuracy in the complex areas with undulating terrain. In addition, there are also scholars who combine ML with expert knowledge for soil mapping, such as the soil land inference model (SoLIM) [
19] and high-accuracy surface modeling (HASM) [
20,
21]. HASM assumes that soil properties are constant within the same type of area, and it was first used in DSM to predict soil pH [
22], but this method may not be applicable in areas where soil properties are often asymptotic due to less human influence, and a large number of samples are required to calculate the average value for each partition.
Soil is formed under the combined action of multiple environmental factors, and it is difficult for the modeling process to take all the environmental factors into account. Limitations on the type of input environmental covariates, as well as the influence of the model structure and parameter selection, can prevent the model from fully exploiting the relationship between soil and environmental factors and can lead to a certain amount of model error [
23]. The error is manifested as the residuals between the predicted and actual measured values of the fitted model, and the residuals have some structural and spatial autocorrelation as environmental variables. Therefore, the residuals can be interpolated by geostatistical methods, and the spatial variation pattern of the residuals is attributed to a function of distance and direction. Combining regression models and geostatistics, the regression kriging (RK) method can not only make full use of the spatial surface data of environmental covariates to improve the prediction accuracy but also reduce the error of the model through residual interpolation with spatial autocorrelation. Although the RK method has the advantages of both ML and geostatistics, they have been less studied in DSM [
4].
The main purpose of this study was to compare the accuracy and potential of the ML and the RK models for SOC mapping in undulating complex terrain areas. Four ML methods—RF, Cubist, stochastic gradient boosting (SGB), and Bayesian regularized neural networks (BRNNs)—were used to fit the trend between SOC and environmental covariates, and simple kriging (SK) was used to spatially interpolate the residuals of each sampling point. We then performed residual correction on the prediction results of the ML models to obtain the final SOC prediction map.
3. Results and Analysis
3.1. Data Preprocessing and Exploratory Analysis
The feature variables with a high correlation with SOC were selected according to the correlation test, the linear correlation between the variables was tested by Pearson correlation coefficient, and the features were ranked in importance according to the principal component analysis (PCA) method (
Figure 2). The scatter plot in the lower left panel of the correlation matrix shows the trend of the fit between the variables. The nonlinear correlation was tested by the distance correlation coefficient (
Table 2). Although the correlation between the topographic index and SOC is lower than that of the vegetation index and SOC, the topographic index is mainly affected by DEM accuracy and raster scale, whereas it is less affected by human factors and environmental factors, and it has a long time stability which can effectively improve the model accuracy and stability, so it is necessary to use the topographic index with a certain correlation with SOC as the input variable. The variables with a correlation greater than 0.2 with SOC were selected in the image-derived variables, the variables with a correlation greater than 0.15 with SOC were selected in the DEM-derived variables, and the feature variables with covariance were excluded according to the threshold value of a correlation greater than 0.85 in absolute value. In the end, five vegetation and soil factors (RVI, NDVI, SATV, SCI, and SRI) and five topographic factors (Slope, TWI, RSP, CI, and PrC) were selected as predictor variables.
The 115-sample data set was partitioned into a calibration set and a validation set with 95 and 20 sample points each at an 80% ratio. Descriptive statistics (
Table 3) were performed for the calibration and validation sets, and the
p-values > 0.05 by the K-W normality test were consistent with a normal distribution, which is also evident from the overall SOC distribution density curve on the diagonal of
Figure 2. The coefficient of variation, skewness, and kurtosis values of the calibration set is very close to those of the total set data, which indicates that the calibration set is well representative of the overall sample. The independent variables in the calibration and validation sets were preprocessed in R, and the ten environmental covariates were tested for covariance and approximate zero variance, and then centralized and standardized to prevent the covariance of the features and the different data outline from affecting the instability of the model.
3.2. Performance Analysis of Trend Models
In order to evaluate the performance of the trend-fit models, explanatory analyses were performed for each of the four models after determining the best-fit parameters, and the results are shown in
Figure 3.
From the analysis in the box plot of model residuals (
Figure 3a), we can conclude that the average absolute residuals of ranger and Cubist are low, and the overall distribution is at a low level, but the maximum value of the absolute residuals of Cubist is slightly higher, indicating that the difference between its predicted maximum or minimum value and the actual value of SOC is large. While
BRNN has the largest absolute residual,
GBM is second, and the average absolute residuals of both
GBM and
BRNN are high, indicating that all the predicted values of these two models differ from the real values of SOC. In the plot of the reverse cumulative distribution of absolute residuals (
Figure 3b), it can be concluded that the cumulative values of ranger and Cubist residuals increase slowly with the increase in modeling sample data, and the model performance is more stable, while the cumulative curves of
GBM and
BRNN residuals increase faster after the input sample reaches 45%, and the model performance is not stable enough. Overall, the
ranger model has the highest accuracy and better stability in the model performance of the calibration set, followed by the Cubist model, while the
BRNN model has the lowest accuracy and the worst stability, followed by the
GBM model.
3.3. Residual Analysis and Geostatistical Modeling
Before applying geostatistics to the residual, we should check whether there is an obvious trend between the predictor and target variables in the ML models according to the residual–predicted value fit plot (
Figure 4). The analysis showed that the four ML models have almost no trend, and the fitted curves are close to horizontal, achieving very good fitting results. The soil samples corresponding to the extreme values of residuals were all from natural woodland. The number of samples from woodland was much smaller than that from agricultural land due to the small area of natural woodland in the study area, while the SOC content of natural woodland was generally much higher than that of agricultural land, which might cause the model to fit most of the sample points of agricultural land and result in larger residuals of woodland sample points. Overall, the
residuals and SOC predicted values in the residual–predicted scatter plot showed a random distribution with no obvious trend, indicating that the relationship between the target and predicted variables was almost fully expressed by the model.
The
four models tapped most of the correlations between the target variables and the environmental covariates, but the residuals still have some spatial autocorrelation due to the influence of other structural environmental factors. SK was selected for geostatistical analysis in ArcGIS, and we set the appropriate step size and number of steps by observing the change trend of the residual semi-variogram value with distance. The semi-variogram function of the residuals of the four models is shown in
Figure 5. The spatial autocorrelation of all four model residuals is low, and the block gold value is high, close to the pure block gold effect, which reflects the success of the trend models’ fitting.
3.4. Evaluation and Comparison of Trend and RK Models
In order to evaluate the performance of the residual kriging interpolation in different trend models, the residuals of the four trend models were interpolated separately, and the trend model predictions and residual kriging predictions at the same location were extracted and superimposed based on the sample points. At the same time, the model accuracy was tested with the test set to assess the generalization ability of the models, and the results are shown in
Table 4 (the models with the highest accuracy are marked in bold, and the models with the second highest accuracy are marked in bold italics).
By comparing the results of the four trend models, it was concluded that the ranger model had the highest accuracy and the lowest error in the calibration set (R2c = 0.834, RMSEc = 1.669 g/kg) and slightly lower accuracy in the test set (R2v = 0.362, RMSEv = 2.493 g/kg), but the Cubist model performed well in both the calibration and validation sets (R2c = 0.693, RMSEc = 1.880 g/kg and R2v = 0.445, RMSEv = 2.283 g/kg). However, the BRNN model performed the worst in both the calibration and validation sets, and the GBM model had the next highest accuracy, with the BRNN model explaining only about 30% of the variation (R2c = 0.323, RMSEc = 2.712 g/kg and R2v = 0.282, RMSEv = 2.589 g/kg) and the GBM model explaining about 40% of the variation (R2c = 0.430, RMSEc = 2.551 g/kg and R2v = 0.336, RMSEv = 2.522 g/kg). On the other hand, by comparing the performance of the four RK models, it can be seen that the trend models superimposed with residual kriging interpolation can greatly improve the prediction ability of the model, especially in the test set. The prediction accuracy of the four RK models reached a high level, with the prediction accuracy R2 greater than 0.56 in the calibration set and R2 greater than 0.62 in the test set. Moreover, in the comparison with the trend model without residual interpolation, the prediction accuracy of the four RK models improved by 0.033, 0.101, 0.304, and 0.246 in the calibration set, and by 0.356, 0.229, 0.388, and 0.343 in the test set, respectively. Due to the fact that the Cubist model predictions are closer to the true values (the predicted–observed fit line is close to 1:1), the R2 of the Cubist model is not the highest, but the RMSE is the smallest, and the smaller residual value domain of the trend model makes the overall residuals of the RK-Cubist model small.
The performance of several fitted models can be visualized in the fit plots of predicted values against observed values of the validation samples (
Figure 6). The predicted–observed fit curve of the four RK models are closer to 1:1 than that of the trend models, and the scatter plot distribution is more concentrated near the fit curve. In addition, the slope of the predicted–observed fit curve of the RK-GBM and RK-BRNN models increases significantly, indicating that the prediction accuracy has been greatly improved; the slope of the predicted–observed fit curve of the RK-ranger and RK-Cubist models changes less, mainly reflecting that the scatter distribution becomes more concentrated and the scatter points are closer to the fit curve.
3.5. Variable Importance Analysis of Different Models
We chose the RMSE as the model performance index to rank the importance of the variables, and, considering the uncertainty of the permutations, we computed the mean values over a set of 10 permutations (
Figure 7). For the ranger model, the importance of each predictor variable did not differ significantly, except for the most important, B11; the lack of one of the other variables would increase the RMSE of the model by no more than 0.3 g/kg, and the absolute RMSE was not greater than 2 g/kg. For the Cubist model, the first three variables were more important, namely, SATVI, B11, and SRI, and the remaining variables had low importance. For the GBM model, the importance of the B11 was largest, the importance of the remaining variables was relatively small, and the importance of the last six variables was weak. For the BRNN model, the first five variables had high importance, and the last six variables had a weak impact on the model. In general, the ranger model has the lowest dependence on variables, and the absence of a particular variable does not have a significant impact, whereas the GBM model is too dependent on one variable, which may bring greater instability to the model. Although the importance of the variables varied among the models, B11, RVI, and SATVI were the main explanatory variables for SOC prediction, indicating that the image derivatives were generally more important than the DEM derivatives. This indicates the high relevance and good prediction ability of vegetation and soil indices regarding SOC content.
3.6. Spatial Distribution of SOC Content Maps
The optimal parameters of the model were used to predict SOC content, and the spatial distribution maps of SOC content predicted by four trend models and four RK models are presented in
Figure 8. The SOC content predicted by the eight models has similar spatial distribution characteristics, with low SOC content at high terrain and high SOC content at low terrain, which indicates an obvious correlation between SOC content and terrain. Topography is not determined by absolute elevation but by relative elevation, so that SOC is highly correlated with topographic indices, such as slope position and slope gradient, but not with absolute elevation. When the topography changes in a complex way, there may be a situation where the elevation is the same but the slope is different, and then using the topographic index instead of elevation as a predictor variable can effectively explain the nonlinear relationship. Moreover, topography type integrates topographic factors such as slope position, slope gradient, slope length, etc., which can be used as a predictive variable for DSM. In addition, areas with high SOC content are mainly distributed in areas with a high vegetation index and low soil index, which indicate that SOC has obvious correlation with soil color, soil mineral composition, and surface vegetation cover.
For the ranger and BRNN models, the SOC content prediction value range is smaller than the observed value range (4.29~19.79 g/kg); the ranger spatial distribution boundary is obvious, but the BRNN spatial distribution boundary is fuzzy. However, the SOC content prediction value range of the Cubist model is the largest (3.74~19.71 g/kg), slightly larger than the observed value range. The predicted SOC content map of the GBM model has obvious distribution boundaries in space, which is not consistent with the asymptotic nature of spatial distribution, and the predicted value range is the smallest (9.28~14.10 g/kg), which is larger than the observed value range of SOC content and also leads to its lower prediction accuracy in the calibration and test sets, which is consistent with the model performance in
Table 4.
Due to the inclusion of the effect of residuals, which extends the upper and lower bounds of the predicted values, the predicted value domain of the RK models increases over both the predicted value domains of the corresponding trend models. While the transition boundary of the trend model is more obvious, the RK model weakens this trend, and SOC has better asymptotic variability in space. In the central and eastern parts of the study area, the values became lower in places with high SOC content and higher in places with low SOC content, which reduced the differences between land use types in a small area and was more consistent with the actual distribution. In the natural forest and orchard areas in the northwest, the trend models, in order to fit the whole samples, made the predicted values lower in this area, and the residuals were larger than the observed values. The predicted values of the RK models in the northwestern part of the study area were significantly higher after superimposing the residuals, which improved the fitting accuracy in this area to a large extent.
4. Discussion
The reasons affecting the DSM prediction accuracy are mainly composed of three parts: model structure, tuning parameters, and input predictor variables [
23].
For the trend models: The RF and Cubist models both have higher model accuracy, and SGB and BRNN have worse accuracy, with high accuracy in the RF calibration set and the Cubist test set. In addition, the prediction value domain of the Cubist model is larger than other models and larger than the upper and lower limits of the observed value domain. This may be due to the fact that RF uses a tree-based learner and the splitting criteria of the tree focus only on the range of relevant features and useful values, which cannot infer higher or lower values than those in the calibration data, and it is almost impossible to obtain values outside the observed interval. In contrast, classical linear regression is less influenced by the range of data values and is good at inferring trends, and the nodes of the Cubist tree model are linear models rather than specific values. The Cubist model combines the advantages of the tree model and the linear model, which make the predicted value domain potentially larger than the observed value domain and has better generalization ability in the unsampled region. In this study, RF as a traditional ensemble model has high prediction accuracy but performs slightly worse in the validation set. Cubist is not only the second most accurate in the calibration set but also has the highest accuracy in the validation set, so we can conclude that Cubist has good prospects for application in SOC prediction in hilly farming areas with complex topography. In addition, all four trend models require user-defined parameters, which can lead to modeling instability and affect their generalization ability and practical applications. To balance the prediction accuracy and stability of the models while ensuring the prediction accuracy, models with the same algorithm but without tuning parameters or with fewer parameters can be selected. For instance, both RF and bagged CART methods are integration models of the tree model and bagging algorithms, but bagged CART can be used with the “treebag“ function without tuning parameters for modeling, which will improve the stability of the model to some extent.
For the RK models: The influence of environmental covariates not incorporated in the trend modeling and the limitations of the model’s structure and parameters can lead to incomplete mining of the relationship between the predictor and the response variables. This relationship, which is not expressed by the trend model, is passed through the model to the residuals of the target variable, causing the residuals to still have a certain trend and exhibit some spatial autocorrelation in space. We can use the kriging method to analyze the residuals of the model with geostatistics, and the remaining trend of the target variable can be tapped to improve the prediction accuracy. It is difficult to achieve a good fitting accuracy with the default parameters of geostatistics, and they often need to be set artificially according to the semi-variance function. The RK method combines the advantages of geostatistics and ML models, which not only exploits the relationship between target variables and known environmental covariates but also makes full use of the spatial autocorrelation of target variables and can effectively reduce the errors caused by uninvolved predictor variables or model instability. The four RK models all improve the prediction accuracy and stability to a great extent, which indicates that the RK method has good scientific research value and application prospects in DSM.
Regarding the improvement of predictor variables and study content: Residuals brought by structural factors with spatial correlation can be well predicted by kriging interpolation, but residuals brought by random factors such as human activities (e.g., land use cover) cannot be expressed. In addition, the residual–predicted value fit plot (
Figure 4) analysis shows that the residuals are larger for all sample points in natural woodland, partly because the large area of agricultural land makes the trend model fit the natural woodland sample points less well, and partly because of the non-asymptotic variability between natural woodland and agricultural land, and this abrupt variability can lead to unstable fits. Land use cover has a certain randomness in spatial distribution due to the dual effect of natural and human factors over a long period of time, but it also has a certain dominance and stability in the soil formation process, so it is necessary to fit this change trend of SOC by using land use type as a category factor. From the predicted distribution map of SOC content (
Figure 8), the correlation between the spatial distribution of SOC content and terrain height is generally obvious, and the correlation with indices reflecting topographic features such as slope position, slope degree, and moisture index is high, but the correlation coefficient with Jedi elevation is not high. This indicates that there is no obvious relationship between soil properties and elevation in the undulating hilly areas, and topographic indices or relative elevation should be used as predictor variables. In addition, the landscape types related to soil mapping can be studied based on the theory of soil genesis and the law of similarity of the geographic environment, integrating land use types and topographic features, as a way to express the topography, vegetation, and human activities among soil genesis factors, and using the landscape types as key factors to predict soil attributes or types, effectively improving the prediction accuracy and interpretability of DSM.
In general, the machine learning method can make full use of environmental covariate surface data and predict soil properties outside the sampling area, but it needs to be restricted to areas with similar environmental conditions to the study area. In order to improve the generalization ability of the model, it is also necessary to expand the area covered by the study area and increase the environmental complexity of the study. The RK method can offset the effect of residuals with spatial correlation by overlaying residual interpolation and trend model prediction, which can greatly improve the prediction accuracy. However, the kriging method relies on soil sampling point data for interpolation, which cannot be used in unsampled areas, and kriging interpolation results are strongly influenced by the extreme value points, which require a good representation and uniform distribution of sample points. In the case of limited funds or time, or the situation of a large-scale soil survey, RK mapping can be carried out using the sample points of historical soil data.
5. Conclusions
In this study, four ensemble models and four RK models were used for SOC mapping. In areas with complex topography, the indices which reflect the topographical features should be used to predict SOC instead of the absolute elevation. Vegetation and soil indices play an important role in predicting SOC, and land use and geomorphic types also have a significant impact on the spatial distribution of soil attributes or types, which should be considered as a key variable for soil mapping.
The Cubist model, due to combining the advantages of linear and tree models, has both high prediction accuracy and generalization ability, and it is more advantageous in studying nonlinear relationships in areas with complex topography. The RK model can make full use of spatial autocorrelation not explained by ML models through geostatistics modeling, so it can largely improve the prediction accuracy of DSM. On the other hand, the kriging method requires that the samples should be evenly distributed and sufficient in number, and the samples of historical legacy soil data can be considered as a supplement in situations of limited research conditions or large-scale soil surveys.