1 Introduction

The loss of a firm’s production capacity due to accidents or natural disasters can propagate through supplier–customer relationships to both upstream and downstream firms [1,2,3]. Many firms were compelled to suspend operations and reduce production due to supply chain disruption by the Great East Japan Earthquake in 2011 (Nikkei Newspaper, May 3, 2011). In the supply chain networks, individual firm shocks are widely diffused through hub firms with a very large number of links. The diffused shocks will circle back around and return to the original firm. It was pointed out that such a swinging back phenomenon prolongs the deterioration of the original firm’s performance [5]. In addition to such natural disasters, the increase in tariffs in the US–China trade war [4, 6, 7] and the suspension of economic activities caused by the worldwide COVID-19 pandemic [8,9,10,11] caused economic losses to both upstream and downstream sectors. Several recent studies have substantiated the shock propagation in supply chains using big data of the supplier–customer relationships and calculating the coefficients of ordinary least squares (OLS) regression as a measure [12,13,14]. These studies used the firms’ sales growth as the objective variable and the presence (or number) of damaged firms among their suppliers/customers as the main explanatory variable.

This approach, however, does not quantitatively reveal how much a firm’s sales growth rate decreases in accordance with the decrease in that of its suppliers/customers. One reason is that the explanatory variable is not the suppliers’/customers’ sales growth but the presence of damaged firms among the suppliers/customers. Another reason is that OLS focuses not on the accuracy of the regression but only on the value of the regression coefficients. Here, supervised machine learning methods can construct regression models that ensure generalized performance, even for out-of-sample data. By analyzing these regression models, we can clarify how much the sales growth of suppliers/customers contributes to that of a focused firm.

Our study aims to clarify the magnitude of shock propagation quantitatively through the supplier–customer relationships. Since corporate sales are determined by a complex combination of several macroeconomic factors and individual corporate factors, it is difficult to predict fluctuation in sales with high accuracy using a simple prediction model with a small number of parameters. We use supervised machine learning methods to construct models that predict the firm’s sales growth rate based on various parameters, such as corporate attributes and sales information of the firm and its suppliers/customers. The prediction models indicate that not only macroeconomic factors such as the year and country but also a firm’s fluctuation in sales are important to predict that of its transaction firms. We then plot the change in the predicted sales growth rate in accordance with that of the suppliers/customers using partial dependence plots (PDP). Thus, we quantify how much a firm’s sales growth rate changes in accordance with the change in that of its suppliers/customers, namely, the magnitude of shock propagation.

Our manuscript is organized as follows. Section 2 defines shock propagation through supplier–customer relationships in this study and describes how to construct the prediction models using supervised machine learning methods. We also explain how to verify the magnitude of shock propagation by comparing it with the sales growth rate of firms that have suppliers/customers damaged by an actual disaster event. Section 3 describes the dataset we used and how to make data subsets for investigating shock propagation by industry type and by direction of propagation. Section 4 shows the prediction performance of the constructed models and the contribution of suppliers’/customers’ information in the models. Finally, we show the magnitude of shock propagation derived from the models and demonstrate that we can simulate the magnitude of shock propagation owing to a disaster event. We make our concluding remarks in Sect. 5.

2 Method

2.1 What is Shock Propagation Through Supplier–Customer Relationships?

In the manufacturing industry, accidents originate from human operation errors, and disasters such as typhoons, floods, and earthquakes result in damage to buildings and facilities, electric outage, and gas supply interruption. This leads to the termination of the supply of materials and parts to customers, which then affects the production of firms that are not directly damaged. Similarly, the degradation of customers’ production activities also degrades the suppliers’ production activities. We define such marked degradation of corporate activities as “shock.”

When considering supply chain networks, suppliers and customers can be described as being connected by links. Shock propagates from the supplier’s node to the customer’s node and vice versa when the production activity of either the supplier or the customer is degraded. Previous studies [12,13,14] have proved that both upstream and downstream propagation is important for understanding shock in supply chain networks. Therefore, we quantify the shock propagation as downstream and upstream (i.e., from suppliers and customers) in this study.

Degradation of corporate activities does not always occur as a result of damaged buildings and facilities when the firm belongs to the non-manufacturing industry and previous studies often excluded non-manufacturing firms from their analyses [12, 14]. Provided that we do not restrict the target events to physical and exogenous events such as accidents and natural disasters, the degradation of corporate activities occurs in the non-manufacturing industry and the shock can also propagate to connected firms. Therefore, our study quantifies the shock propagation for both firms belonging to the manufacturing industry and firms belonging to the non-manufacturing industry.

2.2 Prediction Models Using Supervised Machine Learning Methods

Let \(x_i^{(n)}\) be the annual sales of the ith firm \((i=1,2,...N)\) in the year n and let \(\ln x_i^{(n)}\) be the natural logarithm of \(x_i^{(n)}\) and \(\Delta \ln x_i^{(n)} :=\ln x_i^{(n)} - \ln x_i^{(n-1)}\). We consider the following approximation of \(\Delta \ln x_i^{(n)}\) with the function f as

$$\begin{aligned} \Delta \ln x_{i}^{(n)}= & {} f\Bigg( x_{i}^{(n-1)}, \Delta \ln x_{i}^{(n-1)}, \max \left( x_{i_{1}}^{(n-1)}, \ldots , x_{i_{k}}^{(n-1)}, \ldots , x_{i_{M}}^{(n-1)}\right) , \nonumber \\& \frac{1}{M} \sum _{k=1}^{M} \Delta \ln x_{i_{k}}^{(n)}, \frac{1}{M} \sum _{k=1}^{M} \Delta \ln x_{i_{k}}^{(n-1)}, M, {\varvec{\alpha }}_{i}, {\varvec{\alpha }}_{i^{\prime }}, n\Bigg) , \end{aligned}$$
(1)

where \(i_k \in (1,2,...,N)\) \((k=1,2,..M)\) is the firm that is linked with the i-th firm as its Tier 1 supplier/customer, M is the number of suppliers/customers of the i-th firm, and \({\varvec{\alpha }}_i \in \mathbb {R}^2\) is the vector whose components correspond to the characteristics of the i-th firm: the country and category of industry. Similarly, \({\varvec{\alpha }}_{i^\prime }\) is the vector of the characteristics of the \(i^\prime \)-th firm, which has the highest sales in the year \(n-1\) among the suppliers/customers of i-th firm, namely, \(i^\prime :=\max _k \bigcup _{k=1,2,\cdots ,M} x_{i_k}^{(n-1)}\). n is the year of the sales growth rate that we approximate, from 2012 to 2016. For example, when we approximate the sales growth rate in 2016, n is 2016 and \(n-1\) is 2015. We call the variables that function f has the explanatory variables. We also call \(\Delta \ln x_i^{(n)}\) the objective variable and the ith firm the target firm.

The main explanatory variable in our study is \((1/M)\sum _{k=1}^M \Delta \ln x_{i_k}^{(n)}\), which represents the fluctuation in sales (including the shock) of suppliers/customers in year n. \(x_i^{(n-1)}\) and \(\max (x_{i_1}^{(n-1)},x_{i_2}^{(n-1)},\ldots ,x_{i_M}^{(n-1)})\) are intended to reflect the size of the target firm and its suppliers/customers. \(\Delta \ln x_i^{(n-1)}\) and \((1/M)\sum _{k=1}^{M} \Delta \ln x_{i_{k}}^{(n-1)}\) are intended to reflect the recent growth trend of the target firm and its suppliers/customers. M is expected to reflect the difference in the shock propagation depending on the number of suppliers/customers that the target firm has. \({\varvec{\alpha }}_i\), \({\varvec{\alpha }}_{i^\prime }\), and n are intended to reflect macroeconomic growth trends according to a combination of the country, category of industry, and year. Figure 1 shows a schematic of the equation to approximate the objective variable \(\Delta \ln x_i^{(n)}\).

Fig. 1
figure 1

Schematic of the equation to approximate the sales growth rate of the ith firm (which we call the target firm) in year n. One information used in the approximation is the target firm’s own information in year \(n-1\). The other information is of its Tier 1 suppliers/customers in the years n and \(n-1\)

f is determined by the supervised machine learning method. The first is the least absolute shrinkage and selection operator (LASSO) regression [15], which is a linear regression method in machine learning. The second is CatBoost [16], which is a Gradient Boosting Decision Tree method. Either machine learning method optimizes its parameters to minimize the error between the left and right sides of Eq. (1); we call this function the prediction model. The number of training data is set as four times the validation data and we implement the cross-validation five times.

Furthermore, we use a Partial Dependence Plot (PDP) [17] to visualize how much the sales growth rate of the target firm changes in accordance with that of its suppliers/customers in the prediction models. PDP visualizes high-dimensional functions and can illustrate how the objective variable depends on one explanatory variable in the machine learning model. Let \(f(x_1, x_2,\ldots ,x_p)\) be a machine learning model with p explanatory variables. We consider visualizing the values of the objective variables according to the \(i\in [1,p]\)th explanatory variable \(x_i\). \(F(x_i)\), which represents the PDP, is defined as

$$\begin{aligned} F_{i}(x)=\frac{1}{N} \sum _{j=1}^{N} f\left( x_{1}^{j}, \ldots , x_{i-1}^{j}, x, x_{i+1}^{j}, \ldots , x_{p}^{j}\right) , \end{aligned}$$
(2)

where \(x_i^j\) denotes the value of the ith explanatory variable in the data \(j \in [1,N]\).

2.3 Verification of Magnitude of Shock Propagation

We derive the relationship between the sales growth rate of a target firm and that of its suppliers/customers using the supervised machine learning methods and PDP. The relationship is the magnitude of shock propagation. To verify the derived magnitude of shock propagation, we compare it with the shock propagation observed in an actual disaster event. The disaster event we used for comparison is Hurricane Sandy, which struck the East Coast of the United States in October 2012 and caused widespread damage, mainly in New York and New Jersey, by strong winds and storm surges. The total economic loss is estimated at about 77.4 billion USD, which is the fourth-largest loss in the history of meteorological disasters in the United States as of July 2021 [18].

The procedure of this comparison is as follows. First, we define the area considered directly affected by Hurricane Sandy and, conversely, the area that was free from damage by Hurricane Sandy. Following the method of Kashiwagi et al. [14], we define the disaster-affected area as the counties that are categorized as “Very High” by the Federal Emergency Management Agency (FEMA) [19]. Meanwhile, we define counties that are not categorized as “Very High,” “High,” or “Moderate” as non-affected areas (see the appendix for a map of the defined disaster-affected areas and non-affected areas). We then define the firms whose headquarters are in the disaster-affected area and whose sales growth rate is negative as “disaster-affected firms.” We extract the i-th target firm whose headquarters are in the non-affected area and whose suppliers/customers include the disaster-affected firms. In accordance with Eq. (1), we calculate \(\Delta \ln x_i^{(n)}\) and \((1/M)\sum _{k=1}^M \Delta \ln x_{i_k}^{(n)}\) and obtain the relationship between the sales growth rates of the target firm and its suppliers/customers. Finally, we compare this relationship with the magnitude of shock propagation derived from the machine learning models.

3 Data

This study mainly uses the Supply Chain Relationships data provided by FactSet Research Systems Inc.Footnote 1 This dataset includes a time series of information on pairs of firms that have business relationships, such as a supplier and a customer or a supplier and a distributor (in the following, the distributor is identified with the customer in our study). The dataset maximizes the coverage of the business relationships by referring to direct relationships (relationships disclosed by the reporting firm) and inverse relationships (disclosed not by the reporting firm but by firms that do business with the reporting firm). The dataset’s source is publicly available information, such as securities reports and websites, and it covers only listed firms and firms that have business relationships with them. The business relationships are recorded in units of a parent firm and all transactions between subsidiaries are grouped in their parent firms.

This study also utilizes the dataset of annual sales of firms, names of countries in which the firms’ headquarters are located, and industrial codes, which correspond to levels 1–6, namely, the FactSet Revere Business Industry Classifications System (RBICS) [20]. We tag the firms in the dataset with annual sales in USD and country name. The firms are also tagged with level 2 of RBICS and categorized into “manufacturing” and “non-manufacturing” industries (see the appendix for the industry classification definition).

To investigate the shock propagation by industry type and by direction of propagation, we generate the data subsets by the following steps. We extract pairs of firms with business relationships as supplier–customer at least for some time between 2010 and 2016. This extraction is implemented because some relationships are missed by incorrect records even when the transactions are actually continuing. We prepare 20,773 target firms with suppliers and 18,935 target firms with customers from those pairs of firms. Provided that there is a mixture of manufacturing and non-manufacturing firms in the suppliers/customers of a target firm, we exclude the target firm from the following analysis. The generated four data subsets are

subset (1)::

target firms with their suppliers in the manufacturing industry,

subset (2)::

target firms with their customers in the manufacturing industry,

subset (3)::

target firms with their suppliers in the non-manufacturing industry, and

subset (4)::

target firms with their customers in the non-manufacturing industry.

In addition, we generated another data subset as

subset (5)::

target firms with their suppliers and customers in the manufacturing or non-manufacturing industry.

Data subset (5) includes supplier—target firm—customer linkages and both manufacturing and non-manufacturing firms (data with a mixture of manufacturing and non-manufacturing firms in one hierarchy are excluded). The number of target firms in data subsets (1)–(5) is 17,625, 18,506, 24,405, 21,974, and 26,992, respectively. The objective variable and explanatory variables are prepared for each data subset as described in Sect. 2.2. We compute the cumulative relative frequencies of the countries for target firms and their suppliers/customers, leave the top 80% of the countries as they are, and aggregate the remaining countries as “others.” See the appendix for the summary of statistics and the number of target firms by the country, the industry, and the year in each data subset.

Finally, we remove outliers and oversample the data to improve the imbalances in each data subset. Sales growth rates of firms concentrate around zero and the number of firms is extremely small at the positive and negative tails of the distribution of the sales growth rates. Using such imbalanced data, machine learning models have good performance only in the range of the sales growth rate with high data volume. To address this problem of imbalanced data, we first remove the target firms and their suppliers/customers whose sales growth rate is out of the quartiles as outliers. Second, we apply the synthetic minority over-sampling technique (SMOTE) [21]. For a datum belonging to a minority group, SMOTE randomly selects its neighbor from the k-nearest neighbors and interpolates a new datum between the two selected data. The procedure for applying SMOTE to each data subset is as follows. Each data subset is divided into training and validation data vs. test data = 8:2. Then, the training and validation data are divided into the training data and validation data for fivefold cross-validation. Next, we divide each data into eight ranges, such as \(\Delta \ln x_i^{(n)} \in [\pm 0.5k\cdot \sigma , \pm 0.5(k-1)\cdot \sigma ]\) \((k=1,2,3)\) (double-sign corresponds) and outside of \(\pm 1.5\sigma \). SMOTE oversamples the data in each range to align the number of data with the group that has the most data.

4 Results

4.1 Prediction Performance of Machine Learning Models

Table 1 shows the prediction performance of the machine learning models, which are obtained by applying LASSO regression and CatBoost to data subsets (1)–(4). The performance is shown in Root Mean Square Error (RMSE) and the correlation coefficient between the target firms’ true and predicted sales growth rate.

Table 1 The performance of machine learning models obtained using LASSO regression and CatBoost

“Baseline” shows the RMSE value when all predicted values are set to zero. The LASSO regression and CatBoost show better RMSE values than the Baseline, and it is concluded that the learnings are successful. We can confirm that the models avoid over-learning because the RMSE values of the prediction with the validation data are similar to that with test data. The correlation coefficients obtained by the LASSO regression and CatBoost with test data are in the range of (0.429, 0.495) and (0.506, 0.591), respectively, and these values also indicate that the learning is successful. These accuracies also show that the models obtained using CatBoost have higher prediction performance than that obtained using LASSO regression. Linear regression has been used to investigate the significance of regression coefficients in the previous studies mentioned above. On the other hand, our machine learning models indicate that Catboost is better than linear regression in prediction accuracy.

Next, Fig. 2 shows the prediction result obtained by the CatBoost models when they are applied to the test data.

Fig. 2
figure 2

Prediction results obtained by the CatBoost models when they are applied to the test data

This figure shows that there is a correlation between the target firms’ true and predicted sales growth rates, as shown in Table 1. Meanwhile, the larger the absolute values of the true sales growth rate become, the larger the deviation of the predicted sales growth rate from the true one becomes. In the case of data subset (1), the RMSE value in intervals \((-0.1, 0.0]\) and (0.0, 0.1] are 0.069 and 0.071. On the other hand, the RMSE value for intervals \((-0.2, -0.1]\) and (0.1, 0.2] are 0.116 and 0.122, and for intervals \((-\infty , -0.2]\) and \((0.2, +\infty )\) are 0.189 and 0.187, respectively.

Figure 3 shows the regression coefficients of the model obtained using LASSO regression on data subset (1).

Fig. 3
figure 3

Regression coefficients of the model obtained by using LASSO regression on data subset (1)

The explanatory variable of interest is the suppliers’/customers’ sales growth rate in year n (“Salesgrowth_supplier(n)” or “Salesgrowth_customer(n)”), and its regression coefficients show relatively large values. This result indicates that the sales growth rate of target firms decreases in accordance with the decrease in that of their suppliers/customers; this is consistent with the shock propagation through the supplier–customer relationship. We also confirmed that this tendency similarly holds true for other data subsets. Table 1 shows the performance of the model obtained using CatBoost when the explanatory variables related to suppliers/customers are excluded from the explanatory variables. The prediction performance decreases without explanatory variables related to the suppliers/customers (i.e., the third through sixth and eighth arguments in the function f of Eq. 1), as shown in Table 1. This result indicates that the information about suppliers/customers is significant for predicting the sales growth rate of target firms.

Finally, Fig. 4 shows the importance of explanatory variables of the models obtained using CatBoost on each data subset.

Fig. 4
figure 4

The importance of explanatory variables of the models obtained using CatBoost on each data subset

Regarding data subsets (1)–(4), the ranking of the contribution of the explanatory variables tends to be similar to that of the regression coefficients in the LASSO regression models. In the case of data subset (5), the sales growth rate of customers in year n is slightly more significant than that of suppliers, but there is no marked difference. The significant variables together with the sales growth rate of suppliers/customers are the year n (“n(predicted_year)”), the country in which the target firm is located (“Country_target”), the sales growth rate of the target firm in the previous year (“Salesgrowth_target\((n-1)\)”), and the category of industry of the target firm (“Industry_target”). This tendency indicates that the macroeconomic trend, expressed as a combination of a year, a country, an industry, and the firm’s own recent sales growth, also contribute as much as or more significantly than the suppliers’/customers’ sales growth rate.

4.2 Magnitude of Shock Propagation Derived from Machine Learning Models

This section describes the magnitude of shock propagation calculated using the machine learning models. Figure 5 shows the PDPs drawn from the machine learning models obtained using CatBoost on each data subset.

Fig. 5
figure 5

PDPs drawn from the machine learning models obtained using CatBoost on each data subset. “Average rate of change” indicates the average rate of change in the sales growth rate of the target firms when their suppliers’/customers’ sales growth rate varies from \(-\) 0.1 to 0.1 in steps of 0.02

The x-axis is the suppliers’/customers’ sales growth rate. It varies from \(-\) 0.1 to 0.1 in steps of 0.02 and the average value of objective variables is calculated at each step. We set the number of samples as 500 and randomly extracted 500 data from all the training, validation, and test data of each data subset. There is a positive correlation between the target firm’s sales growth rate and that of its suppliers/customers in each data subset, although the slope differs in each interval. This correlation indicates that the sales growth rate of target firms decreases in accordance with the decrease in that of their suppliers/customers through supplier–customer relationships. In data subsets (1)–(4), the average changes in the sales growth rate of target firms are 0.173, 0.192, 0.163, and 0.227, respectively. This result indicates that, for example, in data subset (1), our machine learning model predicts that the sales growth rate of a target firm decreases by 1.73% on average when that of its suppliers decreases by 10%. Provided that these PDPs are decomposed into the category of industry, there are no marked differences among the inclination of the PDPs for each category of industry.

The lower panel of Fig. 5 shows the PDPs drawn from the machine learning model obtained using CatBoost on data subset (5). The left half of these panels shows the PDP with the sales growth rate of suppliers as the x-axis and the right half shows the PDP with the sales growth rate of customers as the x-axis. The average rates of change are 0.177 and 0.224, respectively. The inclination of customers’ sales growth rate is slightly larger than that of suppliers, but the difference is not marked.

4.3 Verification of the Magnitude of Shock Propagation Using Disaster Event

This section explains the result of comparing the magnitude of shock propagation obtained in the previous section with that in an actual disaster event. First, Table 2 shows the sales growth rates of the suppliers/customers averaged over the disaster-affected area and those averaged over the non-disaster-affected area in each year n.

Table 2 Sales growth rates of suppliers/customers averaged over the disaster-affected area and those averaged over the non-disaster-affected area in each year n.

Since Hurricane Sandy struck in Oct. 2012, we measured its impact on the sales growth rate in 2013. Table 2 shows that the sales growth rate of the firms averaged over the disaster-affected area is certainly lower than that averaged over the non-affected area in 2013. In particular, in data subsets (1)–(3), the sales growth rate averaged over the disaster-affected area is statistically significantly lower than that in the non-affected area at the 5% significance level. This result indicates that Sandy indeed affected the sales growth rate of the firms in the disaster-affected area. We conclude that Sandy is an appropriate subject for investigating the magnitude of shock propagation. In subset (4), the impact of hurricane appears to be more significant in the sales growth rate of firms a year earlier than in the other subsets.

Next, we obtain the relationship between a target firm’s sales growth rate and that of its suppliers/customers, including the disaster-affected firms. Figure 6 shows the regression lines of the relationship superimposed on the PDP shown in Fig. 5.

Fig. 6
figure 6

The regression lines represent the relationship between the sales growth rate of the target firms and that of their suppliers/customers, including the disaster-affected firms. The regression line is superimposed on the PDP shown in Fig. 5

The slopes of the regression lines in each data subset (1)–(4) are 0.252, 0.095, 0.361, and 0.708, respectively. Provided these slopes are not markedly different from the average rate of change of the PDP, the magnitude of shock propagation derived from our machine learning models is reasonable. To confirm this, we first check that both the slope of the regression line and the average rate of change of PDP are not zero by a t test. As a result, only data subsets (3) and (4) contradict the hypothesis that they are zero at the 5% level of significance. For these two data subsets, we implement the t-test on the hypothesis that the slope of the regression line is equal to the slope at the points plotted in the PDP. As a result, this hypothesis is correct in either data subset at the 5% level of significance. Thus, we conclude that the magnitude of shock propagation derived from our machine learning models is reasonable compared with an actual disaster event. This conclusion indicates that we can simulate how much the shock that occurred in the disaster-affected firms propagates to their transaction firms.

5 Conclusion

This study quantified the magnitude of shock that propagates individual firms through direct supplier–customer relationships. To achieve this aim, we constructed supervised machine learning models that predict the sales growth rate of individual firms by using firm-level big data, which includes supplier–customer relationships, sales, and several corporate attributes. Then, we plotted the change in the predicted sales growth rates in accordance with those of suppliers/customers, which indicated that 16.3–22.7% of the shock on suppliers/customers propagates to their transaction firms. Meanwhile, the models also show that the performance of suppliers/customers is only one of the factors that affect their transaction firm’s performance; the macro growth trends depend on the year, country, and category of industry and the firm’s own recent growth trend has a strong impact. This finding suggests that firms do not need to overly fear the risk of shock propagation from their suppliers/customers compared to the macroeconomic environment surrounding them or their own growth trend. We also found that these machine learning models do not depend on the firm’s industry and direction of shock propagation. Finally, we compared the magnitude of shock propagation derived from the machine learning models with that observed in an actual disaster event. This comparison showed that our models accurately evaluate the magnitude of shock propagation in accordance with the scale of the natural disaster, indicating that these models work as simulators of shock propagation.

The recent trade war between the U.S. and China and the various economic impacts of COVID-19 are likely to increase risks in the global supply chain. The contribution of suppliers/customers performance in our machine learning models probably increases year by year and thus investigation of the annual change in the contribution is a future task.