1 Introduction

Households are becoming increasingly heterogeneous, due to increasing wealth inequalities (Atkinson et al., 2011; Piketty, 2013), financial crisis (Krueger & Perri, 2006), or the COVID-19 pandemic (Blundell et al., 2020; Dizioli & Pinheiro, 2021). Krueger et al. (2016) found that households in different segments of the wealth distribution had different reactions to the 2007–2008 Global Financial Crisis, and Eichenbaum et al. (2021) reported that households have different COVID-19 pandemic mortality rates depending on their income levels. Consequently, many researchers have investigated the heterogeneity of household finances in various aspects. For example, heterogeneity in portfolio composition (Mankiw & Zeldes, 1991; Heaton & Lucas, 1997; Krusell & Smith, 1997; Case et al., 2005, 2011), income level (Constantinides & Duffie, 1996; Krueger et al., 2016; Lucas, 1994; Ahn et al., 2018), wealth level (Bricker et al., 2021; Case et al., 2005, 2011; Krueger et al., 2016), and demographics (Campbell, 2006; Berton et al., 2018; Calvet et al., 2021; Das et al., 2020) have been identified and analyzed.

However, Jappelli and Pistaferri (2014) and Krueger et al. (2016) pointed out the limitations of existing studies that separately investigate household heterogeneity in each dimension (e.g., income and wealth). That is, considering a few variables would not be enough to have a more detailed understanding of complex household heterogeneity. Krueger et al., (2016, p. 67) further noted that additional dimensions of household heterogeneity should be introduced to “better capture the joint distribution of wealth, income, and expenditure we observe in the data.”

Figure 1 illustrates the average asset allocation of Korean households with respect to their wealth percentile from 2017 to 2020. Panel a of Fig. 1shows the results for the entire dataset. The proportions of deposit savings and long-term rental deposits almost monotonically decrease as households become wealthier. The proportion of residential housing increases up to middle class households, but it suddenly decreases. Instead, the proportion of nonresidential real estate increases. It is clear that the relationship between households’ asset allocation and wealth level is nonlinear. Panels b and c of Fig. 1 represent the results from the bottom 20% and the top 20% income households, respectively. The relationship is clearly not simplified even if we look at subgroups partitioned by income level. This shows why conventional approaches would have difficulties in investigating the heterogeneity in household finance, which involves nonlinear relationships that are entangled in a multi-dimensional space.

Fig. 1
figure 1

Average portfolio weights of Korean households in 2017–2020

Consequently, in this study, we perform a comprehensive analysis of household finance heterogeneity in various dimensions using an advanced clustering method. Since household wealth, income, and consumption are known to have skewed marginal distributions (Campbell, 2006), it would be difficult to fit such data using standard probability distributions. We believe that clustering methods can be helpful because these methods are specifically designed to find representative clusters based on the multidimensional joint distribution of data points. Because household financial data would have a complex dependence structure between a large number of items, deep learning-based and manifold learning-based dimension reduction techniques are employed along with conventional clustering methods. Many studies have shown that deep learning and manifold learning methods are helpful for handling complex nonlinear dependent structures (Bengio et al., 2012).

While we use only the financial aspects (as reported in the balance sheets) of households to identify the representative clusters, the clusters are analyzed in terms of multiple criteria. That is, the clusters are analyzed in terms of household demographics (age, gender, education, family size, and employment) as well as households’ balance sheets (income, expenditure, assets, and debt). Our analysis shows that financial heterogeneity is closely related to demographic heterogeneity.

Korean household finance and living condition survey data were used in this study. Annual data from 2017 to 2019 consist of balance sheets (including income, expenditure, assets, and debt) and demographics (including age, gender, education, and employment status of householder, family size) of around 20,000 households each year. The Republic of Korea has shown remarkable growth since the Korean War in the 1950s to become the world’s 10th largest economy in 2020 according to the World Bank (2021). However, such rapid growth has been accompanied by various social issues. Currently, Korea has the world’s lowest fertility rate (OECD, 2021) and severe inter- and intra-generational wealth inequality compared to other developed countries (OECD, 2018). Hence, Korea offers a good example of a clearer heterogeneity in household finance.

The remainder of this paper is organized as follows. Section 2 introduces the clustering method employed in this study, Sect. 3 discusses the data and experimental setting, and Sect. 4 presents findings from the numerical experiments. Finally, Sect. 5 concludes the study.

2 Deep clustering

Consider a household \(i\)’s balance sheet data \({{\varvec{x}}}^{{\varvec{i}}}\in {\mathbb{R}}^{d}\), which consists of asset variables \({{\varvec{x}}}_{{\varvec{A}}}^{{\varvec{i}}}\in {\mathbb{R}}^{{d}_{A}}\), debt variables \({{\varvec{x}}}_{{\varvec{D}}}^{{\varvec{i}}}\in {\mathbb{R}}^{{d}_{D}}\), and expenditure variables \({{\varvec{x}}}_{{\varvec{E}}}^{{\varvec{i}}}\in {\mathbb{R}}^{{d}_{E}}\). Hence, \({{\varvec{x}}}^{{\varvec{i}}}=[{{\varvec{x}}}_{{\varvec{A}}}^{{\varvec{i}}}\boldsymbol{ };{{\varvec{x}}}_{{\varvec{D}}}^{{\varvec{i}}}\boldsymbol{ };{{\varvec{x}}}_{{\varvec{E}}}^{{\varvec{i}}}]\in {\mathbb{R}}^{d}\). Our purpose is to find \(k\) clusters that divide \(N\) households based on their balance sheet data \({\varvec{X}}\in {\mathbb{R}}^{N\times d}\) so that each cluster would contain households that are similar in terms of their financial status. Hence, we apply clustering algorithms to households’ balance sheet data \(\mathbf{X}\in {\mathbb{R}}^{N\times d}\).

Clustering is one of the most popular unsupervised machine learning tasks that clusters through the similarity of data points without any label information (i.e., uses an unlabeled data). The objective of clustering is to maximize intra-group similarities and minimize inter-group similarities. Clustering methods have been shown to be useful in various tasks, such as images, medical, and finance (Ahmad & Khan, 2019).

The well-known clustering methods such as k-means, DBSCAN, hierarchical clustering, and Gaussian mixture model (GMM) have been successfully employed in various fields.Footnote 1 However, such conventional methods are not suitable for handling high-dimensional data.

Recently, many studies have shown that deep learning methods can be useful for enhancing clustering methods to effectively handle high-dimensional datasets. The so-called “deep clustering” methods have been proposed. Ghasedi Dizaji et al. (2017) and Caron et al. (2018) proposed clustering neural network models that utilize extracted important features from high-dimensional image data using a convolution neural network and an autoencoder,Footnote 2 respectively, which are jointly learned by interacting with conventional clustering methods (e.g., k-means). Guo et al. (2017) and Mukherjee et al. (2019) proposed clustering methods based on latent modeling using an autoencoder and generative adversarial networks,Footnote 3 respectively, and tested them on tabular data and image data. However, there was no significant performance improvement compared to conventional clustering methods.

McConville et al. (2021) proposed a simple deep clustering framework called N2D that directly uses conventional clustering algorithms (e.g., GMM) in a latent space found by deep learning and manifold learning techniques (see Fig. 2). Unlike other deep clustering methods mentioned earlier, the clustering step is separated from the dimension-reduction step. The N2D approach has been shown to achieve similar or even better performance compared to other deep clustering methods as well as conventional approaches. The key trick was to combine deep learning and manifold learning techniques to reduce the dimensionality of data by capturing complex nonlinear dependency structures. Therefore, we follow the N2D framework proposed by McConville et al. (2021) to find representative clusters of household balance sheet data.

Fig. 2
figure 2

N2D framework for deep clustering by McConville et al. (2021) (Created by the authors)

For a household \(i\)’s balance sheet data \({x}^{i}\), we first find its \(k\)-dimensional embedding \({z}^{i}\in {\mathbb{R}}^{k}\) via an autoencoder, and we further reduce it into a two-dimensional embedding \({z^{\prime i}}\in {\mathbb{R}}^{2}\) via UMAP. Then, clustering is performed with the two-dimensional embeddings \({z^{\prime i}}\) of all households (i.e., for all \(i\)). The following subsections will explain in detail the two steps: (1) dimension reduction (autoencoder and UMAP) and (2) clustering (GMM).

2.1 Dimension reduction for clustering

Dimension reduction techniques are incorporated in most deep clustering methods to effectively handle high-dimensional data. The key to dimension reduction is to find low-dimensional representations (or features) lying in a high-dimensional space, which is often called latent modeling or feature extraction (Bengio et al., 2013). While other deep clustering algorithms jointly optimize latent modeling (or feature extraction) and clustering iteratively, McConville et al. (2021) separate the two tasks to simplify the overall process. However, to retain (or even improve) the performance of other deep clustering methods, they further divided the dimension reduction part into two. First, an autoencoder is used to find mid-dimensional embeddings to capture the global features. Second, manifold learning techniques, such as t-SNE and UMAP, are used to find low-dimensional manifolds to better capture local features. McConville et al. (2021) argue that such an approach can find more clusterable embeddings because both global and local features are crucial for clustering tasks.

2.1.1 Autoencoder

An autoencoder (AE) is a dimension reduction technique based on artificial neural networks and is often referred to as a deep learning version of principal component analysis (PCA), one of the most popular dimension reduction methods. While PCA is only able to capture linear dependence structures within data, AE is known to capture complex non-linear dependencies well (Bengio et al., 2013; Burges, 2010; Burges, 2010; Xie et al., 2016).

The AE is composed of an encoder function \({f}_{ENC}: {\mathbb{R}}^{d}\to {\mathbb{R}}^{k}\) and a decoder function \({f}_{DEC}: {\mathbb{R}}^{d}\to {\mathbb{R}}^{k}\). The encoder function \({f}_{ENC}\) is a mapping from high-dimensional data \(\mathbf{X}\in {\mathbb{R}}^{N\times d}\) with \(N\) samples and \(d\) features to corresponding embeddings \(\mathbf{Z}\in {\mathbb{R}}^{N\times k}\) in a \(k\)-dimensional latent space with \(k\ll d\). The decoder function \({f}_{DEC}\) is a mapping from embeddings \(\mathbf{Z}\in {\mathbb{R}}^{N\times k}\) to the original data \(\mathbf{X}\in {\mathbb{R}}^{N\times d}\). AE is trained to minimize the following reconstruction loss:

\(\ell_{\text{AE}}= \parallel \mathbf{X}-{f}_{DEC}\left({f}_{ENC}\left(\mathbf{X}\right)\right){\parallel }_{{\varvec{F}}}^{2},\)

where \({\Vert \bullet \Vert }_{\mathrm{F}}^{2}\) is the Frobenius norm. While various neural network structures (e.g., convolutional neural networks and recurrent neural networks) can be used for both encoder and decoder functions, we use fully connected layers with a rectified linear unit (ReLU) for both functions. More details regarding the architectural choices are discussed in Appendix A.

Hence, the entire household balance sheet data \(\mathbf{X}\in {\mathbb{R}}^{N\times d}\) is reduced to \(\mathbf{Z}\in {\mathbb{R}}^{N\times k}\). Note that the embeddings are not separately found for asset, debt, and expenditure variables. Instead, each embedding incorporates all balance sheet variables so that the final clustering is done based on the entire balance sheet, not just subsets.

However, embeddings \(\mathbf{Z}\in {\mathbb{R}}^{N\times k}\) found by AE do not necessarily preserve distances between data points \(\mathbf{X}\in {\mathbb{R}}^{N\times d}\) in the original space, because AE is trained only in terms of minimizing the reconstruction loss. For any two data points \({\mathbf{x}}_{{\varvec{i}}},{\mathbf{x}}_{{\varvec{j}}}\in {\mathbb{R}}^{d}\) and their autoencoded embeddings \({\mathbf{z}}_{{\varvec{i}}}={f}_{ENC}({\mathbf{x}}_{{\varvec{i}}}),\boldsymbol{ }{\mathbf{z}}_{{\varvec{j}}}={f}_{ENC}({\mathbf{x}}_{{\varvec{j}}})\in {\mathbb{R}}^{k}\), there is no relationship between \(d({\mathbf{x}}_{{\varvec{i}}},{\mathbf{x}}_{{\varvec{j}}})\) and \(d({\mathbf{z}}_{{\varvec{i}}},{\mathbf{z}}_{{\varvec{j}}})\), where \(d\) is an arbitrary distance measure. Then, autoencoded embeddings would not be appropriate for clustering because the objective of clustering is to find similar data points.

Therefore, in the N2D framework, clustering is not performed on the auto-encoded embeddings. Instead, AE is used to find intermediate embeddings with its dimension \(k\) not being too small, so that the distances in the original space are not fully lost. McConville et al. (2021) recommend using the dimension of autoencoded embeddings \(k\) as the desired number of clusters.

2.1.2 UMAP: uniform manifold approximation and projection

The manifold assumption in machine learning is that the observed data lie approximately on a low-dimensional manifold, and manifold learning refers to non-linear dimension reduction techniques based on such an assumption. Because a manifold is a topological concept in which every point is locally connected, manifold learning techniques are known to capture local features well. Many different models have been proposed, including isometric mapping (Tenenbaum et al., 2000), locally linear embedding (Tenenbaum et al., 2000), modified locally linear embedding (Zhang & Wang, 2007), Hessian eigenmapping (Donoho & Grimes, 2003), and t-distributed stochastic neighbor embedding (Van der Maaten & Hinton, 2008). While the last one (t-SNE) showed promising performance for complex datasets, it is often criticized for being too locally focused and lacks scalability (McConville et al., 2021).

In this regard, uniform manifold approximation and projection (UMAP) was recently proposed by McInnes et al. (2018), which is known to preserve the global structure as well as the local structure of data through a cross-entropy cost function. Let us consider a dimension-reduction task from \(\mathbf{Z}\in {\mathbb{R}}^{N\times k}\) to \(\mathbf{Z}\mathbf{^{\prime}}\in {\mathbb{R}}^{N\times 2}\). In other words, we wish to reduce \(k\)-dimensional dataset into two-dimensional embeddings. UMAP consists of three steps. First, graph construction. In this step, a graphical representation of \(\mathbf{Z}\in {\mathbb{R}}^{N\times k}\) is presented. The relationship between two data points \({z}_{i},{z}_{j}\in {\mathbb{R}}^{k}\) is represented as a probability

$$ {\text{p}}_{{i|j}} = \exp \left( { - \frac{{d\left( {z_{i} ,z_{j} } \right) - \rho _{i} }}{{\sigma _{i} }}} \right), $$

where \(d\) is a distance measure, \({\rho }_{i}\) is a local connectivity parameter, and \({\sigma }_{i}\) is a normalization factor. Here, \({\rho }_{i}\) is set as the average distance from \({z}_{i}\) to its u nearest neighbors, where u controls the balance between local and global structure. If u is low, the UMAP model would focus on more detailed local structure, while a high u would ignore small details to represent global structure. Then, the global probability between the two data points is computed as.

$${\mathrm{p}}_{ij}=\left({\mathrm{p}}_{i\mid j}+ {\mathrm{p}}_{j\mid i}\right)-{\mathrm{p}}_{i\mid j}{\mathrm{p}}_{j\mid i}$$

Second, graph embedding. For the corresponding embeddings \({z}_{i}^{\prime},{z}_{j}^{\prime}\in {\mathbb{R}}^{2}\), the pairwise probability \({q}_{ij}\) is computed as:

$${q}_{ij}=\frac{1}{1+a\parallel {z}_{i}^{\prime}-{z}_{j}^{\prime}{\parallel }^{2b}} ,$$

where \(a\) and \(b\) are hyper-parameters, and \(\parallel \bullet \parallel \) is a norm function. Finally, cross-entropy is used as a loss function to find the optimal mapping \({f}_{UMAP}: {\mathbb{R}}^{k}\to {\mathbb{R}}^{2}\) from \(\mathbf{Z}\in {\mathbb{R}}^{N\times k}\) to \(\mathbf{Z}\mathbf{^{\prime}}\in {\mathbb{R}}^{N\times 2}\) from a fuzzy topological point of view. The cross-entropy loss function can be expressed as follows:

$$\ell_{\mathrm{UMAP}}= \sum_{i \ne j}{p}_{ij}\mathrm{log}\left(\frac{{p}_{ij}}{{q}_{ij}}\right)+\left(1-{\mathrm{p}}_{ij}\right)\mathrm{log}\left(\frac{1-{p}_{ij}}{1-{q}_{ij}}\right)$$

McConville et al. (2021) tested various manifold learning techniques (isomapping, t-SNE, and UMAP) for their N2D framework, and N2D with UMAP demonstrated the best performance. Therefore, we use UMAP to find the final two-dimensional embeddings \(\mathbf{Z}\mathbf{^{\prime}}\in {\mathbb{R}}^{N\times 2}\) from the intermediate embeddings \(\mathbf{Z}\in {\mathbb{R}}^{N\times k}\) found by AE.

2.2 Clustering via Gaussian mixture model

Finally, the Gaussian mixture model (GMM) is employed to find clusters for the two-dimensional embeddings \(\mathbf{Z}\mathbf{^{\prime}}\in {\mathbb{R}}^{N\times 2}\) found by AE and UMAP. Consider a \(k\) mixture of Gaussian distributions

$$\mathrm{p}\left(\mathrm{z}\right)= \sum_{i=1}^{k}{\pi }_{i}\mathcal{N}(\mathrm{z}\mid {\mu }_{i},{\Sigma }_{i}),$$

where \( {\mathcal{N}}\left( {{\text{z}}\mid \mu _{i} ,\Sigma _{i} } \right) \) is a multi-dimensional Gaussian distribution with mean \( {\upmu }_{\rm i} \) and covariance matrix \({\Sigma }_{i}\), and \({\uppi }_{\mathrm{i}}\) is a weight coefficient with \({\uppi }_{\mathrm{i}}\ge 0\) and \({\sum }_{i=1}^{k}{\pi }_{i}=1\). GMM finds the optimal parameters of the above Gaussian mixture that are most likely for the given data. That is, a log-likelihood given parameter \({\uptheta }_{\mathrm{GMM}}\)

$$ \ell _{{{\text{GMM}}}} = \ln {\text{p}}\left( {{\mathbf{Z}}'\mid{{\uptheta }}_{{{\text{GMM}}}} } \right) = ~\mathop \sum \limits_{{j = 1}}^{N} \ln \left\{ {\mathop \sum \limits_{{i = 1}}^{k} \pi _{i} {\mathcal{N}}\left( {{\text{z}}_{j} '|\mu _{i} ,\Sigma _{i} } \right)} \right\} $$

is maximized with respect to \({\uptheta }_{\mathrm{GMM}}\). Subsequently, the resulting \(k\) Gaussian distributions were considered as the optimal clusters.

Of course, conventional clustering methods would be subject to robustness issues with respect to initial points. Since k-means or GMM all start from random initial points and are not always guaranteed to converge to global optima, such clustering algorithms are often built to run multiple times with different random initial points and select the best one among them. We also use the same method to obtain more robust results.

3 Data and model

In this section, we describe our data and models (Sect. 3.1), and a simple analysis was performed to determine the appropriate number of clusters (Sect. 3.2). Also, we compare clustering performance of the deep clustering method with other popular clustering algorithms (Sect. 3.3).

3.1 Data and experimental settings

The Korean household finances and living conditions survey data were used in this study. This survey is conducted annually by the National Statistical Office of Korea, the Bank of Korea, and the Financial Supervisory Service of Korea to provide a solid ground for policymakers to account for households’ financial soundness in terms of their level of income, assets, liabilities, and expenditures. Since the survey instrument was revised in 2017, we used data from 2017. The main analysis was done using survey data from 2017 to 2020. The total number of respondent households during that period was 54,920, and the number of unique households excluding multiple participation in different years was 26,907. In addition, the 2021 survey data of 18,187 households was used for out-of-sample analysis in Sect. 4.4. Note that the annual survey is conducted around every March. Hence, for example, the survey in 2020 is mostly based on households’ financial activities in 2019. This means that our main analysis in Sects. 4.14.3 was done prior to COVID-19, and the out-of-sample analysis in Sect. 4.4 would show the changes after COVID-19.

For clustering purposes, we chose six asset-related variables, 12 debt-related variables, and seven expenditure-related variables for household balance sheets. The asset variables include deposit savings, other savings, long-term rental deposits, residential housing, non-residential real estate, other real assets. The debt variables include:

  • Mortgage loans: Residential housing, nonresidential real estate, long-term rental deposit, living expenses, business, refinance

  • Credit loans: Residential housing, nonresidential real estate, long-term rental deposit, living expenses, business, refinance

Expenditure variables include foodstuffs, housing, education, medical expenses, transportation, communication, other consumption expenditures. Other real assets include automobiles and valuables, and other consumption expenditures include spending on cultural life, clothing, alcohol, and tobacco. All variables are winsorized for the upper and lower 1% to handle extremely skewed distributions. In addition, they are divided by the total consumption expenditure to mitigate scale differences between households.

For demographic analysis, householder information (age, gender, education level, and employment status), number of household members, residential type, and location were used.

The specifications of the models are as follows: Both the encoder and decoder of the AE are fully connected multi-layer perceptrons (MLPs) with three hidden layers. All layers have rectified linear unit (ReLU) activation. The encoder MLP dimensions are d-100–100-200-k, where d is the dimensionality of the clustering variables and k is the number of clusters. That is, it receives a d-dimensional input, which goes through three hidden layers with 100, 100, and 200 neurons, respectively, and outputs a k-dimensional output. The decoder has an exactly opposite structure. Then, they are optimized using the Adam optimizer (Kingma & Ba, 2014). In Appendix A, we provide more detailed parameter settings and check the robustness of model outputs with respect to parameter choices. We confirm that our analysis would not be affected by small changes in parameters.

3.2 Number of clusters

We varied the number of clusters k from 4 to 12 to see how households are clustered as the number of clusters increases, and to determine the appropriate number of clusters for a more detailed analysis. Figure 3 shows the optimal clusters of household balance sheets obtained with different k, which is a hyperparameter that we should set before running the model. That is, circles with black color (label 4) represent optimal clusters when we set \(k=4\). Similarly, circles with light grey color (label 12) represent optimal clusters when we set \(k=12\). The location of a circle represents the median of total assets and total debt of households within each cluster, and the size of a circle indicates the average of the total expenditure of households within each cluster. Due to large scale differences in the total asset values of households, the asset axis is represented on a log-scale. The unit of all variables is KRW 10,000 (≈ USD 10).

Fig. 3
figure 3

Optimal household clusters with different number of clusters

It can be seen from Fig. 3 that clusters are created along similar increasing curves of debt with respect to log(asset). In addition, there are a couple of clusters with very small total expenditures, while other clusters tend to have similar total spending. Hence, we would expect that there are more dimensions to household heterogeneity than total assets, total debt, and total expenditure. That is, we should investigate more detailed compositions of assets, debt, and expenditure to further understand household heterogeneity.

Next, we determined the most appropriate k (number of clusters) for further analysis. There are households that appear in multiple years of the survey (17,887 out of 26,907). If they are assigned to different clusters in different years, it would result from either a significant change in the household balance sheet or unstable clustering. Thus, we keep track of these households and calculate the average of absolute changes in asset, debt, and expenditure variables. If the changes in the variables are small, it would mean that clustering is unstable. On the other hand, if the changes in the variables are large, it would imply that a household’s cluster would change mostly when they had a significant change in their financial status, and thus, clustering would be stable. While the asset, debt, and expenditure variables are all used together for clustering, we calculated the changes in variables separately so that we may see more detailed aspects of the clustering results.

Table 1 shows the average absolute changes in assets, debt, expenditure variables in cluster changes, and total count of cluster changes. The average absolute differences indicate that cluster changes are caused by significant changes in debt and asset variables, while the effect of expenditure variables is relatively small. In terms of cluster numbers, note that the average absolute change of variables naturally decreases as the number of clusters increases because there are more clusters. For a similar reason, the total count of cluster changes tends to increase as the number of clusters increases. In this regard, the case of \(k=8\) (represented in bold) is particularly interesting because all variable changes are larger than in the case of \(k=7\) while the increment of total count is marginal compared to \(k=7\). That is, we would achieve relatively robust clusters when \(k=8\), thus, we fixed \(k=8\) for further analyses.

Table 1 Variable deviations and total count of cluster label changes

3.3 Model comparisons

Although we explained the reasons why we use a deep clustering method in Sect. 2, they should be backed up by performance comparisons. We compare our method (deep clustering via N2D) with four popular clustering methods, k-means, DBSCAN, hierarchical clustering (Ward’s method), and hierarchical DBSCAN. k-means clustering would be the most well-known clustering algorithm that tries to separate data samples into k groups by choosing centroids that minimize the within-cluster variances. DBSCAN (Ester et al., 1996) is the acronym of density-based spatial clustering of applications with noise, which sums up its characteristics. It gathers points that are close to each other, while leaving out outliers. Hierarchical clustering methods aim to find clusters by building a hierarchy of clusters. There are various approaches depending on the linkage criterion that determines the dissimilarity between clusters. We use Ward’s method, which can be seen as the hierarchical version of the k-means method. Lastly, the hierarchical DBSCAN is a hierarchical version of DBSCAN proposed by Schubert et al. (2017).

Clustering is a typical unsupervised learning task, and thus, the performance evaluation of clustering algorithms is not as trivial as regression models and classification models. The two most popular metrics are the Silhouette index and Davies-Bouldin index. The Silhouette index, proposed by Rousseeuw (1987), measures how each data point is similar to its own cluster compared to other clusters. The Davies-Bouldin index (Davies & Bouldin, 1979) represents the average similarity between each cluster and its closest cluster. Hence, good clusters would have a high Silhouette index but a low Davies-Bouldin index.

Table 2 summarizes the clustering performances of different methods. For each method, the number of clusters \(k\) is chosen to maximize the Silhouette index and minimize the Davies-Bouldin index. It is clear that the deep clustering method shows the best performance compared to other popular clustering methods in terms of two indexes in our dataset.

Table 2 Clustering performance comparison

4 Analysis of household heterogeneity via deep clustering

In this section, we find representative clusters of household balance sheets via deep clustering and analyze them. The optimal clusters are analyzed in detail in terms of financial (Sect. 4.1) and demographic (Sect. 4.2) perspectives. The inter-cluster mobility is discussed in Sect. 4.3. Finally, we present an out-of-sample analysis in Sect. 4.4.

4.1 Household heterogeneity in balance sheets

As we have seen from Fig. 1, the relationship between asset allocation and wealth level is highly nonlinear, and dividing households in terms of wealth level was not helpful in simplifying the relationship. We present the same results for all five income quintiles, four age groups (under 40, 40 to 50, 50 to 60, above 60), and 20 income-age groups in Appendix B. While households are often classified in terms of their income or age, these results indicate that such groups do not do much to reduce within-group heterogeneity.

Figure 4 represents the average portfolio weights with respect to wealth level of different household clusters found by the deep clustering method. We can clearly see that the relationship has become much simpler. In particular, asset allocations seem almost constant within Clusters 1 to 4. This shows what deep learning can do in analyzing complex household finance data. Deep learning has been exceptional in capturing nonlinear dependencies within data. Hence, it was able to group households accounting for complex relationships, and thus, groups have much higher within-group homogeneity.

Fig. 4
figure 4

Average portfolio weights of different household clusters

We now investigate the financial heterogeneity of households in more detail. Table 3 summarizes the financial variables of eight clusters with units of KRW 10,000 (≈ USD 10).Footnote 4 Clusters are sorted with respect to the average total asset value in descending order. Hence, Cluster 1 was the wealthiest group and Cluster 8 was the poorest group. The numbers in parentheses are proportions of each variable within the asset, debt, and expenditure categories. Values with relatively large proportions compared to other clusters are highlighted in bold.

Table 3 Average values (proportions) of asset, debt, expenditure variables of different household clusters

For assets shown in Panel A of Table 3, there is a clear tendency that the wealthy-half (Clusters 1, 2, 3, 4) hold more than 50% of their assets in real estate (residential and non-residential), while non-wealthy-half (Clusters 5, 6, 7, 8) hold more than 50% of their assets in financial assets (deposit savings, other savings, long-term rental deposits). Among the wealthy-half, the wealthiest two (Clusters 1 and 2) have a significant amount of nonresidential real estate, but the other two (Clusters 3 and 4) do not. As for the non-wealthy-half, Cluster 5 has more than 60% of their assets in long-term rental deposits, whereas Clusters 6 and 7 are more concentrated in savings and other real assets. Cluster 8 seems to be the poorest group with a very small amount of assets. Overall, the major asset classes of different household groups are summarized in Fig. 5.

Fig. 5
figure 5

Major asset class of households with different level of wealth

It is widely known that Korean household wealth is excessively concentrated in real estate compared to other developed countries (Fredriksen, 2012; Park, 2020). However, our analysis reveals that this statement is true only for the wealthy-half groups. This shows the importance of analyzing heterogeneous household groups, because aggregated values would be naturally biased towards wealthy groups that possess large amounts of assets.

A similar tendency can be found for the debt variables (Panel B of Table 3). More than 30% of loans in Clusters 1 and 2 are for nonresidential real estate, and more than 60% of loans in Clusters 3 and 4 are for residential housing. Approximately 70% of the loans for Cluster 5 are for long-term rental deposits, and more than 70% of loans in Clusters 7 and 8 are for living expenses, business funds, and refinances. Hence, the purpose of loans changes from urgent financial liquidity to purchasing real estate as the household wealth level increases. In addition, more than 70% of the loans in Clusters 1 to 5 are mortgage loans, but the other clusters have more credit loans. Clusters 7 and 8 rarely have mortgage loans (\(\le 10\%\)), probably due to a lack of underlying assets. Figure 6 summarizes the findings.

Fig. 6
figure 6

Major loan types of households with different level of wealth`

Panel C of Table 3 shows the expenditure variables for different household clusters. While the overall proportions are not as heterogeneous as in the asset and debt variables, a few interesting observations can be found. First, the poorest two clusters (7 and 8) spent a relatively large amount on housing (\(\ge 20\%\)) compared to the others. Second, Clusters 2 to 5 tended to invest more on education (\(\ge 10\%\)). Third, more than 10% of the expenditure of the poorest group (Cluster 8) is for medical purposes. Fourth, wealthy groups (Clusters 1 to 5) tend to spend slightly more (around 25%) for cultural life, clothing, alcohol, tobacco, etc. (categorized as ‘others’).

A rough decision tree is shown in Fig. 7 to summarize the multidimensional heterogeneity of household finance. We can see that asset and debt variables are more crucial for representing the heterogeneity of households than expenditure variables. For more detailed classifications, asset compositions (especially real estate) are important for wealthy groups, whereas the purpose and type of debt are important for non-wealthy groups.

Fig. 7
figure 7

Decision tree for household clusters

4.1.1 Clustering quality and variable importance

Here we further check the quality of the clustering results and the importance of each variable by investigating how variables are distributed within and between groups. Recall that the objective of clustering is to find clusters with high within-cluster similarities and low between-cluster similarities. We believe that the Gini coefficient and its decomposition can be useful in this regard. The Gini coefficient is a popular measure of inequality in the distribution of income or wealth, and some researchers have decomposed the Gini coefficient to investigate the causes of disparity in income distributions with different populations and educational backgrounds (Deaton & Paxson, 1994, 1997). There are two popular approaches to decomposition: Pyatt (1976) and Shorrocks (1982). While the former directly compares the Gini coefficients of different groups, the latter linearly decomposes the Gini coefficient into within-group, between-group, and overlapping inequalities. We use the latter approach because it quantifies within-group and between-group inequalities that are exactly in line with the clustering objective.

Let us consider k groups (or clusters) and a variable \(\mathrm{Y}\). \({\mathrm{Y}}_{\mathrm{I}}\) represents the variable within group \(\mathrm{i}\) with mean \({\upmu }_{\mathrm{I}}\) and cumulative distribution \({\mathrm{F}}_{\mathrm{i}}\left({\mathrm{Y}}_{\mathrm{i}}\right)\). Then, the overall population \({\mathrm{Y}}_{\mathrm{u}}={\mathrm{Y}}_{1}\cup {\mathrm{Y}}_{2}\ldots \cup {\mathrm{Y}}_{\mathrm{k}}\) is the union of all groups with \({\mathrm{F}}_{\mathrm{u}}\left({\mathrm{Y}}_{\mathrm{u}}\right)={\sum }_{\mathrm{i}}{\mathrm{p}}_{\mathrm{i}}{\mathrm{F}}_{\mathrm{i}}\left({\mathrm{Y}}_{\mathrm{i}}\right)\), where \({\mathrm{p}}_{\mathrm{I}}\) is the population share of group \(\mathrm{i}\), with mean \({\upmu }_{\mathrm{u}}\). The Gini coefficient of the overall population is defined as.

$$\mathrm{G}=\frac{2\mathrm{cov}\left({\mathrm{Y}}_{\mathrm{u}}, {\mathrm{F}}_{\mathrm{u}}\left({\mathrm{Y}}_{\mathrm{u}}\right)\right)}{{\upmu }_{\mathrm{u}}},$$

and Mookherjee and Shorrocks (1982) decomposed it into

$$\mathrm{G}={\mathrm{G}}_{\mathrm{W}}+{\mathrm{G}}_{\mathrm{B}}+{\mathrm{G}}_{\mathrm{O}}.$$

Here, within-group inequality \({\mathrm{G}}_{\mathrm{W}}\) is defined as \({\mathrm{G}}_{\mathrm{W}}= {\sum }_{\mathrm{i}}{\mathrm{p}}_{\mathrm{i}}{\mathrm{q}}_{\mathrm{i}}{\mathrm{G}}_{\mathrm{i}}\), where \({\mathrm{q}}_{\mathrm{i}}\) is the variable share of group i, \({\mathrm{G}}_{\mathrm{i}}= \frac{2\mathrm{cov}\left({\mathrm{Y}}_{\mathrm{i}}, {\mathrm{F}}_{\mathrm{i}}\left({\mathrm{Y}}_{\mathrm{i}}\right)\right)}{{\upmu }_{\mathrm{i}}}\) is the Gini coefficient within group i. Between-group inequality \({\mathrm{G}}_{\mathrm{B}}\) is defined as \({\mathrm{G}}_{\mathrm{B}}={\sum }_{\mathrm{i}}{\sum }_{\mathrm{j}}\frac{{\mathrm{p}}_{\mathrm{i}}{\mathrm{p}}_{\mathrm{j}}\mid {\upmu }_{\mathrm{i}}-{\upmu }_{\mathrm{j}}\mid }{2{\upmu }_{\mathrm{u}}}\), and overlapping inequality \({\mathrm{G}}_{\mathrm{O}}\) is the remainder.

We calculated within-group inequality (\({\mathrm{G}}_{\mathrm{W}}\)), between-group inequality (\({\mathrm{G}}_{\mathrm{B}}\)), and overlapping inequality (\({\mathrm{G}}_{\mathrm{O}}\)) for all cluster variables, and the proportions of the three inequalities are shown in Fig. 8. Three important observations were made. First, we can see that all within-group inequalities are less than 20% and are mostly much less than between group inequalities. This indicates that the quality of clustering is good because all variables tend to have high within-group similarities and low between-group similarities. Second, there are some variables in which between-group inequality accounts for more than 60% of the Gini index. For example, long-term rental deposits, residential housing, nonresidential real estate, mortgage loans for nonresidential housing, long-term rental deposits, business funds, and credit loans for long-term rental deposits. All these variables were shown to be very important in interpreting the clustering results. Third, all expenditure variables exhibited more than 60% of the overlapping inequalities. That is, these variables do not contribute much to clustering, which is consistent with our previous discussion.

Fig. 8
figure 8

Decomposition of Gini coefficients into between-group, within-group, and overlapping inequalities

4.2 Sociodemographic characteristics of clusters

Although optimal clusters are found only with respect to a financial perspective, there is no doubt that household finance is closely related to sociodemographics, such as householder’s age, education level, and so on. Therefore, we conducted logistic regressions for all clusters to investigate their sociodemographic characteristics. Consider a logistic regression for a cluster. The dependent variable \({y}_{i}\) is defined to represent whether a household is in a cluster. The independent variables are presented in Table 4. (Detailed statistics with the percentage see Appendix A.1.)

Table 4 List of independent variables for logistic regression

Table 5 summarizes the results of logistic regressions. Regression coefficients with statistical significance and corresponding odd ratios are shown. Notable variables are highlighted with shadows: positive (italic) and negative (bold) relationships. We can see that most variables are statistically significant, while having both positive and negative values. It shows a strong relationship between the multidimensional heterogeneity of household finance and sociodemographics.

Table 5 Logistic regression results of clusters with respect to socio-demographic variables

Clusters 1 and 2, the wealthiest two groups, were shown to consist of older households compared to others. They both tend to have a highly educated male householder, living in their own houses, and have a high income. While Cluster 2 households live outside the Seoul metropolitan area and are employed, Cluster 1 households live in or near Seoul and have a small number of family members with mixed employment status. Cluster 2 was also more likely to have more family members.

Cluster 3 is quite unique in that it is one of the wealthiest groups with their own houses in metropolitan areas, but its households are likely to be unemployed (including freelance or helping family business) and have low income. They can also be characterized as highly educated young households. Perhaps this peculiar cluster represents young households who inherited houses early.

Clusters 4 and 5 can be regarded as two middle-class groups. Cluster 4 can be characterized as living outside metropolitan areas, large families, homeowners, and low education, while Cluster 5 can be characterized as living in metropolitan areas, small families, long-term rental housing, high education, and high income. These reflect typical rural–urban differences in family size (Key, 1961), income (Lipton, 1977), education (van Maarseveen, 2020), and housing affordability (Lee & Jun, 2018).

Clusters 6 and 7 both consist of poor households who are relatively young, under temporary housing (mostly monthly rent), with no higher education. However, the former is likely to be employed, whereas the latter is not.

Cluster 8 clearly represents the most vulnerable households with very small families (high probability of being alone), low education, low income, low education, unemployed, and under temporary housing, regardless of their age. This cluster had the smallest number of constituents.

Let us summarize the findings with respect to variables.

Age

Old clusters are likely to be wealthy, which is natural in a sense that households would accumulate wealth during working ages. However, there were also two strong exceptions (Clusters 3 and 8)

Education

The three most wealthy clusters are highly educated while the three most poor clusters are poorly educated. For the two middle class groups, one in metropolitan area (Cluster 5) is highly educated and the other outside metropolitan area (Cluster 4) is poorly educated. Also, Cluster 3 is highly educated but has low income

Income

The two most wealthy clusters have high income, and the three most poor clusters have low income. However, three clusters in the middle exhibit mixed results (especially Cluster 3)

Number of

family members

While there is no clear linear relationship between family size and wealth, it is interesting to note that the wealthiest and the poorest clusters are highly likely to consist of small families

Area of residence

No overall trend is found, but typical rural–urban differences can be seen between the two middle class groups (Clusters 4 and 5)

Previous studies have focused on finding a linear relationship between two variables. For example, researchers have reported the existence of a linear relationship between income and wealth (Lee et al., 2020), between education and wealth (Brückner & Gradstein, 2013; Boshara et al., 2015), and the absence of a linear relationship between income and wealth (Mueller, Buchholz, & Blossfeld, 2011). However, our results show that even if there is an overall trend between two variables, there is always a strong exception, making the relationship non-linear. Hence, considering multiple variables is crucial for understanding the complex relationship between financial and sociodemographic variables.

4.3 Mobility between clusters

We analyze the mobility between clusters by tracking the cluster movements of households who participated in the survey multiple times. From 2017 to 2020, clusters of 12,272 households out of 52,920 total respondent households changed. Figure 9 shows the transition matrix of the clusters. The number in cell \((i, j)\) represents the probability of a household moving from cluster \(i\) to cluster \(j\) in the next survey.

Fig. 9
figure 9

Transition matrix between household clusters

Some block-diagonal shapes can be observed. Two large blocks can be seen within Clusters 1, 2, 3, 4 and within Clusters 5, 6, 7, 8. That is, not many households move from the wealthy groups to the non-wealthy groups and vice versa, which indicates that there are two separate classes that are not reachable to each other in a few years of term. It is interesting to note that Clusters 1, 2, 3, 4 mostly own their houses and Clusters 5, 6, 7, 8 do not.

In addition, there are small blocks between most adjacent clusters (e.g., between clusters 1–2, 3–4, 5–6, 6–7, 7–8). However, we can find another weak link between Clusters 2 and 3. Recall that the key difference between the two clusters was that Cluster 2 had a substantial amount of nonresidential real estate, but Cluster 3 had almost none. Therefore, real estate is not only a crucial factor for classifying households, but also a huge hurdle for households who wish to climb up the class ladder.

4.4 Out-of-sample analysis after COVID-19

Lastly, we show the out-of-sample results using the survey data in 2021, which is mostly based on financial activities of households in 2020. Hence, it will allow us to see the changes after COVID-19 pandemic.

Figure 10 represents the variable importance weights of between-group inequalities of Gini coefficients of asset and debt (mortgage and credit loans) variables. We can see from the figure that after COVID-19, between-group inequalities are decreased in asset variables, but they are increased in debt variables. This means that the changes in debt after COVID-19 are quite different for different clusters, while changes in assets would not. Hence, we can see that the impact of COVID-19 was quite asymmetric for household debt, but it was relatively even for household assets. This makes sense because COVID-19 caused immediate damage to the income of households who have their own business (e.g., restaurants or coffee shops), and many of them had to obtain additional loans.

Fig. 10
figure 10

Variable importance weights of between-group inequalities of Gini coefficients in different years

Next, we investigate the change in mobility between clusters. Figure 11 compares the transition matrix before and after COVID-19. It is clear that the mobility is increased after COVID-19, because every diagonal term became smaller (i.e., probabilities of staying in the same cluster are reduced).

Fig. 11
figure 11

Transition matrix between household clusters before (left) and after (right) COVID-19

$$\frac{\mathrm{Average \,\, probability \,\, of \,\, moving \,\, into \,\, a \,\, poorer \,\, cluster}}{\mathrm{Average \,\, probability \,\, of \,\, moving \,\, into\,\, a wealthier\,\, cluster}}$$

However, if we look into the above ratioFootnote 5 for the transition matrix, there is a significant difference between the wealthy half (Clusters 1,2,3,4) and the non-wealthy half (Clusters 5,6,7,8). Before COVID-19, the average of the above ratio for the transition matrix for the wealthy half and the non-wealthy half was 0.621 and 0.653, respectively. After COVID-19, however, they become 0.639 and 1.151. While the direction of cluster mobility for the wealthy half was not affected by COVID-19, it is clear that the probability of the non-wealthy half going into poorer clusters became much higher. Hence, we can see that COVID-19 had a much greater adverse impact for the non-wealthy half than the wealthy half.

5 Conclusion

This study has shown how advanced clustering techniques, especially that involve deep learning models, can be useful for understanding the complex heterogeneity of household finance. By utilizing a deep learning-based clustering N2D framework proposed by McConville et al. (2021), we were able to efficiently handle high-dimensional data to find representative clusters. More specifically, we could capture and decompose the nonlinear relationships in data through deep clustering, whereas conventional age or income groups could not.

The key implication of this study is that various variables should be considered together to analyze household heterogeneity. For example, real estate ownership was shown to be critical for the broad classification of wealthy and non-wealthy Korean households. Within the wealthy group, nonresidential real estate was shown to be the next key factor, while credit loans were found to be important explanatory variables for further classifications within the non-wealthy group. We used the Gini coefficients and their decompositions to further verify the quality of clustering and the relative importance of the variables. In addition, the multidimensional heterogeneity of households was shown to be closely related to sociodemographic variables, and the relationships were non-linear.

Since this study was conducted based on Korean household data, detailed findings should be interpreted carefully and might not be directly applicable to households in different countries. Hopefully, however, our study will encourage other researchers to search for more multidimensional aspects of household heterogeneity. Such findings are crucial for developing more accurate macroeconomic models with heterogeneous agents and deriving appropriate economic policies.