1 Introduction

Modern artificial intelligence (AI) and machine learning (ML) algorithms can classify high-dimensional data sets with great accuracy and efficiency. These are data-driven systems which subsequent to the learning phase on learning data, are also evaluated on unlearned data, called test data to estimate the generalization ability. Very successful in the context of supervised ML are artificial neural networks consisting of simple processing units (neurons) and organized into connected layers (e.g., “deep learning”). Within AI, these algorithms are referred to as subsymbolic ML systems [1, 2]. Subsymbolic systems perform a task such as the classification of a case within a so called black-box model. Due to its inherent complexity, it is unfeasible for humans to forward a query directly to the black-box about an explanation or rationale for its decisions [3]. Within medicine in particular, there is pressing need for comprehensibility and transparency regarding decisions about patients’ health or treatment options made by algorithms.

The alternative are two, fundamentally different approaches to explaining the decision-making process of ML systems. The first solution approach generates explanations often in the form of rules, using a separately performed classification of a black box subsymbolic system followed by post-hoc interpretation and explanation with explainers like LIME [4] or SHAP [5]. The advantage of these post-hoc explainers is their ability to combine with arbitrary subsymbolic ML systems. The disadvantage is that explanations are extracted mostly locally and thus learned from examples, which might be based on non-causal correlations. Also, locally extracted explanations are often required in a large number to cover the structures in the data [6, 7].

The second solution approach involves inherently interpretable procedures of ML, which are referred to as symbolic because their algorithms make decision processes comprehensible [8]. In this work the second approach is referred to as data-driven explainable AI (XAI). Here, the notion of explanation can be specified better by defining levels of explanation with designated procedures based on cognitive processes and theoretically intended to allow conversational interaction between different levels of XAIs and a human [9].

Data-driven XAI methods have the advantage that higher-level structures in the data can be recognized and explained. The disadvantage is that such methods, due to their internal metrics, may have a bias and require already classified data. Such classifications or annotations might not map the structures in the data.

The challenge of both approaches is that the generated explanations are typically not meaningful to the domain expert but rather tailored to the data scientist [10]. Optimally, meaningful explanations should be developed in the language of the domain expert, i.e., particularly be contrastive [11]. Further, these XAI approaches pose risk of modeling artifacts when understanding and control of end-users are diminished, i.e., when structures in data are detected without human intervention [12].

The same challenges arise in the one-dimensional data for the estimation and visualization of probability density distributions (PDFs), for which many kernel density estimators and visualization methods are available. Still, all these methods regularly reveal erroneous distributions [13]. Only integrating a human-in the-loop (HIL) into the algorithms presented here at critical decision points leads to significantly better detection of distance-based structures in data (DSD) and, what follows, to understandable and relevant explanations [14].

This habilitation thesis proposes an alternative solution path for creating a data-driven XAI: higher-level structures being recognized in the data by enabling a HIL to identify them at critical decision points. The author thus follows the reasoning of Holzinger in that the integration of a HIL’s knowledge, intuition and experience may be indispensable, and the interaction of a HIL with the data can significantly improve the overall ML pipeline [15]. The HIL is an agent that interacts with algorithms, allowing the algorithms to optimize their learning behavior [16]. This perspective fundamentally integrates humans into the algorithmic loop to opportunistically and repeatedly use human knowledge and skills to improve the quality of ML systems [16,17,18].

2 Data-driven XAI Using Human-in-the-Loop

The new data-driven XAI approach is summarized in the following two steps. First, the distance-based structures in data (DSD) are identified by a HIL required at three critical decision points. Second, meaningful and relevant explanations are extracted by the XAI, based on the structures in the data involving a HIL at one critical decision point.

The first step works as follows: firstly the identification of empirical probability density distributions (PDF) is performed and, for this purpose, a basic method called Mirrored Density Plot (MD Plot) is proposed [13]. The MD Plot can detect and display distribution skewness, multimodality, abrupt distribution boundaries, and quantized data generation processes. Multimodality in particular, can be seen more sensitively by a HIL than detected by statistical tests [13]. By allowing the HIL, at the first decision point, a visual identification of distributions, appropriate transformations can be selected with the goal to normalize the variance of variables for a distance choice with the objective of avoiding undesirable weights of variables with large ranges of values and variances.

At the second critical decision point, the distance of the high-dimensional data is selected based on the theoretical multimodality concept [19] with a practical example of four gene sets causally associated with pain and the chronification of pain, hearing loss, cancer, and drug addiction in [20]. Multimodality in the distance distribution indicates modes of intrapartition distances and interpartition distances [19]. Thirdly, an automatic structure detection algorithm, called projection-based clustering (PBC), is proposed [21], whose extension is the identification process through a HIL at the third critical decision point [22, 23] as described in Sect. 3.

In the second step, the identified structures in data guide a supervised decision tree. The appropriate decision tree is chosen by the HIL according to Grice’s maxims [14, 24]. As a consequence, the HIL implicitly selects the splitting criterion, which is based on a metric defined by class information [25, 26], at this decision point, so that the decision tree represents the identified structures in data. After the decision tree has been learned, the paths from the root to the leaves are defined as explanations.

The data-driven XAI method was successfully applied to the problem of understanding water quality and its underlying processes. The multivariate time series for different years consists of water quality measurements. It leads to explanations that are both meaningful and relevant to the domain expert [14].

In addition, the data-driven XAI was applied to quarterly available fundamental data of companies traded on the German stock market [24, 27, 28]. In principle, company fundamentals can be used to select stocks with a high probability of rising or falling stock prices. Still, many of the commonly known rules or used explanations for such a stock-picking process are too vague to be applied in concrete cases. Using the explanations of the data-driven XAI, the prediction of the future price trends of specific stocks is made possible with a higher success rate than comparable methods [24].

3 Human-in-the-Loop Projection-Based Clustering (HIL-PBC)

The practical problem of López-García et al. showed that combining principal component analysis with clustering is rather disadvantageous [29]. It served as the motivation to define and systematically investigate distance and density based structures in data. The focus lies on the automatic detection with subsequent HIL recognition of partitions of separable high-dimensional structures in data, which are often also referred to as “natural” and whose data patterns in low-dimensional spaces are perceived by the human eye as separate objects (cf. [30]). Descriptions and access to typical density- and distance-based structures [31] and algorithms [32] is provided. In the subsequent work [33], the pitfalls and challenges of automated cluster detection or cluster analysis pipelines are highlighted. This work shows that

  • Parameter optimization on datasets without distance-based structures,

  • Algorithm selection using unsupervised quality measures on biomedical data, and

  • Benchmarking detection algorithms with first-order statistics or box plots or a small number of repetitions of identical algorithm calls

are biased and often not recommended [33]. This serves as a motivation to investigate HIL approaches for structure identification toward pattern recognition as opposed to automatic algorithmic detection [22].

In order to integrate a HIL into the recognition of structures in data, the combination of non-linear projection methods and automatic detection of structures proved to be very useful [21]: Let d-dimensional data points \(i\in \text{I}\) be in the input space \(I\subset {\mathbb{R}}^{d}\), and let \(o\in \text{O}\) be projected points in the output space \(O\subset {\mathbb{R}}^{b},\) then a mapping performed by a dimensionality reduction method proj: \(I\to O, i\mapsto o\) is called a projection onto a plane if \(b=2\). First, a non-linear projection (e.g., via NeRV [34], t-SNE [35], Pswarm [36]) is computed for the data points, and then the projection points are quantized into grid points \({g}_{i}\in {\mathbb{R}}^{2}\) within the finite two-dimensional space (plane). A \({g}_{l}\) is connected to \({g}_{j}\) via an edge \(e\) if and only if there exists a point \(x\in {\mathbb{R}}^{d}\)that is equally close to \({g}_{l}\) as \({g}_{j}\) in terms of metric \(D\) as well as closer to \({g}_{l}\) and \({g}_{j}\) than any other point \({g}_{i}\) with \(\exists x\in {\mathbb{R}}^{d}: D\left(x,{g}_{l}\right)=D\left(x,{g}_{j}\right) \wedge D\left(x,{g}_{l}\right)<D\left(x,{g}_{i}\right) \forall i\ne l,j\).

Let graph Γ be a pair (V,E) for which the grid points are the vertices \(v\in V\), let \(\left\{{e}_{1}(l,k) . . . , {e}_{n}(m,j)\right\}\in E\) be a sequences of edges defining a walk from grid point \({g}_{l}\) to \({g}_{j}\), let \(d(l,j)\) be the distances between the corresponding high-dimensional data points \(\{l,j\}\), then the length \(\left|{p}_{l,j}\right|\in {P}_{l,j}\) is derived from the path \({p}_{l,j} = {d\left(l,k\right)*e}_{1}, . . . , d(m,j)*{e}_{n}\). Paths are embedded in the two-dimensional toroidal plane, even if the projection is planar and not toroidal. In a two-dimensional toroidal plane, its four sides are cyclically connected. Thus, border effects of the projection process can be compensated. Then the shortest path between two grid points \({g}_{l},{g}_{j}\) in (\(\varGamma , P)\) is defined by \(\widetilde{d}\left({g}_{l},{g}_{j}\right)=\text{min}\{{\text{P}}_{l,j}\}.\)

Let \({C}_{r} \subset I\) and \({C}_{q} \subset I\) be two partitions with \(r,q\in\left\{1,\dots,k\right\}\) and \({C}_{r} \cap {C}_{q} =\left\{\right\}\) for \(r \ne q\)and let data points in the partitions be defined by \(l \in {C}_{q}\) and \(j\in {C}_{r}\), with powers \(k = \left|{C}_{q}\right|\) and \(p = \left|{C}_{r}\right|\), further let \(\{{g}_{l}, {g}_{j}\}\) be the nearest neighbors of two partitions \({C}_{r} \subset I\) and \({C}_{q} \subset I\), then in each case two partitions \(\{{C}_{r} , {C}_{q}\}\) are aggregated bottom-up with either the minimum dispersion of \(\{{C}_{r} , {C}_{q} \}\):

$$S\left(C_r,C_q\right)=\frac{2k\cdot p}{k+p}\cdot\widetilde d(C_r,C_q)$$
(1)

or with the smallest distance between \(\{{C}_{r} , {C}_{q} \}\):

$$S\left( {C_{r} ,C_{q} } \right)\, = \,\mathop {min}\limits_{{l \in C_{r} ,j \in C_{q} }} \left[ {\widetilde d\left( {g_{l} ,g_{j} } \right)} \right]$$
(2)

The algorithm stops when the set number of partitions is reached. Yet, PBC required two parameters to be set and for specific projection methods like Pswarm a distance to be selected.

Thus, the HIL-PBC was proposed as an extension for integrating a HIL at critical decision points through an interactive topographic map to detect separable structures [22, 23]. These separable high-dimensional structures in data are visualized using a topographic map [37] based on the U-matrix principle [38, 39] with a height-dependent color mapping (so-called regional colors or “hypsometric tints”) and it can be 3D printed [40].

The task of a HIL is to estimate by inspecting the topographic map whether there is a tendency for separable high-dimensional structures (cf., clusterability [41]) to appear. Moreover, the interaction of a HIL with the topographic map enables the estimation of the number of partitions in the data and making the correct choice of the Boolean parameter (Eqs. 1, 2).

4 Concluding Remarks

The recommendation is to integrate the human-in-the-loop (HIL) at critical decision points with the goal to identify structures in the high-dimensional data to exploit them in data-driven XAIs [13, 22]. The HIL is necessary as the thesis shows that automatic ML pipelines are disadvantageous [33]. One exemplary critical decision point is the selection of the distance metric by recognizing multimodality in its distribution even when statistical testing is not sensitive enough [19]. These distance-based structures in data (DSD) guide the decision tree whose splitting criterion satisfies well the Grice’s maxim [14, 24]. From the decision tree, the explanations are extracted without a necessity for priorly labeled data, although HIL-PBC can verify that a given classification represents structures in data. The proposed algorithms in the two steps described in Sect. 2 are compared with a wide range of conventional algorithms in a variety of published works.

Supposing no multimodal distance distribution can be found or the number of cases makes a distance calculation impracticable, the first results on biomedical data show a successful application of an XAI, called algorithmic population description (ALPODS) [6, 7]. ALPODS identifies density-based partitions within the training dataset relevant to the given classification. These partitions (“populations”) are recursively generated as a sequence of decisions for each of the variables in the data [6, 7].

One open issue worth considering for future work is the evaluation of the proposed data-driven XAI based on HIL-PBC with additional human-experts.