Keywords

1 Introduction

We come across high dimensional datasets in various real-world problems related to finance, meteorology, computer vision, medical imaging, multimedia information processing, text mining, and many more [2, 15, 16, 18]. Handling high dimensional data is difficult as it demands more storage space, more processing time, and increased system complexity. For many applications, visualization of the structure in the data is important which is difficult for high dimensional data. Dimensionality reduction (DR) is a process that addresses these problems by representing a high dimensional data into a lower dimensional space, keeping the relevant information as intact as possible. DR schemes can be broadly categorized into two groups: feature selection (FS) and feature extraction (FE). FS chooses a small but important subset of features, whereas, FE constructs a smaller number of new features from the original features. Methods of FE are more appropriate when the system does not bother about the physical meaning of the extracted features and focuses principally on preserving the local or global structure present in the original data.

We mathematically formulate FE as follows. Let the input data be represented by the set \(\mathcal {X}=\{\mathbf {x}_i \in \mathbb {R}^D: i \in \{1,2,\cdots ,n\}\}\). Here, n is the number of instances and D is the dimensionality of the data. The vector \(\mathbf {x}_i\) represents the ith instance of the data and \(x_{ik}\) represents the value of the kth feature associated with the ith instance. Then, \(\forall \mathbf {x}_i \in \mathcal {X}\), the corresponding lower dimensional representation is \(\mathbf {y}_i \in \mathbb {R}^{D^{\prime }}, D^{\prime } < D\). Let us represent the output dataset as \(\mathcal {Y}=\{\mathbf {y}_i \in \mathbb {R}^{D^{\prime }}: i \in \{1,2,\cdots ,n\}\}\). FE is finding an implicit or explicit mapping, \(\phi : \mathbb {R}^{D} \rightarrow \mathbb {R}^{D^{\prime }}\), such that, an objective \(\psi \) is minimized/maximized. Usually, \(\psi \) is a function of \(\mathcal {X}\) and \(\mathcal {Y}\), where \(\mathbf {y}_i = \phi \left( \mathbf {x}_i \right) , \forall i\). Then, the obtained optimal mapping \(\hat{\phi } = \mathop {\mathrm {arg\,min}}\nolimits _{\phi } \left\{ \psi \left( \mathcal {X},\mathcal {Y}\right) \right\} \).

The Sammon’s method [14] is a popular nonlinear FE technique that does not learn the mapping function \(\phi \) explicitly. It aims to preserve certain geometric properties of the data by preserving the inter-point distances in the original higher dimensional space as the inter-point distances in the extracted lower dimensional space as much as possible. It reduces the following error function E (i.e., \(\psi \)):

$$\begin{aligned} E=\frac{1}{ \sum _{i=1}^{n-1} \sum _{j=i+1}^{n} d^{*}_{ij}} \sum _{i=1}^{n-1} \sum _{j=i+1}^{n} \frac{\left( d^{*}_{ij}-d_{ij}\right) ^{2}}{d^{*}_{ij}} \end{aligned}$$
(1)

Here, \(d_{ij}\) and \(d^{*}_{ij}\) are the Euclidean distances between \(\mathbf {x}_i\) and \(\mathbf {x}_j\); and \(\mathbf {y}_i\) and \(\mathbf {y}_j\); respectively. The error function, E, is called Sammon’s stress. To minimize E, Sammon’s method employs the steepest descent method. Note that, for a fixed pair of points i and j, Sammon’s stress multiplies a factor \(\nicefrac {1}{d^{*}_{ij}}\) to \(\left( d^{*}_{ij}-d_{ij}\right) ^{2}\). As a result, the method pays more attention in preserving short pairwise distances, and hence, preserves the local structure.

The Sammon’s method is a corner-stone method in structure preserving dimensionality reduction. However, it has a drawback: it does not learn the mapping function \(\phi (\cdot )\) explicitly. After learning a mapping from n points, when a new point occurs, the Sammon’s method needs to repeat the whole optimization process considering the \((n+1)\) points. From (1) we observe that, every iteration in the steepest descent search requires computation of \(\nicefrac {n(n\,-\,1)}{2}\) distances. When n is large, computations involved may be prohibitive. Another disadvantage of Sammon’s method is that it optimizes a non-convex objective. So, it is possible that the method may get stuck at some local minima on the error surface. To avoid such a situation and to obtain a good solution the method needs to try several initializations. This again adds to the computational overhead.

To reduce the computational cost associated with Sammon’s method, some attempts have been made to learn an explicit mapping with the Sammon’s objective function. Jain and Mao [8, 9] have used a multilayer feed-forward network to learn Sammon’s projection based explicit mapping. The number of input nodes and the number of output nodes of the neural network are set to D and \(D'\), respectively. A pair of inputs \(\mathbf {x}_i\) and \(\mathbf {x}_{j}\) is applied one after another and their corresponding outputs \(\mathbf {y}_i\) and \(\mathbf {y}_{j}\) are used to define the error function E. A problem with neural network based methods is that they predict outputs even when the test points are far away from the training data. Again, in [11], a supervised fuzzy rule-based model has been proposed to learn the Sammon’s projection parametrically. This method first generates the Sammon’s projection of the input data. Then, a fuzzy rule base is learnt using a supervised learning scheme, where the obtained output from Sammon’s projection is used as the target. An advantage of this method [11] over the neural network based methods [8, 9] is that it can be used to detect test instances that are far away from the training data. However, it [11] has the following drawback. The computational cost for this method comprises the computational cost for obtaining Sammon’s projection and computational cost for learning the fuzzy system.

The objective of this paper is to propose a method for structure preserving dimensionality reduction. Although, any stress function can be used, we use here Sammon’s error, such that, it should avoid the additional computational overheads associated with the methods in [11, 14]. Additionally unlike [8, 9], it should decline to produce outcomes when the test points are far away from the training data. To achieve our objectives, here, we have contributed in the following ways:

  1. 1.

    We have proposed a fuzzy rule-based method which learns an explicit mapping function with the objective of Sammon’s method in an unsupervised fashion. It is unsupervised as there are no target values provided for the inputs. It is quite general and other objectives can also be used.

  2. 2.

    We have equipped the Sammon’s method with predictability, and hence, have cut down the computational cost associated with Sammon’s method to project new points.

  3. 3.

    The proposed system can indicate when its projection may not be reliable.

2 Proposed Method

The proposed method has the following major steps.

  1. 1.

    Form an initial fuzzy rule base \(\mathcal {R}\).

  2. 2.

    Project input dataset \(\mathcal {X}\) with \(\mathcal {R}\) to obtain corresponding output \(\mathcal {Y}\).

  3. 3.

    Modify \(\mathcal {R}\) to reduce the Sammon’s stress defined by \(\mathcal {X}\) and \(\mathcal {Y}\).

  4. 4.

    Repeat steps 2 and 3 until convergence.

  5. 5.

    Use \(\mathcal {R}\) to project new points.

Next, we discuss the rule-based system used in this work, and then, we discuss the above steps with further details.

2.1 The Rule-Based System

We explore both Mamdani-Assilian (MA) and Takagi-Sugeno (TS) type fuzzy rule-based systems. Let there be r rules in the rule base. Let, for the ith sample \(\mathbf {x}_i\), the mth output feature produced by the kth rule be \(y_{im}^{k}\). Here, \(m \in \{1,2,\cdots ,D'\}\); \(k \in \{1,2,\cdots ,r\}\); and \(i \in \{1,2,\cdots ,n\}\). For MA and TS models, the rules are denoted by \(\mathcal {R}_{km}^{MA}\) and \(\mathcal {R}_{km}^{TS}\), respectively. They are defined as follows.

(2)
(3)

Here, \(A_{kl}\) is the kth fuzzy set (linguistic value) on the lth input feature and \(B_{km}\) is the kth output fuzzy set on the mth output variable; \(u_{km}(): \mathbb {R}^{D} \rightarrow \mathbb {R}\) is a mapping function associated with the kth rule for the mth output feature.

Let, \(\alpha ^{k}_{i}\) be the firing strength of the kth fuzzy rule for the ith input \(\mathbf {x}_i\). The rule firing strength is computed as the conjunction of D atomic clauses in (2)/(3) using a T-norm [7] over the fuzzy sets \(A_{k1},A_{k2},\cdots A_{kD}\). We use the product operator here. For both MA and TS models, the output \(\mathbf {y}_{i}=(y_{i1},y_{i2},\cdots , y_{iD'})^{T}\) is given by

$$\begin{aligned} y_{im}=\frac{\sum _{k=1}^{r} \alpha _{i}^{k} y_{im}^{k}}{\sum _{k=1}^{r} \alpha _{i}^{k}}. \end{aligned}$$
(4)

We shall explain later how for the MA model we get the value of \(y_{im}^{k}\) from \(B_{km}\). We note here that, for MA models, more involved method of defuzzification can be used [6].

2.2 Initial Rule Base Formation

When both input-output data are provided there are many ways of generating an initial rule base and its refinement [4, 5, 10,11,12, 17]. Some of the popular methods use clustering [4, 11, 12] of input-output data. The philosophy behind the use of clustering is that if there is cluster in the input space with centroid \(\mathbf {v}_{k} \in \mathbb {R}^D\), for a smooth input-output relationship, the points in the output space are likely to form cluster around \(\mathbf {u}_{k}\) (which is the lower dimensional representation of \(\mathbf {v}_{k}\)). But here, we do not have any output data. We, however, use the clustering approach on the input data. We use the fuzzy c-means clustering algorithm [3] to find r clusters in the input data \(\mathcal {X}\). The cluster centroids are denoted by \(\mathbf {v}_{k} \) \(=\) \(({v}_{k1},{v}_{k2},\cdots ,{v}_{kD})^{T}\) \(\in \) \(\mathbb {R}^{D}; k \in \{1, 2, \cdots , r \}\). The kth cluster is translated into the antecedent of the kth rule as follows.

(5)

Here, \(x_{il}\) is close to \(v_{kl}\) is modeled using a Gaussian membership function centered at \(m_{kl}\) with spread \(\sigma _{kl}\). The membership for \(x_{il}\) is

$$\begin{aligned} \mu _{ikl} = \exp \left\{ -\dfrac{(x_{il}-v_{kl})^2}{2\sigma _{kl}^2}\right\} . \end{aligned}$$
(6)

Here, \(v_{kl}\)s and \(\sigma _{kl}\)s are the antecedent parameters of the rules.

For both MA and TS model, though we use the same fashion to define the rule-antecedents, we use different types of consequents. We discuss this next.

For MA model the rule-consequents are fuzzy sets in the output feature space. For each rule the output is a fuzzy set and the union of these fuzzy sets constitutes the fuzzy output which can be defuzzified using several methods [6]. For simplicity, here we use the height method of defuzzification [6]. Let \(B_{km}\) be either a symmetric triangular or Gaussian membership function with center at \(h_{km}\). According to the height method the output of the rule \(R_{km}^{MA}\) is taken as \(h_{km}\). Note that this becomes equivalent to the zeroth order TS model.

For the TS model, a consequent denotes an input-output mapping. For the first order TS model, this input-output mapping is a linear one and is described as follows.

$$\begin{aligned} y_{im}^{k}=u_{km}(\mathbf {x}_{i})=c_{km0}+ \sum _{l=1}^{D} c_{kml}x_{il} \end{aligned}$$
(7)

Thus, the kth rule for TS model becomes

(8)

For formation of the initial rule base, we randomly initialize the consequent parameters (\(h_{km}\)s for the MA model and \(c_{kml}\)s for the TS model).

2.3 Training the Rule Base

We estimate the rule base parameters that minimizes the Sammon’s stress E in (1) using gradient-descent optimization. The learning rules for this are as follows.

$$\begin{aligned} v_{km}(t+1)&=v_{km}(t)-\eta _{v}\frac{\partial E}{\partial v_{km}} \end{aligned}$$
(9)
$$\begin{aligned} \sigma _{km}(t+1)&=\sigma _{km}(t)-\eta _{\sigma }\frac{\partial E}{\partial \sigma _{km}} \end{aligned}$$
(10)
$$\begin{aligned} h_{km}(t+1)&=h_{km}(t)-\eta _{h}\frac{\partial E}{\partial h_{km}} \end{aligned}$$
(11)
$$\begin{aligned} c_{kml}(t+1)&=c_{kml}(t)-\eta _{c}\frac{\partial E}{\partial c_{kml}} \end{aligned}$$
(12)

Here, t indicates the iteration number. \(\eta _{v}, \eta _{\sigma }, \eta _{h}\) and \(\eta _{c}\) are learning rates for \(v, \sigma , h, \text { and } c,\) respectively.

2.4 Rejection of Outputs

An interesting feature of the proposed method is that it can reject test points for which it may not produce correct output. As shown in (4), the mth feature of ith output, \(y_{im}\), is computed as a convex sum of the rule outputs \(y^{k}_{im}\)s. Each rule output (\(y^{k}_{im}\)) is multiplied by the firing strength (\(\alpha _{i}^{k}\)) of the corresponding rule. The rule that fires with the maximum strength, primarily characterizes the output. The profile of the maximum rule firing strength on the training data may help in rejection of test points in a desired manner. For test points far away from the training data, no rule is expected to fire strongly. Possibly, none of the rules are appropriate to model those test points. Thus, we can reject a test point when the maximum value of the firing strength associated with a test point is less than a given threshold. We choose the value of this threshold based on only the training data.

3 Experimentation

3.1 Dataset Descriptions and Experimental Settings

We use two synthetic datasets: S-Curve and Swiss Roll shown in Figs. 1a and 2a, respectively. Both of them are three dimensional data consisting of 1000 points. They are generated using scikit-learn [13] package (version 0.19.2) of Python. We also use a real-world dataset, Iris. Iris has 150 datapoints from three classes in four dimensional space, such that, in each class there are 50 points.

Fig. 1.
figure 1

Visualization of the S Curve data: (a) original, (b) Sammon’s projection, (c) proposed method with MA model, (d) proposed method with TS model.

Fig. 2.
figure 2

Visualization of the Swiss Roll data: (a) original, (b) Sammon’s projection, (c) proposed method with MA model, (d) proposed method with TS model.

We choose the number of fuzzy rules or the number of clusters as 0.05n, where n is the number of points. In FCM clustering algorithm, the value of the fuzzifier and the maximum number of iterations are set to 2 and 200, respectively. We initialize the consequent parameters randomly in \([-0.5, 0.5]\). The spreads of the Gaussian membership functions in (6) are initialized as \(0.2 \beta _l\) and \(0.3 \beta _l\). Here, \(\beta _l\) indicates the difference between the maximum and the minimum values of the lth feature. We repeat each experiment five times for each of these two initializations of the spreads. Out of these ten runs, the one with the minimum error (measured using (1)) is chosen as the output of the proposed algorithm. Note that, it is done automatically without any human intervention. To optimize the objective function in (1) using stochastic gradient descent, we use an optimizer, train.GradientDescentOptimizer, from a standard machine learning framework called TensorFlow [1]. The parameter learning_rate is set to 2.0 and 0.1, respectively, for the MA and TS models. Thus, in experiments with MA models, all the three involved learning rates \(\eta _{v},\eta _{\sigma },\) and \(\eta _{h}\) are equal to 2.0. Similarly for experiments with TS models, all the three involved learning rates \(\eta _{v},\eta _{\sigma },\) and \(\eta _{c}\) are equal to 0.1.

3.2 Results and Discussions

Figures 1b, 2b and 3a show the results with Sammon’s method [14] on S-Curve, Swiss Roll and Iris datasets, respectively. The results of the proposed method with the MA model on the three datasets are illustrated in Figs. 1c, 2c, and 3b respectively. Similarly, the outputs of the proposed method with the TS model are shown in Figs. 1d, 2d and 3c, respectively. From Figs. 1, 2, and 3, we observe that the proposed methods, both with MA and TS models, project the three datasets faithfully to a two dimensional space. For the synthetic datasets, we observe that the proposed method successfully preserves the underlying geometries as preserved in their Sammon’s projection. We note that for S Curve, the MA model (with height method) produces somewhat better outcome than the TS model, whereas for Swiss Roll the TS model generates better projection than the MA model. In case of Iris dataset, separability of classes in the projected space is important. As seen in Figs. 3b and c the separability between the three classes are similar as in the case of Sammon’s method (Fig. 3a). We note here that different initializations and use of different number of rules may produce different results. In addition to the graphical results, to asses the quality of the projection, we present the value of the Pearson’s correlation coefficient between pairwise distances in the input space and in the output space in Table 1. For all the three data sets we obtained quite high value of the correlation coefficients.

This indicates that the relation among the pairwise distances of the inputs are mostly preserved in the obtained projections.

Fig. 3.
figure 3

Visualization of the Iris data: (a) Sammon’s projection, (b) proposed method with MA model, (c) proposed method with TS model.

Table 1. Pearson’s correlation coefficient between the pairwise distances in the input and output spaces
Table 2. Pearson’s correlation coefficient between the pairwise distances in the input and output spaces
Fig. 4.
figure 4

Visualization in the original space (a) training data, (b) full data. Visualization using proposed method with MA model, (c) training data (d) full data with prediction. Visualization using proposed method with TS model, (e) training data, (f) full data with prediction.

To illustrate the predictability of the proposed model, we perform an experiment with the S-Curve dataset. A contiguous portion of the dataset containing 50 points is removed and the rest of the points form a training dataset (see Fig. 4a). The removed 50 points serve as the test dataset. The MA model and the TS model based proposed schemes are used to learn the training data. The corresponding outputs are shown in Figs. 4c and e respectively. The full dataset containing both the training and the test points, is shown in Fig. 4b. To differentiate the training and the test points, we represent them by circle (o) and plus (\(+\)) signs, respectively. The predictions for the test points obtained by the MA model and the TS model based proposed method are shown in Figs. 4d and f, respectively. We observe that irrespective of the types of the model, the proposed method projects test points in the expected region. However, the prediction by the MA model based method seems more appropriate than that of the TS model based method. In Table 2, the values Pearson’s correlation coefficient between pairwise distances in input and output spaces for the training set and the training and test set together are presented. For both MA and TS models the correlations for the training data are similar to those for the cases when the training and test data are taken together.

4 Conclusion

In this work, we have proposed a fuzzy rule-based scheme for learning Sammon’s map explicitly. We have explored the MA model along with the height method of defuzzification. We have also used the TS model where consequents are a linear combination of input variables. For both the models the antecedent part is constructed by clustering the given input data. Antecedent and consequent parameters are modified using gradient descent search to obtain optimal values for minimizing the Sammon’s stress. The obtained results depict that the proposed system successfully produce the structure-preserving projection as that of the original Sammon’s method. It can also project the new (test) points in an expected manner. The framework is quite general and we can use objectives other than Sammon’s stress.

An issue with the proposed method is that the performance of the proposed method depends on the chosen number of clusters. Here, we have chosen the number of clusters based on our initial ad-hoc experiments. In future, we plan to investigate to address this issue. In our future works, we also plan to introduce regularizers to improve the quality of predictability. We have not investigated how the proposed method performs when it should reject the outputs. All these are left for our future work.