1 Introduction

Distance metric learning (DML) is an important subject in machine learning, and has found applications in many domains, including information retrieval (He et al. 2004), supervised classification (Weinberger and Saul 2009), clustering (Xing et al. 2002), and semi-supervised clustering (Chang and Yeung 2004). The objective of DML is to learn a distance metric consistent with a given set of constraints, namely minimizing the distances between pairs of data points from the same class and maximizing the distances between pairs of data points from different classes. The constraints are often specified in the form of must-links, where data points belong to the same class, and cannot-links, where data points belong to different classes. The constraints can also be specified in the form of triplets \(({\mathbf {x}}_i, {\mathbf {x}}_j, {\mathbf {x}}_k)\) (Weinberger and Saul 2009), in which \({\mathbf {x}}_i\) and \({\mathbf {x}}_j\) belong to a class different from that of \({\mathbf {x}}_k\) and therefore \({\mathbf {x}}_i\) and \({\mathbf {x}}_j\) should be separated by a distance smaller than that between \({\mathbf {x}}_i\) and \({\mathbf {x}}_k\). In this work, we focus on DML using triplet constraints due to its encouraging performance (Weinberger and Saul 2009; Chechik et al. 2010; Shaw et al. 2011).

The main computational challenge in DML arises from the restriction that the learned distance metric must be a positive semi-definite (PSD) matrix, which is often referred as the PSD constraint. Early approach (Xing et al. 2002) addressed the PSD constraint by exploring the technique of semi-definite programming (SDP) (Boyd and Vandenberghe 2004), which unfortunately does not scale to large and high dimensional datasets. More recent approaches (Jain et al. 2008; Shaw et al. 2011) addressed this challenge by exploiting the techniques of online learning and stochastic optimization, particularly stochastic gradient descent (SGD), that only needs to deal with one constraint at each iteration. Although these approaches are significantly more efficient than the early approach, they share one common drawback: in order to ensure that the learned distance metric is PSD, these approaches require, at each iteration, projecting the updated distance metric onto the PSD cone. The projection step requires performing the eigen-decomposition for a given matrix, and therefore is computationally expensive Footnote 1. As a result, the key challenge in developing efficient SGD algorithms for DML is how to reduce the number of projections without affecting the performance of DML.

A common approach for reducing the number of updates and projections in DML is to use the non-smooth loss function. A popular choice of the non-smooth loss function is the hinge loss, whose derivative becomes zero when the input value exceeds a certain threshold. Many online learning algorithms for DML (Davis et al. 2007; Jain et al. 2008; Chechik et al. 2010) take advantage of the non-smooth loss function to reduce the number of updates and projections. In Shaw et al. (2011), the authors proposed a structure preserving metric learning algorithm (SPML) that combines a mini-batch strategy with the hinge loss to further reduce the number of updates for DML. It groups multiple constraints into a mini-batch and performs only one update of the distance metric for each mini-batch. But, according to our empirical study, although SPML reduces the running time of the standard SGD algorithm, it results in a significantly worse performance for several datasets, due to the deployment of the mini-batch strategy.

In this work, we first develop a new mini-batch based SGD algorithm for DML, termed Mini-SGD. Unlike SPML that relies on the hinge loss, the proposed Mini-SGD algorithm exploits a smooth loss function for DML. By using a smooth loss function, the proposed algorithm is able to effectively take advantage of the reduction in the variance of gradients achieved by the mini-batch, which in return leads to a better regret bound for online learning (Cotter 2011) and consequentially a more accurate prediction for the learned distance metric. We show theoretically that by using a smooth loss function, Mini-SGD is able to achieve similar convergence rate as the standard SGD algorithm but with significantly less number of updates. The second contribution of this work is to develop a new strategy, termed adaptive sampling, for reducing the number of projections in DML. The key idea of adaptive sampling is to first measure the “difficulty” in classifying a constraint using the learned distance metric, and then perform stochastic updating based on the difficulty measure. Finally, we develop two hybrid approaches that combine adaptive sampling with mini-batch to further improve the computational efficiency of SGD for DML. We conduct an extensive empirical study to verify the effectiveness and efficiency of the proposed algorithms for DML. We summarize the main contribution of this work as follows:

  • To the best of our knowledge, this is the first work that exploits the combination of the mini-batch strategy with smooth loss function for DML. We verify, both theoretically and empirically, the efficiency and effectiveness of the mini-batch strategy with smooth loss for DML.

  • We propose an adaptive sampling approach for efficient DML. We verify, both theoretically and empirically, the efficiency and effectiveness of the adaptive sampling approach for DML.

  • We present two hybrid approaches that exploit the combination of the mini-batch strategy with adaptive sampling for DML. Our extensive empirical study verifies that the hybrid approaches are significantly more efficient than both the mini-batch strategy and the adaptive sampling approach.

The rest of the paper is organized as follows: Sect. 2 reviews the related work on distance metric learning and stochastic gradient descent with reduced number of projection steps. Section 3 describes the proposed SGD algorithms for DML based on mini-batch and adaptive sampling. Two hybrid approaches are presented that combine mini-batch and adaptive sampling for DML. The theoretical guarantees for both mini-batch based and adaptive sampling based SGD are also presented in Sect. 3. Section 4 summarizes the results of the empirical study, and Sect. 5 concludes this work with future directions.

2 Related work

Many algorithms have been developed to learn a linear distance metric from pairwise constraints, where must-links include pairs of data points from the same class and cannot-links include pairs of data points from different classes ((Yang and Jin 2006) and references therein). Besides pairwise constraints, an alternative strategy is to learn a distance metric from a set of triplet constraints \(({\mathbf {x}}_i^t, {\mathbf {x}}_j^t, {\mathbf {x}}_k^t), t=1, \ldots , N\), where \({\mathbf {x}}_i^t\) is expected to be closer to \({\mathbf {x}}_j^t\) than to \({\mathbf {x}}_k^t\). Previous studies (Weinberger and Saul 2009; Chechik et al. 2010; Shaw et al. 2011) showed that triplet constraints could be more effective for DML than pairwise constraints.

Several online algorithms have been developed to reduce the computational cost of DML (Globerson and Roweis 2005; Davis et al. 2007; Jain et al. 2008). Most of these methods are based on stochastic gradient descent. At each iteration, they randomly sample one constraint, and update the distance metric based on the sampled constraint. The updated distance metric is further projected onto the PSD cone to ensure that it is PSD. Although these approaches are significantly more scalable than the batch learning algorithms for DML (Weinberger and Saul 2009), they suffer from the high computational cost in the projection step that has to be performed at every iteration. One way to improve the efficiency of SGD for DML is to reduce the dimensionality \(d\) of the data as the computational cost of projection is at least \(O(d^2)\) (Weinberger and Saul 2009). The main problem with these approaches is that they often result in a significantly worse performance than using the original data (Davis and Dhillon 2008).

An alternative approach for improving the efficiency of SGD for DML is to reduce the number of projections. A common approach for reducing the number of projections is to use a non-smooth loss function, such as the hinge loss. In addition, in Shaw et al. (2011), the authors proposed a structure preserving metric learning (SPML) that combines mini-batch with the hinge loss to further reduce the number of projections. The main problem with SPML is its relatively poor performance compared to the standard SGD algorithm. This is because in theory, the mini-batch strategy only works well with a smooth loss (Cotter 2011). Since the hinge loss is a non-smooth loss function, combining mini-batch with the hinge loss may result in a suboptimal performance. This is verified by our empirical study in which we observed that the distance metric learned by SPML performs significantly worse than that learned by the standard stochastic gradient descent method. We resolve this problem by presenting a new SGD algorithm for DML that combines mini-batch with a smooth loss, instead of the hinge loss.

Finally, it is worthwhile mentioning several recent studies proposed to avoid projections in SGD. In Hazan and Kale (2012), the authors developed a projection free SGD algorithm that replaces the projection step with a constrained linear programming problem. In Mahdavi et al. (2012), the authors proposed a SGD algorithm with only one projection that is performed at the end of the iterations. Unfortunately, the improvement of the two algorithms in computational efficiency is limited, because they require computing, at each iteration, the minimum eigenvalue and eigenvector of the updated distance metric, an operation with \(O(d^2)\) cost, where \(d\) is the dimensionality of the data.

3 Improved SGD for DML by mini-batch and adaptive sampling

We first review the basic framework of DML with triplet constraints. We then present two strategies to improve the computational efficiency of SGD for DML, one by mini-batch and the other by adaptive sampling. We present the theoretical guarantees for both strategies, and defer more detailed analysis to the appendix. At the end of this section, we present two hybrid approaches that combine mini-batch with adaptive sampling for more efficient DML.

3.1 DML with triplet constraints

Let \(\mathcal {X} \subset {\mathbb {R}}^d\) be the domain for input patterns, where \(d\) is the dimensionality. For the convenience of analysis, we assume all the input patterns with bounded norm, i.e. \(\forall {\mathbf {x}}\in \mathcal {X}, |{\mathbf {x}}|_2 \le r\). Given a distance metric \(M \in {\mathbb {R}}^{d\times d}\), the distance square between \({\mathbf {x}}_a\) and \({\mathbf {x}}_b\), denoted by \(|{\mathbf {x}}_a - {\mathbf {x}}_b|^2_M\), is measured by

$$\begin{aligned} |{\mathbf {x}}_a - {\mathbf {x}}_b|_M^2 = ({\mathbf {x}}_a - {\mathbf {x}}_b)^{\top }M({\mathbf {x}}_a - {\mathbf {x}}_b) \end{aligned}$$

Let \(\varOmega = \left\{ M: M\succeq 0, \Vert M\Vert _F \le R\right\} \) be the domain for distance metric \(M\), where \(R\) specifies the domain size. Let \(\mathcal {D} = \{({\mathbf {x}}_i^1, {\mathbf {x}}_j^1, {\mathbf {x}}^1_k), \ldots , ({\mathbf {x}}_i^N, {\mathbf {x}}_j^N, {\mathbf {x}}^N_k)\}\) be the set of triplet constraints used for DML, where \({\mathbf {x}}^t_i\) is expected to be closer to \({\mathbf {x}}^t_j\) than to \({\mathbf {x}}^t_k\). Let \(\ell (z)\) be the convex loss function. Define \(\varDelta ({\mathbf {x}}_i^t, {\mathbf {x}}_j^t, {\mathbf {x}}_k^t; M)\) as

$$\begin{aligned} \varDelta ({\mathbf {x}}_i^t, {\mathbf {x}}_j^t, {\mathbf {x}}_k^t; M)&= |{\mathbf {x}}^t_i - {\mathbf {x}}^t_k|_M^2 - |{\mathbf {x}}^t_i - {\mathbf {x}}^t_j|_M^2 \\&= \left\langle M, ({\mathbf {x}}^t_i - {\mathbf {x}}^t_k)({\mathbf {x}}^t_i - {\mathbf {x}}^t_k)^{\top } - ({\mathbf {x}}^t_i - {\mathbf {x}}^t_j)({\mathbf {x}}^t_i - {\mathbf {x}}^t_j)^{\top }\right\rangle = \langle M, A_t\rangle \end{aligned}$$

where

$$\begin{aligned} A_t = ({\mathbf {x}}^t_i - {\mathbf {x}}^t_k)({\mathbf {x}}^t_i - {\mathbf {x}}^t_k)^{\top } - ({\mathbf {x}}^t_i - {\mathbf {x}}^t_j)({\mathbf {x}}^t_i - {\mathbf {x}}^t_j)^{\top } \end{aligned}$$

Given the triplet constraints in \(\mathcal {D}\) and the domain in \(\varOmega \), we learn an optimal distance metric \(M \in {\mathbb {R}}^{d\times d}\) by solving the following optimization problem

$$\begin{aligned} \min \limits _{M \in \varOmega }&\widehat{{\mathcal {L}}}(M) = \frac{1}{N}\sum _{t=1}^N \ell \left( \varDelta ({\mathbf {x}}_i^t, {\mathbf {x}}_j^t, {\mathbf {x}}_k^t; M) \right) \end{aligned}$$
(1)

We also define the expectation of the loss function as

$$\begin{aligned} {\mathcal {L}}(M) = {\mathrm {E}}\left[ \ell (\varDelta ({\mathbf {x}}_i, {\mathbf {x}}_j, {\mathbf {x}}_k; M))\right] , \end{aligned}$$
(2)

where the expectation is taken over \({\mathbf {x}}_i\), \({\mathbf {x}}_j\) and \({\mathbf {x}}_k\).

The key idea of online DML is to minimize the empirical loss \(\widehat{{\mathcal {L}}}(M)\) by updating the distance metric based on one sampled constraint at each iteration. More specifically, at iteration \(t\), it samples a triplet constraint \(({\mathbf {x}}_i^t, {\mathbf {x}}_j^t, {\mathbf {x}}_k^t)\), and updates the distance metric \(M_t\) to \(M_{t+1}\) by

$$\begin{aligned} M_{t+1} = \varPi _{\varOmega }\left( M_t - \eta \ell '(\varDelta ({\mathbf {x}}_i^t, {\mathbf {x}}_j^t, {\mathbf {x}}_k^t; M_t))A_t\right) , \end{aligned}$$

where \(\eta > 0\) is the step size, \(\ell '(\cdot )\) is the derivative and \(\varPi _{\varOmega }(M)\) projects a matrix \(M\) onto the domain \(\varOmega \). The following proposition shows \(\varPi _{\varOmega }(M)\) can be computed in two steps, i.e. first projecting \(M\) onto the PSD cone, and then scaling the projected \(M\) to fit in with the constraint \(\Vert M\Vert _F \le R\).

Proposition 1

Boyd and Vandenberghe (2004) We have

$$\begin{aligned} \varPi _{\varOmega }(M) = \frac{1}{\max (\Vert M'\Vert _F/R, 1)} P(M). \end{aligned}$$

Here \(P(M)\) projects matrix \(M\) onto the PSD cone and is computed as

$$\begin{aligned} P(M) = \sum _{i=1}^d \max (\lambda _i, 0) {\mathbf {v}}_i{\mathbf {v}}_i^{\top }, \end{aligned}$$

where \((\lambda _i, {\mathbf {v}}_i), i=1, \ldots , d\) are the eigenvalues and corresponding eigenvectors of \(M\).

As indicated by Proposition 1, \(\varPi _{\varOmega }(M)\) requires projecting distance metric \(M\) onto the PSD cone, an expensive operation that requires eigen-decomposition of \(M\).

Finally, we approximate the hinge loss by a smooth loss in our study

$$\begin{aligned} \ell (z) = \frac{1}{L}\log (1 + \exp \left( -L(z - 1) \right) ) \end{aligned}$$
(3)

where \(L > 0\) is a parameter that controls the approximation error: the larger the \(L\), the closer \(\ell (z)\) is to the hinge loss. Note that the smooth approximation of the hinge loss was first suggested in Zhang and Oles (2001) for classification and was later verified by an empirical study in Zhang et al. (2003). The key properties of the loss function \(\ell (z)\) in (3) are given in the following proposition.

Proposition 2

For the loss function defined in (3), we have

$$\begin{aligned} \forall z \in {\mathbb {R}}, \quad |\ell '(z)| \le 1, \; |\ell '(z)| \le L\ell (z) \end{aligned}$$

Compared to the hinge loss function, the main advantage of the loss function in (3) is that it is a smooth loss function. As will be revealed by our analysis, it is the smoothness of the loss function that allows us to effectively explore both the mini-batch and adaptive sampling strategies for more efficient DML without having to sacrifice the prediction performance.

3.2 Mini-batch SGD for DML (Mini-SGD)

Mini-batch SGD improves the computational efficiency of online DML by grouping multiple constraints into a mini-batch and only updating the distance metric once for each mini-batch. For brevity, we will refer to this algorithm as Mini-SGD for the rest of the paper.

Let \(b\) be the batch size. At iteration \(t\), it samples \(b\) triplet constraints, denoted by

$$\begin{aligned} ({\mathbf {x}}_i^{t,s}, {\mathbf {x}}_j^{t,s}, {\mathbf {x}}_k^{t,s}), s= 1,\ldots , b, \end{aligned}$$

and defines the mini-batch loss at iteration \(t\) as

$$\begin{aligned} \ell _t(M_t) = \frac{1}{b}\sum _{s=1}^b \ell \left( \varDelta ({\mathbf {x}}_i^{t,s}, {\mathbf {x}}_j^{t,s}, {\mathbf {x}}_k^{t,s}; M_t)\right) \end{aligned}$$

Mini-batch DML updates the distance metric \(M_t\) to \(M_{t+1}\) using the gradient of the mini-bach loss function \(\ell _t(M)\), i.e.,

$$\begin{aligned} M_{t+1} = \varPi _{\varOmega }\left( M_t - \eta \nabla \ell _t(M_t)\right) \end{aligned}$$

Algorithm 1 gives the detailed steps of Mini-SGD for DML, where in step 5 Proposition 1 is used to efficiently compute the projection \(\varPi _{\varOmega }(\cdot )\).

figure a

The theorem below provides the theoretical guarantee for the Mini-SGD algorithm for DML using the smooth loss function defined in (3).

Theorem 1

Let \(\bar{M}\) be the solution output by Algorithm 1 that uses the loss function defined in (3). Let \(M_*\) be the optimal solution that minimizes \({\mathcal {L}}(M)\). Assuming \(\Vert A_t\Vert _F \le A\) for any triplet constraint and \(\eta < 1/(3LA^2)\), we have

$$\begin{aligned} {\mathrm {E}}[{\mathcal {L}}(\bar{M})] \le \frac{{\mathcal {L}}(M_*)}{1 - 3\eta LA^2} + \frac{b R^2}{2(1 - 3\eta LA^2)\eta N} \end{aligned}$$
(4)

where the expectation is taken over the sequence of triplet constraints.

Remark 1

First, we observe that the second term in the upper bound in (4), i.e., \(bR^2/[2(1 - 3\eta LA^2)\eta N]\), has a linear dependence on mini-batch size \(b\), implying that the larger the \(b\), the less accurate the distance metric learned by Algorithm 1. Hence, by adjusting parameter \(b\), the size of mini-batch, we are able to make appropriate tradeoff between the prediction accuracy and the computational efficiency: the smaller the \(b\), the more accurate the distance metric but with more updates and consequentially higher computational cost. When \({\mathcal {L}}(M_*) = 0\), we have \({\mathrm {E}}[{\mathcal {L}}(\bar{M})] = O(1/N)\), i.e. the expected prediction error will be reduced at the rate of \(b/N\), significantly faster than that of the mini-batch SGD algorithm (i.e. \(O(1/\sqrt{N})\)) given in Cotter (2011). Second, if we set \(\eta \) as:

$$\begin{aligned} \eta = \frac{\rho }{3LA^2(1 + 2\rho )}\quad where \quad \rho = \frac{3bLR^2A^2}{N {\mathcal {L}}(M_*)} \end{aligned}$$
(5)

we have

$$\begin{aligned} {\mathrm {E}}[{\mathcal {L}}(\bar{M})] \le 2{\mathcal {L}}(M_*) + \frac{6b LA^2R^2}{N} \end{aligned}$$
(6)

Although the step size in (5) requires the knowledge of \({\mathcal {L}}(M_*)\) that is usually unavailable, as suggested in Jin (2013), \({\mathcal {L}}(M_*)\) can be estimated empirically using part of training examples. Third, the bound in (4) reveals the importance of using a smooth loss function as the last term in (4) is proportional to \(L\) that measures the smoothness of the loss function. As a result, using a non-smooth loss function (e.g. hinge loss) in DML will not be able to benefit the advantage of the mini-batch strategy. Finally, unlike the analysis in Shaw et al. (2011) (i.e. Theorem 2) that only consider the case when \(b = 1\), Theorem 1 provide a general result for any mini-batch size \(b\).

3.3 Adaptive sampling based SGD for DML (AS-SGD)

We now develop a new approach for reducing the number of updates in SGD in order to improve the computational efficiency of DML. Instead of updating the distance metric at each iteration, the proposed strategy introduces a random binary variable to decide if the distance metric \(M_t\) will be updated given a triplet constraint \(({\mathbf {x}}_i^t, {\mathbf {x}}_j^t, {\mathbf {x}}_k^t)\). More specifically, it computes the derivative \(\ell '(\varDelta ({\mathbf {x}}_i^t, {\mathbf {x}}_j^t, {\mathbf {x}}_k^t; M_t))\), and samples a random variable \(Z_t\) with probability

$$\begin{aligned} \Pr (Z_t = 1) = |\ell '(\varDelta ({\mathbf {x}}_i^t, {\mathbf {x}}_j^t, {\mathbf {x}}_k^t; M_t))| \end{aligned}$$

The distance metric will be updated only when \(Z_t = 1\). According to Proposition 2, we have \(|\ell '(\varDelta ({\mathbf {x}}_i^t, {\mathbf {x}}_j^t, {\mathbf {x}}_k^t; M_t))| \le L \ell (\varDelta ({\mathbf {x}}_i^t, {\mathbf {x}}_j^t, {\mathbf {x}}_k^t; M_t))\) for the smooth loss function given in (3), implying that a triplet constraint has a high chance to be used for updating the distance metric if it has a large loss. Therefore, the essential idea of the proposed adaptive sampling strategy is to give a large chance to update the distance metric when the triplet is difficult to be classified and a low chance when the triplet can be classified correctly with large margin. We note that an alternative strategy is to sample a triplet constraint \(({\mathbf {x}}_i^t, {\mathbf {x}}_j^t, {\mathbf {x}}_k^t)\) base on its loss \(\ell (\varDelta ({\mathbf {x}}_i^t, {\mathbf {x}}_j^t, {\mathbf {x}}_k^t; M_t))\). We did not choose the loss as the basis for updating because it is the derivative, not the loss, that will be used by SGD for updating the distance metric. The detailed steps of adaptive sampling based SGD for DML is given in Algorithm 2. We refer to this algorithm as AS-SGD for short in the rest of this paper.

The theorem below provides the performance guarantee for AS-SGD. It also bounds the number of updates \(\sum _{t=1}^T Z_t\) for AS-SGD.

Theorem 2

Let \(\bar{M}\) be the solution output by Algorithm 2 that uses the loss function defined in (3). Let \(M_*\) be the optimal solution that minimizes \({\mathcal {L}}(M)\). Assuming \(\Vert A_t\Vert _F \le A\) for any triplet constraint and \(\eta < 2/LA^2\), we have

$$\begin{aligned} {\mathrm {E}}\left[ {\mathcal {L}}(\bar{M})\right] \le \frac{{\mathcal {L}}(M_*)}{1-\eta LA^2/2} + \frac{R^2}{2\eta N(1 - \eta LA^2/2)} \end{aligned}$$
(7)

and the number of updates bounded by

$$\begin{aligned} {\mathrm {E}}\left[ \sum _{t=1}^N Z_t\right] \le \frac{NL{{\mathcal {L}}}(M_*)}{1-\eta LA^2/2} + \frac{LR^2}{2\eta (1 - \eta LA^2/2)}, \end{aligned}$$
(8)

where the expectation is taken over both the binary random variables \(\{Z_t\}_{t=1}^N\) and the sequence of triplet constraints.

figure b

Remark 2

If we set \(\eta \) as

$$\begin{aligned} \eta = \frac{2\rho }{LA^2(1 + 2\rho )}\quad where \quad \rho = \frac{LA^2 R^2}{4N {\mathcal {L}}(M_*)} \end{aligned}$$

we have

$$\begin{aligned} {\mathrm {E}}[{\mathcal {L}}(\bar{M})] \le 2{\mathcal {L}}(M_*) + \frac{LR^2 A^2}{N} \end{aligned}$$
(9)

and

$$\begin{aligned} {\mathrm {E}}\left[ \sum _{t=1}^N Z_t\right] \le 2NL{\mathcal {L}}(M_*) + L^2A^2R^2 \end{aligned}$$
(10)

The bounds given in (7) and (9) share similar structures as those given in (4) and (6) except that they do not have mini-batch size \(b\) that can be used to make tradeoff between the number of updates and the classification accuracy. The number of updates performed by Algorithm 2 is bounded by (10). The dominate term in (10) is \(O({\mathcal {L}}(M_*) N)\), implying that Algorithm 2 will have a small number of updates if the optimal distance metric only makes a small number of mistakes for the given set of training triplets. In the extreme case when \({\mathcal {L}}(M_*) \rightarrow 0\), the expected number of updates will be bounded by a constant \(L^2A^2R^2\). We note that this is consistent with our intuition: it will be easy to learn a good distance metric when the optimal one only makes a few mistakes, and as a result, only a few updates are needed to find a distance metric that are consistent with most of the training triplets. Compared with the result of perceptron method (Shalev-Shwartz and Singer 2004), we do not assume that the dataset is separable, which makes our bound for the number of updates more practically useful.

3.4 Hybrid Approaches: combining mini-batch with adaptive sampling for DML

figure c

Since mini-batch and adaptive sampling improve the computational efficiency of SGD from different aspects, it is natural to combine them together for more efficient DML. Similar to the Mini-SGD algorithm, the hybrid approaches will group multiple triplet constraints into a mini-batch. But, unlike Mini-SGD that updates the distance metric for every mini-batch of constraints, the hybrid approaches follow the idea of adaptive sampling, and introduce a binary random variable to decide if the distance metric will be updated for every mini-batch of constraints. By combining the strength of mini-batch and adaptive sampling for SGD, the hybrid approaches are able to make further improvement in the computational efficiency of DML. Algorithm 3 highlights the key steps of the hybrid approaches.

One of the key steps in the hybrid approaches (step 5 in Algorithm 3) is to choose appropriate sampling probability \(\gamma _t\) for every mini-batch constraints \(({\mathbf {x}}_i^{t,s}, {\mathbf {x}}_j^{t,s}, {\mathbf {x}}_k^{t,s}), s=1, \ldots , b\). In this work, we study two different choices for sampling probability \(\gamma _t\):

  • The first approach chooses \(\gamma _t\) based on a triplet constraint randomly sampled from a mini-batch. More specifically, given a mini-batch of triplet constraints \(\{{\mathbf {x}}_i^{t,s}, {\mathbf {x}}_j^{t,s}, {\mathbf {x}}_k^{t,s}\}_{s=1}^b\), it randomly samples an index \(s'\) in the range \([1, b]\). It then sets the sampling probability \(\gamma _t\) to be the derivative for the randomly sampled triplet, i.e.,

    $$\begin{aligned} \gamma _t = |\ell '(\varDelta ({\mathbf {x}}_i^{t,s'}, {\mathbf {x}}_j^{t,s'}, {\mathbf {x}}_k^{t,s'}; M_t))| \end{aligned}$$
    (11)

    We refer to this approach as HR-SGD.

  • The second approach is based on the average case analysis. It sets the sampling probability as the average derivative measured by the norm of the gradient \(\nabla \ell _t(M_t)\), i.e.

    $$\begin{aligned} \gamma _t = \frac{1}{W}\Vert \nabla \ell _t(M_t)\Vert _F, \end{aligned}$$
    (12)

    where \(W = \max _t \Vert \nabla \ell _t(M_t)\Vert _F\) and is estimated by sampling. We refer to this approach as HA-SGD.

4 Experiments

Ten datasets are used to validate the effectiveness of the proposed algorithms. Table 1 summarizes the information of these datasets. Datasets dna, letter (Hsu and Lin 2002), protein and sensit (Duarte and Hu 2004) are downloaded from LIBSVM (Chang and Lin 2011). Datasets tdt30 and rcv20 are document corpora: tdt30 is the subset of tdt2 data (Cai et al. 2009) comprised of the documents from the \(30\) most popular categories and rcv20 is the subset of a large rcv1 dataset (Bekkerman and Scholz 2008) consisted of documents from the \(20\) most popular categories. Following Chechik et al. (2010), we reduce the dimensionality of these document datasets to \(200\) by principle components analysis (PCA). All the other datasets are downloaded directly from the UCI repository (Frank and Asuncion 2010). For all the datasets used in this study, we use the standard training/testing split provided by the original dataset, except for datasets semeion, connect4 and tdt30 where \(70~\%\) of data is randomly selected for training and the remaining \(30~\%\) is used for testing. All the experiments are repeated five times, and both the average results and their standard deviation are reported. All the experiments are run on a laptop with 8 GB memory and two 2.50 GHz Intel Core i5-2520M CPUs.

Table 1 Statistics for the ten datasets used in our empirical study

4.1 Parameter setting

The parameter \(L\) in the loss function (3) is set to be \(3\) according to the suggestion in Zhang and Oles (2001). The number of triplet constraints \(N\) is set to be \(100,000\) for all the datasets except for two small datasets semeion and dna where \(N = 20 n\). To construct triplet constraints, we follow the active sampling strategy given in Weinberger and Saul (2009): at each iteration \(t\), we first randomly pick a training example \({\mathbf {x}}_i^t\), and then \({\mathbf {x}}_j^t\) from the 3 positive nearest neighbors Footnote 2 of \({\mathbf {x}}_i^t\); we then randomly select a triplet constraint from the set of active constraints Footnote 3 involving \({\mathbf {x}}_i^t\) and \({\mathbf {x}}_j^t\). We note that this is different from Chechik et al. (2010), where triplet constraints are selected completely randomly. Our empirical study shows that the active sampling strategy is more effective than choosing triplet constraints completely randomly. This is because the actively sampled constraints are more informative to the learned distance metric than a completely random choice. Furthermore, to verify that the choice of \(N\) does not lead to the overfitting of training data, particularly for the two small datasets, in Fig. 1, we show the training and test errors for dataset dna. It is clear that both errors decline over epoches, suggesting that no overfitting is found even for the small dataset.

For Mini-SGD and the hybrid approaches, we set \(b = 10\) for the size of mini-batch as in Shaw et al. (2011), leading to a total of \(T = 10,000\) iterations for these approaches. We evaluate the learned distance metric by the classification error of a \(k\)-NN on the test data, where the number of nearest neighbors \(k\) is set to be \(3\) based on our experience.

Fig. 1
figure 1

The training and testing errors over epoches for dataset dna

Parameter \(R\) in the proposed algorithms determines the domain size for the distance metric to be learned. We observe that the classification error of \(k\)-NN remains almost unchanged when varying \(R\) in the range of \(\{100, 1000, 10000\}\). We thus set \(R = 1,000\) for all the experiments. Another important parameter used by the proposed algorithms is the step size \(\eta \). We evaluate the impact of step size \(\eta \) by measuring the classification error of a \(k\)-NN algorithm that uses the distance metric learned by the Mini-SGD algorithm with \(\eta = \{0.1,1,10\}\). We observe that \(\eta = 1\) yields a low classification error for almost all datasets. We thus fix \(\eta = 1\) for the proposed algorithms in all the experiments.

4.2 Experiment (I): effectiveness of the proposed SGD algorithms for DML

In this experiment, we compare the performance of the proposed SGD algorithms for DML, i.e. Mini-SGD, AS-SGD and two hybrid approaches (HR-SGD and HA-SGD), to the full version of SGD for DML (SGD). We also include Euclidean distance as the reference method in our comparison. Tables 2 shows the classification error of \(k\)-NN (\(k=3\)) using the proposed DML algorithms and the baseline algorithms, respectively. First, it is not surprising to observe that all the distance metric learning algorithms improve the classification performance of \(k\)-NN compared to the Euclidean distance. Second, for almost all datasets, we observe that all the proposed DML algorithms (i.e., Mini-SGD, AS-SGD, HR-SGD, and HA-SGD) yield similar classification performance as SGD, the full version of SGD algorithm for DML. This result confirms that the proposed SGD algorithms are effective for DML despite the modifications we made to the SGD algorithm.

Table 2 Classification error (\(\%\)) of \(k\)-NN (\(k=3\)) using the distance metrics learned by the proposed SGD methods for DML

4.3 Experiment (II): efficiency of the proposed SGD algorithms for DML

Figure 2 summarizes the running time for the proposed DML algorithms and the baseline SGD algorithm. We note that the running times in Fig. 2 do not take into account the time for constructing triplet constraints since it is shared by all the methods in comparison.

Fig. 2
figure 2

The comparison of running time (seconds) for various SGD methods. Note that LMNN, a batch DML algorithm, is mainly implemented in C, which is computationally more efficient than our Matlab implementation. All the other methods are implemented in Matlab

It is not surprising to observe that all the proposed SGD algorithms, including Mini-SGD, AS-SGD, HA-SGD and HR-SGD, significantly reduce the running time of SGD. For instance, for dataset isolet, it takes SGD more than \(35,000\) seconds to learn a distance metric, while the running time is reduced to less than \(4,000\) seconds when applying the proposed SGD algorithms, roughly a factor of \(10\) reduction in running time. Comparing the running time of AS-SGD to that of Mini-SGD, we observe that each method has its own advantage: AS-SGD is more efficient on dataset semeion while Mini-SGD is more efficient on the other datasets. This is because different mechanisms are employed by AS-SGD and Mini-SGD to reduce the computational cost: AS-SGD improves the computational efficiency of DML by skipping the constraints that are easy to be classified, while Mini-SGD improves the the computational efficiency of SGD by performing the updating of distance metric once for multiple triplet constraints. Finally, we observe that the two hybrid approaches that combine the strength of both adaptive sampling and mini-batch SGD, are computationally most efficient for almost all datasets. We also observe that HR-SGD appears to be more efficient than HA-SGD on six datasets and only loses it edge on datasets protein and rcv20. This is because HR-SGD computes the sampling probability \(\gamma _t\) based on one randomly sampled triplet while HA-SGD needs to compute the average derivative for each mini-batch of triplet constraints for the sampling probability.

To further examine the computational efficiency of proposed SGD algorithms for DML, we summarize in Fig. 3 the number of updates performed by the proposed SGD algorithms and the baseline SGD algorithm, respectively. We observe that all the proposed SGD algorithms for DML are able to reduce the number of updates significantly compared to SGD. Comparing Mini-SGD to AS-SGD, we observe that for semeion, the number of updates performed by AS-SGD is significantly less than Mini-SGD, while it is the other way around for the other datasets. This is again due to the fact that AS-SGD and Mini-SGD deploy different mechanisms for reducing computational costs. As we expect, the two hybrid approaches are able to further reduce the number of updates performed by AS-SGD and Mini-SGD, making them more efficient algorithms for DML.

Fig. 3
figure 3

The comparison of number of updates for various SGD methods. Note that since POLA and LEGO optimize pairwise constraints, we decompose each triplet constraint into two pairwise constraints for these two methods. As a result, the number of constraints is doubled for these two methods

By comparing the running time in Fig. 2 to the number of updates in Fig. 3, we observe that a small number of updates does NOT always guarantee a short running time. This is exhibited by the comparison between the two hybrid approaches: although HA-SGD performs the similar number of updates as HR-SGD on datasets semeion and dna, it takes HA-SGD significantly longer time to finish the computation than HR-SGD. This is also exhibited by comparing the results across different datasets for a fixed method. For example, for the HA-SGD method, the number of updates for the protein dataset is nearly the same as that for the poker dataset, but the running time for the protein dataset is about \(100\) times longer than that for the poker dataset. This result may sound counter intuitive at the first glance. But, a more careful analysis reveals that in addition to the number of updates, the running time of DML is also affected by the computational cost per iteration, which explains the consistency between Figs. 2 and 3. In the case of comparing the two hybrid approaches, we observe that HA-SGD is subjected to a higher computational cost per iteration than HR-SGD because HA-SGD has to compute the norm of the average gradient over each mini-batch while HR-SGD only needs to compute the derivative of one randomly sampled triplet constraint for each mini-batch. In the case of comparing the running time across different datasets, the protein dataset has a significantly higher dimensionality than the poker dataset, and therefore is subjected to a higher computational cost per iteration because the computational cost of projecting an updated distance metric onto the PSD cone increases at least quadratically in the dimensionality.

4.4 Experiment (III): comparison with state-of-the-art online DML methods

We compare the proposed SGD algorithms to three state-of-the-art online algorithms and one bath method for DML:

  • SPML (Shaw et al. 2011): an online learning algorithm for DML that is based on mini-batch SGD and the hinge loss,

  • OASIS (Chechik et al. 2010): a state-of-the-art online DML algorithm and symmetric version with only one projection is applied,

  • LEGO (Jain et al. 2008): an online version of the information theoretic based DML algorithm Davis et al. (2007).

  • POLA (Shalev-Shwartz and Singer 2004): a Perception based online DML algorithm.

Finally, for sanity checking, we also compare the proposed SGD algorithms to LMNN (Weinberger and Saul 2009), a state-of-the-art batch learning algorithm for DML.

Both SPML and OASIS use the same set of triplet constraints to learn a distance metric as the proposed SGD algorithms. Since LEGO and POLA are designed for pairwise constraints, for fair comparison, we generate pairwise constraints for LEGO and POLA by splitting each triplet constraint \(({\mathbf {x}}_i^t, {\mathbf {x}}_j^t, {\mathbf {x}}_k^t)\) into two pairwise constraints: a must-link constraint \(({\mathbf {x}}_i^t, {\mathbf {x}}_j^t)\) and a cannot-link constraint \(({\mathbf {x}}_i^t, {\mathbf {x}}_k^t)\). This splitting operation results in a total of \(2N\) pairwise constraints for LEGO and POLA. Finally, we note that since LMNN is a batch learning method, it is allowed to utilize any triplet constraint derived from the data, and is not restricted to the set of triplet constraints we generate for the SGD methods. All the baseline DML algorithms are implemented by using the codes from the original authors except for SPML, for which we made appropriate changes to the original code in order to avoid large matrix multiplication and improve the computational efficiency. SPML, OASIS LEGO and POLA are implemented in Matlab, while the core parts of LMNN are implemented by C that is usually deemed to be more efficient than Matlab. The default parameters suggested by the original authors are used in the baseline algorithms. The step size of LEGO is set to be \(1\), as it was observed in  Chechik et al. (2010) that the prediction performance of LEGO is in general insensitive to the step size. In all experiments, all the baseline methods set the initial solution for distance metric to be an identity matrix.

Tables 3 summarizes the classification results of \(k\)-NN (\(k=3\)) using the distance metrics learned by the proposed method and by baseline algorithms, respectively. SGD* denotes the best result of propose methods in Table 2. First, we observe that LEGO and POLA perform significantly worse than the proposed DML algorithms for four datasets, including semeion, connect4, sensit and poker. LEGO also performs poorly on isolet and tdt30. This can be explained by the fact that LEGO and POLA use pairwise constraints for DML while the other methods in comparison use triplet constraints for DML. According to Chechik et al. (2010); Shaw et al. (2011); Weinberger and Saul (2009), triplet constraints are in general more effective than pairwise constraints. Second, although both SPML and Mini-SGD are based on the mini-batch strategy, SPML performs significantly worse than Mini-SGD on three datasets, i.e. protein, connect4, and poker. The performance difference between SPML and Mini-SGD can be explained by the fact that Mini-SGD uses a smooth loss function while a hinge loss is used by SPML. According to our analysis and the analysis in Cotter (2011), using a smooth loss function is critical for the success of the mini-batch strategy. Third, OASIS yields similar performance as the proposed algorithms for almost all datasets except for datasets semeion, dna and poker, for which OASIS performs significantly worse. Overall, we conclude that the proposed DML algorithms yield similar, if not better, performance as the state-of-the-art online learning algorithms for DML.

Table 3 Classification error (\(\%\)) of \(k\)-NN (\(k=3\)) using the distance metrics learned by baseline SGD method, online learning algorithms and batch learning approach for DML

Compared to LMNN, a state-of-the-art batch learning algorithm for DML, we observe that the proposed SGD algorithms yield similar performance on four datasets. They however perform significantly better than LMNN on datasets semeion and letter, and significantly worse on datasets dna and tdt30. We attribute the difference in classification error to the fact that the proposed DML algorithms are restricted to \(100,000\) randomly sampled triplet constraints while LMNN is allowed to use all the triplet constraints that can be derived from the data. The restriction in triplet constraints could sometimes limit the classification performance but at the other time help avoid the overfitting problem. We also observe that LMNN is unable to run on the two large datasets rcv20 and poker, indicating that LMNN does not scale well to the size of datasets.

The running times for the proposed algorithms and the baseline algorithms are summarized in Fig. 2. The number of updates for both groups of algorithms are provided in Fig. 3. It is not surprising to observe that two online DML algorithms (SPML, OASIS) are significantly more efficient than SGD in terms of both running time and the number of updates. We also observe that Mini-SGD and SPML share the same number of updates and similar running time for all datasets because they use the same mini-batch strategy. Furthermore, compared to three online DML algorithms (SPML, LEGO and POLA), the two hybrid approaches are significantly more efficient in both running time and the number of updates. Table 4 compares the detailed running time of OASIS and the hybrid methods. We also observe that the hybrid methods are more efficient than OASIS on six datasets (i.e. semeion, dna, letter, connect4, sensit and poker), even though OASIS only performs projection once at the end of the program. We attribute the efficiency of the hybrid approaches to the reduced number of updates they have to perform on the learned metric. Finally, since LMNN is implemented by C, it is not surprising to observe that LMNN shares similar running time as the other online DML algorithms for relatively small datasets. It is however significantly less efficient than the online learning algorithms for datasets of modest size (e.g. letter, connect4 and sensit), and becomes computationally infeasible for the two large datasets rcv20 and poker. Overall, we observe that the two hybrid approaches are significantly more efficient than the other DML algorithms in comparison.

Table 4 The comparison of running time (seconds) for OASIS and the hybrid methods. Average results over five trials are reported

5 Conclusion

In this paper, we propose two strategies to improve the computational efficiency of SGD for DML, i.e. mini-batch and adaptive sampling. The key idea of mini-batch is to group multiple triplet constraints into a mini-batch, and only update the distance metric once for each mini-batch; the key idea of adaptive sampling is to perform stochastic updating by giving a difficult triplet constraint more chance to be used for updating the distance metric than an easy triplet constraint. We develop theoretical guarantees for both strategies. We also develop two variants of hybrid approaches that combine mini-batch with adaptive sampling for more efficient DML. Our empirical study confirms that the proposed algorithms yield similar, if not better, prediction performance as the state-of-the-art online learning algorithms for DML but with significantly less amount of running time. Since our empirical study is currently limited to datasets with relatively small number of features, we plan to examine the effectiveness of the proposed algorithms for DML with high dimensional data.