We further divide the techniques, which match the distances or similarities of image pairs derived from two spaces, i.e., the original space and the Hamming space into two parts as follows:
3.2.1 Difference Loss Minimization.
Deep Supervised Hashing (
DSH) [
111]. DSH unitizes a network consisting of three convolutional-pooling layers and two fully connected layers. Recall that the outputs of the hashing network are
\(\lbrace \mathbf {h}_i \rbrace _{i=1}^N\). The origin pairwise loss function is defined as follows:
where
\(d_{ij}^h = ||\mathbf {h}_i-\mathbf {h}_j||_2^2\),
\([\cdot ]_+\) denotes
\(\max (\cdot ,0)\) and
\(\epsilon \gt 0\) is a given threshold parameter. The loss function obeys a distance-similarity product minimization formulation that expects similar examples mapped to similar binary codes and rewards dissimilar examples transferred to distinct binary codes when the Hamming distances are smaller compared with the margin threshold
\(m\). It is noticed that when
\(d_{ij}^h\) is larger than
\(m\), the loss does not produce gradients. This idea is similar to the hinge loss function.
As we discuss before, DSH relaxes the binary constraints and a regularizer is added to the continuous outputs of the hashing network, which approximates the binary codes, i.e.,
\(h\approx sgn(h)\). The pairwise loss is rewritten as
where
\(\mathbf {1}\) denotes a all-one vector and
\(||\cdot ||_p\) produces the
\(\ell _{p}\)-norm of the vector.
\(\lambda _1\) is a parameter to balance the effects of the regularization loss. DSH does not utilize saturating non-linearities because it may slow down the training process. With the above loss function, the neural network is able to be trained with an end-to-end back propagation algorithm. For the evaluation process, the binary codes can be derived using the sign activation function. DSH is a straight-forward deep supervised hashing method in the early period, and its idea originates from Spectral Hashing [
171] but with a deep learning framework.
Pairwise Correlation Discrete Hashing (PCDH) [
27]. PCDH utilizes four fully connected layers after the convolutional-pooling layer, named deep feature layer, hash-like layer, discrete hash layer as well as classification layer, respectively. The third layer can directly generate discrete hash code. Different from DSH, PCDH leverages
\(\ell _2\) norm of deep features and hash-like codes. Besides, the classification loss is included in the final function:
where
\(\mathbf {z}_i,\mathbf {h}_i,\) and
\(\mathbf {b}_i\) denote the outputs of the first three fully connected layers. The last term is the classification cross-entropy loss.
1 Note that the second term is called pairwise correlation loss, which guides the similarity learning of deep features to avoid overfitting. The classification loss provides semantic supervision, which helps the model achieve competitive performance. Besides, PCDH proposes a pairwise construction module named Pairwise Hard, which samples positive pairs with the maximum distance between deep features and negative pairs with the distances smaller than the threshold randomly. It is evident that Pairwise Hard chooses the hard pairs with the large loss for effective hash code learning.
Supervised Deep Hashing (SDH) [
40]. SDH utilizes the fully-connected neural network for deep hashing and has a similar loss function except for a term that enforces a relaxed orthogonality constraint on all projection matrices (i.e., weight matrices in a neural network) for the property of fully-connected layers. Bit balance regularization is also included, which will be introduced in Equation (
14).
Supervised Hashing with Binary Deep Neural Network (SH-BDNN) [
36]. The architecture of SH-BDNN is stacked by a fully connected layer, in which
\(\mathbf {W}_i\) denotes the weights in the
ith layer. SH-BDNN not only considers the bit balance, i.e., each bit obeys a uniform distribution, but also considers the independence of different hash bits. Given the hash code matrix
\(\mathbf {B} = [\mathbf {b}_1, \ldots , \mathbf {b}_N]^T\), the two conditions are formulated as
where
\(\mathbf {1}\) is a
\(L\)-dimension vector whose elements are all one, and
\(\mathbf {I}\) is an identity matrix of size
\(N\) by
\(N\). The loss function is
\(\mathbf {H}\) is stacked by the outputs of network and
\(\mathbf {B}\) is stacked by the binary codes to be optimized from the Equation (
14).
\(\mathbf {S}\) is the pairwise similarity matrix valued 1 or
\(-\)1. The first term is similarity difference loss minimization, the second term is the
\(\ell _{2}\) regularization, the third term is the quantization loss, and the last two terms are to punish the correlation and the imbalance of bits, respectively. Note that the
\(\mathbf {B}\) is not the sign of
\(\mathbf {H}\). As a result, the loss function is optimized by updating the network parameter and
\(\mathbf {B}\) alternatively. To be specific,
\(\mathbf {B}\) is optimized with a fixed neural network, while the neural network is trained with fixed
\(\mathbf {B}\) alternatively. SH-BDNN has a well-designed loss function, which follows Kernel-based Supervised Hashing [
113]. However, the architecture does not include the popular convolutional neural network, and it is not an end-to-end model. As a result, the efficiency of this model is low in large-scale datasets.
Convolutional Neural Network Hashing (CNNH) [
173]. CNNH is the earliest deep supervised hashing framework to our knowledge. It adopts a two-step strategy. In the first step, it optimizes the objective function using a coordinate descent strategy as follows:
which generates approximate binary codes. In the second step, CNNH utilizes obtained hash codes to train the convolutional neural network with
\(L\) output units. Besides, if class labels are available, the fully connected layer with
\(K\) output units is added, which correspond to the
\(K\) class labels of images and the classification loss is added to the loss function. Although CNNH uses labels in a clumsy manner, this two-step strategy is still popular in deep supervised hashing and inspires many other state-of-the-art methods.
Deep Discrete Supervised Hashing (DDSH) [
82]. DDSH uses a column-sampling manner for partitioning the training data into
\(\lbrace \mathbf {x}_i\rbrace _{i\in \Omega }\) and
\(\lbrace \mathbf {x}_i\rbrace _{i\in \Gamma }\), where
\(\Omega\) and
\(\Gamma\) are the indexes. The loss function is designed in an asymmetric form:
where
\(\mathbf {b}_i\) and
\(\mathbf {h}_i\) are the binary code to be optimized and the output of the network, respectively.
\(\mathbf {b}_i\) and
\(\mathbf {h}_i\) are updated alternatively following [
36]. It is notable because DDSH takes an asymmetric strategy for learning to hash, which aids in both binary code generation and continuous feature learning through the pairwise similarity structure.
Hashing with Binary Matrix Pursuit (HBMP) [
9]. HBMP also takes advantage of the two-step strategy introduced above. Different from CNNH, HBMP utilizes the weighted Hamming distances and adopts a different traditional hashing algorithm called binary code inference to get hash codes. In the first step, the objective function is written in the following equation
where
\(\mathbf {\Lambda }\) is a diagonal weight matrix. It is noticed that the similarity matrix with each element
\(S^h_{ij}=\mathbf {b}_i^T\Lambda \mathbf {b}_j\) can be approximated by a step-wise algorithm. HBMP also trains a convolutional neural network by the obtained hash codes with point-wise hinge loss and shows that deep neural networks help to simplify the optimization problem and get robust hash codes.
Asymmetric Deep Supervised Hashing (ADSH) [
84]. ADSH considers the samples in the database and query set using an asymmetric manner, which can help to train the model more effectively, especially for large-scale nearest neighbor search. ADSH contains two critical components, i.e., a feature learning part and a loss function part. The first one is to utilize a hashing network to learn discrete codes for queries. The second one is used to directly learn discrete codes for database points by minimizing the same objective function with supervised information. The loss function is formulated as
where
\(\Omega\) is the index of query points,
\(\Gamma\) is the index of database points. Network parameters
\(\Theta\) and binary codes
\(\mathbf {b}_j\) are updated alternatively following SH-BDNN [
36] during the optimization process. If only the database points are available, we let
\(\Omega \subset \Gamma\) and add a quantization loss
\(\sum _{i \in \Omega }(\mathbf {b}_i-\mathbf {h}_i)^2\) with the coefficient
\(\gamma\). This asymmetric strategy combines deep hashing and traditional hashing, which can help achieve better performance.
Deep Incremental Hashing Network (DIHN) [
172]. DIHN tries to learn hash codes in an incremental manner. Similar to ADSH [
84], the dataset is divided into two parts, i.e., original and incremental databases, respectively. When a new image comes from an incremental database, its hash code is learned while keeping the hash codes of the original database unchanged. The optimization process still uses the strategy of alternately updating parameters.
Deep Ordinal Hashing (DOH) [
85]. DOH generates ordinal hash codes by taking advantage of both local and global features. Specifically, two subnetworks learn the local semantics using a spatial attention module-enhanced fully convolutional network and the global semantics using a convolutional neural network, respectively. Afterward, the two outputs are combined to produce
\(R\) ordinal outputs
\(\lbrace \mathbf {h_i}^r\rbrace _{r=1}^R\). For each segment
\(\mathbf {h_i}^r\), the corresponding hash code can be obtained as follows:
The full hash code can be obtained by concatenating \(\lbrace \mathbf {b}_i^r\rbrace _{r=1}^R\). DOH adopts an end-to-end ranking-to-hashing framework, which avoids using the undifferentiable sign function. Furthermore, it uses a relatively complex network that is able to handle large datasets with higher performance.
3.2.2 Likelihood Loss Minimization.
Deep Pairwise Supervised Hashing (DPSH) [
101]. DPSH adopts CNN-F [
23] as the backbone of the hashing network and the standard form of likelihood loss based on similarity information. Besides similarity information, quantization loss is also introduced to the final loss function, i.e.,
where
\(s_{ij}^h=\tfrac{1}{2}\mathbf {h}_{i}^T\mathbf {h}_{j}\) and
\(\mathbf {h}_{i}\) is the output of the hashing network. Although triplet loss was popular at that time, DPSH adopts the pairwise form to simultaneously learn deep features and hash codes, which improves both accuracy and efficiency. This likelihood loss function can easily introduce different Bayesian priors, making it flexible in applications and achieving better performance than different loss functions.
Deep Hashing Network (DHN) [
200]. It has a similar likelihood loss function to DPSH. Differently, DHN considers the quantization loss as Bayesian prior and proposes a bimodal Laplacian prior for the output
\(\mathbf {h}_i\), i.e.,
and the negative log likelihood (i.e., quantization loss) is
which can be smoothed by a smooth surrogate [
74] into
where
\(\mathbf {h}_{ik}\) is the
kth element of
\(\mathbf {h}_i\). We notice that the DHN replaced
\(\ell _2\) norm (ITQ quantization error [
48]) by
\(\ell _1\) norm. [
200] also shows that the
\(\ell _{1}\) norm is an upper bound of the
\(\ell _{2}\) norm, and the
\(\ell _{1}\) norm encourages sparsity and is easier to optimize.
HashNet [
19]. As a variant of DHN, HashNet considers the imbalance training problem that the positive pairs are much more than the negative pairs. Hence, it adopts
Weighted Maximum LikeLihood (
WML) loss with different weights for each image pair. The weight is formulated as
where
\(\mathcal {S}_{1}=\lbrace (i,j)\in \mathcal {E}: s_{i j}^o=1\rbrace\) comprises similar image pairs, while
\(\mathcal {S}_0 = \mathcal {E}/\mathcal {S}_1\) comprises dissimilar image pairs.
\(c_{i j}=\tfrac{\mathbf {y}_{i} \cap \mathbf {y}_{j}}{\mathbf {y}_{i} \cup \mathbf {y}_{j}}\) for multi-label datasets and equals 1 for single-label datasets. Besides, the sigmoid function in condition probability is substituted by
\(1/1+e^{-\alpha x}\) called adaptive sigmoid function, which equals to adding a hyper-parameter into the hash code similarity computation, i.e.,
\(s_{ij}^h=\alpha \mathbf {b}_i^T\mathbf {b}_j\). Different from other methods, HashNet continuously approximates sign function through the hyperbolic tangent function
The activation function for outputs is \(\tanh (\beta _t \cdot)\) through updating \(\beta _t\rightarrow \infty\) step-wise and the optimal network with \(\operatorname{sgn}(\cdot)\) can be derived. Besides, this operation can be illustrated using multi-stage pretraining, which means that the deep network using activation function \(tan(\beta _{t+1}\cdot)\) is initialized using the well-trained network using activation function \(tan(\beta _t\cdot)\). The two skills proposed by HashNet greatly increase the performance of deep supervised hashing.
Deep Priority Hashing (DPH) [
20]. DPH also adds different weights to different image pairs, but reduces the weights of pairs with higher confidence, which is similar to AdaBoost [
138]. The difficulty is measured by
\(q_{ij}\), which indicates how difficult a pair is classified as similar when
\(s_{ij}^o=1\) or classified as dissimilar when
\(s_{ij}^o=0\). In formulation,
Besides, the weight characterizing class imbalance is measured by
\(\alpha _{ij}\):
where
\(\mathcal {S}_{i}=\lbrace (i,j)\in \mathcal {E}:\forall j\rbrace\), and
The final priority weight is formulated as
where
\(\gamma\) is a hyper-parameter. With the priority cross-entropy loss, DPH down-weighs confident image pairs and prioritizes on difficult image pairs with low confidence. Similarly, priority quantization loss changes the weight for different images to be
\(w_i^{\prime }=(1-q_i)\gamma\) and
\(q_i\) measures how likely a continuous output can be perfectly quantized into a binary code. In this way, DPH achieved better performance than HashNet.
Deep Supervised Discrete Hashing (DSDH) [
100]. Besides leveraging the pairwise similarity information, DSDH also takes advantage of label information by adding a linear regression loss with regularization to the loss function. By dropping the binary restrictions, the loss is formulated as
where
\(s_{ij}^h=\tfrac{1}{2}\mathbf {h}_i^T\mathbf {h}_j\) and the label is encoded in one-hot format
\(\mathbf {y}_i\). The second term in Equation (
30) is the linear regression term and the last term is an
\(\ell _2\) regularization.
\(\lbrace \mathbf {h}_i\rbrace _{i=1}^N\),
\(\lbrace \mathbf {b}_i\rbrace _{i=1}^N\), and
\(\mathbf {W}\) are updated alternatively by using gradient descent method and discrete cyclic coordinate descend method. DSDH greatly increases the performance of image retrieval since it takes advantage of both label information and pairwise similarity information. It should be noted that in the linear regression term, the binary code is updated by discrete cyclic coordinate descend, so the constraint of discreteness is met.
Deep Cauchy Hashing (DCH) [
15]. DCH is a Bayesian learning framework similar to DHN, but it replaced the sigmoid function with the function based on Cauchy distribution in the conditional probability. DCH aims at improving the search accuracy with Hamming distances smaller than radius 2. Probability on the basis of generalized sigmoid function could be extremely large when Hamming distances are greater than 2. This could be detrimental to current Hamming ball retrieval. DCH tackles this problem via incorporating the Cauchy distribution, since the probability drops rapidly if Hamming distances are greater than 2. The Cauchy distribution is formulated as
where
\(\gamma\) is a hyper-parameter and
\(d_{ij}^h\) is measured by the normalized Euclidean distance, i.e.,
\(d_{ij}^h= d(\mathbf {h}_i,\mathbf {h}_j)=\tfrac{L}{2}(1-cos(\mathbf {h}_i,\mathbf {h}_j)\). Besides, the prior is based on a variant of the Cauchy distribution, i.e.,
The final loss function is formulated as the log-likelihood plus the quantization loss based on the prior weight. However, this loss function will get almost the same hash code for images with the same label. Even worse, the relationship for the dissimilar pairs is not considered.
Maximum-Margin Hamming Hashing (MMHH) [
87]. In view of the shortcomings of DCH, MMHH utilizes the t-Distribution and contains different objective functions for similar and dissimilar pairs. The total loss is the weighted sum of two losses. Besides, a margin
\(\zeta\) is utilized to avoid producing the exact same hash codes. The Cauchy distribution in DCH is replaced by
The loss function is the weighted log-likelihood of conditional probability, i.e.,
The last term is a standard quantization loss. MMHH also proposed a semi-batch optimization strategy to alleviate the imbalance problem. Specifically, the binary codes of the training data are stored as extra memory. The pairwise loss is calculated by the new codes computed in the current epoch and their similar and dissimilar pairs are added into the memory bank for a new epoch. In general, MMHH solves the shortcomings of DCH, which greatly improves search performance.
Deep Fisher Hashing (DFH) [
103]. DFH points out that the pairwise loss minimization is similar to Fisher’s Linear discriminant, which maximizes the gaps between inter-class examples whilst minimizing the gaps between the intra-class examples. Its logistic loss function is similar to MMHH and the final loss function is formulated as
in which
\(\epsilon\) is a margin parameter. Besides, the quantized center loss is added to the objective function, which not only minimizes intra-class distances but also maximizes inter-class distances between binary hash codes of each image.
Deep Asymmetric Pairwise Hashing (DAPH) [
139]. Similar to ADSH, DAPH also adopted an asymmetric strategy. The difference is that DAPH uses two networks with different parameters for the database and queries. Besides, the bit independence, bit balance and quantization loss are added to the loss function following SH-BDNN. The loss function is optimized by updating the two neural networks alternatively.
Deep Attention-guided Hashing (DAgH) [
182]. DAgH adopts a two-step framework similar to CNNH, while it utilizes neural networks to learn hash codes in both two steps. In the first step, the objective function is the combination of the log-likelihood loss and the difference loss with a margin. In the second step, DAgH utilizes binary point-wise cross-entropy for optimization. Besides, the backbone of DAgH includes a fully convolutional network with an attention module for obtaining accurate deep features.
Deep Joint Semantic-Embedding Hashing (DSEH) [
99]. DSEH is the first work to introduce LabNet in deep supervised hashing. It also adopts a two-step framework with LabNet and ImgNet, respectively. LabNet is a neural network designed to capture abundant semantic correlation with image pairs, which can help to guide the hash code learning in the second step.
\(\mathbf {f}_i\) denotes the label embedding produced from one-hot label
\(\mathbf {y}_i\). LabNet replaces the input from images to their label and learns the hash codes from labels with a general hashing loss function. In the second step, ImgNet utilizes an asymmetric loss between the labeled features in the first step and the newly obtained features from ImageNet
\(\mathbf {h}_j\), i.e.
\(s_{ij}^h={\mathbf {f}_i}^T\mathbf {h}_j\) along with the binary cross-entropy loss similar to DAgH [
182]. DSEH fully makes use of the label information from the perspectives of both pairwise loss and cross-entropy loss, which can help generate discriminative and similarity-preserving hash codes.
Asymmetric Deep Semantic Quantization (ADSQ) [
181]. ADSQ increases the performance by utilizing two hashing networks and reducing the difference between the continuous network outputs and the desired hash codes, and the difference loss is also involved.
Deep Anchor Graph Hashing (DAGH) [
26]. In the anchor graph, a minimal number of anchors are used to link the whole dataset, allowing for implicit computation of the similarities between distinct examples. At first, it samples a number of anchors and builds an anchor graph between training samples and anchors. Then, the loss function can be divided into two parts. The first part contains a typical pairwise likelihood loss and a linear regression loss. In the second part, the loss is calculated by the distances between training samples and anchors in the same class, and both deep features and binary codes are used to compute the distances. Besides a general pairwise likelihood loss and a linear regression loss, DAGH minimizes the distances between deep features of training samples and binary codes of anchors belonging to the same class. This method fully utilizes the remaining labeled data during mini-batch training and helps to obtain efficient binary hash codes.