tutorial

Open access

A Survey on Deep Hashing Methods

Authors:

Xian-Sheng HuaAuthors Info & Claims

ACM Transactions on Knowledge Discovery from Data, Volume 17, Issue 1

Article No.: 15, Pages 1 - 50

https://doi.org/10.1145/3532624

Published: 20 February 2023 Publication History

All formats PDF

Abstract

Nearest neighbor search aims at obtaining the samples in the database with the smallest distances from them to the queries, which is a basic task in a range of fields, including computer vision and data mining. Hashing is one of the most widely used methods for its computational and storage efficiency. With the development of deep learning, deep hashing methods show more advantages than traditional methods. In this survey, we detailedly investigate current deep hashing algorithms including deep supervised hashing and deep unsupervised hashing. Specifically, we categorize deep supervised hashing methods into pairwise methods, ranking-based methods, pointwise methods as well as quantization according to how measuring the similarities of the learned hash codes. Moreover, deep unsupervised hashing is categorized into similarity reconstruction-based methods, pseudo-label-based methods, and prediction-free self-supervised learning-based methods based on their semantic learning manners. We also introduce three related important topics including semi-supervised deep hashing, domain adaption deep hashing, and multi-modal deep hashing. Meanwhile, we present some commonly used public datasets and the scheme to measure the performance of deep hashing algorithms. Finally, we discuss some potential research directions in conclusion.

1 Introduction

Nearest neighbor search is among the most basic tasks in various domains including data mining [196] and image retrieval [16]. There have been a variety of algorithms for exact nearest neighbor search, such as KD-tree [42, 165]. Unfortunately, when it comes to high-dimensional and large-scale data, the time cost of accurately identifying the sample nearest to the query is substantial. To tackle the challenge, the approximate nearest neighbor search has received ever-increasing attention since it could significantly decrease the search complexity under most circumstances [36, 120, 165]. Hashing is one of the most widely-used methods, because it is very efficient in terms of computation and storage [10]. Its purpose is to convert the high-dimensional features vectors into low-dimensional hash codes, so that the hash codes of the similar objects are as close as possible, and the hash codes of dissimilar objects are as different as possible. The existing hashing methods consist of local sensitive hashing [22, 75] and learning to hash. The purpose of local sensitive hashing is to map the original data into several hash buckets. The closer the original distance between objects is, the greater the probability of falling in the same hash bucket. Through this mechanism, many algorithms based on locally sensitive hashing have been proposed [6, 7, 32, 33, 127, 131], which show high superiority in both calculation and storage. However, in order to improve the recall rate of search, these methods usually need to build many different hash tables, so their applications on particularly large data sets are still limited.

Since local sensitive hashing is data-independent, researchers try to get high-quality hashing codes by learning good hash functions. As two pioneering methods, i.e., spectral hashing and semantic hashing, have been proposed [136, 171], learning to hash has sparked considerable academic interest in both machine-learning and data mining. With the development of deep learning [93], getting hash codes through deep learning gets more and more attention for two reasons. The first reason is that the powerful representation capabilities of deep learning can learn very complex hash functions. The second reason is that deep learning can achieve end-to-end hashing codes, which is very useful in many applications. In this survey, we mainly focus on deep supervised hashing methods and deep unsupervised hashing methods, which are two mainstreams in hashing research. Moreover, three related important topics including semi-supervised deep hashing, domain adaption deep hashing, and cross-modal deep hashing are also included.

Deep supervised hashing has been explored over a long period. The design of the deep supervised hashing method mainly includes two parts: the design of the network structure and the design of the loss function. For small datasets like MINST [94] and CIFAR-10 [89], shallow architecture such as AlexNet [90] and CNN-F [23] are widely used. While for complex datasets like NUS-WIDE [29] and COCO [109], deeper architecture such as VGG [149] and ResNet50 [61] are needed. The loss objectives are designed with the intention of maintaining similarity structures. These methods [15, 104] usually aim at narrowing the difference between the similarity structures in the original and Hamming spaces. Researchers usually obtain the similarities in the original space by using label information in supervised scenarios, which is widely studied in different deep hashing methods. Hence, how obtaining the similarities of learned hash codes are important for different algorithms. We further categorize the deep supervised hashing algorithms according to how measuring the similarities of learned hash codes into four classes, i.e., pairwise methods, ranking-based methods, pointwise methods and quantization. For each manner, we comprehensively analyze how the related articles design the optimization objective and take advantage of semantic labels, as well as what additional tricks are used.

Another area of research along this line is deep unsupervised hashing, which does not require any label information. Deep unsupervised hashing has drawn widespread attention recently, since it is easily applied in practice. In unsupervised settings, the semantic information is usually derived from the relationship in the original space. Based on manners of learning semantic information, we categorize the deep unsupervised hashing algorithms into pseudo-label-based methods, similarity reconstruction-based methods, and prediction-free self-supervised learning-based methods. In addition, we also introduce some other related important topics such as semi-supervised deep hashing, domain adaptation deep hashing, and multi-modal deep hashing methods. The overall structure of this survey is shown in Figure 1. Meanwhile, we also present some commonly used public datasets and the scheme to measure the performance of deep hashing algorithms. At last, a comparison of some key algorithms was given.

Fig. 1.

Compared to other surveys on hashing methods [18, 163, 164, 165], our article mainly centers on recent deep hashing methods rather than traditional hashing methods and how they optimize the hashing network. Moreover, we study both deep supervised hashing as well as deep unsupervised hashing extensively. Finally, we classify two topics in a brand-new view based on the different manners of optimization. As far as we know, this is the most comprehensive survey about deep hashing, which is beneficial to researchers in understanding the mechanisms and trends of deep hashing.

2 Background

2.1 Nearest Neighbor Search

Given a \(d\)-dimensional Euclidean space \(\mathbb {R}^{d}\), the nearest neighbor search aims at finding the sample \(\text{NN}(\mathbf {x}_r)\) in a finite set \(\Pi \subset \mathbb {R}^{d}\) such that

\begin{equation} \text{NN}(\mathbf {x}_r) = \arg \min _{\mathbf {x}_b \in \Pi } \rho (\mathbf {x}_r,\mathbf {x}_b), \end{equation}

(1)

in which \(\mathbf {x}_r\in \mathbb {R}^{d}\) represents the query point. Note that \(\rho\) could be any metrics such as Euclidean distance, cosine distance along with general \(\ell _{p}\) distance. Many exact nearest neighbor search methods [42] have been developed by the researchers, which works quite well when \(d\) is small. However, nearest neighbor search is intrinsically costly due to the curse of dimensionality [3, 4]. Although KD-tree can be extended to high-dimensional situations, its efficiency is far from satisfactory.

To solve this problem, a series of algorithms for approximate nearest neighbors have been proposed [33, 46, 79, 128]. The principle of these methods is to find the nearest point with a high probability, rather than to find the nearest point accurately. These ANN algorithms are mainly divided into three categories: hashing-based methods [1, 33, 123], product quantization-based methods [44, 79, 86, 190], and graph-based methods [56, 125, 126]. These algorithms have greatly improved the efficiency of searching while ensuring a relatively high accuracy, so they are widely used in the industry. Compared to the other two types of methods, hashing-based algorithms are the longest studied and the most studied by researchers, because it has great potential in improving computing efficiency and reducing memory cost.

2.2 Hashing Algorithms

For nearest neighbor search, hashing algorithms are very efficient in terms of both computing and storage. Two main types of hashing-based search methods have been developed, i.e., hash table lookup [96, 151] and hash code ranking [80, 116].

The primary goal of hash table lookup is to decrease the number of distance calculations for speeding up searches. The structure of the hash table contains various buckets, each of which is indicated by one separate hash code. Each point is associated with a hash bucket that shares the same hash code. Thus, the manner to learn hash codes for this kind of algorithm is to increase the likelihood of producing the same hash codes for adjacent points in the original space. When a query is given, we can find the corresponding hash bucket according to the hash code of the query, so as to find the corresponding candidate set. After this step, we usually re-rank the points in the candidate set to get the final search target. However, the recall of selecting a single hash bucket as a candidate set will be relatively low. Two methods are usually adopted to overcome this problem. The first method is to select some buckets that are close to the target bucket at the same time. The second method is to independently create multiple different hash tables according to different hash codes. Then, we can select the corresponding target bucket from each hash table.

Hash code ranking is a relatively easier way than hash table lookup. When a query comes, we compute the Hamming distance between the query and each point in the searching dataset, then select the points with relatively smaller Hamming distances as the candidates for nearest neighbor search. After that, a re-ranking process by the original features is usually followed to obtain the final nearest neighbor. Different from hash table lookup methods, hash code ranking methods prefer hash codes that preserve the similarities or distances in the original space.

2.3 Deep Neural Networks

Deep neural networks [137] have achieved significant success in various areas including computer vision [38, 61] and natural language processing [52, 154]. Early works such as deep belief network [63], and autoencoder [129] are mostly based on multi-layer perceptions. However, these networks do not show much better performance compared with traditional methods such as support vector machine and k-nearest neighbors algorithm. As convolutional neural networks have been introduced to process image data, various popular deep networks have been proposed and achieved promising results. AlexNet [90] consists of five convolutional layers followed by three fully connected layers. VGGNet [149] increases the model depth and improves the performance of image classification. NIN [107] is further proposed to promote the discriminability of image patches within the receptive field. Researchers have found that the depth of representations is the key to high performance for various visual recognition tasks. However, the problem of vanishing/exploding gradients makes it difficult to build very deep neural networks. ResNet [61] tackles this problem by leveraging the residual learning to deepen the network and benefits from very deep models. Recently, Vision Transformer [38] has achieved great success on image classification tasks due to its high model capacity and easy scalability. These powerful neural network architectures has become the backbone networks in various applications, including semantic segmentation [155] and object detection [57]. In virtue of the strong representation ability of deep neural networks, deep hashing has shown great performance in image retrieval and drawn increasing attention recently.

2.4 Learning to Hash

Given an input item \(\mathbf {x}\), learning to hash aims at obtaining a hash function \(f\), which maps \(\mathbf {x}\) to a binary code \(\mathbf {b}\) for the convenience of the nearest neighbor search. The hash codes obtained by a good hash function should preserve the distance order in the original space as much as possible, i.e., those items that are close to the specific query in Hamming space should also be close to the query in the original space. Many traditional hash functions include spherical function, linear projection, and even a non-parametric function. A wide range of traditional hashing methods [46, 48, 58, 113, 140, 153, 165, 171, 196] have been proposed by researchers to learn compact hash codes, and achieved significant progress. For instance, AQBC [47] utilizes the angle between two vectors to measure the similarity and maps feature vectors into the most similar vertices of a binary hypercube. FSDH [53] regresses the semantic labels of samples to their binary codes and optimizes the hash codes in an alternative manner. For a more comprehensive understanding, refer to a survey article [163]. However, these simple hash functions do not scale well for huge datasets. For the strong representation ability of deep learning, more and more researchers pay attention to deep supervised hashing and develop a range of promising methods. These methods generally achieve better performance than traditional methods.

3 Deep Supervised Hashing

In this article, we first talk about deep supervised hashing methods, which are the basis of the subsequent deep unsupervised hashing techniques.

3.1 Overview

Deep supervised hashing uses deep neural networks as hash functions, which can generate hash codes in an end-to-end manner. We focus on the following four key problems: (1) what deep neural network architecture is adopted; (2) how to design the loss function for preserving the similarity structure; (3) how to optimize the deep neural network with the discretization problem; and (4) what other skills can be used to improve the performance. We first answer the first three problems in a nutshell and the last problem is left in the subsequent detailed introduction. Figure 2 shows a representative framework of deep supervised hashing.

Fig. 2.

3.1.1 Network Architecture.

Traditional hashing methods usually utilize linear projection and kernels, which show poor representation ability. After AlexNet and VGGNet [90, 149] were proposed, deep learning shows its superiority in computer vision, especially for classification problems. And more and more experiments have proved that the deeper the network, the better the performance. As a result, ResNet [61] takes advantage of residual learning, which can train very deep networks, achieved significantly better results. After that, ResNet and its variants have become basic architectures in deep learning [61, 71, 175]. The latest researches often utilize the popular architectures with pre-trained weights in large datasets such as ImageNet, following the idea of transfer learning. Most of the researchers utilize shallower architectures such as AlexNet, CNN-F, and design stacked convolutional neural networks for simple datasets, e.g., MNIST, CIFAR-10. Deeper architectures such as VGGNet and ResNet50 are often utilized for complex datasets such as NUS-WIDE and COCO. To be more precise, for deep supervised hashing methods, the hashing network is usually modified from these aforementioned standard networks by replacing the classification head with a fully-connected layer containing \(L\) units for hash code learning. The network outputs are usually continuous codes. The hash codes can be obtained using a sign activation. Graph neural networks, which capture the dependence between the nodes of graphs via message passing mechanisms, have been popular in various applications. They have also been adopted in recent hashing methods to learn the correlation of datasets [30, 157, 167].

The architecture of the hashing network is one of the most important factors for deep supervised hashing, and it affects both the accuracy of the search and the time cost of inference. If the architecture degenerates into MLP or linear projections, deep supervised hashing will degrade into traditional hashing methods. Although the deeper the network architecture, the greater the search accuracy, it also increases the time cost. We think that the architecture needs to be considered combined with the complexity of datasets. As we know, the majority of existing deep hashing methods can use any network architecture as needed. Therefore, we do not adopt the network architectures for categorizing the deep supervised hashing algorithms.

3.1.2 Similarity Measurement and Objective Function.

We first provide formal notations and key concepts in Table 1 for the sake of clarity. \(\mathcal {X}=\lbrace \mathbf {x}_i\rbrace _{i=1}^{N}\) is denoted as the training set. \(\mathcal {H}=\lbrace \mathbf {h}_i\rbrace _{i=1}^{N}\) denotes the outputs of the hashing network, i.e., \(\mathbf {h}_i= \Psi (\mathbf {x}_i)\). \(\mathcal {B}=\lbrace \mathbf {b}_i\rbrace _{i=1}^{N}\) is the obtained binary codes. We denote the similarity between pair of items \((\mathbf {x}_i,\mathbf {x}_j)\) in the input space and Hamming space as \(s_{ij}^o\) and \(s_{ij}^h\), respectively. In the input space, the similarity is the ground truth, which is mainly based on sample distance \(d_{ij}^o\) and semantic labels. The former refers to the distance of features, e.g., Euclidean distance \(||\mathbf {x}_i-\mathbf {x}_j||_2\), and the similarity can be computed using Gaussian function or Characteristic function, i.e., \(\exp (-\tfrac{(d_{i j}^{o})^{2}}{2 \sigma ^{2}})\) and \(I_{d_{ij}^o\lt \tau }\) where \(\tau\) is a given threshold. The cosine similarity is also popular for measurement. The latter is more popular in deep supervised hashing, where the value is 1 if the two examples share a common semantic label and 0 vice visa.

Table 1.

Symbol	Description
\(\mathbf {x}_i\) (\(\mathbf {X}\))	input images (in matrix form)
\(\mathbf {b}_i\) (\(\mathbf {B}\))	output hash codes (in matrix form)
\(\mathbf {h}_i\) (\(\mathbf {H}\))	network outputs (in matrix form)
\(\mathbf {y}_i\) (\(\mathbf {Y}\))	one-hot image labels (in matrix form)
\(\Psi (\cdot)\)	hashing network
\(N\)	the number of input images
\(L\)	hash code length
\(\mathcal {E}\)	a set of pair items
\(s_{ij}^o\)	the similarity of item pair \((\mathbf {x}_i, \mathbf {x}_j)\) in the input space
\(s_{ij}^h\)	the similarity of item pair \((\mathbf {x}_i, \mathbf {x}_j)\) in the Hamming space
\(d_{ij}^o\)	the distance of item pair \((\mathbf {x}_i, \mathbf {x}_j)\) in the input space
\(d_{ij}^h\)	the distance of item pair \((\mathbf {x}_i, \mathbf {x}_j)\) in the Hamming space
\(\epsilon\)	margin threshold parameter
\(\mathbf {W}\)	weight parameter matrix
\(\Theta\)	set of neural parameters

Table 1. Summary of Symbols and Notation

The pairwise distance \(d_{ij}^h\) in the Hamming space is Hamming distance naturally, which is defined as follows:

\begin{equation} d_{i j}^{h}=\sum _{l=1}^{L} \delta \left[\mathbf {b}_{i}(l) \ne \mathbf {b}_{j}(l) \right]\!. \end{equation}

(2)

If the hash code is valued by 1 and 0, we have:

\begin{equation} d_{i j}^{h}=\Vert \mathbf {b}_{i}-\mathbf {b}_{j}\Vert _{1}\!, \end{equation}

(3)

and it varies from 0 to \(L\). As a result, the similarity in this circumstance is denoted as \(s_{ij}^h=(L-d_{ij}^h)/L\). If the code is valued by 1 and \(-\)1, we have:

\begin{equation} d_{ij}^h=\frac{1}{2}(L-\mathbf {b}_i^T\mathbf {b}_j). \end{equation}

(4)

The similarity is defined using the inner product, i.e., \(s_{i j}^{h}=(\mathbf {b}_{i}^{\top } \mathbf {b}_{j}+L)/2L\). We can also extend this to the weighted circumstance. In formulation,

\begin{equation} d_{i j}^{h}=\sum _{l=1}^{L} \lambda _{l} \delta \left[\mathbf {b}_{i}(l) \ne \mathbf {b}_{j}(l) \right]\!, \end{equation}

(5)

where each bit is associated with a weight \(\lambda _l\), and if the values of codes are 1 and \(-\)1, we have

\begin{equation} s_{i j}^{h}=(\mathbf {b}_{i}^{\top } \Lambda \mathbf {b}_{j}+ tr(\Lambda))/2tr(\Lambda), \end{equation}

(6)

in which \(\mathbf {\Lambda }=\text{diag}(\lambda _{1}, \lambda _{2}, \ldots , \lambda _{l})\) is diagonal and \(tr(\cdot)\) denotes the trace of the matrix. The weight of the associated hash bit fills each diagonal element of the matrix.

After defining the similarity measurement, we focus on the objective functions in deep supervised methods. A well-designed objective functions is one of the most important factors to promise the performance of deep supervised hashing. The main guideline for designing the objective function is to keep the similarity structure, which means to minimize the difference between the similarities in the original and Hamming spaces. As a result, most of the objective functions contain the terms of similarity information. Among them, the typical loss functions are in a pairwise manner, making similar pairs of images have similar hash codes (small Hamming distance) and dissimilar pairs of images have dissimilar hash codes (large Hamming distance). Besides, a variety of researchers adopt ranking-based similarity preserving loss terms. For example, triplet loss is often used to maintain as much consistency as possible between the ordering of numerous items calculated from the original and Hamming spaces. There are also several listwise loss terms that consider the whole datasets for similarity preserving.

Besides similarity information, the pointwise label information is also well-explored in the design of the objective function. There are three popular ways to take advantage of label information summarized below. The first way is a regression on hash codes with labels. The label is encoded into one-hot format matrix and regression loss, i.e., \(||\mathbf {Y}-\mathbf {W}\mathbf {H}||_F\) are added into the loss function. The second way is adding a classification layer after the hashing network, and a classification loss (e.g., cross-entropy loss) is added to the objective function. The last one is utilizing LabNet, which was first proposed in [99]. LabNet aims at capturing the ample semantic relationships among example pairs.

The quantization loss term is also commonly used in deep supervised hashing, especially in quantization-based hashing methods. The typical form of quantization is to penalize the distance between continuous codes (i.e., network outputs) and binary codes. As a common technique in deep hashing, bit balance loss penalizes the situation that each bit has a large chance of being 1 or \(-1\) among the whole dataset. Several regularization losses can be added to the loss function, which is also important for improving the performance.

3.1.3 Optimization Algorithm.

It is difficult to optimize the hashing network parameters because of the vanishing gradient issue resulting from the sign activation function, which is used to obtain binary hash codes. Specifically, the sign function is in-differentiable at zero and its gradient is zero for all nonzero input, which is fatal to the hashing network using gradient descent for optimization.

Almost all the works adopt that continuous relaxation by smoothing the sign function using the sigmoid function or the hyperbolic tangent function, and apply sign function to obtain final binary codes later in the evaluation phase. The first typical way is quantization function by adding a penalty term in loss function, which is often formulated as \(|||\mathbf {h}_i|-\mathbf {1}||_1\), or \(-||\mathbf {h}_{i}||\) with tanh activation. This penalty term helps the neural network to obtain \(sgn(\mathbf {h_i})\approx \mathbf {h_i}\). Note that this loss can be considered as a prior for every binary code \(h_i\) on basis of a variant of certain distribution, e.g., bimodal Laplacian and Cauchy distribution. From this view, we can get a few variants, e.g., pairwise quantization [200] and Cauchy quantization loss [15]. If the loss function is a non-smooth function and its derivative is hard to calculate, a modified version can be adopted instead, e.g., \(|x| \approx \log (\cosh x)\) [200]. The second way is an alternative scheme, which resolves the optimization into several sub-problems. Then, these sub-problems could be iteratively settled through alternating the minimization of objectives. In this alternative process, backpropagation can only work in one sub-problem, and the other sub-problems can be solved by other optimization methods. For example, DSDH [100] utilizes the discrete cyclic coordinate descend algorithm. These methods can keep the discrete constraint during the whole optimization process, while it can not lead to end-to-end training, which has limited application for solving the unknown sub-problems. The third method is named continuation, which utilizes a smoothed function \(y= tanh(\beta x)\) to approach the discrete activation function by increasing \(\beta\) [19]. There are some other ways to solve this problem by changing the calculation and the propagation of gradients, e.g., Greedy Hash [156] and Gradient Attention Network [72], which improve the effectiveness and accuracy of deep supervised hashing.

3.1.4 Summarization.

In this survey, we divide the current methods into the following four classes mainly based on how to measure the similarities in the Hamming space: the pairwise methods, the ranking-based methods, the pointwise methods, and the quantization methods. The quantization methods are separated from the pairwise methods due to their specificity. The key motivation we select how to measure the similarities of the learned hash codes for categorization is that the fundamental core of learning to hash is to maintain the similarity structure and the manners of similarity measurement decide the loss functions in deep supervised hashing. Additionally, neural network architectures, optimization manners as well as other skills are also significant for the retrieval performance. For each class, we will discuss the corresponding deep hashing methods in detail one by one. The detailed summarization of these methods is shown in Table 2.

Table 2.

Approach	Pairwise	Ranking-based	Pointwise	Binarization	Other skills
SDH [40]	Prod.	-	-	Quan.	Bit Bal. + Orthogonality
DSH [111]	Prod. + Margin	-	-	Quan.	-
PCDH [27]	Prod. + Margin	-	Cla. Layer	Drop	Pairwise Correlation
WMRSH [98]	Prod. + Margin	-	Cla. Layer	Quan.	Bit and Table Weight
SHBDNN [36]	Diff.	-	-	Quan. + Alternation	Bit Bal. + Independence
DDSH [82]	Diff.	-	-	Alternation	Splitting Training Set
CNNH [173]	Diff.	-	Part of Hash Codes	-	Two-step
ADSH [84]	Diff.	-	-	Quan. + Alternation	Asymmetry
DIH [172]	Diff.	-	-	Quan. + Alternation	Incremental Part + Bit Bal.
HBMP [9]	Diff.	-	-	Drop	Bit Weight + Two-step
DOH [85]	Diff.	-	-	Ranking	FCN
DPSH [101]	Like.	-	-	Quan.	-
DHN [200]	Like.	-	-	Quan. + Smooth	-
HashNet [19]	Weighted Like.	-	-	Tanh + Continuation	-
DSDH [100]	Like.	-	Linear Reg. + L2	Quan. + Alternation	-
DAPH [139]	Like.	-	-	Quan. + Alternation	Bit Bal. + Ind.
DAgH [180]	Like. + Diff.	-	-	-	Two-step
DCH [15]	Cauchy Like.	-	-	Cauchy Quan.	-
DJSEH [99]	Like.	-	LabNet + Linear Reg.	Quan.	Two-step + Asymmetry
ADSQ [181]	Diff. + Like.	-	LabNet	Quan. + Alternation	Bit Bal. + Two-step
MMHH [87]	t-Distribution. Like.	-	-	Quan.	Semi-Batch Optimization
DAGH [26]	Like.	-	Linear Reg.	Drop + Alternation	Reg. with Anchor Graph
HashGAN [14]	Weighted Like.	-	-	Cosine Quan.	GAN
DPH [20]	Priority Like.	-	-	Priority Quan.	Priority CE Loss
DFH [103]	Like. + Margin	-	-	Quan. + Alternation	Quantized Center Loss
DRSCH [189]	Diff. + Margin	Triplet + Margin	-	Drop	Bit Weight
DNNH [92]	-	Triplet + Margin	-	Piecewise Thresholding	-
DSRH [197]	-	Weighted Triplet	-	Quan.	Bit Bal.
DTSH [169]	-	Triplet + Like.	-	Quan.	-
DSHGAN [133]	-	Triplet + Margin	Cla. Layer	Drop	GAN
AnDSH [198]	-	Matrix Optimization	Angular-softmax	Drop	Bit Bal.
HashMI [8]	-	Mutual Information	-	Drop	-
TALR [59]	-	Relaxed AP + NDCG	-	Tie-Awareness
MLRDH [117]	-	-	Multi-linear Reg.	Alternation	Hash Boosting
HCBDH [25]	-	-	Cla. Layer	-	Hadamad Loss
DBH [106]	-	-	Cla. Layer	-	Transform Learning
SSDpH [180]	-	-	Cla. Layer	Quan.	Bit Bal.
VDSH [195]	-	-	Linear Reg.	Drop + Alternation	-
PMLR [144]	-	-	Cla. Layer	-	Distribution Regu.
CSQ [185]	-	-	Center + Binary CE	Quan.	-
DPH [41]	-	-	Center + Polarization	-	-
OrthHash [64]	-	-	Center + CE	-	-
PSLDH [158]	-	-	Center + Partial Loss	Quan.	-
DVsQ [16]	-	Triplet Loss + Margin	-	Inner-Product Quan.	Label Embeddings
DPQ [88]	-	-	Cla. Layer	-	Joint Central Loss
DSQ [39]	-	-	Cla. Layer	Quan. + Alternation	Joint Central Loss
SPDAQ [24]	Diff.	-	Cla. Layer	Drop + Alternation	Asymmetry
DQN [17]	Diff.	-	-	Product Quan.	Asymmetric Quan. Distance
DTQ [110]	-	Triplet + Margin	-	Weak-Orthogonal Quan.	Group Hard

Table 2. A Summary of Deep Supervised Hashing Methods w.r.t the Different Manner of Similarity Measurement (Pairwise Methods, Ranking-based Methods, and Pointwise Methods), Binarization as well as Other Skills

Drop = Drop the sign operator in the neural network and treat the binary code as an approximation of the network output, Two-step = Two-step optimization. Reg. = Regression, Quan. = Quantization Loss, Cla. = Classification, Ind. = Independence, Regu. = Regularization, Bal. = Balance, Diff. = Difference Loss, Prod. = Product Loss, and Like. = Likelihood Loss.

3.2 Pairwise Methods

We further divide the techniques, which match the distances or similarities of image pairs derived from two spaces, i.e., the original space and the Hamming space into two parts as follows:

–

Difference loss minimization. The kind of losses minimizes the difference between the similarities, i.e., \(\min \sum _{(i, j) \in \mathcal {E}}(s_{i j}^{o}-s_{i j}^{h})^{2}\) [9, 36, 82, 84, 85, 172, 173]. \(s_{i j}^{h}\) can be derived with inner product of binary codes, i.e., \(s_{i j}^{h} = \mathbf {b}_i^T\mathbf {b}_j/L\) and \(s_{i j}^{o}\) is now valued by 1 or \(-1\). However, binary optimization is difficult to implement. Early methods utilizes the relaxed outputs of neural networks to replace the hash codes, i.e., \(s_{i j}^{h} = \mathbf {h}_i^T\mathbf {h}_j/L\) [36, 173]. Subsequent methods utilize a asymmetric manner to calculate the similarity, i.e., \(s_{i j}^{h} = \mathbf {b}_i^T\mathbf {h}_j/L\), which releases the impact of quantization error [82, 85, 172]. There are also works combining both symmetric and asymmetric similarities [84]. Weighted bits are also introduced for adaptive similarity calculation [9]. Note that the difference losses can be transformed into a product form. Hence, we also categorize the methods minimizing product loss as this group. They usually adopt a loss in the product form, i.e., \(\min \sum _{(i, j) \in \mathcal {E}} s_{i j}^{o} d_{i j}^{h}\) [40], which expects that if the similarities in the original space are higher, the distances in the Hamming space should be less. Subsequent methods usually involve a margin in the loss for better relaxation [27, 98, 111].

–

Likelihood loss minimization. This kind of losses is derived from the probabilistic model. Given similarity matrix \(\mathbf {S}=\lbrace s_{ij}^o\rbrace _{(i,j)\in \mathcal {E}}\) and hash codes \(\mathbf {B} = [\mathbf {b}_1, \ldots , \mathbf {b}_N]^T\), the posterior estimation of binary codes is formulated as follows:

\begin{equation} p(\mathbf {B}| \mathbf {S}) \propto p(\mathbf {S} |\mathbf {B}) p(\mathbf {B})=\prod _{(i, j) \in \mathcal {E}} p\left(s_{i j}^o | \mathbf {B}\right) p(\mathbf {B}), \end{equation}

(7)

where \(p(\mathbf {B})\) denotes a prior distribution and \(p(\mathbf {S}|\mathbf {B})\) is the likelihood. The conditional probability of \(s_{ij}^o\) given their hash codes is denoted by \(p(s_{i j}^o |\mathbf {B})\). Note that the \(s_{ij}^h\) is derived from \(\mathbf {B}\). In formulation,

\begin{equation} p\left(s_{i j}^{o} | \mathbf {B}\right) = p \left(s_{i j}^{o} | s_{i j}^{h} \right)=\left\lbrace \begin{array}{cc}\sigma \left(s_{i j}^{h}\right)\!, & s_{i j}^{o} = 1 \\ 1-\sigma \left(s_{i j}^{h}\right)\!, & s_{i j}^{o} = 0 \end{array}\right.\!\!\!, \end{equation}

(8)

in which \(\sigma (x)=1/(1+e^x)\). From Equation (8), the probabilistic model expects the similarities in the Hamming space to be larger if the similarities in the original space are larger. The loss function is the negative log-likelihood (NLL) [101, 200], i.e.,

\begin{equation} \mathcal {L}_{NLL} = -\log p(\mathbf {S}|\mathbf {H})=\sum _{(i, j) \in \mathcal {E}}\log (1+e^ {s_{ij}^h})-s_{i j}^o s_{ij}^h. \end{equation}

(9)

Similarly, the hashing network usually cannot directly obtain the hash codes. Hence, these codes \(\mathbf {B}\) will be replaced by the network outputs \(\mathbf {H}\) to generate \(s_{ij}^h\). The majority of methods adopt the symmetric similarities, while several methods utilize the asymmetric form for similarity calculation [99, 139]. However, the sigmoid function in Equation (8) is not optimal, and there are a number of works that utilize different tools to design valid probability functions, e.g., priority weighting [20], Cauchy distribution [16], imbalance learning [19] and t-Distribution [87]. Subsequent works combine label information with pairwise similarity learning for better semantic preserving [26, 99, 100]. Li et al. [103] associate the likelihood loss with Fisher’s Linear discriminant, and introduce a margin for discriminative hash codes. Chen et al. [26] reduce the computational cost by introducing anchors for similarity calculation. There are also some works combining both difference loss minimization and likelihood loss minimizing for comprehensive optimization [182].

Although these methods in each group could utilize the same pairwise loss term, they may involve different architectures, optimization manners and regularization terms, as shown in Table 2. These details of different variants will be shown below.

3.2.1 Difference Loss Minimization.

Deep Supervised Hashing (DSH) [111]. DSH unitizes a network consisting of three convolutional-pooling layers and two fully connected layers. Recall that the outputs of the hashing network are \(\lbrace \mathbf {h}_i \rbrace _{i=1}^N\). The origin pairwise loss function is defined as follows:

\begin{equation} \begin{aligned}\quad & \mathcal {L}_{\text{DSH}}=\sum _{(i, j) \in \mathcal {E}}\frac{1}{2}s_{ij}^od_{ij}^h+\frac{1}{2}(1-s_{ij}^o)[\epsilon -d_{ij}^h]_+\\ & \mbox{s.t.}\ \forall \quad \mathbf {h}_i,\mathbf {h}_j \in \lbrace -1,1\rbrace ^L, \end{aligned} \end{equation}

(10)

where \(d_{ij}^h = ||\mathbf {h}_i-\mathbf {h}_j||_2^2\), \([\cdot ]_+\) denotes \(\max (\cdot ,0)\) and \(\epsilon \gt 0\) is a given threshold parameter. The loss function obeys a distance-similarity product minimization formulation that expects similar examples mapped to similar binary codes and rewards dissimilar examples transferred to distinct binary codes when the Hamming distances are smaller compared with the margin threshold \(m\). It is noticed that when \(d_{ij}^h\) is larger than \(m\), the loss does not produce gradients. This idea is similar to the hinge loss function.

As we discuss before, DSH relaxes the binary constraints and a regularizer is added to the continuous outputs of the hashing network, which approximates the binary codes, i.e., \(h\approx sgn(h)\). The pairwise loss is rewritten as

\begin{equation} \mathcal {L}_{\text{DSH}}=\frac{1}{2}s_{ij}^o||\mathbf {h}_i-\mathbf {h}_j||_2^2+\frac{1}{2}(1-s_{ij}^o)[\epsilon -||\mathbf {h}_i-\mathbf {h}_j||_2^2]_+ +\lambda _1\sum _{k=i,j}|||\mathbf {h}_k|-\mathbf {1}||_1, \end{equation}

(11)

where \(\mathbf {1}\) denotes a all-one vector and \(||\cdot ||_p\) produces the \(\ell _{p}\)-norm of the vector. \(\lambda _1\) is a parameter to balance the effects of the regularization loss. DSH does not utilize saturating non-linearities because it may slow down the training process. With the above loss function, the neural network is able to be trained with an end-to-end back propagation algorithm. For the evaluation process, the binary codes can be derived using the sign activation function. DSH is a straight-forward deep supervised hashing method in the early period, and its idea originates from Spectral Hashing [171] but with a deep learning framework.

Pairwise Correlation Discrete Hashing (PCDH) [27]. PCDH utilizes four fully connected layers after the convolutional-pooling layer, named deep feature layer, hash-like layer, discrete hash layer as well as classification layer, respectively. The third layer can directly generate discrete hash code. Different from DSH, PCDH leverages \(\ell _2\) norm of deep features and hash-like codes. Besides, the classification loss is included in the final function:

\begin{equation} \begin{aligned}\mathcal {L}_{\text{PCDH}} &= \mathcal {L}_{s}+\lambda _1 \mathcal {L}_{p}+\beta \mathcal {L}_{l} \\ &=\sum _{(i, j) \in \mathcal {E}}\left(\frac{1}{2}(1-s_{i j}^o)[\epsilon -\Vert \mathbf {h}_{i}-\mathbf {h}_{j}\Vert _{2}^{2}]_+^{2}+\frac{1}{2} s_{i j}\Vert \mathbf {h}_{i}-\mathbf {h}_{j}\Vert _{2}^{2}\right) \\ &\quad +\lambda _1\sum _{(i, j)\in \mathcal {E}}\left(\frac{1}{2}\left(1-s_{i j}^o\right)[\epsilon -||\mathbf {z}_i-\mathbf {z}_j||_{2}^{2}]_+^{2}+\frac{1}{2} s_{i j}^o\Vert \mathbf {z}_i-\mathbf {z}_j\Vert _{2}^{2}\right)\\ &\quad +\lambda _2\left(\sum _{i=1}^{N} \phi \left(\mathbf {w}_{i}^{T} \mathbf {b}_{i}, \mathbf {y}_{i}\right)+\sum _{j=1}^{N} \phi \left(\mathbf {w}_{j}^{T} \mathbf {b}_{j}, \mathbf {y}_{j}\right)\right), \end{aligned} \end{equation}

(12)

where \(\mathbf {z}_i,\mathbf {h}_i,\) and \(\mathbf {b}_i\) denote the outputs of the first three fully connected layers. The last term is the classification cross-entropy loss.¹ Note that the second term is called pairwise correlation loss, which guides the similarity learning of deep features to avoid overfitting. The classification loss provides semantic supervision, which helps the model achieve competitive performance. Besides, PCDH proposes a pairwise construction module named Pairwise Hard, which samples positive pairs with the maximum distance between deep features and negative pairs with the distances smaller than the threshold randomly. It is evident that Pairwise Hard chooses the hard pairs with the large loss for effective hash code learning.

Supervised Deep Hashing (SDH) [40]. SDH utilizes the fully-connected neural network for deep hashing and has a similar loss function except for a term that enforces a relaxed orthogonality constraint on all projection matrices (i.e., weight matrices in a neural network) for the property of fully-connected layers. Bit balance regularization is also included, which will be introduced in Equation (14).

Supervised Hashing with Binary Deep Neural Network (SH-BDNN) [36]. The architecture of SH-BDNN is stacked by a fully connected layer, in which \(\mathbf {W}_i\) denotes the weights in the ith layer. SH-BDNN not only considers the bit balance, i.e., each bit obeys a uniform distribution, but also considers the independence of different hash bits. Given the hash code matrix \(\mathbf {B} = [\mathbf {b}_1, \ldots , \mathbf {b}_N]^T\), the two conditions are formulated as

\begin{equation} \mathbf {B}^T\mathbf {1}=\mathbf {0}, \frac{1}{N}\mathbf {B}^T\mathbf {B}=\mathbf {I}, \end{equation}

(13)

where \(\mathbf {1}\) is a \(L\)-dimension vector whose elements are all one, and \(\mathbf {I}\) is an identity matrix of size \(N\) by \(N\). The loss function is

\begin{equation} \begin{aligned}\mathcal {L}_{\text{SH-BDNN}} &= \frac{1}{2 N}\left\Vert \frac{1}{L}\mathbf {H}\mathbf {H}^{T} -\mathbf {S}\right\Vert ^{2}\\ &\quad +\frac{\lambda _{1}}{2} \sum _{k=1}^{K-1}||\mathbf {W}^{(k)}||^{2} +\frac{\lambda _{2}}{2N}||\mathbf {H}-\mathbf {B}||^{2} \\ &\quad +\frac{\lambda _{3}}{2}\left\Vert \frac{1}{N} \mathbf {H}^{T}\mathbf {H}-\mathbf {I}\right\Vert ^{2} +\frac{\lambda _{4}}{2 N}||\mathbf {H}^T \mathbf {1}||^{2} \\ &\ \mbox{s.t.} \quad \mathbf {B} \in \lbrace -1,1\rbrace ^{N \times L}\!. \end{aligned} \end{equation}

(14)

\(\mathbf {H}\) is stacked by the outputs of network and \(\mathbf {B}\) is stacked by the binary codes to be optimized from the Equation (14). \(\mathbf {S}\) is the pairwise similarity matrix valued 1 or \(-\)1. The first term is similarity difference loss minimization, the second term is the \(\ell _{2}\) regularization, the third term is the quantization loss, and the last two terms are to punish the correlation and the imbalance of bits, respectively. Note that the \(\mathbf {B}\) is not the sign of \(\mathbf {H}\). As a result, the loss function is optimized by updating the network parameter and \(\mathbf {B}\) alternatively. To be specific, \(\mathbf {B}\) is optimized with a fixed neural network, while the neural network is trained with fixed \(\mathbf {B}\) alternatively. SH-BDNN has a well-designed loss function, which follows Kernel-based Supervised Hashing [113]. However, the architecture does not include the popular convolutional neural network, and it is not an end-to-end model. As a result, the efficiency of this model is low in large-scale datasets.

Convolutional Neural Network Hashing (CNNH) [173]. CNNH is the earliest deep supervised hashing framework to our knowledge. It adopts a two-step strategy. In the first step, it optimizes the objective function using a coordinate descent strategy as follows:

\begin{equation} \mathcal {L}_{\text{CNNH}}=\left\Vert \frac{1}{L}\mathbf {H}\mathbf {H}^{T} -\mathbf {S}\right\Vert ^{2}, \end{equation}

(15)

which generates approximate binary codes. In the second step, CNNH utilizes obtained hash codes to train the convolutional neural network with \(L\) output units. Besides, if class labels are available, the fully connected layer with \(K\) output units is added, which correspond to the \(K\) class labels of images and the classification loss is added to the loss function. Although CNNH uses labels in a clumsy manner, this two-step strategy is still popular in deep supervised hashing and inspires many other state-of-the-art methods.

Deep Discrete Supervised Hashing (DDSH) [82]. DDSH uses a column-sampling manner for partitioning the training data into \(\lbrace \mathbf {x}_i\rbrace _{i\in \Omega }\) and \(\lbrace \mathbf {x}_i\rbrace _{i\in \Gamma }\), where \(\Omega\) and \(\Gamma\) are the indexes. The loss function is designed in an asymmetric form:

\begin{equation} \mathcal {L}_{\text{DDSH}}= \sum _{i\in \Omega ,j\in \Gamma }\mathcal {L}\left(s_{ij}^o - \mathbf {b}_i^T \mathbf {h}_j\right)^2 +\sum _{i,j\in \Omega }\mathcal {L}\left(s_{ij}^o - \mathbf {b}_i^T \mathbf {b}_j\right)^2, \end{equation}

(16)

where \(\mathbf {b}_i\) and \(\mathbf {h}_i\) are the binary code to be optimized and the output of the network, respectively. \(\mathbf {b}_i\) and \(\mathbf {h}_i\) are updated alternatively following [36]. It is notable because DDSH takes an asymmetric strategy for learning to hash, which aids in both binary code generation and continuous feature learning through the pairwise similarity structure.

Hashing with Binary Matrix Pursuit (HBMP) [9]. HBMP also takes advantage of the two-step strategy introduced above. Different from CNNH, HBMP utilizes the weighted Hamming distances and adopts a different traditional hashing algorithm called binary code inference to get hash codes. In the first step, the objective function is written in the following equation

\begin{equation} \mathcal {L}_{\text{HBMP}}=\frac{1}{4}\sum _{i,j}\left(\mathbf {b}_i^T\Lambda \mathbf {b}_j -s_{ij}^o\right)^2\!, \end{equation}

(17)

where \(\mathbf {\Lambda }\) is a diagonal weight matrix. It is noticed that the similarity matrix with each element \(S^h_{ij}=\mathbf {b}_i^T\Lambda \mathbf {b}_j\) can be approximated by a step-wise algorithm. HBMP also trains a convolutional neural network by the obtained hash codes with point-wise hinge loss and shows that deep neural networks help to simplify the optimization problem and get robust hash codes.

Asymmetric Deep Supervised Hashing (ADSH) [84]. ADSH considers the samples in the database and query set using an asymmetric manner, which can help to train the model more effectively, especially for large-scale nearest neighbor search. ADSH contains two critical components, i.e., a feature learning part and a loss function part. The first one is to utilize a hashing network to learn discrete codes for queries. The second one is used to directly learn discrete codes for database points by minimizing the same objective function with supervised information. The loss function is formulated as

\begin{equation} \begin{aligned}& \mathcal {L}_{\text{ADSH}}=\sum _{i \in \Omega , j\in \Gamma } \left(\mathbf {h}_i^T\mathbf {b}_j-L s_{ij}^o\right)^2\!, \\ & \mbox{s.t.}\quad \mathbf {b}_j\in \lbrace -1,1\rbrace ^L, \end{aligned} \end{equation}

(18)

where \(\Omega\) is the index of query points, \(\Gamma\) is the index of database points. Network parameters \(\Theta\) and binary codes \(\mathbf {b}_j\) are updated alternatively following SH-BDNN [36] during the optimization process. If only the database points are available, we let \(\Omega \subset \Gamma\) and add a quantization loss \(\sum _{i \in \Omega }(\mathbf {b}_i-\mathbf {h}_i)^2\) with the coefficient \(\gamma\). This asymmetric strategy combines deep hashing and traditional hashing, which can help achieve better performance.

Deep Incremental Hashing Network (DIHN) [172]. DIHN tries to learn hash codes in an incremental manner. Similar to ADSH [84], the dataset is divided into two parts, i.e., original and incremental databases, respectively. When a new image comes from an incremental database, its hash code is learned while keeping the hash codes of the original database unchanged. The optimization process still uses the strategy of alternately updating parameters.

Deep Ordinal Hashing (DOH) [85]. DOH generates ordinal hash codes by taking advantage of both local and global features. Specifically, two subnetworks learn the local semantics using a spatial attention module-enhanced fully convolutional network and the global semantics using a convolutional neural network, respectively. Afterward, the two outputs are combined to produce \(R\) ordinal outputs \(\lbrace \mathbf {h_i}^r\rbrace _{r=1}^R\). For each segment \(\mathbf {h_i}^r\), the corresponding hash code can be obtained as follows:

\begin{equation} \begin{aligned}\mathbf {b}_i^r &= \arg \max _{\mathbf {\theta }} \mathbf {\theta }^T\mathbf {h}_i,\\ \text{s.t.}&\ \mathbf {\theta }\in \lbrace 0,1\rbrace ^L, \Vert \mathbf {\theta }\Vert _{1}=1. \end{aligned} \end{equation}

(19)

The full hash code can be obtained by concatenating \(\lbrace \mathbf {b}_i^r\rbrace _{r=1}^R\). DOH adopts an end-to-end ranking-to-hashing framework, which avoids using the undifferentiable sign function. Furthermore, it uses a relatively complex network that is able to handle large datasets with higher performance.

3.2.2 Likelihood Loss Minimization.

Deep Pairwise Supervised Hashing (DPSH) [101]. DPSH adopts CNN-F [23] as the backbone of the hashing network and the standard form of likelihood loss based on similarity information. Besides similarity information, quantization loss is also introduced to the final loss function, i.e.,

\begin{equation} \mathcal {L}_{DPSH}=-\sum _{(i,j)\in \mathcal {E}}\left(s_{i j}^o s_{ij}^h-\log \left(1+e^{s_{ij}^h}\right)\right) +\lambda _1 \sum _{i=1}^{N}||\mathbf {h}_{i}-sgn(\mathbf {h}_{i})||_{2}^{2}, \end{equation}

(20)

where \(s_{ij}^h=\tfrac{1}{2}\mathbf {h}_{i}^T\mathbf {h}_{j}\) and \(\mathbf {h}_{i}\) is the output of the hashing network. Although triplet loss was popular at that time, DPSH adopts the pairwise form to simultaneously learn deep features and hash codes, which improves both accuracy and efficiency. This likelihood loss function can easily introduce different Bayesian priors, making it flexible in applications and achieving better performance than different loss functions.

Deep Hashing Network (DHN) [200]. It has a similar likelihood loss function to DPSH. Differently, DHN considers the quantization loss as Bayesian prior and proposes a bimodal Laplacian prior for the output \(\mathbf {h}_i\), i.e.,

\begin{equation} p\left(\mathbf {h}_{i}\right)=\frac{1}{2 \epsilon } \exp \left(-\frac{\left\Vert \left|\mathbf {h}_{i}\right|-\mathbf {1}\right\Vert _{1}}{\epsilon }\right), \end{equation}

(21)

and the negative log likelihood (i.e., quantization loss) is

\begin{equation} \mathcal {L}_{Quan}=\sum _{i=1}^N |||\mathbf {h}_i-\mathbf {1}||_1, \end{equation}

(22)

which can be smoothed by a smooth surrogate [74] into

\begin{equation} \mathcal {L}_{Quan}=\sum _{i=1}^N \sum _{l=1}^L log (cosh(|h_{il}|-1)), \end{equation}

(23)

where \(\mathbf {h}_{ik}\) is the kth element of \(\mathbf {h}_i\). We notice that the DHN replaced \(\ell _2\) norm (ITQ quantization error [48]) by \(\ell _1\) norm. [200] also shows that the \(\ell _{1}\) norm is an upper bound of the \(\ell _{2}\) norm, and the \(\ell _{1}\) norm encourages sparsity and is easier to optimize.

HashNet [19]. As a variant of DHN, HashNet considers the imbalance training problem that the positive pairs are much more than the negative pairs. Hence, it adopts Weighted Maximum LikeLihood (WML) loss with different weights for each image pair. The weight is formulated as

\begin{equation} \begin{aligned}w_{i j}=c_{i j} \cdot \left\lbrace \begin{array}{ll}{|\mathcal {S}| /\left|\mathcal {S}_{1}\right|,} & {s_{i j}^o=1} \\ {|\mathcal {S}| /\left|\mathcal {S}_{0}\right|,} & {s_{i j}^o=0} \end{array}\right.\!\!\!, \end{aligned} \end{equation}

(24)

where \(\mathcal {S}_{1}=\lbrace (i,j)\in \mathcal {E}: s_{i j}^o=1\rbrace\) comprises similar image pairs, while \(\mathcal {S}_0 = \mathcal {E}/\mathcal {S}_1\) comprises dissimilar image pairs. \(c_{i j}=\tfrac{\mathbf {y}_{i} \cap \mathbf {y}_{j}}{\mathbf {y}_{i} \cup \mathbf {y}_{j}}\) for multi-label datasets and equals 1 for single-label datasets. Besides, the sigmoid function in condition probability is substituted by \(1/1+e^{-\alpha x}\) called adaptive sigmoid function, which equals to adding a hyper-parameter into the hash code similarity computation, i.e., \(s_{ij}^h=\alpha \mathbf {b}_i^T\mathbf {b}_j\). Different from other methods, HashNet continuously approximates sign function through the hyperbolic tangent function

\begin{equation} \lim _{\beta \rightarrow \infty } \tanh (\beta z)=\operatorname{sgn}(z). \end{equation}

(25)

The activation function for outputs is \(\tanh (\beta _t \cdot)\) through updating \(\beta _t\rightarrow \infty\) step-wise and the optimal network with \(\operatorname{sgn}(\cdot)\) can be derived. Besides, this operation can be illustrated using multi-stage pretraining, which means that the deep network using activation function \(tan(\beta _{t+1}\cdot)\) is initialized using the well-trained network using activation function \(tan(\beta _t\cdot)\). The two skills proposed by HashNet greatly increase the performance of deep supervised hashing.

Deep Priority Hashing (DPH) [20]. DPH also adds different weights to different image pairs, but reduces the weights of pairs with higher confidence, which is similar to AdaBoost [138]. The difficulty is measured by \(q_{ij}\), which indicates how difficult a pair is classified as similar when \(s_{ij}^o=1\) or classified as dissimilar when \(s_{ij}^o=0\). In formulation,

\begin{equation} \begin{aligned}q\left(s_{i j}^o | \mathbf {h}_{i}, \mathbf {h}_{j}\right) &=\left\lbrace \begin{array}{ll}{\frac{1+s_{ij}^h}{2},} & {s_{i j}^o=1} \\ {\frac{1-s_{ij}^h}{2},} & {s_{i j}^o=0} \end{array}\right.\\ &=\left(\frac{1+s_{ij}^h}{2}\right)^{s_{i j}^o}\left(\frac{1-s_{ij}^h}{2}\right)^{1-s_{i j^o}}\!\!. \end{aligned} \end{equation}

(26)

Besides, the weight characterizing class imbalance is measured by \(\alpha _{ij}\):

\begin{equation} \alpha _{i j}=\left\lbrace \begin{array}{l}\frac{\left|\mathcal {S}_{i}\right|\left|\mathcal {S}_{j}\right|}{\sqrt {\left|\mathcal {S}_{i}^{1}\right|\left|\mathcal {S}_{j}^{1}\right|}}, s_{i j}=1 \\ \frac{\left|\mathcal {S}_{i}\right|\left|\mathcal {S}_{j}\right|}{\sqrt {\left|\mathcal {S}_{i}^{0}\right|\left|\mathcal {S}_{j}^{0}\right|}}, s_{i j}=0 \end{array}\right.\!\!\!, \end{equation}

(27)

where \(\mathcal {S}_{i}=\lbrace (i,j)\in \mathcal {E}:\forall j\rbrace\), and

\begin{equation} \begin{aligned}&\mathcal {S}_{i}^{1}=\left\lbrace (i, j) \in \mathcal {E}: \forall j, s_{i j}^{o}=1\right\rbrace \\ &\mathcal {S}_{i}^{0}=\left\lbrace (i, j) \in \mathcal {E}: \forall j, s_{i j}^{o}=0\right\rbrace \!. \end{aligned} \end{equation}

(28)

The final priority weight is formulated as

\begin{equation} w_{i j}=\alpha _{i j}(1-q_{i j})^{\gamma }\!\!\!, \end{equation}

(29)

where \(\gamma\) is a hyper-parameter. With the priority cross-entropy loss, DPH down-weighs confident image pairs and prioritizes on difficult image pairs with low confidence. Similarly, priority quantization loss changes the weight for different images to be \(w_i^{\prime }=(1-q_i)\gamma\) and \(q_i\) measures how likely a continuous output can be perfectly quantized into a binary code. In this way, DPH achieved better performance than HashNet.

Deep Supervised Discrete Hashing (DSDH) [100]. Besides leveraging the pairwise similarity information, DSDH also takes advantage of label information by adding a linear regression loss with regularization to the loss function. By dropping the binary restrictions, the loss is formulated as

\begin{equation} \mathcal {L}_{DSDH}=-\sum _{(i,j)\in \mathcal {E}}\left(s_{i j}^o s_{ij}^h-\log \left(1+e^{s_{ij}^h}\right)\right) +\lambda _1 \sum _{i=1}^{N}||\mathbf {h}_{i}-sgn(\mathbf {h}_{i})||_{2}^{2} +\lambda _2||\mathbf {y}_i-\mathbf {W}^T\mathbf {b}_i|| +\lambda _3||\mathbf {W}||_F, \end{equation}

(30)

where \(s_{ij}^h=\tfrac{1}{2}\mathbf {h}_i^T\mathbf {h}_j\) and the label is encoded in one-hot format \(\mathbf {y}_i\). The second term in Equation (30) is the linear regression term and the last term is an \(\ell _2\) regularization. \(\lbrace \mathbf {h}_i\rbrace _{i=1}^N\), \(\lbrace \mathbf {b}_i\rbrace _{i=1}^N\), and \(\mathbf {W}\) are updated alternatively by using gradient descent method and discrete cyclic coordinate descend method. DSDH greatly increases the performance of image retrieval since it takes advantage of both label information and pairwise similarity information. It should be noted that in the linear regression term, the binary code is updated by discrete cyclic coordinate descend, so the constraint of discreteness is met.

Deep Cauchy Hashing (DCH) [15]. DCH is a Bayesian learning framework similar to DHN, but it replaced the sigmoid function with the function based on Cauchy distribution in the conditional probability. DCH aims at improving the search accuracy with Hamming distances smaller than radius 2. Probability on the basis of generalized sigmoid function could be extremely large when Hamming distances are greater than 2. This could be detrimental to current Hamming ball retrieval. DCH tackles this problem via incorporating the Cauchy distribution, since the probability drops rapidly if Hamming distances are greater than 2. The Cauchy distribution is formulated as

\begin{equation} \sigma \left(d_{ij}^h\right)=\frac{\gamma }{\gamma + d_{ij}^h}, \end{equation}

(31)

where \(\gamma\) is a hyper-parameter and \(d_{ij}^h\) is measured by the normalized Euclidean distance, i.e., \(d_{ij}^h= d(\mathbf {h}_i,\mathbf {h}_j)=\tfrac{L}{2}(1-cos(\mathbf {h}_i,\mathbf {h}_j)\). Besides, the prior is based on a variant of the Cauchy distribution, i.e.,

\begin{equation} P\left(\mathbf {h}_{i}\right)=\frac{\gamma }{\gamma +d\left(\left|\mathbf {h}_{i}\right|, \mathbf {1}\right)}. \end{equation}

(32)

The final loss function is formulated as the log-likelihood plus the quantization loss based on the prior weight. However, this loss function will get almost the same hash code for images with the same label. Even worse, the relationship for the dissimilar pairs is not considered.

Maximum-Margin Hamming Hashing (MMHH) [87]. In view of the shortcomings of DCH, MMHH utilizes the t-Distribution and contains different objective functions for similar and dissimilar pairs. The total loss is the weighted sum of two losses. Besides, a margin \(\zeta\) is utilized to avoid producing the exact same hash codes. The Cauchy distribution in DCH is replaced by

\begin{equation} \sigma \left(d_{i j}^{h}\right)=\left\lbrace \begin{array}{ll}\frac{1}{1+\max \left(0, d_{i j}^{h}-\zeta \right)}, s_{i j}^{o}=1 \\ \frac{1}{1+\max \left(\zeta , d_{i j}^{h}\right)}, \quad s_{i j}^{o}=0 \end{array}\right. \end{equation}

(33)

The loss function is the weighted log-likelihood of conditional probability, i.e.,

\begin{align} \mathcal {L}_{MMHH} &=\sum _{(i,j)\in \mathcal {E}} w_{i j}\left(s_{i j}^o\right) \log \left(1+\max \left(0, d_{ij}^h-\zeta \right)\right) \\ &\quad +\sum _{(i,j)\in \mathcal {E}} w_{i j}\left(1-s_{i j}^o\right) \log \left(1+\frac{1}{\max \left(\zeta , d_{ij}^h\right)}\right)\\ &\quad +\lambda _1\sum _{i=1}^N ||\mathbf {h}_i-sgn(\mathbf {h}_i)||_2^2 \end{align}

(34)

The last term is a standard quantization loss. MMHH also proposed a semi-batch optimization strategy to alleviate the imbalance problem. Specifically, the binary codes of the training data are stored as extra memory. The pairwise loss is calculated by the new codes computed in the current epoch and their similar and dissimilar pairs are added into the memory bank for a new epoch. In general, MMHH solves the shortcomings of DCH, which greatly improves search performance.

Deep Fisher Hashing (DFH) [103]. DFH points out that the pairwise loss minimization is similar to Fisher’s Linear discriminant, which maximizes the gaps between inter-class examples whilst minimizing the gaps between the intra-class examples. Its logistic loss function is similar to MMHH and the final loss function is formulated as

\begin{equation} \mathcal {L}_{DFH} =\sum _{(i,j)\in \mathcal {E}} s_{i j}^o \log \left(1+ e^{d_{ij}^h+\epsilon }\right) +\sum _{(i,j)\in \mathcal {E}} \left(1-s_{i j}^o\right) \log \left(1+e^{-d_{ij}^h+\epsilon }\right) +\lambda _1\sum _{i=1}^N ||\mathbf {h}_i-sgn(\mathbf {h}_i)||_2^2, \end{equation}

(35)

in which \(\epsilon\) is a margin parameter. Besides, the quantized center loss is added to the objective function, which not only minimizes intra-class distances but also maximizes inter-class distances between binary hash codes of each image.

Deep Asymmetric Pairwise Hashing (DAPH) [139]. Similar to ADSH, DAPH also adopted an asymmetric strategy. The difference is that DAPH uses two networks with different parameters for the database and queries. Besides, the bit independence, bit balance and quantization loss are added to the loss function following SH-BDNN. The loss function is optimized by updating the two neural networks alternatively.

Deep Attention-guided Hashing (DAgH) [182]. DAgH adopts a two-step framework similar to CNNH, while it utilizes neural networks to learn hash codes in both two steps. In the first step, the objective function is the combination of the log-likelihood loss and the difference loss with a margin. In the second step, DAgH utilizes binary point-wise cross-entropy for optimization. Besides, the backbone of DAgH includes a fully convolutional network with an attention module for obtaining accurate deep features.

Deep Joint Semantic-Embedding Hashing (DSEH) [99]. DSEH is the first work to introduce LabNet in deep supervised hashing. It also adopts a two-step framework with LabNet and ImgNet, respectively. LabNet is a neural network designed to capture abundant semantic correlation with image pairs, which can help to guide the hash code learning in the second step. \(\mathbf {f}_i\) denotes the label embedding produced from one-hot label \(\mathbf {y}_i\). LabNet replaces the input from images to their label and learns the hash codes from labels with a general hashing loss function. In the second step, ImgNet utilizes an asymmetric loss between the labeled features in the first step and the newly obtained features from ImageNet \(\mathbf {h}_j\), i.e. \(s_{ij}^h={\mathbf {f}_i}^T\mathbf {h}_j\) along with the binary cross-entropy loss similar to DAgH [182]. DSEH fully makes use of the label information from the perspectives of both pairwise loss and cross-entropy loss, which can help generate discriminative and similarity-preserving hash codes.

Asymmetric Deep Semantic Quantization (ADSQ) [181]. ADSQ increases the performance by utilizing two hashing networks and reducing the difference between the continuous network outputs and the desired hash codes, and the difference loss is also involved.

Deep Anchor Graph Hashing (DAGH) [26]. In the anchor graph, a minimal number of anchors are used to link the whole dataset, allowing for implicit computation of the similarities between distinct examples. At first, it samples a number of anchors and builds an anchor graph between training samples and anchors. Then, the loss function can be divided into two parts. The first part contains a typical pairwise likelihood loss and a linear regression loss. In the second part, the loss is calculated by the distances between training samples and anchors in the same class, and both deep features and binary codes are used to compute the distances. Besides a general pairwise likelihood loss and a linear regression loss, DAGH minimizes the distances between deep features of training samples and binary codes of anchors belonging to the same class. This method fully utilizes the remaining labeled data during mini-batch training and helps to obtain efficient binary hash codes.

3.3 Ranking-based Methods

In this section, we will review the category of deep supervised hashing algorithms that use the ranking to preserve the similarity structure. Specifically, these methods attempt to preserve the similarity relationships for over two examples that are calculated in the original and Hamming spaces. We further divide ranking-based methods into two groups:

–

Triplet methods. Due to the ease with which triplet-based similarities could be obtained, triplet ranking losses are popular in deep supervised hashing. These losses attempt to keep the rankings consistent in the Hamming space and the original space for each sampled triplet. For each triplet \((\mathbf {x}_i,\mathbf {x}_j,\mathbf {x}_k)\) with \(s_{ij}^o\gt s_{ik}^o\), they usually attempt to minimize a difference loss with margin [92], i.e.,

\begin{equation} \mathcal {L}_{Triplet}(\mathbf {h}_i,\mathbf {h}_j,\mathbf {h}_k)=max\left(0,m+d_{ij}^h-d_{ik}^h\right), \end{equation}

(36)

where \(m\) is a margin parameter. Subsequent works introduce the weights based on ranking for each triplet [197] or utilize the likelihood loss for preserving triplet ranking [169]. The triplet loss can also be combined with the pairwise loss above [189].

–

List-wise methods. This sub-class usually considers the rankings in the whole dataset rather than in sampled triplet. An example is to optimize ranking-based metrics, i.e., Average Precision and Normalized Discounted Cumulative Gain [59]. Other works utilize the mutual information [8] and matrix optimization [198] for optimizing the hash network from the view of whole datasets. These methods can release the bias during triplet sampling but usually suffer from poor efficiency.

3.3.1 Triplet Methods.

Deep Neural Network Hashing (DNNH) [92]. DNNH modifies the popular triplet ranking objective [130] to preserve the relative relationships of samples. To be more precise, given a triplet \((\mathbf {x}_i,\mathbf {x}_j,\mathbf {x}_k)\) with \(s_{ij}^o\gt s_{ik}^o\), the ranking loss with margin is formulated as

\begin{equation} \mathcal {L}_{DNNH}(\mathbf {h}_i,\mathbf {h}_j,\mathbf {h}_k)=max\left(0,1+d_{ij}^h-d_{ik}^h\right). \end{equation}

(37)

The loss encourages the binary code \(\mathbf {b}_j\) to be closer to the \(\mathbf {b}_i\) than \(\mathbf {b}_k\). By substituting the Euclidean distance for the Hamming distance, the loss function becomes convex, allowing for straightforward optimization:

\begin{equation} \mathcal {L}_{DNNH}(\mathbf {h}_i,\mathbf {h}_j,\mathbf {h}_k)=max(0,1+||\mathbf {h}_i-\mathbf {h}_j||_2^2-||\mathbf {h}_i-\mathbf {h}_k||_2^2). \end{equation}

(38)

Besides, DNNH introduces a sigmoid activation function along with a piece-wise threshold function, which encourage the continuous outputs to approach discrete codes. The piece-wise threshold function is defined as

\begin{equation} \begin{aligned}g(s)=\left\lbrace \begin{array}{lr}{0,} & {s\lt 0.5-\epsilon } \\ {s,} & {0.5-\epsilon \le s \le 0.5+\epsilon } \\ {1,} & {s\gt 0.5+\epsilon } \end{array}\right. \end{aligned}\!\!, \end{equation}

(39)

in which \(\epsilon\) denotes a positive hyper-parameter. It is evident that most elements of the outputs will be exact 0 or 1 by using this piece-wise threshold function, thus resulting in less quantization loss.

Deep Regularized Similarity Comparison Hashing (DRSCH) [189]. Besides the triplet loss, DRSCH also took advantage of pairwise information by introducing a difference loss as the regularization term. The bit weights are also included when calculating the distances in the Hamming space.

Deep Triplet Supervised Hashing (DTSH) [169]. DTSH replaces the ranking loss by the negative log triplet label likelihood as

\begin{equation} \mathcal {L}_{DTSH}(\mathbf {h}_i,\mathbf {h}_j,\mathbf {h}_k)=log\left(1+e^{s_{ij}^h-s_{ik}^h-m}\right)-\left(s_{ij}^h-s_{ik}^h-m\right), \end{equation}

(40)

which considers the conditional probability [100], and \(m\) is a margin parameter.

Deep Semantic Ranking-based Hashing (DSRH) [197]. DSRH leverages a surrogate loss based on triplet loss. Given query \({q}\) and database \(\lbrace \mathbf {x}_i\rbrace _{i=1}^N\), the rankings \(\lbrace r_i\rbrace _{i=1}^N\) in database is defined as the number of labels shared with the query. The ranking loss is defined in a triplet form

\begin{equation} \mathcal {L}_{DSRH}=\sum _{i=1}^N\sum _{j:r_j\lt r_i} w(r_i,r_j)\delta max\left(0,\epsilon +d_{qi}^h-d_{qj}^h\right), \end{equation}

(41)

where \(\delta\) and \(\epsilon\) are two hyper-parameters, and \(w(r_i,r_j)\) is the weight for each triplet:

\begin{equation} \omega (r_{i}, r_{j})=\frac{2^{r_{i}}-2^{r_{j}}}{Z}. \end{equation}

(42)

The form of weights comes from Normalized Discounted Cumulative Gains [78] score, and \(Z\) is a normalization constant, which can be omitted. Besides, the bit balance loss and weight regularization are added to the loss function. DSRH improves deep hashing by the surrogate loss, especially on multi-label image datasets.

3.3.2 Listwise Methods.

Hashing as Tie-Aware Learning to Rank (HALR) [59]. HALR explicitly optimizes popular ranking-based assessment metrics including average precision and normalized discounted cumulative gain, which improves the retrieval performance based on ranking. It is noticed that tied ranks may occur due to integer-valued Hamming distance. Hence, HALR introduces a tie-aware formulation of these metrics and trains the hashing network using their continuous relaxations for effective optimization.

Hashing with Mutual Information (HashMI) [8]. HashMI follows the idea of minimizing neighborhood ambiguity and derives a loss term based on mutual information, which is sufficiently connected to the aforementioned ranking-based assessment metrics. Given an image \(\mathbf {x}_i\), the random variable \(\mathcal {V}_{i,\Phi }\) is defined as a mapping from \(\mathbf {x}_j\) to \(d_{ij}^h\), where \(\Phi\) is the hashing network. \(\mathcal {C}_i\) is the set of images that share the same label with \(\mathbf {x}_i\), i.e., the neighbor of \(\mathbf {x}_i\). The mutual information is defined as

\begin{equation} \mathcal {I}_{HashMI}(\mathcal {V}_{i,\Phi };\mathcal {C}_{i})= H(\mathcal {C}_{i})-H(\mathcal {C}_{i}|\mathcal {V}_{i,\Phi }). \end{equation}

(43)

The mutual information is incorporated over the deep feature space for any hashing network \(\Phi\), such that a measurement of the quality is obtained which desires to be maximized

\begin{equation} \mathcal {O}=-\int _\Omega \mathcal {I}(\mathcal {V}_{i,\Phi };\mathcal {C}_{i})p_id\mathbf {x}_i, \end{equation}

(44)

where \(\Omega\) is the sample space and \(p_i\) denotes the prior distribution, which can be removed. After discretion, the loss function turns into:

\begin{equation} \mathcal {L}_{HashMI}= -\sum _{i=1}^N \mathcal {I}(\mathcal {V}_{i,\Phi };\mathcal {C}_{i}), \end{equation}

(45)

whose gradient can be calculated by relaxing the binary constraint and effective minibatch back propagation. The minibatch back propagation is able to effectively retrieve one example against the other example within a minibatch cyclically similar to leave-one-out validation.

Angular Deep Supervised Hashing (AnDSH) [198]. AnDSH calculates the Hamming distance between images of different classes to form an upper triangular matrix with size \(K\) by \(K\), where \(K\) is the number of categorizations. The mean of Hamming distance matrices is maximized, while the variance of the matrices is minimized to make sure that all elements in the matrix could be covered by these hash codes and from the view of bucket theory there is no weakness, i.e., achieving bit balance. Besides, this method utilizes classification loss similar to PCDH but replaces the softmax loss by A-softmax objective [115] that could obtain potentially larger inter-class variation along with larger inter-class separation.

3.4 Pointwise Methods

In this section, we review pointwise methods that directly take advantage of label information instead of similarity information. Early methods usually add a classification layer to map the hash-like representations into label distributions [76, 106, 124, 180, 193, 194, 195]. Then, the hash codes are enhanced with the standard classification loss in label space. Further works include the probabilistic models for better binary optimization [144]. Recent methods usually build the classification loss in the Hamming space instead. Specifically, they will generate some central hash codes,² each of which is associated with a class label. These methods enforce the network outputs to approach their corresponding hash centers with different loss terms, i.e., binary cross-entropy [185], difference loss [25], polarization loss [41], softmax loss [64], and partial softmax loss [158]. These hash centers are mostly produced by Hadamard matrix [25, 64, 185], random sampling [64, 185] as well as adaptive optimization [158], which achieves better performance compare with the former two manners.

Deep Binary Hashing (DBH) [106]. After pre-training of a convolution neural network on the ImageNet, DBH adds a latent layer with sigmoid activation, where the neurons are utilized to learn hash-like representations while fine-tuning with classification loss on the target dataset. The outputs of the latent layer are discretized into binary hash codes. DBH also emphasizes that the obtained hash codes are for coarse-level search because the quality of hash codes is limited.

Supervised Semantics-preserving Deep Hashing (SSDpH) [180]. SSDpH utilizes a similar architecture to DBH and adds the quantization loss and the bit balance loss for regularization. In this way, SSDpH can produce high-quality hash codes for better retrieval performance.

Very Deep Supervised Hashing (VDSH) [195]. VDSH builds a very deep hashing network and trains the network with an efficient strategy layer-wise motivated by alternating direction method of multipliers (ADMM) [5]. In virtue of the strong representation ability of the deeper neural network, VDSH can produce better hash codes for effective image retrieval.

SUBIC [76]. SUBIC generates structured binary hash codes consisting of the concatenation of several one-hot encoded vectors (i.e., blocks) and obtains each one-hot encoded vector with several softmax functions (i.e., block softmax). Besides classification loss and bit balance regularization, SUBIC utilizes the mean entropy for quantization loss for each block. SUBIC can also be applied to a range of downstream search tasks including instance retrieval and image classification.

Just Maximizing Likelihood Hashing (PMLR) [144]. PMLR integrates two dense layers above the top of the hashing network. It utilizes the probability models to parameterize the hashing network for binary constraints. Then, PMLR utilizes a classification loss along with a regularization term for better hash code distributions in Hamming space.

Central Similarity Quantization (CSQ) [185]. CSQ also utilizes a classification model but in a different way. First, CSQ generates some central hash codes by the properties of a Hadamard matrix or random sampling from Bernoulli distributions, such that the distance between each pair of centroids is large enough. Each label is corresponding to a centroid in the Hamming space and thus each image has its corresponding semantic hash center according to its label. Afterward, the model is trained by the central similarity loss (i.e., binary cross-entropy) with the supervised label information as well as the quantization loss. In formulation,

\begin{equation} \mathcal {L}_{CSQ}=\sum _{i=1}^{N} \sum _{l=1}^L [\mathbf {c}_{i,l}\log \mathbf {h}_{i,l}+(1-\mathbf {c}_{i,l})\log (1-\mathbf {h}_{i,l})] + \lambda _1 \sum _{i=1}^N (|| |\mathbf {h}_i-\mathbf {1}|-1||_1), \end{equation}

(46)

where \(\mathbf {c}_i \in \lbrace 0,1\rbrace ^L\) is the hash center generated from labels and \(\mathbf {h}_i\in (0,1)^L\) is the output of the hashing network. It is evident that CSQ directly enforces the generated hash codes to approach the corresponding centroids with some relaxations. The core of CSQ is to map the semantic labels into Hamming space to guide hash code learning directly. Thus, samples with comparable labels are converted into similar hash codes, maintaining the global similarities between image pairs and then resulting in effective hash codes for image retrieval.

Hadamard Codebook-based Deep Hashing (HCDH) [25]. HCDH also utilizes the Hadamard matrix by minimizing the \(\ell _2\) difference between hash-like outputs and the target hash codes with their corresponding labels (i.e., Hadamard loss). Different from CSQ, HCDH trains the classification loss and Hadamard loss simultaneously. Hadamard loss can be interpreted as learning the hash centers guided by their supervised labels in \(L_2\) norm. Note that HCDH is able to yield discriminative and balanced binary codes for the property of the Hadamard codebook.

Deep Polarized Network (DPN) [41]. DPN combines metric learning framework with learning to hash and develops a novel polarization loss which minimizes the distance between hash centers and hashing network outputs. In formulation,

\begin{equation} \mathcal {L}_{DPN}=\sum _{i=1}^{N} \sum _{l=1}^L \max \left(\epsilon -\mathbf {h}_{il} \cdot \mathbf {c}_{il}, 0\right), \end{equation}

(47)

where \(\mathbf {c}_i \in \lbrace 0,1\rbrace ^L\) is the hash center and \(\mathbf {h}_i\in (-1,1)^L\) is the output of the hashing network. Different from CSQ, the hash centers can be updated after a few epochs. It has been proved that minimizing polarization loss can simultaneously minimize inter-class and maximize intra-class Hamming distances theoretically. In this way, the hash codes can be easily derived for effective image retrieval.

OrthHash [64]. OrthHash is a one-loss model that gets rid of the hassles of tuning the balance coefficients of various losses. Similar to CSQ, OrthHash generates hash centers using Bernoulli distributions. Then, it maximizes the cosine similarity between the hashing network outputs and their corresponding hash centers. In formulation,

\begin{equation} \mathcal {L}_{OrthHash}=- \sum _{i=1}^{N} \log \frac{\exp \left(\mathbf {c}_i^{\top } \mathbf {h}_{i}\right)}{\sum _{\mathbf {c}\in \mathcal {C}} \exp \left(\mathbf {c}^{\top } \mathbf {h}_{i}\right)}\!, \end{equation}

(48)

where \(\mathcal {C}\) denotes the set of all hash centers. Compared with CSQ and DPN, OrthHash not only compare the network outputs and corresponding hash centers, but also considers the other hash centers of different labels. In this way, OrthHash improves the discriminativeness of hash codes. Moreover, since Hamming distance is equivalent to cosine distance for hash codes, OrthHash can promise quantization error minimization. With a single classification objective, it realizes the end-to-end training of deep hashing with promising performance.

Partial-Softmax Loss based Deep Hashing (PSLDH) [158]. PSLDH generates a semantic-preserving hash center for each label instead of using Hadamard matrix or random sampling [185]. Specifically, it not also minimizes the inner product of each hash center pair, but also maximizes the information of each hash bit with a bit balance loss term. Moreover, PSLDH trains the hashing network with a partial-softmax loss, which compares the network outputs with both their corresponding hash centers and other centers of partial categories in the datasets. Let \(\mathbf {c}^j\) denote the hash center associated with the \(j\)th category. The loss is formulated as

\begin{equation} \mathcal {L}_{PSLDH}=\sum _{i=1}^{N} \sum _{j \in \Gamma _{i}}-\log \frac{\exp \left(\eta \left(\mathbf {h}_{i}^{T} \mathbf {c}^{j}-\mu L\right)\right)}{\exp \left(\eta \left(\mathbf {h}_{i}^{T} \mathbf {c}^{j}-\mu L\right)\right)+\sum _{q \in \Psi _{i}} \exp \left(\eta \mathbf {h}_{i}^{T} \mathbf {c}^{q}\right)}, \end{equation}

(49)

where \(\Gamma _{i}\) denotes the index set of categories associated with \(x_i\), and \(\Psi _{i}\) denotes the index set of categories unassociated with \(x_i\).

3.5 Quantization

The quantization techniques have been presented to be derivable from our aforementioned difference loss minimization in Section 3.2 [165]. From a statistical standpoint, the quantization error could bound the distance reconstruction error [79]. Consequently, quantization could be utilized for deep supervised hashing. These methods usually leverage deep neural networks to generate deep features and then adopt product quantization approaches for subsequent quantization. Hence, they optimize the deep features with pairwise difference loss [17], pairwise likelihood loss [39], and triplet loss [110] for better retrieval performance. Further works combine label semantic information for discriminative deep features [16]. Recent works [39, 88] integrate deep neural networks into the process of product quantization rather than feature generation and achieve better performance. Other than product quantization, composite quantization can also be enhanced by deep learning [24]. Then, we will review the typical deep supervised hashing methods based on quantization.

Deep Quantization Network (DQN) [17]. DQN generates hash code \(b_i\) from the obtained representation \(z_i\in \mathbb {R}^D\) with semantics preserved using the product quantization method. First, it decomposes the feature space into the target space, i.e., a Cartesian product of \(M\) low-dimensional subspaces, and each subspace is quantized into \(T\) codewords via clustering. More precisely, the original feature is partitioned into \(M\) sub-vectors i.e., \(\mathbf {z}_i=[\mathbf {z}_{i1};\dots ;\mathbf {z}_{iM}], i=1,\ldots ,N\) and \(\mathbf {z}_{im}\in \mathbb {R}^{D/M}\) is the sub-vector of \(\mathbf {z}_i\) in the \(m\)th subspace. Thus, all sub-vectors in each subspace are quantized into \(T\) codewords using K-means without mutual influences. The total loss is defined as follows:

\begin{equation} \begin{gathered}\mathcal {L}_{D Q N}=\sum _{i, j} \left(s_{i j}^{o}-\cos (z_{i}, z_{j})\right)^2+\lambda _{1} \sum _{m=1}^{M} \sum _{i=1}^{N}\left\Vert z_{i m}-C_{m} b_{i m}\right\Vert _{2}^{2}, \\ s.t.\ \left\Vert b_{i m}\right\Vert _{0}=1, b_{i m} \in \lbrace 0,1\rbrace ^{T}, \end{gathered} \end{equation}

(50)

where \(cos(\cdot)\) denotes the cosine similarity metric and \(\mathbf {C}_m=[\mathbf {c}_{m1},\ldots ,\mathbf {c}_{mT}]\) represents \(T\) codewords of the \(m\)th subspace, and \(\mathbf {b}_{im}\) is the one-hot embedding to guide which codeword in \(\mathbf {C}_m\) should be used to approach the \(i\)th point \(\mathbf {z}_{im}\). Mathematically, the second term, i.e., product quantization can be reformulated as

\begin{equation} \sum _{i=1}^N||\mathbf {z}_i-\mathbf {C}\mathbf {b}_i||_2^2, \end{equation}

(51)

where \(\mathbf {C}\) is a \(D\times MT\) matrix can be written as \(\mathbf {C}=diag(\mathbf {C}_1,\ldots ,\mathbf {C}_M)\). Note that the quantization loss of converting the feature \(\mathbf {z}_i\) into binary code \(\mathbf {b}_i\) can be restricted via minimizing \(Q\). Besides, quantization-based hashing also adds pairwise similarity preserving loss to the final loss function. Finally, Asymmetric Quantizer Distance (AQD) is widely used for approximate nearest neighbor search, which is formulated as

\begin{equation} AQD(\mathbf {q},\mathbf {x}_i)=\sum _{m=1}^M||\mathbf {z}_{qm}-\mathbf {C}_m\mathbf {b}_{im}||_2^2, \end{equation}

(52)

where \(\mathbf {z}_{qm}\) is the \(m\)th sub-vector for the feature of query \(\mathbf {q}\).

Deep Triplet Quantization (DTQ) [110]. DTQ uses a triplet loss to preserve the similarity information and a smooth orthogonality regularization is added to the codebooks, which are similar to the bit independence. Let \(\mathcal {T}\) denote the set of all triplet. Each triplet \((\mathbf {x}_i,\mathbf {x}_j,\mathbf {x}_k)\) satisfies \(s_{ij}^o\gt s_{ik}^o\). The total loss function is as follows:

\begin{equation} \begin{aligned}\mathcal {L}_{DTQ} =& \sum _{(\mathbf {x}_i,\mathbf {x}_j,\mathbf {x}_k)\in \mathcal {T}} (\max (0, \epsilon +||\mathbf {z}_i- \mathbf {z}_j ||_2^2-||\mathbf {z}_i- \mathbf {z}_k ||_2^2)+\lambda _{1} \sum _{m=1}^{M} \sum _{i=1}^{N}\left\Vert z_{i m}-C_{m} b_{i m}\right\Vert _{2}^{2} \\ &+\lambda _2 \sum _{m=1}^M\sum _{m^{\prime }=1}^M||\mathbf {C}_m^T\mathbf {C}_{m^{\prime }}-\mathbf {I}||^2. \end{aligned} \end{equation}

(53)

The last term is the orthogonality penalty term. In addition, DTQ selects triplets by Group Hard to make sure that the number of explored valid triplets is suitable for optimization. Specifically, the training data are split into various groups, and a hard (i.e., with positive triplet loss) negative example is picked randomly as an anchor-positive image pair from every group.

Deep Visual-semantic Quantization (DVsQ) [16]. DVsQ optimizes the quantization network using labeled image samples along with the semantic messages from their latent text domains. Specifically, by using the image representations \(\mathbf {z}_i\) from the pre-trained network, it produces deep visual-semantic representations. They are then trained to forecast the word embeddings \(\mathbf {v}\) (i.e., \(\mathbf {v}_i\) for label \(i\)), which are further estimated by a skip-gram model. The loss function includes the adaptive margin ranking loss and a quantization loss:

\begin{equation} \mathcal {L}_{DVsQ}=\sum _{i=1}^N\sum _{j\in \mathbf {y}_i}\sum _{k\notin \mathbf {y}_i}max(0,\delta _{jk}-cos(\mathbf {v}_j,\mathbf {z}_i)+cos(\mathbf {v}_k,\mathbf {z}_i)) + \lambda _1 \sum _{i=1}^N\sum _{j=1}^{|\mathbf {y}|}||\mathbf {v}_j^T (\mathbf {z}_i-\mathbf {C}\mathbf {b}_i)||_2^2, \end{equation}

(54)

where \(\mathbf {y}_i\) is the label set of the \(i\)th image, and \(\delta _{jk}\) is an adaptive margin and the quantization loss is inspired by the maximum inner-product search. DVsQ adopts the same strategy as LabNet and combines the visual information and semantic quantization in a uniform framework instead of a two-step approach. By this means, DVsQ greatly improves the retrieval performance.

Deep Product Quantization (DPQ) [88]. DPQ leverages both the powerful capacity of product quantization (PQ) and the end-to-end learning ability of deep learning to optimize the clustering results of product quantization through classification tasks. Specifically, for each input \(\mathbf {x}_i\), it first uses an embedding layer and an MLP to obtain the deep representation \(\mathbf {z}_i \in \mathbb {R}^{MF}\). Then, the representation is sliced into \(M\) sub-vectors with \(z_{i,m} \in \mathbb {R}^F\) similar to PQ. Different from DQN, an MLP is used to turn each sub-vector into a probabilistic vector with \(T\) elements \(p_m(t), t=1,\ldots ,T\) by softmax function. The matrix \(\mathbf {C}_m\in \mathbb {R}^{T\times D}\) denotes the \(T\) centroids. \(p_m(k)\) denotes the probability that the \(m\)th sub-vector is quantized by the \(t\)th row of \(\mathbf {C}_m\). The soft representation of the \(m\)th sub-vector is calculated by combining the row vectors of \(\mathbf {C}_m\).

\begin{equation} soft_m=\sum _{t=1}^Tp_m(t)\mathbf {C}_m(t). \end{equation}

(55)

Considering the probability \(p_m(k)\) in one-hot format, given \(t^*=argmax_t p_m(t)\), the hard probability is denoted as \(e_m(t)=\delta _{tt*}\) in one-hot format and we have

\begin{equation} hard_m=\sum _{t=1}^Te_m(t)\mathbf {C}_m(t). \end{equation}

(56)

The obtained sub-vectors of soft and hard representations are then concatenated to produce the ultimate representations, i.e., \(\text{soft}=[\text{soft}_1, \ldots , \text{soft}_M]\) and \(\text{hard}=[\text{hard}_1, \ldots , \text{hard}_M] \in \mathbb {R}^{MD}\). Each representation is followed by a fully-connected classification layer. Besides two classification losses, the joint central loss is also added by first learning the center vector for each categorization and minimizing the distances between deep features. It is noticed that both the soft and hard representations come from the same centers in DPQ, which encourages both representations to approach the centers, reducing the disparity between the soft and hard representations. This helps to improve the discriminative power of the features and to contribute to the retrieval performance. Gini batch loss and Gini sample loss are also introduced for the class balance and encourage the two representations of the same image to be closer. Overall, DPQ replaces the k-means process in PQ and DQN technique with deep learning combined with a classification model and is able to create compressed representations for fast classification and fast image retrieval.

Deep Spherical Quantization (DSQ) [39]. DSQ first uses the deep neural network to obtain the \(\ell _{2}\) normalized features and then quantizes these features on a unit hypersphere with an elaborate quantization manner. After constraining the continuous representations to staying on a unit hypersphere, DSQ attempts to reduce the reconstruction loss using multi-codebook quantization (MCQ). Different from PQ, MCQ draws near the representation vectors with the summation of multiple codewords instead of the concatenation. \(\hat{\mathbf {y}}_i\) denotes the predicted label distribution. \(\phi _{y_i}\) denotes the feature center of the \(y_i\)th class. The overall loss for training the model is as follows:

\begin{equation} \begin{aligned}\mathcal {L}_{DSQ}=& \sum _{i=1}^N -\log \mathbf {y}_i^T log \hat{\mathbf {y}}_i+\lambda _1 \sum _{i=1}^N||\mathbf {z}_{i}-[\mathbf {C}_1,\ldots ,\mathbf {C}_M]\mathbf {b}_{i}||_2^2\\ &+ \lambda _2 \sum _{i=1}^N ||\mathbf {z}_i - \phi _{y_i} ||_2^2 + \lambda _3 \sum _{i=1}^N ||\phi _{y_i}-\mathbf {C}\mathbf {b}_i||_2^2\\ & s.t.\ ||\mathbf {b}_{im}||_0=1, \mathbf {b}_i \in \lbrace 0,1\rbrace ^K, \mathbf {b}_i=[\mathbf {b}_{i,1}^T,\ldots ,\mathbf {b}_{i,M}^T]^T\!, \end{aligned} \end{equation}

(57)

where the first, second, third and last term is the softmax loss, quantization loss, the center loss and the discriminative loss, respectively. The last two losses encourage both the quantized vectors and deep features to approach their centers, respectively.

Similarity Preserving Deep Asymmetric Quantization (DPDAQ) [24]. DPDAQ adopts Asymmetric Quantizer Distance to approach the desired similarity metric, which is similar to ADSH. Moreover, it uses composite quantization instead of product quantization and the representations in the training set come from the deep neural network in an unquantized form. SPDAQ also takes advantage of similarity information and label information to achieve better retrieval performance.

3.6 Other Techniques for Deep Hashing

3.6.1 Hashing with Generative Adversarial Networks.

Generative Adversarial Networks (GANs) [49] are popular neural network models to generate virtual examples without needing supervised knowledge. There are also several hashing methods leveraging GANs to enhance the performance.

Deep Semantic Hashing with GAN (DSH-GAN) [133]. DSH-GAN is the first hashing method that takes advantage of GANs for image retrieval. It typically includes four components, i.e., a neural network to produce image representations, an adversarial discriminator for differentiating between synthetic images and real images, a hashing network for projecting representations into binary codes and a classification head. Specifically, the generator network attempts to generate synthetic images after concatenating the label embedding as well as generated noise embedding. The discriminator attempts to jointly differentiate between real samples and synthetic ones and categorize the inputs into proper semantic labels. Finally, the overall framework is optimized using the adversarial loss to mix two sources and the classification loss to obtain the ground truth labels using a classic minimax mechanism. The input of the network is image triplets, each of which contains three images. The first one is a real image treated as a query, the second one is a synthetic image created with the same label as the query image by the generator network, and the third one is a synthetic image with different semantics. GAN provides a hashing model with strong generalization potential from the maintaining of semantics and similarity, which improves the quality of hash codes.

HashGAN [14]. HashGAN augments the training data with images synthesized by pair conditional Wasserstein GAN (WGAN) inspired by [54], which sufficiently explores the pairwise semantic relationships. In this module, the training samples along with the pairwise similarities are considered as inputs and a generator and a discriminator is trained simultaneously by adding the pairwise similarity besides the loss function of WGAN. The hash encoder produces high-quality binary codes for all occurred pictures using a likelihood objective similar to HashNet. HashGAN is also capable of coping with the dataset without class labels but with pairwise similarity information.

3.6.2 Ensemble Learning.

Guo et al. [55] point out that for the current deep supervised hashing model, simply increasing the length of the hash code with a single hashing model cannot significantly enhance the performance. The potential cause is that the loss functions adopted by existing methods are prone to produce highly correlated and redundant hash codes. Inspired by this, several methods attempt to leverage ensemble learning to increase the retrieval performance with more hash bits.

Ensemble-based Deep Supervised Hashing (EbDSH) [55]. EbDSH leverages an ensemble learning strategy for better retrieval performance. Specifically, it trains a number of deep hashing models with different training datasets, training data, initialization, and networks, then concatenates them into the final hash codes. It is noticed that the ensemble strategy is suitable for parallelization and incremental learning.

Weighted Multi-deep Ranking Supervised hashing (WMRSH) [98]. WMRSH attempts to generate a high-quality hash function using multiple hash tables derived from the hashing networks. To be specific, WMRSH adds bit-wise weights and table-wise weights for each bit in each hash table. For each bit in a table, the similarity preservation is measured by product loss. Afterward, the bit independence is measured by the correlation between two bits. Finally, the table-wise weight can be derived from the mean average precision for every hash table. The final weight is the product of the three above terms for the final hash codes (i.e., the concatenation of the hash tables with weights). A similar strategy called Hash Boosting has been introduced in [117].

Apart from these methods, NMLayer [43] balances the importance of each bit and merges the redundant bits together to learn more compact hash codes.

3.6.3 Training Strategy for Deep Hashing.

In this subsection, we will introduce two methods that adopt different training strategies from most other methods.

Greedy Hash [156]. Greedy Hash adopts a greedy algorithm for fast processing of hashing discrete optimization by introducing a hash layer with a sign function instead of the quantization error. To overcome the ill-posed gradient problem [177], the gradients are transmitted entirely to the front layer, which effectively prevents the vanishing gradients of the sign function and updates all bits together. This strategy is also adopted in recent works [134].

Gradient Attention deep Hashing (GAH) [72]. This work points out a dilemma in learning deep hashing models through gradient descent that it makes no difference to the loss if the paired hash codes change their signs together. As a result, GAH generates attention on the derivatives of each hash bit for each image by maximizing the decrease of loss during optimization. It leverages a gradient attention network with two fully-connected layers to produce normalized weights and then applies them to the derivatives in the last layer. In conclusion, this model optimizes the training process by adopting a gradient attention network for acceleration.

4 Deep Unsupervised Hashing

4.1 Overview

Recently, unsupervised hashing methods have received widespread attention due to their sufficient leverage of the unlabeled data, which facilities the practical applications in the real world. Since deep unsupervised methods can not acquire label information, the semantic information is obtained in deep feature space with pre-trained networks. With semantic information, the problem can be converted into a supervised problem. However, how to infer semantics information and how to utilize semantics information for learning hash codes are two key problems here. According to semantics learning manners, the unsupervised methods can be mainly classified into three categories, i.e., similarity reconstruction-based methods, pseudo-label-based methods, and prediction-free self-supervised learning-based methods. Similarity reconstruction-based methods usually generate pairwise semantic information, and then leverage pairwise semantic preserving techniques in Section 3.2 for hash code learning. Pseudo-label-based methods usually produce pointwise pseudo-labels for inputs and then leverage pointwise semantic preserving techniques in Section 3.4 for hash code learning. Lastly, prediction-free self-supervised learning-based methods leverage data itself for training without generating explicit semantic information, i.e., similarity signals and pseudo-labels. Specifically, they usually utilize regularization terms, auto-encoder models, generative models, and contrastive learning to produce high-quality hash codes. The regularization terms include bit balance loss term, bit independence loss term and a transformation-invariant regularization term. Several approaches may combine different kinds of semantics learning manners. The optimization of binarization is still an important problem for deep unsupervised hashing. Most of the methods use \(tanh(\cdot)\) to approximate \(sign(\cdot)\) and generate approximate hash codes by the hashing network for optimization. The summary of these algorithms is shown in Table 3. Then, we elaborate on these classes as below.

Table 3.

Approach	Similarity Information	Pseudo-Label	Binarization	Other skills
SSDH [177]	Local Dist.	-	Tanh	-
DistillHash [177]	Local Dist. + Neighbour Information	-	Tanh	-
SADH [141]	Network Output + Adjacent Matrix	-	Drop + Alternation	-
MLS\(^3\)RUDH [159]	Local Dist. + Manifold Dist.	-	Tanh	-
TBH [145]	Hash Codes	-	Bottleneck Reg.	AE
GLC [120]	Local Dist. + K-Means	-	Tanh	-
MBE [104]	Local Dist.	-	-	Bit Bal. with Bi-Half Layer
CIMON [122]	Local Dist. + Spectral Clu. + Conf.	-	Tanh	Contrastive Learning
DATE [121]	Local Dist. + Distribution Dist. + Conf.	-	Tanh	Contrastive Learning
PLUDDH [67]	-	Kmeans + Cla. Layer	Tanh + Quan. Loss	-
DAVR [69]	-	Deep Clu. + Triplet Loss	Drop	-
CUDH [51]	-	Deep Embedding Clu.	Tanh + Quan. Loss	-
DVB [143]	Adjacent Matrix	Clu.	Quan. Loss	VAE + Bit Indep.
UDHPL [186]	-	Kmeans + PCA + MI	-	-
DU3H [191]	Local Dist. + Conf.	Kmeans + Hash Center	Tanh	GCN
UDMSH [132]	Local Dist. + Conf.	-	Quan. Loss	-
DSAH [108]	Updated Local Dist. + Conf.	-	Quan. Loss	-
UDKH [37]	-	Hash Code + Deep Clu.	Alternative	-
BDNN [36]	-	-	Drop	AE
DH [40]	-	-	Quan. Loss	Bit Bal. + Ind.
DeepBit [105]	-	-	Quan. Loss	Bit Bal. + Trans. Reg.
UTH [105]	-	-	Quan. Loss	Bit Bal. + Triplet Trans. Reg.
BGAN [152]	Local Dist.	-	Tanh + Continuation	GAN
BinGAN [201]	-	-	Quan. Loss	GAN + Bit independence
HashGAN [45]	-	-	Quan. Loss	Bit Bal. + Ind. + Trans. Reg. + GAN
SGH [31]	-	-	-	VAE
CIBHash [134]	-	-	Drop	Contrastive Learning
SPQ [77]	-	-	Quantization	Cross Contrastive Learning
HashSIM [119]	Local Dist.	-	Tanh	Bit Contrastive Learning

Table 3. A Summary of Deep Unsupervised Hashing Approaches w.r.t the Manner of Generating Similarity Information, Generating and Handling the Pseudo-Label, Binarization, as well as Other Skills

Drop = Drop the sign operator in the neural network and treat the binary code as an approximation of the network output, Reg. = Regression, Quan. = Quantization, Dist. = Distance, Conf. = Confidence, Trans. = Transformation, Ind. = Independence, Bal. = Balance., and Cla. = Classification, Clu. = Clustering.

4.2 Similarity Reconstruction-based Methods

Similarity reconstruction-based methods aim at leveraging pairwise methods to solve the problem. However, the similarity information is unavailable without label annotation. Hence, these methods utilize a two-step framework as shown in Figure 3. Firstly, they extract deep representations \(\mathbf {z}_i\) using the pre-trained neural network and then infer the similarity information \(\lbrace s_{ij}^o\rbrace _{(i,j)\in \mathcal {E}}\) by distance metrics in deep feature space. Secondly, a hashing network is trained to create similarity-preserving binary codes by leveraging the reconstructed similarity structure as guidance. With similarity information, the problem can be solved with pairwise supervised methods. The key to this kind of methods is how to generate accurate similarity information. Early methods usually truncate pairwise distances in deep feature space [177]. Further studies utilizes the neighbourhood information [120, 122, 179], confidence degree [122], and other similarity matrices [121, 159] to obtain a precise similarity structure for reliable guidance of subsequent optimization. Recently, a few researchers argue that static similarity structure from the pre-trained network is not optimal and propose to update it based on obtained hash codes [108, 141, 145]. Next, we revise these methods in detail.

Fig. 3.

Semantic Structure-based Unsupervised Deep Hashing (SSDH) [177]. SSDH is the first study along this line, which applies VGG-F model to extract deep features and perform hash code learning. It studies the cosine distance for each pair in deep feature space, and finds that the distribution of cosine distances can be approximated by two half Gaussian distributions. Hence, through parameter estimation, SSDH sets two distance threshold \(d_l\) and \(d_r\) and construct a similarity structure as follows:

\begin{equation} s_{ij}^o=\left\lbrace \begin{array}{ll} 1, & \text{if}\quad d(\mathbf {z}_i, \mathbf {z}_j) \le d_{l} \\ 0, & \text{if}\quad d_{l}\lt d(\mathbf {z}_i, \mathbf {z}_j)\lt d_{r} \\ -1, & \text{if}\quad d(\mathbf {z}_i, \mathbf {z}_j) \ge d_{r} \end{array}\right.\!\!\!, \end{equation}

(58)

where \(d(\cdot , \cdot)\) denotes the cosine distance of two vectors. From Equation (58), SSDH considers sample pairs with distance smaller than \(d_l\) as semantically similar while considers sample pairs with distances large than \(d_r\) as semantically dissimilar. Similar to SH-BDNN, a similarity difference loss is adopted as follows:

\begin{equation} \mathcal {L}_{SSDH}=\sum _{i=1}^{N} \sum _{j=1}^{N}\left|s_{ij}^o\right|\left(s_{ij}^h -s_{ij}^o\right)^{2}\!\!, \end{equation}

(59)

where \(s_{ij}^h = \mathbf {h}_i^T\mathbf {h}_j/L\), \(\mathbf {h}_i\) denotes the output of the deep network with activation function \(tanh(\cdot)\). The activation function \(sign(\cdot)\) is utilized instead during evaluation. However, the performance of SSDH is limited due to two issues. On one hand, its similarity structure is typically unreliable using two coarse thresholds. On the other hand, it discards a range of signals in similarity structure.

DistillHash [179]. DistillHash leverages the similarity signals from local structures to distill similarity signals. Specifically, for each pair of images, it studies the similarities of their neighbors and then removes the similarity signal if it has huge variants in local structures. The distillation process can be implemented with Bayes optimal classifier. Finally, DistillHash leverages likelihood loss minimization to train the hashing network with the similarity structure

\begin{equation} \mathcal {L}_{DistillHash}=- \sum _{i=1}^{N} \sum _{j=1}^{N} \left(\mathbf {1}_{s_{ij}^o=1} \sigma \left(s_{ij}^b\right)+ \mathbf {1}_{s_{ij}^o=-1} \left(1-\sigma \left(s_{ij}^b\right)\right)\right). \end{equation}

(60)

The improvement of DistillHash over SSDH is mainly the introduction of local structures to distill confident signals, which releases the first issue in the last paragraph.

Similarity Adaptive Deep Hashing (SADH) [141]. SADH trains the model alternatively over three parts. In part one, it trains the hashing network under the guidance of binary codes. In part two, it leverages the network output to update the similarity structure. In part three, the hash codes are optimized with network output following the ADMM process. The alternative optimization improves the robustness of the model and helps achieve better hash codes for image retrieval.

Deep Unsupervised Hashing via Manifold based Local Semantic Similarity Structure Reconstructing (MLS\(^3\)RUDH) [159]. MLS\(^3\)RUDH incorporates the manifold structure in deep feature space to generate an accurate similarity structure. Specifically, it leverages a random walk on the nearest neighbor graph to measure the manifold similarity. The final similarity structure is denoted as follows:

\begin{equation} s_{ij}^o=\left\lbrace \begin{array}{ll}1, & \mathbf {x}_{j} \in N^{c}\left(\mathbf {x}_{i}\right) \wedge \mathbf {x}_{j} \in N^{m}\left(\mathbf {x}_{i}\right) \\ -1, & \mathbf {x}_{j} \in N^{c}\left(\mathbf {x}_{i}\right) \wedge \mathbf {x}_{j} \notin N^{m}\left(\mathbf {x}_{i}\right) \\ 0, & \text{ otherwise} \end{array}\right.\!\!\!, \end{equation}

(61)

where \(N^c(\cdot)\) and \(N^m(\cdot)\) denote the set of the neighbour samples in terms of both cosine similarity and manifold similarity, respectively. Then, the hashing network is optimized through difference loss minimization as

\begin{equation} \mathcal {L}_{MLS^3RUDH}= \sum _{i=1}^{N} \sum _{j=1}^{N} \log \left(\cosh \left(s_{ij}^h -s_{ij}^o\right)\right)\!. \end{equation}

(62)

MLS\(^3\)RUDH leverages the manifold similarity to generate a more accurate similarity structure, which guides the optimization of the hashing network effectively.

Auto-Encoding Twin-Bottleneck Hashing (TBH) [145]. TBH introduces an adaptive code-driven graph to guide hash code learning. It contains a binary bottleneck to construct code-driven similarity graph and a continuous bottleneck for reconstruction. To be specific, the similarity structure is defined by hash codes

\begin{equation} s_{ij}^o = 1- d_{ij}^h/L, \end{equation}

(63)

where \(d_{ij}\) is the Hamming distance between \(b_i\) and \(b_j\). The outputs of the continuous bottleneck are fed into graph neural networks with the similarity structure as the adjacency for the final reconstruction. Moreover, TBH involves adversarial learning to regularize the network for high-quality hash codes. TBH utilizes a dynamic graph guided with the reconstruction loss for accurate similarity structures, which helps hash code preserve better similarity for reliable retrieval.

Deep Unsupervised Hashing by Global and Local Consistency (GLC) [120]. GLC extracts semantic information from both local and global views. For local views, it builds reliable graphs and penalty graphs based on the cosine distances of image pairs. For global views, it utilizes global clustering to derive cluster centers for different classes. During the optimization of hashing network, GLC preserves the local similarity using a product loss and minimizes the Hamming distances between the hash codes in the same cluster. Compared with previous methods, GLC preserves the similarity from different views in a unified manner, resulting in effective retrieval performance.

CIMON [122]. CIMON first sets a threshold \(d_t\) to partition the local similarity signals based on cosine metric into two groups. Inspired by the fact that the representations of samples with the similar semantic information ought to be on a high-dimensional manifold, CIMON adopts the results of spectral clustering to remove contradictory results for refining the semantic similarities. Moreover, it constructs the confidence of the similarity signals. The semantic information includes similarity signals \(\lbrace s_{ij}^o\rbrace _{(i,j)\in \mathcal {E}}\) and their confidence \(\lbrace w_{ij}\rbrace _{(i,j)\in \mathcal {E}}\). In formulation,

\begin{equation} s_{ij}^o= {\left\lbrace \begin{array}{ll}1 & c_{i}=c_{j} \& d(\mathbf {z}_i,\mathbf {z}_j)\lt d_{t} \\ -1 & c_{i} \ne c_{j} \& d(\mathbf {z}_i,\mathbf {z}_j)\lt d_{t} \\ 0 & \text{ otherwise} \end{array}\right.}\!\!, \end{equation}

(64)

where \(\lbrace c_i\rbrace _{i=1}^N\) is the cluster label of clustering. The confidence is built based on the cumulative distribution function

\begin{equation} w_{i j}= {\left\lbrace \begin{array}{ll}\frac{\Phi _{1}(d_t)-\Phi _{1}\left(d(\mathbf {z}_i,\mathbf {z}_j)\right)}{\Phi _{1}(d_t)-\Phi _{1}(0)} & d(\mathbf {z}_i,\mathbf {z}_j) \le d_t \& s_{ij}^o \ne 0 \\ \frac{\Phi _{2}\left(d(\mathbf {z}_i,\mathbf {z}_j)\right)-\Phi _{2}(d_t)}{\Phi _{2}(2)-\Phi _{2}(d_t)} & d_t\lt d(\mathbf {z}_i,\mathbf {z}_j) \& s_{ij}^o \ne 0 \\ 0 & \hat{S}_{i j}=0 \end{array}\right.}\!, \end{equation}

(65)

where \(\Phi _\cdot (\cdot)\) is cumulative distribution function of estimated Gaussian distribution. CIMON generates two groups of semantic information by data augmentation and matches the hash code similarity with similarity information in a parallel and cross manner. Moreover, contrastive learning is also introduced to improve the quality of hash codes. To our knowledge, CIMON is the first method using contrastive learning for hash code learning and achieves impressive performance due to both reliable similarity information and contrastive learning.

Maximizing Bit Entropy (MBE) [104]. MBE utilizes the continuous cosine similarity signals to guide hash code learning. More importantly, it introduces a bi-half layer for better quantization. Specifically, for the continuous network outputs, MBE sorts the elements of each dimension over all the minibatch samples, and then assigns the top half elements to 1 and the remaining elements to \(-1\). In this manner, MBE can achieve absolute bit balance. The optimization of the bi-half layer is based on a straight-through estimator similar to [156].

DATE [121]. DATE characterizes each image by a set of its augmented views, which can be considered as examples from its latent distributions. Then, it calculates the semantic distances between sample pairs by computing the distribution divergence using a non-parametric way. Specifically, we define the smoothed ball divergence statistic written as

\begin{equation} \begin{aligned}B D\left(\left\lbrace \mathbf {z}_{i}^{r}\right\rbrace _{r=1}^{R},\left\lbrace \mathbf {z}_{j}^{r}\right\rbrace _{m=1}^{R}\right) &=\frac{1}{R} \sum _{r=1}^{R}\left(\left(\frac{1}{R} \sum _{r=1}^{M} d\left(\mathbf {z}_{i}^m, \mathbf {z}_{j}^r\right)-d\big (\mathbf {z}_{i}^m, \mathbf {z}_{i}^r\big)\right)^{2}\right.\\ &\quad \left. +\left(\frac{1}{R} \sum _{r=1}^{M} d\left(\mathbf {z}_{j}^m, \mathbf {z}_{j}^r\right)-d\left(\mathbf {z}_{j}^m, \mathbf {z}_{i}^r\right)\right)^{2}\right), \end{aligned} \end{equation}

(66)

where \(\lbrace \mathbf {z}_{i}^{r}\rbrace _{r=1}^{R}\) and \(\lbrace \mathbf {z}_{j}^{r}\rbrace _{r=1}^{R}\) denote the features of augmented views of images \(\mathbf {x}_i\) and \(\mathbf {x}_j\) through a pre-trained network. Then, the distribution distance is combined with cosine distance to generate reliable semantic information. Contrastive learning is also utilized for high-quality hash codes. Through accurate semantic information enhanced by augmentations, DATE can achieve promising performance for image retrieval.

4.3 Pseudo-label-based Methods

The second class of deep unsupervised methods generates pseudo-labels. These methods treat pseudo-labels as semantic information and convert this problem into supervised hashing. Most of them first leverage clustering (e.g., K-means and spectral clustering) to generate pseudo-labels [67, 69, 143, 186, 191]. Then, these pseudo-labels guide hash code learning with deep supervised hashing methods. Further studies utilize a deep clustering framework to combine clustering with the hashing network to adaptively update pseudo-labels [37, 51].

Pseudo Label-based Unsupervised Deep Discriminative Hashing (PLUDDH) [67]. PLUDDH utilizes the pre-trained network to extract deep features and then generates pseudo-labels via clustering. Then the hashing network is supervised by pseudo-labels. It has the same neural network architecture as DBN and trains it with the classification loss and the quantization loss. PLUDDH explores deep feature space with coarse clustering, which may generate false pseudo-labels. Hence, its retrieval performance is limited when the dataset is complicated.

Unsupervised Learning of Discriminative Attributes and Visual Representations (DAVR) [69]. DAVR adopts a two-step framework. In the first stage, a CNN is trained coupled with unsupervised discriminative clustering [150] to generate the cluster membership. In the second stage, cluster membership is utilized as supervision to uncover common cluster properties while optimizing their separability using a triplet objective. In general, the unsupervised hashing is converted into a supervised problem by the obtained pseudo labels.

Unsupervised Deep Hashing with Pseudo Labels (UDHPL) [186]. UDHPL first extracts features and reduces their dimension with Principle Component Analysis (PCA) to release the noise. Then it generates the pseudo-labels through the Bayes’ rule. UDHPL maximizes the correlation between the projection vectors of pseudo-labels and deep features, and the features can be projected into the Hamming space. With a rotation matrix, the hash code can be generated, which will guide the optimization of the hashing network. UDHPL improves the pseudo-labels through PCA and guides the network training with mutual information maximization, which helps to preserve similarity information for effective retrieval.

Clustering-driven Unsupervised Deep Hashing (CUDH) [51]. CUDH first extracts deep features from the pre-trained network. Inspired by the deep clustering model DEC [174], which performs clustering in the embedding space, it modifies the model to iteratively learn discriminative clusters in the Hamming space with extra quantization loss. CUDH is capable of generating discriminative hash code in virtue of the deep clustering model.

Deep Unsupervised Hybrid-similarity Hadamard Hashing (DU3H) [191]. DU3H first generates pseudo-labels through K-means clustering. Instead of adding a classification layer, DU3H utilizes Hadamard matrix to project pseudo-labels into Hamming space. This strategy is similar to CSQ [185] but in unsupervised scenarios. Moreover, it generates a similarity structure for preserving pairwise similarity, which considers the confidence of different signals. This consideration of different confidence can also be seen in UDMSH [132] and DSAH [108]. Lastly, a two-layer GCN is introduced to amplify the discrepancy of similarity signals to further guide the hash code learning. DU3H combines pointwise methods and pairwise methods in recent deep supervised learning domains, which helps to achieve significant improvement.

Unsupervised Deep K-means Hashing (UDKH) [37]. UDKH is a joint framework which combines deep clustering with traditional clustering, i.e., K-means. It first uses K-means clustering results to initialize the cluster labels. UDKH learns both hash codes and cluster labels in an alternative manner. Specifically, it first fixes clustering results and optimizes the hash codes as well as the hashing network under supervision. Then, it fixes the hash codes and leverages Discrete Proximal Linearized Minimization [142] to derive the updated pseudo-labels. UDKH repeats the above steps until convergence. UDKH improves the quality of hash codes along with the pseudo-labels with progressive learning, achieving better performance compared with unsupervised methods using fixed pseudo-labels.

4.4 Prediction-Free Self-Supervised Learning-based Methods

The last class of deep unsupervised methods is prediction-free self-supervised learning-based methods. The early methods often impose several constraints on hash codes by minimizing regularization terms (i.e., the bit balance loss, the bit independence loss the quantization loss, and transformation-invariant loss) [40, 105]. To extract more information through deep neural networks, several researchers introduce popular self-supervised techniques into deep unsupervised hashing, such as auto-encoder [31, 36, 145] and generative adversarial network [45, 152, 201], and so on. Recently, contrastive learning has shown promising performance in producing discriminative representations in various domains. Inspired by the fact that hash code is a specific form of representation, several methods involve contrastive learning into recent unsupervised hashing, which helps to get high-quality hash codes [77, 121, 122, 134]. Following the scheme of contrastive learning in [60], these methods usually first transform each input \(\mathbf {x}_i\) into two views \({\mathbf {x}}_i^{(1)}\) and \({\mathbf {x}}_i^{(2)}\). Then, the hashing network projects them into two hash codes \({\mathbf {b}}_i^{(1)}\) and \({\mathbf {b}}_i^{(2)}\). Given the \(\mathbf {\alpha } \star \mathbf {\beta }\) denotes the cosine similarity of two vectors \(\mathbf {\alpha }\) and \(\mathbf {\beta }\), the network is trained by minimizing the loss for each batch as follows:

\begin{equation} \mathcal {L}_{CL}=-\frac{1}{2 N_B} \sum _{i=1}^{N_B}\left(\log \frac{e^{\mathbf {b}_{i}^{(1)} \star \mathbf {b}_{i}^{(2)} / \tau }}{Z_{i}^{(1)}}+\log \frac{e^{\mathbf {b}_{i}^{(1)} \star \mathbf {b}_{i}^{(2)} / \tau }}{Z_{i}^{(2)}}\right), \end{equation}

(67)

where \(\tau\) is a temperature parameter, \(N_B\) is the batch size, and \(Z_{i}^{(r)}=\sum _{j \ne i}(e^{\mathbf {b}^{(r)} \star \mathbf {b}_{j}^{(1)} / \tau }+e^{\mathbf {b}_i^{(r)} \star \mathbf {b}_j^{(2)} / \tau })\), \(r=1\) or 2. This term can also be illustrated using mutual information [134]. Minimizing Equation (67) has three potential benefits. First, since the numerator penalizes the difference in binary codes of samples under different views, it assists in the production of transformation-invariant binary codes. Second, since the denominator promotes to amply the distances between binary codes of different examples which facilities the binary codes to approach a uniform distribution in the Hamming space [166], it assists in optimizing the capacity of hash bits [141], preserving the most semantic information. Third, because contrastive learning demonstrates promising performance in various tasks including linear classification, clustering as well as transfer learning [60, 102], it aids in developing high-quality binary codes for effective retrieval [121].

Deep Hashing (DH) [40]. DH utilizes a deep hashing network and optimizes the parameters of the network with three criteria for the hash codes. Firstly, it minimizes a quantization loss by minimizing the gap between the network output and the learnt hash codes. Secondly, it minimizes the bit balance loss in Equation (14) so that generated binary codes distribute evenly on each bit. Thirdly, it regularizes the weights of hashing network for independent hash codes. The parameters of the hashing network are updated by back-propagation based on the composite objective function. DH only imposes several constraints on hash codes without inferring similarity information from the training data, which results in limited performance.

DeepBit [105]. DeepBit utilizes a deep convolutional neural network as the backbone. It also minimizes the quantization loss as well as the bit balance loss. Differently, DeepBit enforces the hash codes invariant to image rotation. The rotation invariant loss is formulated as

\begin{equation} \mathcal {L}(RI)= \sum _{i=1}^N\sum _{\theta =-R}^{R} exp\left(-\frac{\theta ^2}{2}\right)\left\Vert \mathbf {h}_i- \mathbf {h}_{i,\theta }\right\Vert ^{2}, \end{equation}

(68)

where \(\theta\) is the rotation angle and \(\mathbf {h}_{i,\theta }\) denotes the network output from \(\mathbf {x}_i\) with rotation \(\theta\). This loss acts as a regularization term to enforce the hash codes invariant to certain transformations, which improves the performance compared with DH.

Unsupervised Triplet Hashing (UTH) [73]. UTH builds the triplets from the dataset, each of which contains an anchor example, a rotated example along with a random example. Afterward, the hashing network is optimized using the triplet inputs. The quantization loss and bit balance loss are also adopted for high-quality hash codes. The triplet loss compares the hash codes from different hash codes, which helps generate discriminative hash codes compared with the regularization loss in DeepBit. Hence, UTH performs better than DeepBit in various experiments.

HashGAN [45]. HashGAN contains three networks, i.e., a generator, a discriminator, and a hashing network. The hashing network utilizes \(L\) sigmoid function for final activation. Its objective for real data contains four losses. It first minimizes the entropy of each bit, which is equivalent to a quantization loss. The other three terms enforce the bit balance, invariance to different transformations, and bit independence. Similar to DSH-GAN, the discriminator is trained in an adversarial form. It also leverages the synthesized images by minimizing the distances between outputs of the hashing network and the inputs of the generator, which acts like an auto-encoder. Moreover, it encourages the generator to produce synthetic samples with similar statistics to real samples with \(L_2\)-norm loss. With GAN, HashGAN achieves better performance on both information retrieval and clustering tasks.

Stochastic Generative Hashing (SGH) [31]. SGH proposes to utilize a generative manner to train the hashing network through the Minimum Description Length principle. In this manner, the obtained binary codes compress the whole dataset as much as possible. Specifically, it contains a generative network and an encoding network to build the mapping between inputs and binary codes from adverse directions. During optimization, it trains a variational auto-encoder to reconstruct the input using the least information in binary codes. SGH is a general framework, which can be degraded into ITQ [48] as well as Binary Autoencoder [21].

Unsupervised Hashing with Contrastive Information Bottleneck (CIBHash) [134]. CIBHash adapts contrastive learning in deep unsupervised hashing. It considers the outputs of the hashing network as a form of representation and minimizes the contrastive loss on the outputs. Specifically, CIBHash generates two views for each input, and minimizes the contrastive learning objective, i.e., Equation (67). To estimate the gradient of hashing network with discrete stochastic variables, CIBHash leverages the straight-through gradient estimator [2] and the gradients are transmitted entirely to the front layer similar to [156]. CIBHash also illustrates the objective with Information bottleneck theory with an improved model variant. From that moment on, contrastive learning has been shown an effective tool for deep unsupervised learning since then.

Self-supervised Product Quantization (SPQ) [77]. SPQ combines contrastive learning with deep quantization. The codewords and deep continuous representations are simultaneously optimized by contrasting individually augmented views in a cross manner. Specifically, for two views of each sample, i.e., \({\mathbf {x}}_i^{(1)}\) and \({\mathbf {x}}_i^{(2)}\), SPQ generates deep features \({\mathbf {z}}_i^{(1)}\) and \({\mathbf {z}}_i^{(2)}\) and employs codebooks in the quantization head to generate quantized features \({\hat{\mathbf {z}}}_i^{(1)}\) and \({\hat{\mathbf {z}}}_i^{(2)}\). Instead of comparing similarity between two visual descriptors or two quantized features, SPQ attempts to maximizes cross-similarity between the continuous representation from one perspective and the feature after product quantization from the other perspective. In formulation,

\begin{equation} \mathcal {L}_{SPQ}=-\frac{1}{2 N_B} \sum _{i=1}^{N_B}\left(\log \frac{e^{{\mathbf {z}}_i^{(1)} \star {\hat{\mathbf {z}}}_i^{(1)} / \tau }}{Z_{i}^{(2)}}+\log \frac{e^{\hat{\mathbf {z}}_i^{(1)} \star {{\mathbf {z}}}_i^{(2)} / \tau }}{Z_{i}^{(2)}}\right), \end{equation}

(69)

where \(Z_i^{(1)}=\sum _{j\ne i} e^{{\mathbf {z}}_i^{(1)} \star {\hat{\mathbf {z}}}_j^{(2)} / \tau }\) and \(Z_i^{(2)}=\sum _{j\ne i} e^{\hat{\mathbf {z}}_i^{(2)} \star {{\mathbf {z}}}_j^{(1)} / \tau }\). With the cross contrastive learning strategy, both codewords and continuous representations are concurrently optimized to produce high-quality outputs for effective image retrieval.

Hashing via Structural and Intrinsic Similarity Learning (HashSIM) [119]. HashSIM utilizes contrastive learning for deep unsupervised hashing from a different view. For each batch, it stacks two views of binary codes into two distinct matrices \(\mathbf {B}^{(1)}\) and \(\mathbf {B}^{(2)} \in \mathbb {R}^{N_B\times L}\), and takes their column vectors as bit vectors \(\lbrace \mathbf {c}_l^{(r)}\rbrace _{l=1}^L\), \(r=1\) or 2. Then, HashSIM develops a intrinsic similarity learning objective as follows:

\begin{equation} \mathcal {L}_{HashSIM}=-\frac{1}{2L} \sum _{l=1}^{L}\left(\log \frac{e^{\mathbf {c}_{l}^{(1)} \star \mathbf {c}_{l}^{(2)} / \tau }}{Z_{i}^{(1)}}+\log \frac{e^{\mathbf {c}_{l}^{(1)} \star \mathbf {c}_{l}^{(2)} / \tau }}{Z_{l}^{(2)}}\right), \end{equation}

(70)

where \(Z_{i}^{(r)}=\sum _{l^{\prime } \ne l}(e^{\mathbf {c}_l^{(r)} \star \mathbf {c}_{l^{\prime }}^{(1)} / \tau }+e^{\mathbf {c}_{l}^{(r)} \star \mathbf {c}_{l^{\prime }}^{(2)} / \tau })\). Due to the fact the numerator attempts to reduce the gap between each hash bit under distinct augmentations and the denominator attempts to enlarge the distance between distinct bits, minimizing this self-supervised objective helps produce robust and independent hash codes for effective image retrieval.

5 Related Important Topics

5.1 Semi-Supervised Deep Hashing

Semi-supervised deep hashing simultaneously leverages the semantic information from both labeled samples and unlabeled samples, and a range of semi-supervised deep hashing models have been developed recently. Compared with supervised methods and unsupervised methods, these methods can typically overcome label scarcity in practical with limited performance degradation. These methods usually incorporate semi-supervised techniques (e.g., pairwise pseudo-labeling [148, 176, 187], GAN [161, 162], and transductive learning [147]) into deep semi-supervised hashing. Then, the retrieval performance can benefit from abundant unlabeled images in the real world. Generally, semi-supervised deep hashing provides a cost-effective solution to practical applications with promising performance, which desires further study in large-scale scenarios. We then review these methods in detail.

Semi-Supervised Deep Hashing (SSDH) [187]. SSDH minimizes the semi-supervised loss function containing three terms, i.e., a ranking term, a embedding term, as well as a pseudo-label term. Supervised ranking term leverages a triplet loss for labeled data. Then, SSDH generates an online k-NN graph for all data, which guides pairwise similarity preserving of the hashing network. Moreover, in semi-supervised settings, it generates pseudo-labels which further guide the similarity preserving process. SSDH is the first to perform deep hashing in a semi-supervised fashion.

Deep Hashing with a Bipartite Graph (BGDH) [176]. BGDH builds a bipartite graph to uncover the latent semantic structure for unlabeled data. Different from the similarity graph in unsupervised hashing, its similarity structure is based on the relationships between labeled examples and unlabeled examples, resulting in a bipartite graph. Then, BGDH utilizes the bipartite graph to guide hash code learning by pairwise similarity preserving. It also adopts the loss term in DPSH [101] for supervised learning. Through mining the relationship in deep feature space, BGDH utilizes unlabeled data in an appropriate manner and improves the performance.

Semi-Supervised Generative Adversarial Hashing (SSGAH) [161]. SSGAH combines a Generative Adversarial Network with deep semi-supervised hashing. It contains a generative network, a discriminator and a deep hashing network. The generative network produces two synthetic images \(\mathbf {x}_{syn}^p\) and \(\mathbf {x}_{syn}^n\) for each real image \(\mathbf {x}\) and the similarity between \(\mathbf {x}\) and \(\mathbf {x}_{syn}^p\) is larger than the similarity between \(\mathbf {x}\) and \(\mathbf {x}_{syn}^n\). In this way, SSGAH learns the distribution of triplet-wise semantic message from both labeled samples as well as unlabeled samples. The discriminator estimates the likelihood that each input is synthetic. The hashing network is optimized using a triplet loss with the incorporation of synthetic positive and negative images. SSGAH can produce hash codes, which could sufficiently explore semantics in the datasets by training the framework using an adversarial manner.

Semi-supervised Deep Pairwise Hashing (SSDPH) [148]. SSDPH chooses a variety of labeled anchors in the training set, and then uses the pairwise objective for preserving similarities between labeled samples. More importantly, it leverages the technique of temporal ensembling from semi-supervised learning for learning similarity information from unlabeled data. Specifically, it contains a teacher model and a student model. The teacher model provides supervised information to guide the similarity learning, which is then updated in an ensemble manner. SSDPH first combines deep hashing with semi-supervised techniques, which improves the retrieval performance in real-world applications.

Transductive Semi-supervised Deep Hashing (TSDH) [147]. TSDH extends the traditional transductive learning principle into deep semi-supervised hashing, which treats pseudo-labels of unlabeled data as variables and optimizes them alternatively with the hashing network. To accomplish this, it adds a classification layer after producing hash codes. Moreover, it involves a pairwise loss for similarity preservation. Lastly, TSDH estimates the confidence of pseudo-labels by the proximity distance

\begin{equation} v_{i}=\sum _{\mathbf {z}_{j} \in \mathcal {N}\left(\mathbf {z}_{i}\right)}\Vert \mathbf {z}_{i}-\mathbf {z}_{j}\Vert _{2}\!, \end{equation}

(71)

\begin{equation} r_{i}=1-\frac{v_{i}}{v_{\max }}, \quad v_{\max }=\max \left\lbrace v_{1}, \ldots , v_{N}\right\rbrace \!, \end{equation}

(72)

where \(\mathbf {z}_i\) is the extracted features of \(\mathbf {x}_i\) and \(N(\mathbf {z}_i)\) denotes the k-nearest neighbor set of \(\mathbf {z}_i\). In this manner, samples that reside in densely populated regions are assigned a high confidence level. In summary, TSDH utilizes the popular transductive learning technique to improve the retrieval performance of semi-supervised hashing.

Adversarial Binary Mutual Learning (ACML) [162]. ACML also integrates a Generative Adversarial Network into semi-supervised deep hashing. Specifically, it contains a discriminative network and a generation network to mould the relationships between inputs and binary codes from opposite views. Then, an adversarial network is trained to differentiate between real and fake pairs of samples and their hash codes. In this way, it can leverage unlabeled data to make the discriminative network and generation network mutually learn from each other. Moreover, it introduces a Weibull distribution for better similarity preserving. ACML combines a Generative Adversarial Network with deep semi-supervised hashing and shows promising retrieval performance.

5.2 Domain Adaptation Deep Hashing

The data in the domain of interest is likely to be insufficient in practice, while the labeled samples from a separate but correlated domain is usually accessible. To sufficiently utilize the labeled samples from source domains, several domain adaptive hashing methods have been developed in recent years. These hashing methods usually combine similarity preserving techniques (e.g., pairwise [160, 199] and ranking-based similarity preserving [118]) in deep supervised hashing with domain adaptation techniques (e.g., discrepancy minimization [70, 188], adversarial learning [62, 118], and centroid alignment [62, 160]). Hence, their methods are quite flexible. However, the cross-domain retrieval performance of current hashing methods is still not satisfactory, which desires further exploration in the future. We then review these methods as follows.

Domain Adaptive Hashing (DAH) [160]. DAH contains three parts, i.e., a supervised hashing module for source data, an unsupervised hashing module for target data, and a domain disparity reduction module. For source data, it minimizes the likelihood loss along with the quantization loss. For source data, it leverages the source output to generate the label distributions and then minimizes the entropy to ensure that the target outputs approximate source outputs from each category. Furthermore, DAH reduces the domain difference between the source and target representations through the minimization of multi-kernel Maximum Mean Discrepancy. This work is the first to combine unsupervised domain adaptation with deep hashing and improves the efficiency for cross-domain image retrieval.

Domain Adaptive Hashing with Intersectant Generative Adversarial Networks (IGAN) [62]. Different from DAH, IGAN generates the pseudo-labels for target domains and then aligns the semantic centroid for all categories. Moreover, it leverages two generators to reconstruct images in two domains and the generators and the discriminators are updated using a GAN objective. IGAN improves the retrieval performance using GAN as well as centroid alignment, which are two common techniques in domain adaption.

Deep Domain Adaptation Hashing with Adversarial Learning (DeDAHA) [118]. DeDAHA contains two different CNNs for learning image representations. An adversarial loss is enforced to explore the knowledge robust to different domains. Then DeDAHA utilizes a standard triplet loss to learn the hashing encoder. When the label annotations in target data are unavailable, DeDAHA leverages a multi-stage framework for unsupervised domain adaptation hashing.

Deep Transfer Hashing (DPH) [199]. DPH first uses a neural network to extract deep features and then incorporates a deep transformation mapping network for domain adaptation. Then, for effective transfer learning, DPH generates the similarity information based on the cosine similarity of deep features as well as the hash codes in source domains to guide hash code learning. DPH shows great generality utilizing the powerful representative capacity of deep learning.

Optimal Projection Guided Transfer Hashing (GTH) [188]. GTH seeks for the maximum likelihood estimation solution to minimize the error matrix between two hash projections of target and source domains. In this way, GTH can produce domain-invariant hash projections for effective cross-domain image retrieval. However, GTH assumes that similar domains should have small discrepancies between hash projections, which may be not promised in most scenarios.

Domain Adaptation Preconceived Hashing (DAPH) [70]. DAPH first reduces the distribution discrepancy across two domains through learning a transformation matrix to project the samples from different domains into a common space. Moreover, it involves a reconstruction constraint to release the information loss from the transformation. For effective hash code learning, it adds a quantization loss to project features into hash codes. The whole learning process is in an alternative manner for updating the transformation matrix, projection, and binary codes. DAPH improves the performance for challenging cross-domain retrieval.

5.3 Multi-Modal Deep Hashing

Multimedia data have exploded in multiple modalities including text, audio, image, and video since the dawn of the information era and the fast expansion of the Internet. Multi-model deep hashing has arisen much interest in the field of deep hashing recently. These methods typically project multiple modalities of data into a shared Hamming space using deep neural networks for effective cross-modal retrieval. The framework of multi-modal deep hashing methods is similar to general deep hashing methods except that the similarity information includes the intra-modal and inter-modal forms. However, each loss term characterizing the similarity information is similar to that in deep supervised hashing discussed above. Existing methods [170] can also be categorized into supervised methods [13, 83, 178] and unsupervised methods [66, 168, 184]. Cao et al. [11] give a detailed review for the multi-modal hashing methods that includes [12, 13, 28, 35, 50, 65, 68, 81, 83, 95, 97, 178, 183, 192].

6 Evaluation Protocols

6.1 Evaluation Metrics

For deep hashing algorithms, the space cost only depends on the length of the hash codes, so the length is usually kept the same when comparing the performance of different algorithms. The search efficiency is measured by the average search time for a query, which mainly depends on the architecture of the neural networks. Besides, if the weighted Hamming distance is used, we cannot take advantage of bit operation for efficiency.

As discussed above, we usually use search accuracy to measure performance. The most popular matrices include Mean Average Precision, Recall, Precision, as well as the precision-recall curve. Precision: Precision is defined by the proportion of returned samples that share the common label with the query. The formula can be formulated as

\begin{equation} precision=\frac{TP}{TP+FP}, \end{equation}

(73)

where \(TP\) denotes the number of returned samples that have a common label with the query and \(FP\) denotes the number of returned samples that do not have a common label with the query. \(precision\)@\(k\) means the total number of returned sample is \(k\), i.e., \(TP+FP=k\).

Recall: Recall is defined by the proportion of samples in the database that have a common label with the query that is retrieved. The formula can be formulated as

\begin{equation} recall=\frac{TP}{TP+FN}, \end{equation}

(74)

where \(FN\) is the total number of samples in the database that have a common label with the query, including samples not retrieved. \(recall\)@\(k\) means the number of returned examples is \(k\).

Precision-recall curve: The precision rate and recall rate in image retrieval are both influenced by \(k\). The precision and recall rates of an approach are inversely proportional. As a result, we could create the precision-recall curve by altering \(k\) and using the precision rate and recall rate, respectively.

Mean average precision (MAP): When the recall rate varies between 0 and 1, the average accuracy can be computed by varying the precision rate. The sequence summation approach is used to compute the average accuracy in practical applications with discretion

\begin{equation} AP=\frac{1}{F}\sum _{k=1}^N precision\text{@}k\Delta \lbrace T\text{@}k\rbrace , \end{equation}

(75)

in which \(\Delta \lbrace T\text{@}k\rbrace\) denotes the change in recall from item \(k-1\) to \(k\). The sum of \(\Delta \lbrace T\text{@}k\rbrace\) is \(F\) and the core idea of AP is to evaluate a ranked list through averaging the precision at every position. Afterward, MAP can be derived by taking the mean of the average precision of every query. In several works, MAP is calculated in terms of top K ranked retrieval results. Some researchers also calculate the MAP with Hamming Radius r, when only samples with distances not bigger than \(r\) are considered.

Alexandre Sablayrolles et al. [135] show that the above popular evaluation protocols for supervised hashing are not satisfactory because a trivial solution that encodes the output of a classifier significantly outperforms existing methods. Furthermore, they provide a novel evaluation protocol based on retrieval of unseen classes and transfer learning. However, if the design of hashing methods avoids using the encoding of the classifier, the above popular evaluation protocols are still effective generally.

6.2 Datasets

The scales of regularly used assessment datasets range from small to large to extremely large. Single-label datasets and multi-label datasets are two types of datasets.

MNIST [94] is comprised 60,000 training samples and 10,000 testing samples. It is a single-labeled dataset, where the 10 different classes represent different digits. Each image is represented by 784-dimensional raw features.

CIFAR-10 [89] is comprised 60,000 real-world images in 10 distinct categorizations. It is a single-labeled dataset, where 10 different categorizations imply airplanes, cars, birds, cats, deer, dogs, frogs, horses, ships, as well as trucks. These examples are identified with semantic labels utilized to assess the performance of various hashing approaches.

ImageNet [34] is a large-scale dataset that consists of over 1.2 million images hand-annotated by the huge project to find out what objects are included. It is a single-labeled dataset, consisting of 1,000 categories such as “balloon” or “strawberry”.

NUS-WIDE [29] is a well-known multi-labeled image dataset collected by a team from NUS. It consists of 269,648 examples with 5,018 unique tags. These samples are manually associated with some of the 81 concepts. Because images have typically over one label, two samples are treated as semantic similar if they share one common semantic label.

MS COCO [109] is a popular multi-labeled datasets, consisting of 82,783 training examples along with 40,504 validation examples, each of which is associated with part of the 80 categories. After removing examples without any class information, 122,218 samples can be obtained for evaluating the performance of hashing methods.

6.3 Performance Analysis

6.3.1 Performance Comparison of Deep Supervised Methods.

We present the results of some representative deep supervised hashing and quantization algorithms over CIFAR-10, NUS-WIDE, ImageNet, and MS COCO. For CIFAR-10, 100 images are selected randomly per class (resulting in 1,000 images totally) as queries and the rest of samples are adopted as the database. A total of 500 samples per class (resulting in 5,000 samples totally) make up the training set. For NUS-WIDE, a subset of 195,834 samples that correspond to the 21 most frequent labels are picked. Afterward, 100 samples per class (resulting in 2,100 samples totally) are picked as queries and the remaining samples make up the retrieval set. A total of 500 images per class (resulting in 10,500 images totally) are sampled as the training set. For ImageNet, 100 categories are randomly selected as in [19]. The samples associated with the chosen categories in the training set make up the database, and the samples in the validation set are utilized as queries. A total of 100 examples of each category are selected from the database for training. For MS COCO, 5,000 samples are used as queries and the rest are used as the database. A total of 10,000 samples from the database are selected for training.

Note that for the various experimental settings, most of the experimental results are not shown in this summary in detail. The representative compared results of hashing methods are shown in Tables 4 and 5. From the results, there are several observations as follows:

Table 4.

Method	12bits	24bits	32bits	48bits	12bits	24bits	32bits	48bits
	CIFAR-10				NUS-WIDE
CNNH [173]	0.439	0.511	0.509	0.522	0.611	0.618	0.625	0.608
DNNH [92]	0.552	0.566	0.558	0.581	0.674	0.697	0.713	0.715
DHN [200]	0.555	0.594	0.603	0.621	0.708	0.735	0.748	0.758
DRSCH [189]	0.614	0.621	0.628	0.630	0.618	0.622	0.622	0.627
DSCH [189]	0.608	0.613	0.617	0.619	0.591	0.597	0.610	0.608
DSRH [197]	0.608	0.610	0.617	0.617	0.609	0.617	0.621	0.630
DSH-GAN [133]	0.735	0.781	0.787	0.802	0.838	0.856	0.861	0.863
DTSH [169]	0.710	0.750	0.765	0.774	0.773	0.808	0.812	0.824
DPSH [101]	0.713	0.727	0.744	0.757	0.752	0.790	0.794	0.812
DSDH [100]	0.740	0.786	0.801	0.820	0.776	0.808	0.820	0.829
DQN [17]	0.554	0.558	0.564	0.580	0.768	0.776	0.783	0.792
DSH [111]	0.644	0.742	0.770	0.799	0.712	0.731	0.740	0.748
DCEH [172]	0.745	0.788	0.802	0.806	0.781	0.816	0.827	0.839
DDSH [82]	0.753	0.776	0.803	0.811	0.776	0.803	0.810	0.817
DFH [103]	0.803	0.825	0.831	0.844	0.795	0.823	0.833	0.842
Greedy Hash [156]	0.774	0.795	0.810	0.822	-	-	-	-
MIHash [8]	0.738	0.775	0.791	0.816	0.773	0.820	0.831	0.843
HBMP [9]	0.799	0.804	0.830	0.831	0.757	0.805	0.822	0.840
VDSH [195]	0.538	0.541	0.545	0.548	0.769	0.796	0.803	0.807
NMLayer [43]	0.786	0.813	0.821	0.828	0.801	0.824	0.832	0.840
HashNet [19]	0.685	0.707	0.705	0.705	0.770	0.802	0.806	0.816
AnDSH [198]	0.754	0.780	0.786	0.795	0.780	0.808	0.815	0.823
DISH [193]	0.738	0.792	0.822	0.841	0.781	0.823	0.837	0.840
SRE [194]	0.771	0.817	0.839	0.858	0.801	0.833	0.849	0.861
MLDH [124]	0.805	0.825	0.829	0.832	0.800	0.828	0.832	0.835
SDH [140]	0.285	0.329	0.341	0.356	0.568	0.600	0.608	0.638
KSH [113]	0.303	0.337	0.346	0.356	0.556	0.572	0.581	0.588
ITQ [48]	0.127	0.128	0.126	0.129	0.454	0.406	0.405	0.400

Table 4. MAP for Different Hashing Methods on CIFAR-10 and NUS-WIDE

Table 5.

Method	16bits	32bits	64bits	16bits	32bits	64bits
	ImageNet			MS COCO
DBH [106]	0.350	0.379	0.406	0.602	0.639	0.658
DHN [200]	0.311	0.472	0.573	0.677	0.701	0.694
CNNH [173]	0.281	0.449	0.553	0.564	0.574	0.567
DNNH [92]	0.290	0.460	0.565	0.593	0.603	0.609
DTSH [169]	0.442	0.528	0.581	0.699	0.732	0.753
HashNet [19]	0.505	0.630	0.683	0.687	0.718	0.736
SDH [40]	0.584	0.649	0.664	0.671	0.710	0.733
DPSH [101]	0.326	0.546	0.654	0.634	0.676	0.726
DSH [111]	0.348	0.550	0.665	-	-	-
HashGAN [14]	-	-	-	0.687	0.718	0.736
DCH [15]	0.717	0.763	0.787	0.700	0.691	0.680
HashMI [8]	0.569	0.661	0.694	-	-	-
Greedy Hash [156]	0.570	0.639	0.659	0.677	0.722	0.740
JMLH [144]	0.517	0.621	0.662	0.689	0.733	0.758
DPN [41]	0.592	0.670	0.703	0.668	0.721	0.752
CSQ [185]	0.717	0.763	0.804	0.742	0.806	0.829
OrthHash [64]	0.614	0.681	0.709	0.708	0.762	0.785
DSEH [99]	0.715	0.753	0.760	0.735	0.773	0.781
PSLDH [158]	0.734	0.792	0.817	0.782	0.835	0.853
SDH [140]	0.298	0.455	0.585	0.554	0.564	0.579
KSH [113]	0.159	0.297	0.394	0.521	0.534	0.536
ITQ-CCA [48]	0.265	0.436	0.576	0.565	0.562	0.501
ITQ [48]	0.325	0.462	0.552	0.581	0.624	0.657
BRE [91]	0.062	0.252	0.357	0.592	0.622	0.633
SH [171]	0.206	0.328	0.419	0.495	0.509	0.510
LSH [46]	0.100	0.235	0.359	0.459	0.485	0.584

Table 5. MAP for Different Hashing Methods on ImageNet and MS COCO

–

Deep supervised hashing greatly outperforms traditional hashing methods (SDH and KSH) overall, validating the strong representation-learning capacity of deep learning.

–

Similarity information is necessary for deep hashing. For deep supervised hashing methods in the early period (i.e., before 2016), hash codes are mostly obtained by transferring classification models without supervised similarity information while the methods with pairwise and ranking information outperform them.

–

Label information helps to increase the performance of deep hashing. This point can be shown from the fact that DSDH outperforms DPSH evidently and the superiority of LabNet. Moreover, several pointwise methods (CSQ, OrthHash, and PSLDH) show comparable performance recently by mapping the labels into Hamming space, achieving impressive performance on large-scale datasets.

–

Several skills including regularization term, bit balance, ensemble learning, and bit independence help to obtain accurate and robust performance, which can be seen from ablation studies in some papers [36].

–

Although supervised hashing methods have achieved remarkable performance, they are difficult to be applied in practice since large-scale data annotations are unaffordable. To address this problem, deep learning-based unsupervised methods provide a cost-effective solution for more practical applications.

6.3.2 Performance Comparison of Deep Unsupervised Methods.

This part presents the results of representative deep unsupervised hashing approaches over CIFAR-10, NUS-WIDE, and MS COCO. We follows the setting in prior works [121, 134, 145]. The dataset splits for training, testing, and database are the same as Section 6.3.1. Part of records are quoted from [134, 145, 146].

The compared results are shown in Table 6. From the results, we have the following observations:

Table 6.

	CIFAR-10			NUS-WIDE			MS COCO
Method	16bits	32bits	64bits	16bits	32bits	64bits	16bits	32bits	64bits
DeepBit [105]	0.194	0.249	0.277	0.392	0.403	0.429	0.399	0.410	0.475
SGH [31]	0.435	0.437	0.433	0.593	0.590	0.607	0.594	0.610	0.618
BGAN [152]	0.525	0.531	0.562	0.684	0.714	0.730	0.645	0.682	0.707
BinGAN [201]	0.476	0.512	0.520	0.654	0.709	0.713	0.651	0.673	0.696
Greedy Hash [156]	0.448	0.473	0.501	0.633	0.691	0.731	0.582	0.668	0.710
HashGAN [45]	0.447	0.463	0.481	-	-	-	-	-	-
UH-BDNN [36]	0.301	0.309	0.312	-	-	-	-	-	-
UTH [105]	0.287	0.307	0.324	0.450	0.495	0.549	0.438	0.465	0.508
SSDH [177]	-	-	-	0.580	0.593	0.610	0.540	0.566	0.593
DistillHash [179]	0.285	0.294	0.308	0.627	0.656	0.671	0.546	0.566	0.593
MLS\(^3\)RUDH [159]	-	-	-	0.713	0.727	0.750	0.607	0.622	0.641
TBH [145]	0.532	0.573	0.578	0.717	0.725	0.735	0.706	0.735	0.722
GLC [120]	-	-	-	0.759	0.772	0.783	0.715	0.723	0.731
DVB [143]	0.403	0.422	0.446	0.604	0.632	0.665	0.570	0.629	0.623
CIBHash [134]	0.590	0.622	0.641	0.790	0.807	0.815	0.737	0.760	0.775
DATE [121]	0.577	0.629	0.647	0.793	0.809	0.815	-	-	-
ITQ [48]	0.305	0.325	0.349	0.627	0.645	0.664	0.598	0.624	0.648
AGH [114]	0.333	0.357	0.358	0.592	0.615	0.616	0.596	0.625	0.631
DGH [112]	0.335	0.353	0.361	0.572	0.607	0.627	0.613	0.631	0.638

Table 6. MAP for Different Unsupervised Methods on CIFAR-10, NUS-WIDE, and MS COCO

–

Deep unsupervised hashing methods generally perform better than the traditional approaches (ITQ, BRE, SDH, KSH, and LSH), which suggests that the powerful representation learning capacity of deep learning is beneficial to the retrieval performance of generated binary codes in most cases.

–

The methods that only adopt regularization terms (DeepBits and UTH) obtain poor results among compared methods, demonstrating that the exploration of semantic information is indispensable for discriminative hash codes.

–

The methods that explore more accurate similarity structures (DATE and TBH) outperform early approaches that obtain similarity structure in a coarse manner (SSDH and DistillHash). The potential reason is that false similarity signals will result in error propagation during subsequent hash code learning, implying suboptimal performance.

–

The methods utilizing contrastive learning (CIBHash and DATE) achieve superb performance among compared methods, which implies that contrastive learning is an effective tool for discriminative hash code learning. As the research moves along, deep unsupervised learning methods can even outperform part of deep supervised methods, which is really inspiring.

6.4 Training Time Cost

In this part, we investigate the training efficiency of different deep hashing methods. A total of 10 representative methods are selected. These methods are parameterized by different network backbones (e.g., AlexNet and VGG-F), and these backbones could bring in a larger difference of computational cost in training and inference compared with core hashing techniques. Hence, all the hashing networks are parameterized by VGG-F and optimized on a single NVIDIA GeForce GTX TITAN X GPU for fair comparison of the efficiency. In Figure 4, we report the running time of each epoch during the training phase of different compared approaches. From the results, we have the following observations. First, the efficiency difference between these methods is limited. The potential reason is that the computational cost for hashing methods mainly depends on the forwarding and back propagation of the network backbone. The specific optimization manners have limited impacts on the computational cost. Second, OrthHash is the most effective among different methods, which is because that OrthHash only leverages one brief objective during optimization.

Fig. 4.

7 Conclusion

In this survey, we present a comprehensive review of the articles on deep hashing, including deep supervised hashing, deep unsupervised hashing, and other related topics. Based on how measuring the similarities of hash codes, we divide deep supervised hashing methods into four categories: pairwise methods, ranking-based methods, pointwise methods, and quantization. In addition, we categorize deep unsupervised hashing into three classes based on semantics learning manners, i.e., reconstruction-based methods, pseudo-label-based methods, and prediction-free self-supervised learning-based methods. We also explore three important topics including semi-supervised deep hashing, domain adaption deep hashing, and multi-modal deep hashing. We observe that the existing deep hashing methods mainly focus on the public datasets designed for classification and detection, which do not fully address the nearest neighbor search problem. Future works could attempt to combine downstream approximate nearest neighbor search algorithms to design specific deep hashing methods. In this way, researchers will propose more practical deep hashing methods for real-world applications. Furthermore, cutting-edge deep neural network techniques and representation learning techniques will be integrated into deep hashing and promote the development of large-scale image retrieval.

Acknowledgments

We thank Zeyu Ma, Huasong Zhong, and Xiaokang Chen who discussed with us and provided instructive suggestions.

Footnotes

In our survey, \(\lbrace \lambda _1, \lambda _2, \lambda _3, \ldots \rbrace\) always denote the balance coefficients.

They can be also called target codes.

References

[1]

Alexandr Andoni and Piotr Indyk. 2006. Near-optimal hashing algorithms for approximate nearest neighbor in high dimensions. In Proceedings of the Annual IEEE Symposium on Foundations of Computer Science. 459–468.

Abstract

1 Introduction

2 Background

2.1 Nearest Neighbor Search

2.2 Hashing Algorithms

2.3 Deep Neural Networks

2.4 Learning to Hash

3 Deep Supervised Hashing

3.1 Overview

3.1.1 Network Architecture.

3.1.2 Similarity Measurement and Objective Function.

3.1.3 Optimization Algorithm.

3.1.4 Summarization.

3.2 Pairwise Methods

3.2.1 Difference Loss Minimization.

3.2.2 Likelihood Loss Minimization.

3.3 Ranking-based Methods

3.3.1 Triplet Methods.

3.3.2 Listwise Methods.

3.4 Pointwise Methods

3.5 Quantization

3.6 Other Techniques for Deep Hashing

3.6.1 Hashing with Generative Adversarial Networks.

3.6.2 Ensemble Learning.

3.6.3 Training Strategy for Deep Hashing.

4 Deep Unsupervised Hashing

4.1 Overview

4.2 Similarity Reconstruction-based Methods

4.3 Pseudo-label-based Methods

4.4 Prediction-Free Self-Supervised Learning-based Methods

5 Related Important Topics

5.1 Semi-Supervised Deep Hashing

5.2 Domain Adaptation Deep Hashing

5.3 Multi-Modal Deep Hashing

6 Evaluation Protocols

6.1 Evaluation Metrics

6.2 Datasets

6.3 Performance Analysis

6.3.1 Performance Comparison of Deep Supervised Methods.

6.3.2 Performance Comparison of Deep Unsupervised Methods.

6.4 Training Time Cost

7 Conclusion

Acknowledgments

Footnotes

References

Cited By

Index Terms

Recommendations

Order preserving hashing for approximate nearest neighbor search

Class consistent hashing for fast Web data searching

Linear Distance Preserving Pseudo-Supervised and Unsupervised Hashing

Comments

Information

Published In

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Funding Sources

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

Cited By

View options

PDF

eReader

HTML Format

Login options

Full Access

Figures

Other

Share

Share this Publication link

Share on social media

Affiliations