research-article

Open access

CL²R: Compatible Lifelong Learning Representations

Authors:

Alberto Del BimboAuthors Info & Claims

ACM Transactions on Multimedia Computing, Communications and Applications, Volume 18, Issue 2s

Article No.: 132, Pages 1 - 22

https://doi.org/10.1145/3564786

Published: 06 January 2023 Publication History

All formats PDF

Abstract

In this article, we propose a method to partially mimic natural intelligence for the problem of lifelong learning representations that are compatible. We take the perspective of a learning agent that is interested in recognizing object instances in an open dynamic universe in a way in which any update to its internal feature representation does not render the features in the gallery unusable for visual search. We refer to this learning problem as Compatible Lifelong Learning Representations (CL²R), as it considers compatible representation learning within the lifelong learning paradigm. We identify stationarity as the property that the feature representation is required to hold to achieve compatibility and propose a novel training procedure that encourages local and global stationarity on the learned representation. Due to stationarity, the statistical properties of the learned features do not change over time, making them interoperable with previously learned features. Extensive experiments on standard benchmark datasets show that our CL²R training procedure outperforms alternative baselines and state-of-the-art methods. We also provide novel metrics to specifically evaluate compatible representation learning under catastrophic forgetting in various sequential learning tasks. Code is available at https://github.com/NiccoBiondi/CompatibleLifelongRepresentation.

1 Introduction

The universe is dynamic, and the emergence of novel data and new knowledge is unavoidable. The unique ability of natural intelligence to lifelong learning is highly dependent on memory and knowledge representation [18]. Through memory and knowledge representation, natural intelligent systems continually search, recognize, and learn new objects in an open universe after exposure to one or a few samples. Memory is substantially a cognitive function that encodes, stores, and retrieves knowledge. Artificial representations learned by Deep Convolutional Neural Network (DCNN) models [3, 61, 63, 64, 76] stored in a memory bank (i.e., the gallery-set) have been shown to be quite effective in searching and recognizing objects in an open-set/open-world learning context. Successful examples are face recognition [10, 14, 59], person re-identification [78, 79, 80], and image retrieval [19, 65, 73].

These approaches rely on learning feature representations from static datasets in which all images are accessible at training time. However, dynamic assimilation of new data for lifelong learning suffers from catastrophic forgetting: the tendency of neural networks to abruptly forget previously learned information [37, 52].

In the case of visual search, even avoiding catastrophic forgetting by repeatedly training DCNN models on both old and new data, the feature representation still irreversibly changes [31]. Thus, to benefit from the newly learned model, features stored in the gallery must be reprocessed and the “old” features replaced with the “new” ones. Reprocessing not only requires the storage of the original images (a noticeable leap from natural intelligence) but also their authorization to access them [66]. More importantly, extracting new features at each update of the model is computationally expensive or infeasible in the case of large gallery-sets. The speed at which the representation is updated to benefit from the newly learned data may impose time constraints on the re-indexing process. This may occur from timescales on the order of weeks/months as in retrieval systems or social networks [62], to within seconds as in autonomous robotics or real-time surveillance [43, 48]. Recently, in the work of Shen et al. [62], a novel training procedure was proposed to avoid re-indexing the gallery-set. The representation obtained in this manner is said to be compatible, as the features before and after the learning upgrade can be directly compared. Training takes advantage of all data from previous tasks (i.e., no lifelong learning), guaranteeing the absence of catastrophic forgetting. The advantage of considering compatible representation learning within the lifelong learning paradigm, as in this work, is that compatible representation allows visual search systems not only to distribute the computation over time but also to avoid or possibly limit the storage of images on private servers for gallery data. This can have important implications for the societal debate related to privacy, ethical, and sustainable issues (e.g., carbon footprint) of modern AI systems [11, 49, 60, 66].

We identify stationarity as the key requirement for feature representation to be compatible during lifelong learning. Stationary features have been shown to be biologically plausible in many studies of working memory in the prefrontal cortex of macaques [33, 39, 40]. Some works [39, 40] decoded the information from the neural activity of the working memory using a classifier with a single fixed set of weights. They noted that a non-stationary feature representation seems to be biologically problematic since it would imply that the synaptic weights would have to change continuously for the information to be continuously available in memory.

Inspired by this, in this article, we formalize the problem of Compatible Lifelong Learning Representations ( $\textbf {CL}^{\bf 2}{\bf R}$ ) in relation to the relevant areas of compatible learning and lifelong (continual) learning. We call any training procedure that aims to obtain compatible features and minimize catastrophic forgetting as CL²R training, and we propose (1) a novel set of metrics to properly evaluate CL²R training procedures, (2) a training procedure based on rehearsal [52, 54], and feature stationarity [46, 47] to jointly address catastrophic forgetting and feature compatibility. Figure 1 provides an overview of the problem and the training procedure. Specifically, our CL²R training procedure is achieved by encouraging global and local stationarity to the learned features.

Fig. 1.

The rest of the article is organized as follows. In Section 2, we discuss related work, and in Section 3, we highlight our contributions. Section 4 presents the formulation of CL²R, Section 5 proposes new metrics to evaluate compatibility, and Section 6 describes a new training procedure. In Section 7, we compare our results with adapted state-of-the-art methods. Section 8 presents the ablation study. We conclude in Section 9.

2 Related Work

Compatible learning. The work proposed by Shen et al. [62], called Backward-Compatible Training (BCT), first formalizes the problem of learning compatible representation to avoid re-indexing. The method takes advantage of an influence loss that encourages the feature representation toward one that can be used by the old classifier. The old classifier is fixed while learning with the novel data (i.e., its parameters are no longer updated by back-propagation) and cooperates with the new representation model. Cooperation is achieved by aligning the prototypes of the new classifier with the prototypes of the old fixed one. The underlying assumption is that the upgraded feature representation follows the representation learned by the old classifier. BCT has been evaluated in scenarios without the effects of catastrophic forgetting by repeatedly training DCNN models on both old and new images (i.e., jointly re-training from scratch at each upgrade). To compare with this learning strategy in a lifelong learning scenario instead of starting from scratch every time, we have added to BCT the capability of learning by fine-tuning the previously learned model according to a memory-based rehearsal strategy [52, 54].

Compatibility under catastrophic forgetting has been implicitly studied in the work of Iscen et al. [25] (FAN), in which authors presented a method for storing features instead of images in Class-incremental Learning (CiL). They introduce a feature adaptation function to update the preserved features as the network learns novel classes. We compared to this method by storing the updated-preserved features obtained at each task. Although designed to improve classification accuracy, the work can be considered close to a lifelong learning approach with compatible representation in which the feature adaptation function they defined addresses implicitly the problem of feature compatibility as in other works [6, 23, 38, 68]. Differently from BCT, these methods do not completely prevents the cost of re-indexing since the learned mappings require evaluation every time the dataset is upgraded and are therefore they are not suited to lifelong learning and/or large gallery-set. For example, the ResNet-101 architecture is one order slower than the mapping proposed in the work of Chen et al. [6]; therefore, when the size of the gallery increases by an order of magnitude, it is equivalent to re-indexing the images. The method described in the work of Ramanujan et al. [51], in addition to the current feature model, trains from the same data an auxiliary model in a different way (i.e., using self-supervised learning). The auxiliary feature model will then be used with future learned models to learn a mapping model to obtain compatible representations as in other works [25, 38, 68]. The underlying assumption is that as the auxiliary feature model is trained with a different strategy, it encodes different knowledge that may facilitate learning the mapping between the representation spaces.

Compatibility of the representation in a more general sense has been considered in the work of Li et al. [31] and Wang et al. [70], where similarity between features extracted from identical architectures and trained from different initialization has been extensively evaluated. The work of Budnik and Avrithis [5] avoids re-indexing the gallery, although the new model used for queries is not trained on more data. Their work is motivated by the scenario where the gallery is indexed by a large model and the queries are captured from mobile devices in which the use of small models is the only viable solution.

Lifelong learning. Lifelong learning or continual learning studies the problem of learning from a non-i.i.d. stream of data with the goal of assimilating new knowledge preventing catastrophic forgetting [9, 37]. Methods for preventing catastrophic forgetting have been explored primarily in the classification task, where catastrophic forgetting often manifests itself as a significant drop in classification accuracy [2, 13, 35, 41, 67]. The key aspects that distinguish lifelong feature learning for visual search from classification are the following: (i) categorical data often have coarser granularity than visual search data, (ii) evaluation metrics do not involve classification accuracy, and (iii) class labels are not required to be explicitly learned. These differences may suggest that these two catastrophic forgetting occurrences are of different origins. In this context, recent works have discussed the importance of the specific task in assessing catastrophic forgetting of learned representations [1, 7, 8, 12, 47, 50]. Among others, empirical evidence presented in the work of Davari and Belilovsky [12] suggests that feature forgetting is not as catastrophic as classification forgetting and that many approaches that address the problem of catastrophic forgetting do not improve feature forgetting in terms of the usefulness of the representation. We argue that such evidence is relevant in visual search and that it can be exploited with techniques that further encourage learning compatible feature representation. According to this, we consider CiL as the basic building block for the general purpose of learning feature representation incrementally.

In this article, the focus is on CiL methods based on Knowledge Distillation (KD) [21] and rehearsal [55], which are known to be versatile, effective, and widely applicable to reduce catastrophic forgetting. We leverage the classification task in CiL as a surrogate task to learn feature representation as typically performed in face/body identification and retrieval [14, 65, 79]. The work of Li and Hoiem [32] first introduces KD in lifelong learning as an effective way to preserve the knowledge previously acquired from old tasks. In iCaRL [53], KD is combined with rehearsal to reserve samples of exemplars stored in an episodic memory for classes already seen. The BiC work, proposed by Wu et al. [71], extends the work of Rebuffi et al. [53] by developing a bias correction layer to recalibrate the output probabilities learning an additional linear layer on a small set of data. Along a similar vein, in the work of Zhao et al. [77], the bias correction is performed by aligning the norms of the weight vectors of the classifier for new classes to those for old classes without using additional model parameters or reserved data. The work of Romero et al. [56] introduces Feature Distillation (FD), a distillation loss evaluated on the feature vectors instead on the classifier outputs. FD has recently been successfully applied by Hou et al. [22] (LUCIR) and Douillard et al. [16] to reduce catastrophic forgetting. Differently from LUCIR, PODNet uses a spatial-based distillation loss to constrain the statistics of intermediate features after each residual block. Similar to LUCIR, PODNetm and many others works on continual/lifelong learning in the literature, our problem formulation takes advantage of the general concept of KD. Differently from these works, our approach is novel in that it considers FD for the dual purpose of learning feature compatibility and mitigating feature forgetting. The work of Iscen et al. [25] (FAN), also discussed in the previous paragraph, combines strategies from other works [22, 32, 53] to learn and preserve previous features. Although the work does not consider the compatibility problem, it is the closest work to our approach. Recently, Yan et al. [72] (DER) showed an interesting performance improvement in CiL by freezing the previously learned representation and expanding its dimension from a new learnable feature extractor. Despite the clear improvements in classification performance, this has no trivial exploitation in compatible training, as the varying dimensions across tasks do not allow direct application of nearest-neighbor search between models. Features with different dimensions typically require to be projected into a common single space to allow nearest-neighbor to be applied. The FOSTER method [69] improves upon DER by addressing this specific problem by transforming the growing dimension of the feature representation with a trainable linear layer that maps the growing feature vector into a fixed dimension. More in general, CiL methods addressing catastrophic forgetting are in a certain sense related to compatible representation, since forgetting is the change in the feature representation of classifiers that will be learned in the future. We evaluate these methods as baselines to quantify the level of lifelong-compatible representation they intrinsically may have.

3 Main Contributions

(1)

We consider compatible representation learning within the lifelong learning paradigm. We refer to this general learning problem as CL²R.

(2)

We define a novel set of metrics to properly evaluate CL²R training procedures.

(3)

We propose a CL²R training procedure that imposes global and local stationarity on the learned features to achieve compatibility between representations under catastrophic forgetting. Global and local interactions show a significant performance improvement when local stationarity is promoted only from already observed samples in the episodic memory.

(4)

We empirically assess the effectiveness of our approach in several benchmarks showing improvements over baselines and adapted state-of-the-art methods.

4 CL²R Problem Formulation

In a CL²R setting, a sequence of representation models, $\lbrace \phi _t \rbrace _{t=1}^{T}$ , is learned incrementally with a sequence of T tasks, $\lbrace (\mathcal {D}_t, K_t) \rbrace _{t=1}^T$ , where $\mathcal {D}_t$ are the images of the t-th task represented by $K_t$ different classes. Specifically, each task is disjoint from the others: $K_k \cap K_t= \emptyset$ with $t \ne k$ . The learned representation model $\phi _t$ is used to transform the query images into feature vectors that are used to retrieve the images most similar to a set of given gallery images transformed with a previous model $\phi _k$ . Specifically, we indicate with the couple $\mathcal {G}=(I_\mathcal {G},F_\mathcal {G})$ the gallery-set, where ${I}_\mathcal {G}=\lbrace \mathbf {x}_i\rbrace _{i=1}^N$ is the image collection from which the features $F_\mathcal {G}=\lbrace \mathbf {f}_i \rbrace _{i=1}^N$ are extracted, and N is the number of elements of the two sets. Without loss of generality, we assume that the features in $F_\mathcal {G}$ are extracted using the representation model $\phi _{ k}:{\mathbb {R}}^D \rightarrow {\mathbb {R}}^d$ that transforms an image $\mathbf {x} \in {\mathbb {R}}^D$ into a feature vector $\mathbf {f} \in {\mathbb {R}}^d$ , where d and D are the dimensionality of the feature and the image space, respectively. Analogously, we will refer to $\mathcal {Q}=(I_\mathcal {Q},F_\mathcal {Q})$ as the query-set, where $I_\mathcal {Q}$ and $F_\mathcal {Q}$ are the corresponding image-set and the feature-set, respectively. As the t-th task becomes available, the model $\phi _{t}$ is incrementally learned from the previous one along with the new task data $\mathcal {D}_t$ . Our goal is to design a training procedure to learn the model $\phi _{t}$ so that any query image transformed with it can be used to perform visual search through some distance ${\rm dist}:{\mathbb {R}}^d \times {\mathbb {R}}^d \rightarrow \mathbb {R}_+$ to identify the closest features ${F}_\mathcal {G}$ to the query features ${F}_\mathcal {Q}$ without forgetting the previous representation and without computing $F_\mathcal {G}=\lbrace \mathbf {f} \in \mathbb {R}^d \, | \, \mathbf {f} = \phi _{t}(\mathbf {x}) \, \forall \mathbf {x} \in I_\mathcal {G}\rbrace$ (i.e., re-indexing). If this holds, then the resulting representation $\phi _{t}$ is said to be lifelong compatible with $\phi _{k}$ .

The main challenge of the CL²R problem is to jointly alleviate catastrophic forgetting and learn a compatible representation between the previously learned models. In Figure 1, we illustrate the complete CL²R training example using rehearsal to alleviate the effects of catastrophic forgetting.

5 Compatibility Evaluation

A representation model $\phi _{\rm new}$ upgraded with new data is said to be compatible with an old representation model $\phi _{\rm old}$ when it holds [62]:

\begin{equation} M\big (\phi _{\rm new}^{\mathcal {Q}}, \phi _{\rm old}^{\mathcal {G}} \big) \gt {M} \big (\phi _{\rm old}^{\mathcal {Q}}, \phi _{\rm old}^{\mathcal {G}} \big). \end{equation}

(1)

Equation (1) represents the Empirical Compatibility Criterion (ECC), where ${M}$ is an evaluation metric specific to the given visual search problem. Notable examples of the metric M can be found in face verification accuracy [24, 30], face verification/identification accuracy in terms of true acceptance rate and false acceptance rate (TAR $@$ FAR) [27], and person re-identification mean average precision (mAP) [74]. The intuition of these metrics is based on the observation that they can be instantiated with two different representation models $\phi _{\rm new}$ and $\phi _{\rm old}$ when considering the query-gallery pair. The specific notation ${M} (\phi _{\rm new}^{\mathcal {Q}}, \phi _{\rm old}^{\mathcal {G}})$ defines the cross-test between the new and the old model, and it represents the case in which $\phi _{\rm new}$ is used to extract the features of the query-set, $F_\mathcal {Q}$ , whereas $\phi _{\rm old}$ is used to extract the gallery-set ones, $F_\mathcal {G}$ . ${M} (\phi _{\rm old}^{\mathcal {Q}}, \phi _{\rm old}^\mathcal {G})$ is the self-test, and it represents the case in which both query and gallery features are extracted with $\phi _{\rm old}$ . When the model is trained incrementally on T tasks, Equation (1) is evaluated according to the multi-model ECC introduced by Biondi et al. [4]:

\[\begin{eqnarray} M \big (\phi _t^\mathcal {Q}, \phi _k^\mathcal {G} \big) \gt M \big (\phi _k^\mathcal {Q}, \phi _k^\mathcal {G} \big) {\rm \quad with \:} t \gt k, \end{eqnarray}\]

(2)

where $t, k \in \lbrace 1,2,\ldots ,T\rbrace$ refer to two different tasks such that task k is processed by the model before task t. The model $\phi _t$ is compatible with the model $\phi _k$ , when the cross-test $M (\phi _t^\mathcal {Q}, \phi _k^\mathcal {G})$ between $\phi _t$ and $\phi _k$ is greater than the self-test $M (\phi _k^\mathcal {Q}, \phi _k^\mathcal {G})$ of the model $\phi _k$ . The underlying intuition is that if the performance of matching the gallery feature vectors extracted with the old model with the query feature vectors extracted with the new model (i.e., cross-test) is better than the performance of matching the gallery feature vectors with the query feature vectors both extracted with the old model (i.e., self-test), then the system is learning compatible representations. In other words, learning from the new task data improves the representation without breaking the compatibility with the previously learned model. Based on Equation (2), the compatibility matrix C is defined as follows:

\begin{equation} C_{t, k} = {\left\lbrace \begin{array}{ll} M \big (\phi _t^\mathcal {Q}, \phi _k^\mathcal {G} \big) & \text{if} t \gt k \\ M \big (\phi _k^\mathcal {Q}, \phi _k^\mathcal {G} \big) & \text{if} t = k \\ \qquad 0 & \text{if} t \lt k \end{array}\right.}, \end{equation}

(3)

where the element in the row t and the column k of the compatibility matrix denotes the evaluation metric M of the model t to the model k. This definition combines the basic intuition of the classification accuracy matrix R defined elsewhere [15, 34], used to evaluate the CiL problem, with the two specific aspects that distinguish the $\text{CL}^2\text{R}$ learning setting from the CiL one. Namely, (i) in CiL at each task, the train and test data are sampled from the same distribution, whereas in $\text{CL}^2\text{R,}$ the test-set classes are sampled from an unknown distribution (i.e., $\text{CL}^2\text{R}$ addresses the open-set recognition problem); (ii) in CiL, the test-set is dynamic (i.e., it grows including images from the task distributions), whereas in $\text{CL}^2\text{R,}$ it is assumed static for the purpose of a reliable evaluation [62]. In the $\text{CL}^2\text{R}$ setting, a dynamic test-set, as used in CiL, is of difficult definition, as there are infinite ways to make the gallery dynamic and each of them may change unexpectedly the performance of the evaluation. We follow Shen et al. [62] and perform the evaluation assuming a static test-set (i.e., a static query-gallery pair). According to this, we set the elements of the matrix C with $t\lt k$ to zero to indicate the impossibility of a reliable evaluation of a growing test-set that should be sampled from an unknown changing distribution. For the remaining elements, the cross-test values are the elements of the matrix with $t \gt k$ , whereas the self-test values are those of the main diagonal (i.e., when $t = k$ ). Given a compatibility matrix C, the average compatibility (AC) is defined as follows:

\begin{equation} AC = \frac{2}{T(T-1)} \sum \limits _{1 \le k \lt t \le T}{1\!\!1}{ \Big (M \big (\phi _t^\mathcal {Q}, \phi _k^\mathcal {G} \big) \gt M \big (\phi _k^\mathcal {Q}, \phi _k^\mathcal {G} \big)} \Big), \end{equation}

(4)

where ${1\!\!1}(\cdot)$ denotes the indicator function. AC summarizes the compatibility matrix values in a single number that quantifies the number of times that compatibility is verified against all possible $\frac{T(T-1)}{2}$ occurrences.

5.1 Proposed CL²R Metrics

The work of Díaz-Rodríguez et al. [15] and Lopez-Paz and Ranzato [34] proposes a set of metrics to assess the ability of the learner to transfer knowledge based on a matrix that reports the test classification accuracy of the model on task j after learning task i. Along a similar vein, we present a set of metrics to evaluate the compatibility between representation models in a compatible lifelong learning setting.

Let $C \in \mathbb {R}^{ T \times T}$ be the compatibility matrix of Equation (3) for T tasks, and the proposed criteria are the following:

(1)

Backward compatibility (BC) measures the gap in compatibility performance between the representation learned at task T with respect to the representation learned at task k with $k \in \lbrace 1, \ldots , T-1\rbrace$ . When BC $\lt 0,$ the learning procedure is also influenced by catastrophic forgetting because the performance degrades with newer learned tasks. BC is defined as follows:

\begin{equation} \mbox{ $BC$} = \frac{1}{T-1} \sum _{k=1}^{T-1} \left(C_{T,k} - C_{k,k}\right). \end{equation}

(5)

(2)

Forward compatibility (FC) estimates the influence that learning a representation on a task $k-1$ has on the compatibility performance of the representation learned on a future task k by comparing the cross-test (between models at task k and $k-1$ ) with respect to the self-test at task k. FC $\ge 0$ denotes that, on average, the cross-test values are greater than the self-test evaluated on the subsequent tasks, and therefore re-indexing does not necessarily provide improved results. FC is defined as follows:

\begin{equation} \mbox{ $FC$} = \frac{1}{T-1} \sum _{k=2}^{T} \left(C_{k,k-1} - C_{k,k}\right). \end{equation}

(6)

The intuition behind the definition of this metric comes from noticing that as the number of tasks increases, the cross-test may result better than the self-test. As this is not typically observed when there is no catastrophic forgetting (i.e., when repeatedly training with new and old data), we argue this is due to the joint interaction between the compatibility constraint and catastrophic forgetting. This observation led us to define something “positive” when the compatible representation with the previously learned model is higher than the self-test of the current model. This metric is designed to yield high values when a CL²R training procedure is able to positively exploit the joint interaction between feature forgetting and compatible representation.

From Equations (5) and (6), it can be deduced that BC and FC $\in [-1,1]$ . Backward compatibility for the first task and forward compatibility for the last task are not defined. The larger these metrics, the better the model. When AC values are comparable, both BC and FC represent two metrics that quantify the positive interaction between search accuracy under catastrophic forgetting and compatibility. This allows evaluating how catastrophic forgetting affects the representation and its compatibility.

As BC evaluates the relationship between the representations learned at the final task T and the previous ones, it is possible to follow their evolution during CL²R training. According to this, we define the backward compatibility at task t as $BC{(t)} = \frac{1}{t-1} \sum _{{c}k=1}^{t-1} (C_{t,k} - C_{k,k}), \; {\rm with } \; t \gt 1$ where $t \in \lbrace 1, 2, \ldots , T\rbrace$ . This represents the average of the element-wise difference between the t-th row and the first t elements of the main diagonal on the compatibility matrix.

6 Proposed CL²R Training

To achieve compatibility, we encourage global and local stationarity to the feature representation.

Global stationarity is encouraged according to the approach described in the work of Pernici et al. [46], in which features are learned to follow a set of special fixed classifier prototypes. Pernici et al. [46] impose global stationarity using a classifier in which prototypes cannot be trained (i.e., fixed) and are set before training. Under this condition, only the direction of the features aligns toward the fixed directions of the classifier prototypes and not the opposite. This constraint imposes learned features to follow their corresponding fixed prototypes, therefore encouraging representation stationarity. The lack of trainable classifier functionality is basically replaced by previous layers. Fixed prototypes are set according to the coordinate vertices of a d-Simplex regular polytope that, in addition to stationarity, allows maximally separated features to be learned [44, 45].

We take advantage of this result and perform CiL as a surrogate task to learn stationary features’ representation to achieve compatibility. More formally, let $\mathbf {W} \; \forall t \in \lbrace 1, 2, \ldots , T\rbrace$ be the d-Simplex fixed classifier, and we instantiate the CiL problem as $\sigma (\phi _t \circ \mathbf {W})$ , where $\sigma$ indicates the softmax function, and perform learning according to incremental fine-tuning. The evolving training-set $\mathcal {T}_t \leftarrow \mathcal {M}_{t} \cup \mathcal {D}_t$ is computed according to a rehearsal base strategy using the episodic memory, $\mathcal {M}_{t}$ , which contains an updating set of samples from $\lbrace \mathcal {D}_1, \ldots , \mathcal {D}_{t-1} \rbrace$ . The memory is updated as $\mathcal {M}_{t+1} \leftarrow \mathcal {M}_{t} \cup {\rm S}{\rm\small{AMPLING}}({\rm }D_{t})$ . The loss optimized in the work of Pernici et al. [46] is adapted to CL²R training as follows:

\[\begin{eqnarray} \mathcal {L}_t= -\dfrac{1}{|\mathcal {T}_{t}|} \sum \limits _{\mathbf {x} \in \mathcal {T}_{t}} \log \! \left(\dfrac{\exp { \big ({\mathbf {w}}_{y_i}^{\top }\cdot {\mathbf {\phi (\mathbf {x})}} }\big)}{\sum \nolimits _{\scriptscriptstyle j \in K_s} \exp \big ({ {\mathbf {w}}_{j}^{\top }\cdot {\mathbf {\phi (\mathbf {x})}} }\big) + \sum \nolimits _{\scriptscriptstyle j \in K_u} \exp {\big ({\mathbf {w}}_{j}^{\top }\cdot {\mathbf {\phi (\mathbf {x})}} \big) }} \right) , \end{eqnarray}\]

(7)

where $K_s$ is the set of classes learned up to time t, $|\mathcal {T}_{t}|$ is the number of elements in the training-set, $K_u$ is the set of the outputs of the classifier that have not yet been assigned to classes at time t (i.e., future unseen classes [47]), $\mathbf {w}^{\top }_{(\cdot)}$ is a class prototype of the fixed classifier $\mathbf {W}$ , and $y_i$ is the supervising label. In particular, $\mathbf {W}$ is the weight matrix of the fixed classifier, which does not undergo learning during model training. In the work of Pernici et al. [46], the d-Simplex prototypes are defined as $\mathbf {W} = \lbrace e_1,e_2,\dots ,e_{d-1}, \alpha \sum _{i=1}^{d-1} e_i \rbrace ,$ where d is the feature dimensionality of the d-Simplex, $\alpha =\frac{1-\sqrt {d+1}}{d}$ , and $e_i$ denotes the standard basis in $\mathbb {R}^{d-1}$ , with $i \in \lbrace 1,2, \dots , d-1\rbrace$ .

The loss of Equation (7) imposes global stationarity and does not require any knowledge to be extracted from the previously learned models. However, catastrophic forgetting causes misalignment between features and fixed classifier prototypes. Therefore, we further impose additional stationarity constraints in a local neighborhood of a feature by encouraging the current model to mimic the feature representation of the model previously learned. This allows the overall stationarity to also be determined by a local learning mechanism interacting with the global one provided by the d-Simplex classifier of Equation (7). The global-to-local interaction is achieved through the FD loss [56]. Differently from the more common practice of FD in CiL [16, 22, 26] in which each mini-batch is sampled from both the episodic memory and the current task (i.e., $\mathcal {T}_t \leftarrow \mathcal {M}_{t} \cup \mathcal {D}_t$ ), we evaluate the FD loss, at each task t, only on the samples stored in episodic memory $\mathcal {M}_t$ observed from previous tasks:

\begin{equation} \mathcal {L}_{\scriptscriptstyle \textrm {FD}}^{\scriptscriptstyle \mathcal {M}}= \frac{1}{|\mathcal {M}_{t}|} \sum _{\mathbf {x}_i \in \mathcal {M}_{t}} \left(1 - \frac{\phi _{t}(\mathbf {x}_i) \cdot \phi _{t-1}(\mathbf {x}_i)}{\left\Vert \phi _{t}(\mathbf {x}_i)\right\Vert \left\Vert \phi _{t-1}(\mathbf {x}_i)\right\Vert } \right), \end{equation}

(8)

where $\phi _{t-1}$ is the model learned from the previous task. This encourages local stationarity and stability from only the previous classes in the episodic memory and the assimilation of new knowledge (plasticity) from only the classes of the current task. As confirmed by ablation in Section 8, this learning strategy leads to a significant performance improvement. The final optimized loss function is the sum of Equations (8) and (7):

\begin{equation} \mathcal {L} = \mathcal {L}_{t} + \lambda \; \mathcal {L}_{\scriptscriptstyle \textrm {FD}}^{\scriptscriptstyle \mathcal {M}}, \end{equation}

(9)

where $\lambda$ balances the contribution of global and local alignment provided by the two losses. The pseudo-code in Algorithm 1 and in Algorithm 2 detail our training procedure and its application in visual search, respectively.

7 Experimental Results

7.1 Datasets and Verification Protocol

We compare our proposed CL²R training procedure and the baseline methods on several benchmarks: CIFAR10 [28], ImageNet20,¹ ImageNet100 [22, 53, 71], Labeled Faces in the Wild (LFW) [24], and IJB-C [36]. Evaluation is performed in the open-set 1:1 search problem, with verification accuracy as the performance metric M in Equations (1) and (2) for all datasets except IJB-C in which the true acceptance rate and false acceptance rate (TAR@FAR) is used. They are defined as $\text{TAR} = {\text{TP}}/{(\text{TP} + \text{FN})}$ , $\text{FAR} = {\text{FP}}/{(\text{FP} + \text{TN})}$ and $\text{ACC} = {(\text{TP} + \text{TN})}/{(\text{TP} + \text{TN} + \text{FP} + \text{FN})}$ , where TP, TN, FP, and FN indicate true positives, true negatives, false positives, and false negatives, respectively [27, 58]. Following the verification protocol defined in the work of Huang et al. [24], we generate a set of pairs of images that do or do not belong to the same class. A pair is verified on the basis of the distance between feature vectors of its samples. During the evaluation of task t, $\phi _t$ is used to extract the feature representation for the first image of each pair (i.e., the query-set) and $\phi _k$ , with $k \in \lbrace 1, \ldots , t\rbrace$ , is used to extract the feature representation for the second image (i.e., the gallery-set). When $k=t$ , the compatibility test is the self-test, and otherwise it is the cross-test between the two representations learned from the tasks at time t and k. For the LFW and IJB-C evaluation, we use the original pairs provided by the respective datasets; for the CIFAR10, ImageNet20, and ImageNet100 evaluation, the verification pairs are randomly generated. As the open-set evaluation requires no overlap between classes of the training-set and test-set, we use CIFAR100 to perform CiL (i.e., classification is the surrogate task from which the feature representation is learned) and the CIFAR10 pairs are used as the verification test-set. Similarly, Tiny-ImageNet200 [29] is used as the training-set to evaluate the ImageNet20 pairs; LFW and IJB-C pairs are evaluated with models trained on CASIA-WebFace [75]. Finally, for ImageNet100, we train the models with images not included in ImageNet100 (i.e., the subset of the images of the remaining 900 classes that we named ImageNet900). These datasets are divided into tasks as described in Section 7.2.

7.2 Implementation Details

Our CL²R training procedure is implemented in PyTorch [42] and uses the publicly available library Continuum [17]. We used four NVIDIA Tesla A100s to train the representation models, and the neural network architectures are based on the PODNet implementation.² The evaluation is carried out on several ResNet [20] architectures. Specifically, a 32-, 18-, and 50-layer ResNet is used for CIFAR10, ImageNet20 and ImageNet100, and LFW and IJB-C, respectively. As is typically used in CiL [22, 71], the episodic memory $\mathcal {M}$ contains 20 samples for each class. The value of $\lambda$ in Equation (9) is set to $\lambda = \lambda _{\rm base} \sqrt {{k_n}/{k_0}}$ [22], in which $\lambda _{\rm base}$ is a scalar, $k_n$ is the number of classes of the current task, and $k_0$ is the number of old classes in the episodic memory. The training details for each dataset are listed next.

CIFAR100 and CIFAR10. We train the model for 70 epochs for each task with batch size 128, and optimization is performed with SGD with an initial learning rate of 0.1 and weight decay of $2\cdot 10^{-4}$ . The learning rate is divided by 10 at epochs 50 and 64. The input images are RGB, $32 \times 32$ . $\lambda _{\rm base}$ is set to 5.

Tiny-ImageNet200 and ImageNet20. We train the model for 90 epochs at each task with batch size 256, and optimization is performed with SGD with an initial learning rate of 0.1 and a weight decay of $2\cdot 10^{-4}$ . The learning rate is divided by 10 at epochs 30 and 60. To properly evaluate the models in this learning setting, input images and the ImageNet test images are resized to match the Tiny-ImageNet200 input size (RGB $64 \times 64$ ). $\lambda _{\rm base}$ is set to 5.

ImageNet900 and ImageNet100. We train the model for 90 epochs in each task with batch size 256, and optimization is performed with SGD with an initial learning rate of 0.1 and weight decay of $2\cdot 10^{-4}$ . The learning rate is divided by 10 at epochs 30 and 60. The input images are RGB, $224 \times 224$ . $\lambda _{\rm base}$ is set to 10.

CASIA-WebFace and LFW/IJB-C. For each task, we train the model for 120 epochs with batch size 1,024. Optimization is carried out with SGD with an initial learning rate of 0.1 and a weight decay of $2\cdot 10^{-4}$ . The learning rate is divided by 10 at epochs 30, 60, and 90. The input images are RGB, $112 \times 112$ . $\lambda _{\rm base}$ is set to 10.

In Table 1, we summarize the datasets and the training details of our experiments.

Table 1.

network	input size	dataset	# classes	dataset	# pairs
		training-set		test-set
ResNet-32	$32 \times 32$	CIFAR100	100	CIFAR10	6k
ResNet-18	$64 \times 64$	Tiny-ImageNet200	200	ImageNet20	6k
ResNet-18	$224 \times 224$	ImageNet900	900	ImageNet100	6k
ResNet-50	$112 \times 112$	CASIA-WebFace	10,575	LFW	6k
ResNet-50	$112 \times 112$	CASIA-WebFace	10,575	IJB-C	15M

Table 1. Datasets Used in CL²R Training Procedures

Training-set and test-set of the same configuration have non-overlapping classes to properly evaluate different approaches in a open-set setup.

7.3 Baselines and Compared Methods

We compare our training procedure with both the CiL methods and the recently proposed methods for compatible learning. Our baselines include LwF [32], LUCIR [22], BiC [71], PODNet [16], FOSTER [69], FAN [25], and BCT [62]. In particular, FAN and BCT are the only approaches with an explicit mechanism to address feature compatibility. We adapted FAN so that the learned adaptation functions are used to transform the features into compatible features. Since in BCT the model is trained from scratch at each task using all available data, for a fair comparison, we also re-implemented it with an episodic memory and refer to it as lifelong-BCT ( $\ell$ -BCT). At each task, the model is initialized with the parameters of the model of the previous task and the data of the previous tasks can be accessed only through the episodic memory. For LwF, BiC, and PODNet, we use their publicly available implementations,² whereas for LUCIR and FOSTER, we adopted their official implementations.³ Finally, we also include a traditional Experience Replay (ER)-based baseline, where the model is continuously fine-tuned as new tasks become available. To evaluate our training procedure without considering the catastrophic forgetting phenomenon, we define as upper bound (UB) our training procedure re-trained from scratch at each task using an episodic memory with infinite size.

7.4 Evaluation on CIFAR10

In this section, we report the experiments in 2-, 3-, 5-, and 10-task CL²R settings with models trained on CIFAR100 (i.e., using 50, 33, 20, and 10 classes per task) where compatibility is evaluated on the CIFAR10 generated pairs.

In Table 2, we summarize the performance of our CL²R training procedure with respect to the other baselines in the two-task scenario. We evaluate the compatibility of the updated model according to the ECC (Equation (1)), BC (Equation (5)), and FC (Equation (6)). The first row of Table 2 reports the verification accuracy of the model trained on the first 50 classes of CIFAR100. Experiments show that, among the methods compared, LUCIR and PODNet may have an inherent, although limited, level of compatible representations. This substantially confirms the importance of having some form of mechanism to preserve the local geometry of the learned features. Our training procedure achieves the highest cross-test, BC, and FC, thus resulting to be the most suited training procedure to avoid re-indexing.

Table 2.

method	self- test	cross- test	ECC	BC	FC
Initial Task	0.65	–	–	–	–
ER	0.64	0.62	$\times$	–0.034	–0.210
LwF	0.64	0.64	$\times$	–0.009	–0.002
BiC	0.66	0.63	$\times$	–0.015	–0.028
LUCIR	0.70	0.66	$\surd$	–0.012	–0.038
FAN	0.66	0.63	$\times$	–0.023	–0.035
FOSTER	0.66	0.57	$\times$	–0.080	–0.090
$\ell$ -BCT	0.65	0.60	$\times$	–0.047	–0.044
PODNet	0.67	0.66	$\surd$	–0.014	–0.013
Ours	0.66	0.67	$\surd$	–0.017	–0.006
BCT*	0.72	0.65	$\surd$	–0.003	–0.071
Ours (UB)*	0.73	0.69	$\surd$	–0.039	–0.040

Table 2. CIFAR10 Evaluation

Two-task CL²R setting with models trained on CIFAR100. Initial Task (i.e., the previous task) shows the verification accuracy on the first 50 classes, and the other rows represent the performance obtained after two tasks.

*Not subject to catastrophic forgetting.

In the last rows of the table, we report the performance of the BCT and our UB that are not affected by catastrophic forgetting. The effect of catastrophic forgetting and its implications on the reduction of performance in compatibility can be observed in the self-test, as these values are significantly higher than the values reported by the methods learned using CiL.

In Table 3, results for the scenario of 3-, 5-, and 10-task CL²R are presented. For each experiment, we report AC (Equation (4)), BC (Equation (5)), and FC (Equation (6)). As can be noticed, our method always achieves the highest AC, thus obtaining the largest number of compatible representations between models, and always achieves the highest BC between methods that are subject to catastrophic forgetting. FAN achieves almost the same performance as our procedure in the 3-task scenario, but when the number of tasks increases, it has a significant decrease in performance, especially in the 10-task setting. This may be due to the increasing number of adaptation functions between different feature spaces that FAN uses to adapt old features with respect to the new ones. As can be noticed from the two tables, FOSTER does not learn compatible features. This may be due to the fact that feature space compression forces the representation to change abruptly reducing the overall compatibility with previous models. BCT reports higher values since its representation is learned from scratch for each new task. Compared to the UB, our training procedure achieves lower AC and BC, and this is due to the influence of catastrophic forgetting. From the table, it can also be noticed that BiC, LUCIR, and PODNet do not satisfy compatibility when catastrophic forgetting is more severe, as, for example, in the case of 10 tasks. Overall, these results suggest that the interaction between local and global stationarity promoted by our training procedure shows a significant improvement in performance that FD alone cannot provide.

Table 3.

7.5 Evaluation on ImageNet

In this section, we conducted the experiments with models trained on Tiny-ImageNet200 in CL²R settings with 2 (Table 4), 3, 5, and 10 (Table 5) tasks.

Table 4.

method	self- test	cross- test	ECC	BC	FC
Initial Task	0.61	–	–	–	–
ER	0.62	0.59	$\times$	$-$ 0.012	$-$ 0.028
LwF	0.63	0.60	$\times$	$-$ 0.007	$-$ 0.032
BiC	0.60	0.61	$\times$	$-$ 0.001	$\hphantom{-}$ 0.005
LUCIR	0.60	0.62	$\surd$	$\hphantom{-}$ 0.012	$\hphantom{-}$ 0.015
FAN	0.61	0.62	$\surd$	$\hphantom{-}$ 0.008	$\hphantom{-}$ 0.009
$\ell$ -BCT	0.61	0.57	$\times$	$-$ 0.042	$-$ 0.038
Ours	0.61	0.63	$\surd$	$\hphantom{-}$ 0.017	$\hphantom{-}$ 0.015
BCT*	0.65	0.64	$\surd$	$\hphantom{-}$ 0.026	$-0.05\hphantom{0}$
Ours (UB)*	0.66	0.64	$\surd$	$\hphantom{-}$ 0.031	$-$ 0.018

Table 4. ImageNet20 Evaluation

The two-task CL²R setting with models trained on Tiny-ImageNet200. The Initial Task (i.e., the previous task) shows verification accuracy on the first 100 classes, and the other rows represent the performance obtained after two tasks.

*Not subject to catastrophic forgetting.

Table 5.

Table 4 follows the same structure as Table 2 showing the ECC (Equation (1)), BC (Equation (5)), and FC (Equation (6)) values. For all compared methods, the initial model (i.e., the previous model) is trained on the first 100 classes of Tiny-ImageNet200. As can be seen in the table, our method achieves the best performance. However, with low values, other methods such as FAN and LUCIR have a certain level of compatibility, which confirms again that distillation, with which they are equipped, is a useful tool to support learning compatible features. As is also observed in the CIFAR results, methods not subject to catastrophic forgetting (i.e., BCT and our UB), achieve higher BC and lower FC.

Table 5 shows the 3-, 5- and 10-task CL²R settings for Tiny-ImageNet200. In these learning scenarios, each task is made up of 66, 40, and 20 classes, respectively. In this table, we discuss the results by analyzing the values of AC (Equation (4)), BC (Equation (5)), and FC (Equation (6)). Our approach always achieves the highest value of AC. In particular, ER, LwF, BiC, FAN, and $\ell$ -BCT do not achieve lifelong-compatible representation in the 3-task setting as a result of AC = 0. In the 10-task CL²R setting, it is more evident that as the number of tasks increases, methods without any specific mechanism to preserve the representation typically cannot learn compatible representations. LUCIR, BiC, and $\ell$ -BCT obtain significantly lower values than our method. Specifically, the AC performance is more than twice that of BCT, which means that our CL²R procedure obtains twice the number of compatible representations than that of BCT. This may be caused by the fact that the constraints imposed by these techniques on the learned representation seem to have very little effect on its stationarity, and consequently on its compatibility. The results on the 10-task setting are also important, as they suggest that catastrophic forgetting is not an intrinsic impediment to learning compatible representations. The performance difference of 0.11 in AC with respect to the UB can be considered clear evidence of this effect. Finally, the table shows how our training procedure provides the highest FC and is the only case where FC is always positive. As a result, our training procedure achieves, on average, cross-tests higher than self-tests indicating that the system performs better even without re-indexing the gallery.

Table 6 reports ImageNet100 results when models are trained on ImageNet900 with two and three tasks. We compare our approach with the $\ell$ -BCT method as having reasonable performance and with an explicit mechanism to learn compatible features under catastrophic forgetting. As can be noticed from the table, our CL²R training clearly outperforms $\ell$ -BCT. Our method achieves good scores for AC in both scenarios. As remarked in the Section 7.5 of the novel revised manuscript, the reduced performance of $\ell$ -BCT appears to be connected to the fact that the training procedure is only based on pairwise model training (i.e., compatibility is only learned from the previous model). In contrast, our method is not based only on pairwise learning and does not use previous classifiers, which may be incorrectly learned.

Table 6.

method	two tasks			three tasks
method	AC	BC	FC	AC	BC	FC
$\ell$ -BCT	0	$-$ 0.127	$-$ 0.101	0.00	$-$ 0.073	$\hphantom{-}$ 0.006
Ours	1.00	$\hphantom{-}$ 0.005	$-$ 0.009	0.67	$\hphantom{-}$ 0.019	$-$ 0.011

Table 6. ImageNet100 Evaluation

The two and three-task CL²R settings with models trained on ImageNet900. We compare our training procedure and $\ell$ -BCT reporting AC, BC, and FC.

7.6 Face Verification

In this section, we report the experimental results on the LFW and IJB-C benchmarks in 2, 3, 5, and 10 CL²R settings. We incrementally train the representation models with CASIA-WebFace resulting in tasks composed of 5,287, 3,525, 2,115, and 1,057 classes, respectively.

The results are summarized in Tables 7 and 8 for LFW and IJB-C, respectively. In particular, for IJB-C, we report accuracy in terms of AC, BC, and FC at different false acceptance rates (FAR): $10^{-1}$ , $10^{-2}$ , and $10^{-4}$ . In this evaluation, we do not report LUCIR when training on CASIA-WebFace due to the excessive memory requirements of the original implementation.³ Although in the 2-task scenario comparable results are observed to those of $\ell$ -BCT, in the settings of 3 and 5 tasks, our training procedure achieves complete compatibility resulting in AC = 1 and BC always positive. In 10-task compatibility, the difference in performance increases more significantly, confirming a clear overall positive performance. Generally, the reported performances are higher on face datasets than on CIFAR10, ImageNet20, and ImageNet100. Possible reasons may be found in the fact that in face recognition, the domain shift between classes is lower than that for CIFAR or ImageNet. Finally, this experiment shows that the proposed method is effective not only with a larger number of model updates but also with larger datasets.

Table 7.

Method	two tasks			three tasks			five tasks			ten tasks
Method	AC	BC	FC	AC	BC	FC	AC	BC	FC	AC	BC	FC
$\ell$ -BCT	1.00	0.005	$-$ 0.010	0.67	$-$ 0.007	$-$ 0.005	0.40	$-$ 0.002	$-$ 0.015	0.31	$-$ 0.002	$-$ 0.010
Ours	1.00	0.003	–0.001	1.00	$\hphantom{-}$ 0.004	$-$ 0.005	1.00	$\hphantom{-}$ 0.006	–0.005	0.82	$\hphantom{-}$ 0.006	–0.005

Table 7. Face Verification on the LFW Dataset

The 2-, 3-, 5-, and 10-task CL²R settings with models trained on CASIA-WebFace. We compare our training procedure and $\ell$ -BCT reporting BC (Equation (5)), FC (Equation (6)), and AC (Equation (4)), which corresponds to the ECC (Equation (1)) when evaluated in two tasks.

Table 8.

FAR	Method	two tasks			three tasks			five tasks			ten tasks
FAR	Method	AC	BC	FC	AC	BC	FC	AC	BC	FC	AC	BC	FC
10 $^{-1}$	$\ell$ -BCT	1.00	$\hphantom{-}$ 0.002	$-$ 0.010	0.33	$-$ 0.007	$-$ 0.017	0.20	$-$ 0.031	$-$ 0.028	0.22	$-$ 0.029	$-$ 0.029
10 $^{-1}$	Ours	1.00	$\hphantom{-}$ 0.005	–0.009	1.00	$\hphantom{-}$ 0.004	–0.006	0.80	$\hphantom{-}$ 0.001	–0.008	0.76	$\hphantom{-}$ 0.011	–0.002
10 $^{-2}$	$\ell$ -BCT	0	$-$ 0.026	$-$ 0.015	0.33	$-$ 0.011	$-$ 0.010	0.10	$-$ 0.038	$-$ 0.025	0.09	$-$ 0.020	$-$ 0.034
10 $^{-2}$	Ours	1.00	$\hphantom{-}$ 0.005	–0.017	1.00	$\hphantom{-}$ 0.010	$\hphantom{-}$ 0.009	0.80	$\hphantom{-}$ 0.008	–0.014	0.73	$\hphantom{-}$ 0.010	–0.003
10 $^{-4}$	$\ell$ -BCT	0	$-$ 0.012	$-$ 0.004	0.33	$-$ 0.010	$-$ 0.012	0	$-$ 0.041	$-$ 0.028	0.09	$-$ 0.016	$-$ 0.009
10 $^{-4}$	Ours	1.00	$\hphantom{-}$ 0.023	$\hphantom{-}$ 0.005	0.67	$\hphantom{-}$ 0.002	$\hphantom{-}$ 0.005	0.80	$\hphantom{-}$ 0.001	–0.003	0.73	$\hphantom{-}$ 0.012	$\hphantom{-}$ 0.007

Table 8. Face Verification on the IJB-C Dataset

The 2-, 3-, 5-, and 10-task CL²R settings with models trained on CASIA-WebFace. We report $\text{($AC$, $BC$, $FC$)@FAR}=$ ${10}^{-1}, {10}^{-2},\ {\mathrm{and}}\ {10}^{-4}$ and compare our training procedure with $\ell$ -BCT.

7.7 Compatibility and Catastrophic Forgetting

In this section, we study how compatibility is related to the problem of catastrophic forgetting. In Figure 2, we show the evolution of BC in a 5- and 10-task CL²R scenario. In particular, Figure 2(a) and (b) and Figure 2(c) and (d) show the evaluations on the CIFAR10 and ImageNet20 datasets, respectively. We compared our approach with ER, LwF, BiC, LUCIR, FAN, and $\ell$ -BCT. As can be observed, our training procedure achieves the highest performance. As the BC metric is, on average, the closest to zero than the other evaluated methods, the representation learned by our training procedure can be considered to be the most compatible and, from the perspective of visual search, equivalent to the representation models learned from previous tasks. More practically, this allows for the reduction of the computational cost of re-indexing.

Fig. 2.

In contrast, FAN achieves a negative value of BC in all four settings, confirming that the composition of an increasing number of feature adaption functions between sequentially learned representations causes a decrease in compatibility. Despite the absence of considerable performance loss, as in the case of FAN, negative BC values indicate a constant deterioration in performance as the number of tasks increases. In general, except for our method, the figure shows that all other methods follow a common trend with lower performance.

8 Ablation Studies

We analyze by ablation the main components of our training procedure. The ablation is performed on the CIFAR100 dataset as described in Section 7.4 and considers the 10-task CL²R setting, which can be regarded as a worst-case scenario for this dataset. We analyze the impact of (i) the specific classifier: Trainable vs. fixed d-Simplex with or without the FD component, (ii) how the FD loss is evaluated, and (iii) the sensitivity of the number of samples reserved per class in the episodic memory.

Impact of the d-Simplex fixed classifier and FD. As can be noticed from Table 9, the Trainable classifier is not able to learn compatible representations. When combined with FD, the performance improves only marginally and not sufficiently to be compared with the CiL approaches shown in Table 3. FD evaluated on the only samples stored in episodic memory as defined in $\mathcal {L}_{\scriptscriptstyle \textrm {FD}}^{\scriptscriptstyle \mathcal {M}}$ (Equation (8)) improves the values of the reported metrics showing a better supervision signal for the updated model. The d-Simplex alone improves on the previous components obtaining values of AC = 0.27 and FC = 0.003, which are higher than the Trainable classifier with $\mathcal {L}_{\scriptscriptstyle \textrm {FD}}^{\scriptscriptstyle \mathcal {M}}$ . This remarks on the importance of preserving the global geometry of the learned features according to the d-Simplex fixed classifier.

Table 9.

classifier		distillation		ten tasks
Trainable	Fixed	$\mathcal {L}_{\scriptscriptstyle \textrm {FD}}$	$\mathcal {L}_{\scriptscriptstyle \textrm {FD}}^{\scriptscriptstyle \mathcal {M}}$	AC	BC	FC
$\surd$				0.04	$-$ 0.130	$-$ 0.083
$\surd$		$\surd$		0.13	–0.050	–0.049
$\surd$			$\surd$	0.22	$-$ 0.043	$-$ 0.013
	$\surd$			0.27	$-$ 0.078	$\hphantom{-}$ 0.003
	$\surd$	$\surd$		0.40	$-$ 0.019	$-$ 0.011
	$\surd$		$\surd$	0.44	–0.003	$\hphantom{-}$ 0.005

Table 9. Ablation of the Different Main Components of Our CL²R Training Procedure

The evaluation is performed on CIFAR10 and training is based on CIFAR100 with 10 tasks, where Trainable indicates the traditional ER baseline, Fixed indicates ER with stationary features learned from Equation (7) according to the fixed d-Simplex classifier, $\mathcal {L}_{\scriptscriptstyle \textrm {FD}}$ is the traditional FD, and $\mathcal {L}_{\scriptscriptstyle \textrm {FD}}^{\scriptscriptstyle \mathcal {M}}$ is the FD evaluated on the only samples stored in episodic memory as defined in Equation (8).

Impact of memory samples on FD. Table 9 shows that when the distillation loss is evaluated on the only samples stored in episodic memory $\mathcal {L}_{\scriptscriptstyle \textrm {FD}}^{\scriptscriptstyle \mathcal {M}}$ (Equation (8)), our approach achieves better overall results. We argue that this positive effect is mostly due to the interaction between the global feature stationarity learned using the fixed classifier and the local one promoted through FD from the only observed samples in the episodic memory. The interaction is most likely related to the fact that the fixed d-Simplex classifier in general does not allow novel classes to interfere in the feature space of the already learned one. This in turn provides favorable working conditions (i.e., a kind of coarse pre-alignment) for achieving feature alignment with respect to the previous model by the distillation loss. As expected, the impact intensifies when evaluated only on already known classes, as alignment is less prone to unexpected noisy features which may reduce the degree of the alignment. This confirms the effectiveness of restricting FD only on memory samples in contrast to the traditional FD commonly used in CiL.

Impact of the episodic memory size. Figure 3 shows the effect of different numbers of reserved samples per class for both our learning procedure and other baselines. As expected, the more samples per class are reserved in the episodic memory, the better the performance. Our approach, with 20 samples per class, achieves results similar to those obtained by the other methods with more examples per class. Although ER, LUCIR, and FAN have a better relative improvement with 50 samples per class, overall our approach results in the highest performance in learning compatible features.

Fig. 3.

We also evaluated the methods in the challenging memory-free training setting (i.e., without the episodic memory). Our training procedure achieves the highest results also in this condition, remarking on the fact that CiL methods typically do not have an inherent mechanism to learn compatible features.

9 Conclusion

In this article, we have introduced the problem of CL²R, which considers the compatibility learning problem within the lifelong learning paradigm. We introduced a novel set of metrics to properly evaluate this problem and proposed a novel CL²R training procedure that imposes global and local stationarity on the learned features to achieve compatibility between representations under catastrophic forgetting. Global and local stationarity is imposed according to the d-Simplex fixed classifier and the FD loss, respectively. Empirical evaluation of the learned lifelong-compatible representation shows the effectiveness of our method with respect to baselines and state-of-the-art methods.

Footnotes

To meet the open-set protocol, we generated a training set from ImageNet [57] by randomly sampling 20 classes that are not included in the Tiny-ImageNet200 dataset. The indices of the ImageNet classes we use are the following: {n02276258, n01728572, n03814906, n02817516, n03769881, n03220513, n04442312, n04252225, n13037406, n04266014, n03929855, n02804414, n01873310, n03532672, n01818515, n03916031, n03345487, n02114855, n04589890, n03776460}.

https://github.com/arthurdouillard/incremental_learning.pytorch.

https://github.com/hshustc/CVPR19_Incremental_Learning and https://github.com/G-U-N/ECCV22-FOSTER.

References

[1]

Tommaso Barletti, Niccoló Biondi, Federico Pernici, Matteo Bruni, and Alberto Del Bimbo. 2022. Contrastive supervised distillation for continual representation learning. In Proceedings of the International Conference on Image Analysis and Processing. 597–609.

method	two tasks			three tasks
method	AC	BC	FC	AC	BC	FC
\(\ell\) -BCT	0	\(-\) 0.127	\(-\) 0.101	0.00	\(-\) 0.073	\(\hphantom{-}\) 0.006
Ours	1.00	\(\hphantom{-}\) 0.005	\(-\) 0.009	0.67	\(\hphantom{-}\) 0.019	\(-\) 0.011

FAR	Method	two tasks			three tasks			five tasks			ten tasks
FAR	Method	AC	BC	FC	AC	BC	FC	AC	BC	FC	AC	BC	FC
10 \(^{-1}\)	\(\ell\) -BCT	1.00	\(\hphantom{-}\) 0.002	\(-\) 0.010	0.33	\(-\) 0.007	\(-\) 0.017	0.20	\(-\) 0.031	\(-\) 0.028	0.22	\(-\) 0.029	\(-\) 0.029
10 \(^{-1}\)	Ours	1.00	\(\hphantom{-}\) 0.005	–0.009	1.00	\(\hphantom{-}\) 0.004	–0.006	0.80	\(\hphantom{-}\) 0.001	–0.008	0.76	\(\hphantom{-}\) 0.011	–0.002
10 \(^{-2}\)	\(\ell\) -BCT	0	\(-\) 0.026	\(-\) 0.015	0.33	\(-\) 0.011	\(-\) 0.010	0.10	\(-\) 0.038	\(-\) 0.025	0.09	\(-\) 0.020	\(-\) 0.034
10 \(^{-2}\)	Ours	1.00	\(\hphantom{-}\) 0.005	–0.017	1.00	\(\hphantom{-}\) 0.010	\(\hphantom{-}\) 0.009	0.80	\(\hphantom{-}\) 0.008	–0.014	0.73	\(\hphantom{-}\) 0.010	–0.003
10 \(^{-4}\)	\(\ell\) -BCT	0	\(-\) 0.012	\(-\) 0.004	0.33	\(-\) 0.010	\(-\) 0.012	0	\(-\) 0.041	\(-\) 0.028	0.09	\(-\) 0.016	\(-\) 0.009
10 \(^{-4}\)	Ours	1.00	\(\hphantom{-}\) 0.023	\(\hphantom{-}\) 0.005	0.67	\(\hphantom{-}\) 0.002	\(\hphantom{-}\) 0.005	0.80	\(\hphantom{-}\) 0.001	–0.003	0.73	\(\hphantom{-}\) 0.012	\(\hphantom{-}\) 0.007

classifier		distillation		ten tasks
Trainable	Fixed	\(\mathcal {L}_{\scriptscriptstyle \textrm {FD}}\)	\(\mathcal {L}_{\scriptscriptstyle \textrm {FD}}^{\scriptscriptstyle \mathcal {M}}\)	AC	BC	FC
\(\surd\)				0.04	\(-\) 0.130	\(-\) 0.083
\(\surd\)		\(\surd\)		0.13	–0.050	–0.049
\(\surd\)			\(\surd\)	0.22	\(-\) 0.043	\(-\) 0.013
	\(\surd\)			0.27	\(-\) 0.078	\(\hphantom{-}\) 0.003
	\(\surd\)	\(\surd\)		0.40	\(-\) 0.019	\(-\) 0.011
	\(\surd\)		\(\surd\)	0.44	–0.003	\(\hphantom{-}\) 0.005

Abstract

1 Introduction

2 Related Work

3 Main Contributions

4 CL2R Problem Formulation

5 Compatibility Evaluation

5.1 Proposed CL2R Metrics

6 Proposed CL2R Training

7 Experimental Results

7.1 Datasets and Verification Protocol

7.2 Implementation Details

7.3 Baselines and Compared Methods

7.4 Evaluation on CIFAR10

7.5 Evaluation on ImageNet

7.6 Face Verification

7.7 Compatibility and Catastrophic Forgetting

8 Ablation Studies

9 Conclusion

Footnotes

References

Index Terms

Recommendations

Lifelong learning in costly feature spaces

Lifelong Machine Learning

Hybrid learning in lifelong learning implementation

Comments

Information

Published In

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Funding Sources

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

View options

PDF

eReader

HTML Format

Get Access

Login options

Full Access

Figures

Other

Share

Share this Publication link

Share on social media

Affiliations

4 CL²R Problem Formulation

5.1 Proposed CL²R Metrics

6 Proposed CL²R Training