An Ensemble of Epoch-Wise Empirical Bayes for Few-Shot Learning

Yaoyao Liu¹²,
Bernt Schiele¹² &
Qianru Sun¹³

Part of the book series: Lecture Notes in Computer Science ((LNIP,volume 12361))

Included in the following conference series:

European Conference on Computer Vision

4557 Accesses
70 Citations

Abstract

Few-shot learning aims to train efficient predictive models with a few examples. The lack of training data leads to poor models that perform high-variance or low-confidence predictions. In this paper, we propose to meta-learn the ensemble of epoch-wise empirical Bayes models (E$^3$BM) to achieve robust predictions. “Epoch-wise” means that each training epoch has a Bayes model whose parameters are specifically learned and deployed. “Empirical” means that the hyperparameters, e.g., used for learning and ensembling the epoch-wise models, are generated by hyperprior learners conditional on task-specific data. We introduce four kinds of hyperprior learners by considering inductive vs. transductive, and epoch-dependent vs. epoch-independent, in the paradigm of meta-learning. We conduct extensive experiments for five-class few-shot tasks on three challenging benchmarks: miniImageNet, tieredImageNet, and FC100, and achieve top performance using the epoch-dependent transductive hyperprior learner, which captures the richest information. Our ablation study shows that both “epoch-wise ensemble” and “empirical” encourage high efficiency and robustness in the model performance (Our code is open-sourced at https://gitlab.mpi-klsb.mpg.de/yaoyaoliu/e3bm).

You have full access to this open access chapter, Download conference paper PDF

TAFSSL: Task-Adaptive Feature Sub-Space Learning for Few-Shot Classification

Training Few-Shot Classification via the Perspective of Minibatch and Pretraining

A Survey on Meta-learning Based Few-Shot Classification

1 Introduction

The ability of learning new concepts from a handful of examples is well-handled by humans, while in contrast, it remains challenging for machine models whose typical training requires a significant amount of data for good performance [34]. However, in many real-world applications, we have to face the situations of lacking a significant amount of training data, as e.g., in the medical domain. It is thus desirable to improve machine learning models to handle few-shot settings where each new concept has very scarce examples [13, 30, 39, 70].

Meta-learning methods aim to tackle the few-shot learning problem by transferring experience from similar few-shot tasks [7]. There are different meta strategies, among which the gradient descent based methods are particularly promising for today’s neural networks [1, 13,14,15, 20, 25, 38, 70, 74, 80, 82, 83, 85]. These methods follow a unified meta-learning procedure that contains two loops. The inner loop learns a base-learner for each individual task, and the outer loop uses the validation loss of the base-learner to optimize a meta-learner. In previous works [1, 13, 14, 70], the task of the meta-learner is to initialize the base-learner for the fast and efficient adaptation to the few training samples in the new task.

In this work, we aim to address two shortcomings of the previous works. First, the learning process of a base-learner for few-shot tasks is quite unstable [1], and often results in high-variance or low-confidence predictions. An intuitive solution is to train an ensemble of models and use the combined prediction which should be more robust [6, 29, 54]. However, it is not obvious how to obtain and combine multiple base-learners given the fact that a very limited number of training examples are available. Rather than learning multiple independent base-learners [79], we propose a novel method of utilizing the sequence of epoch-wise base-learners (while training a single base-learner) as the ensemble. Second, it is well-known that the values of hyperparameters, e.g., for initializing and updating models, are critical for best performance, and are particularly important for few-shot learning. In order to explore the optimal hyperparameters, we propose to employ the empirical Bayes method in the paradigm of meta-learning. In specific, we meta-learn hyperprior learners with meta-training tasks, and use them to generate task-specific hyperparameters, e.g., for updating and ensembling multiple base-learners. We call the resulting novel approach E$^3$BM, which learns the Ensemble of Epoch-wise Empirical Bayes Models for each few-shot task. Our “epoch-wise models” are different models since each one of them is resulted from a specific training epoch and is trained with a specific set of hyperparameter values. During test, E$^3$BM combines the ensemble of models’ predictions with soft ensembling weights to produce more robust results. In this paper, we argue that during model adaptation to the few-shot tasks, the most active adapting behaviors actually happen in the early epochs, and then converge to and even overfit to the training data in later epochs. Related works use the single base-learner obtained from the last epoch, so their meta-learners learn only partial adaptation experience [13, 14, 25, 70]. In contrast, our E$^3$BM leverages an ensemble modeling strategy that adapts base-learners at different epochs and each of them has task-specific hyperparameters for updating and ensembling. It thus obtains the optimized combinational adaptation experience. Figure 1 presents the conceptual illustration of E$^3$BM, compared to those of the classical method MAML [13] and the state-of-the-art SIB [25].

Our main contributions are three-fold. (1) A novel few-shot learning approach E$^3$BM that learns to learn and combine an ensemble of epoch-wise Bayes models for more robust few-shot learning. (2) Novel hyperprior learners in E$^3$BM to generate the task-specific hyperparameters for learning and combining epoch-wise Bayes models. In particular, we introduce four kinds of hyperprior learner by considering inductive [13, 70] and transductive learning methods [25], and each with either epoch-dependent (e.g., LSTM) or epoch-independent (e.g., epoch-wise FC layer) architectures. (3) Extensive experiments on three challenging few-shot benchmarks, miniImageNet [73], tieredImageNet [58] and Fewshot-CIFAR100 (FC100) [53]. We plug-in our E$^3$BM to the state-of-the-art few-shot learning methods [13, 25, 70] and obtain consistent performance boosts. We conduct extensive model comparison and observe that our E$^3$BM employing an epoch-dependent transductive hyperprior learner achieves the top performance on all benchmarks.

2 Related Works

Few-Shot Learning & Meta-Learning. Research literature on few-shot learning paradigms exhibits a high diversity from using data augmentation techniques [9, 75, 77] over sharing feature representation [2, 76] to meta-learning [18, 72]. In this paper, we focus on the meta-learning paradigm that leverages few-shot learning experiences from similar tasks based on the episodic formulation (see Section 3). Related works can be roughly divided into three categories. (1) Metric learning methods [12, 24, 40, 41, 64, 71, 73, 78, 81] aim to learn a similarity space, in which the learning should be efficient for few-shot examples. The metrics include Euclidean distance [64], cosine distance [8, 73], relation module [24, 41, 71] and graph-based similarity [45, 62]. Metric-based task-specific feature representation learning has also been presented in many related works [12, 24, 41, 78]. (2) Memory network methods [50, 52, 53] aim to learn training “experience” from the seen tasks and then aim to generalize to the learning of the unseen ones. A model with external memory storage is designed specifically for fast learning in a few iterations, e.g., Meta Networks [52], Neural Attentive Learner (SNAIL) [50], and Task Dependent Adaptive Metric (TADAM) [53]. (3) Gradient descent based methods [1, 13, 14, 20, 25, 37, 38, 43, 57, 70, 85] usually employ a meta-learner that learns to fast adapt an NN base-learner to a new task within a few optimization steps. For example, Rusu et al. [61] introduced a classifier generator as the meta-learner, which outputs parameters for each specific task. Lee et al. [37] presented a meta-learning approach with convex base-learners for few-shot tasks. Finn et al. [13] designed a meta-learner called MAML, which learns to effectively initialize the parameters of an NN base-learner for a new task. Sun et al. [69, 70] introduced an efficient knowledge transfer operator on deeper neural networks and achieved a significant improvement for few-shot learning models. Hu et al. [25] proposed to update base-learner with synthetic gradients generated by a variational posterior conditional on unlabeled data. Our approach is closely related to gradient descent based methods [1, 13, 25, 69, 70, 70]. An important difference is that we learn how to combine an ensemble of epoch-wise base-learners and how to generate efficient hyperparameters for base-learners, while other methods such as MAML [13], MAML++ [1], LEO [61], MTL [69, 70], and SIB [25] use a single base-learner.

Hyperparameter Optimization. Building a model for a new task is a process of exploration-exploitation. Exploring suitable architectures and hyperparameters are important before training. Traditional methods are model-free, e.g., based on grid search [4, 28, 42]. They require multiple full training trials and are thus costly. Model-based hyperparameter optimization methods are adaptive but sophisticated, e.g., using random forests [27], Gaussian processes [65] and input warped Gaussian processes [67] or scalable Bayesian optimization [66]. In our approach, we meta-learn a hyperprior learner to output optimal hyperparameters by gradient descent, without additional manual labor. Related methods using gradient descent mostly work for single model learning in an inductive way [3, 10, 15, 44, 46,47,48,49]. While, our hyperprior learner generates a sequence of hyperparameters for multiple models, in either the inductive or the transductive learning manner.

Ensemble Modeling. It is a strategy [26, 84] to use multiple algorithms to improve machine learning performance, and which is proved to be effective to reduce the problems related to overfitting [35, 68]. Mitchell et al. [51] provided a theoretical explanation for it. Boosting is one classical way to build an ensemble, e.g., AdaBoost [16] and Gradient Tree Boosting [17]. Stacking combines multiple models by learning a combiner and it applies to both tasks in supervised learning [6, 29, 54] and unsupervised learning [63]. Bootstrap aggregating (i.e., Bagging) builds an ensemble of models through parallel training [6], e.g., random forests [22]. The ensemble can also be built on a temporal sequence of models [36]. Some recent works have applied ensemble modeling to few-shot learning. Yoon et al. proposed Bayesian MAML (BMAML) that trains multiple instances of base-model to reduce mete-level overfitting [79]. The most recent work [11] encourages multiple networks to cooperate while keeping predictive diversity. Its networks are trained with carefully-designed penalty functions, different from our automated method using empirical Bayes. Besides, its method needs to train much more network parameters than ours. Detailed comparisons are given in the experiment section.

3 Preliminary

In this section, we introduce the unified episodic formulation of few-shot learning, following [13, 57, 73]. This formulation was proposed for few-shot classification first in [73]. Its problem definition is different from traditional classification in three aspects: (1) the main phases are not training and test but meta-training and meta-test, each of which includes training and test; (2) the samples in meta-training and meta-testing are not datapoints but episodes, i.e. few-shot classification tasks; and (3) the objective is not classifying unseen datapoints but to fast adapt the meta-learned knowledge to the learning of new tasks.

Given a dataset $\mathcal {D}$ for meta-training, we first sample few-shot episodes (tasks) $\{\mathcal {T}\}$ from a task distribution $p(\mathcal {T})$ such that each episode $\mathcal {T}$ contains a few samples of a few classes, e.g., 5 classes and 1 shot per class. Each episode $\mathcal {T}$ includes a training split $\mathcal {T}^{(tr)}$ to optimize a specific base-learner, and a test split $\mathcal {T}^{(te)}$ to compute a generalization loss to optimize a global meta-learner. For meta-test, given an unseen dataset $\mathcal {D}_{un}$ (i.e., samples are from unseen classes), we sample a test task $\mathcal {T}_{un}$ to have the same-size training/test splits. We first initiate a new model with meta-learned network parameters (output from our hyperprior learner), then train this model on the training split $\mathcal {T}^{(tr)}_{un}$. We finally evaluate the performance on the test split $\mathcal {T}^{(te)}_{un}$. If we have multiple tasks, we report average accuracy as the final result.

4 An Ensemble of Epoch-Wise Empirical Bayes Models

As shown in Fig. 2, E$^3$BM trains a sequence of epoch-wise base-learners $\{\varTheta _m\}$ with training data $\mathcal {T}^{(tr)}$ and learns to combine their predictions $\{z^{(te)}_m\}$ on test data $x^{(te)}$ for the best performance. This ensembling strategy achieves more robustness during prediction. The hyperparameters of each base-learner, i.e., learning rates $\alpha $ and combination weights v, are generated by the hyperprior learners conditional on task-specific data, e.g., $x^{(tr)}$ and $x^{(te)}$. This approach encourages the high diversity and informativeness of the ensembling models.

4.1 Empirical Bayes Method

Our approach can be formulated as an empirical Bayes method that learns two levels of models for a few-shot task. The first level has hyperprior learners that generate hyperparameters for updating and combining the second-level models. More specifically, these second-level models are trained with the loss derived from the combination of their predictions on training data. After that, their loss of test data are used to optimize the hyperprior learners. This process is also called meta update, see the dashed arrows in Fig. 2.

In specific, we sample K episodes $\{\mathcal {T}_k\}_{k=1}^K$ from the meta-training data $\mathcal {D}$. Let $\varTheta $ denote base-learner and $\psi $ represent its hyperparameters. An episode $\mathcal {T}_k$ aims to train $\varTheta $ to recognize different concepts, so we consider to use concepts related (task specific) data for customizing the $\varTheta $ through a hyperprior $p(\psi _k)$. To achieve this, we first formulate the empirical Bayes method with marginal likelihood according to hierarchical structure among data as follows,

$$\begin{aligned} p(\mathcal {T}) = \prod _{k=1}^K p(\mathcal {T}_k) = \prod _{k=1}^K \int _{\psi _k} p(\mathcal {T}_k|\psi _k)p(\psi _k)d{\psi _k}. \end{aligned}$$

(1)

Then, we use variational inference [23] to estimate $\{p(\psi _k)\}_{k=1}^K$. We parametrize distribution $q_{\varphi _k}(\psi _k)$ with $\varphi _k$ for each $p(\psi _k)$, and update $\varphi _k$ to increase the similarity between $q_{\varphi _k}(\psi _k)$ and $p(\psi _k)$. As in standard probabilistic modeling, we derive an evidence lower bound on the log version of Eq. (1) to update $\varphi _k$,

$$\begin{aligned} \log p(\mathcal {T}) \geqslant \sum _{k=1}^K\Big [ \mathbb {E}_{\psi _k\sim q_{\varphi _k}} \big [ \log p(\mathcal {T}_k|\psi _k) \big ] - D_{\mathrm {KL}}(q_{\varphi _k}(\psi _k)||p(\psi _k))\Big ]. \end{aligned}$$

(2)

Therefore, the problem of using $q_{\varphi _k}(\psi _k)$ to approach to the best estimation of $p(\psi _k)$ becomes equivalent to the objective of maximizing the evidence lower bound [5, 23, 25] in Eq. (2), with respect to $\{\varphi _k\}_{k=1}^K$, as follows,

$$\begin{aligned} \min _{\{\varphi _k\}_{k=1}^K} \frac{1}{K} \sum _{k=1}^K\Big [ \mathbb {E}_{\psi _k\sim q_{\varphi _k}} \big [ -\log p(\mathcal {T}_k|\psi _k) \big ] + D_{\mathrm {KL}}(q_{\varphi _k}(\psi _k)||p(\psi _k))\Big ]. \end{aligned}$$

(3)

To improve the robustness of few-shot models, existing methods sample a significant amount number of episodes during meta-training [13, 70]. Each episode employing its own hyperprior $p(\psi _k)$ causes a huge computation burden, making it difficult to solve the aforementioned optimization problem. To tackle this, we leverage a technique called “amortized variational inference” [25, 32, 59]. We parameterize the KL term in $\{\varphi _k\}_{k=1}^K$ (see Eq. (3)) with a unified deep neural network $\varPsi (\cdot )$ taking $x^{(tr)}_k$ (inductive learning) or $\{x^{(tr)}_k, x^{(te)}_k\}$ (transductive learning) as inputs, where $x^{(tr)}_k$ and $x^{(te)}_k$ respectively denote the training and test samples in the k-th episode. In this paper, we call $\varPsi (\cdot )$ hyperprior learner. As shown in Fig. 3, we additionally feed the hyperprior learner with the training gradients $\nabla \mathcal {L}_{\varTheta }(\mathcal {T}^{(tr)}_k)$ to $\varPsi (\cdot )$ to encourage it to “consider” the current state of the training epoch. We mentioned in Sect. 1 that base-learners at different epochs are adapted differently, so we expect the corresponding hyperprior learner to “observe” and “utilize” this information to produce effective hyperparameters. By replacing $q_{\varphi _k}$ with $q_{\varPsi (\cdot )}$, Problem (3) can be rewritten as:

$$\begin{aligned} \min _{\varPsi } \frac{1}{K} \sum _{k=1}^K\Big [ \mathbb {E}_{\psi _k\sim q_{\varPsi (\cdot )}} \big [ -\log p(\mathcal {T}_k|\psi _k) \big ] + D_{\mathrm {KL}}(q_{\varPsi (\cdot )}(\psi _k)||p(\psi _k))\Big ]. \end{aligned}$$

(4)

Then, we solve Problem (4) by optimizing $\varPsi (\cdot )$ with the meta gradient descent method used in classical meta-learning paradigms [13, 25, 70]. We elaborate the details of learning $\{\varTheta _m\}$ and meta-learning $\varPsi (\cdot )$ in the following sections.

4.2 Learning the Ensemble of Base-Learners

Previous works have shown that training multiple instances of the base-learner is helpful to achieve robust few-shot learning [12, 79]. However, they suffer from the computational burden of optimizing multiple copies of neural networks in parallel, and are not easy to generalize to deeper neural architectures. If include the computation of second-order derivatives in meta gradient descent [13], this burden becomes more unaffordable. In contrast, our approach is free from this problem, because it is built on top of optimization-based meta-learning models, e.g., MAML [13], MTL [70], and SIB [25], which naturally produce a sequence of models along the training epochs in each episode.

Given an episode $\mathcal {T}=\{\mathcal {T}^{(tr)}, \mathcal {T}^{(te)}\}=\{\{x^{(tr)}, y^{(tr)}\},\{x^{(te)}, y^{(te)}\} \}$, let $\varTheta _{m}$ denote the parameters of the base-learner working at epoch m (w.r.t. m-th base-learner or BL-m), with $m \in \{1, ..., M\}$. Basically, we initiate BL-1 with parameters $\theta $ (network weights and bias) and hyperparameters (e.g., learning rate $\alpha $), where $\theta $ is meta-optimized as in MAML [13], and $\alpha $ is generated by the proposed hyperprior learner $\varPsi _{\alpha }$. We then adapt BL-1 with normal gradient descent on the training set $\mathcal {T}^{(tr)}$, and use the adapted weights and bias to initialize BL-2. The general process is thus as follows,

$$\begin{aligned} \varTheta _{0} \leftarrow \theta , \end{aligned}$$

(5)

$$\begin{aligned} \varTheta _{m} \leftarrow \varTheta _{m-1} - \alpha _{m}\nabla _{\varTheta }\mathcal {L}^{(tr)}_{m} = \varTheta _{m-1} - \varPsi _{\alpha }(\tau , \nabla _{\varTheta }\mathcal {L}^{(tr)}_{m})\nabla _{\varTheta }\mathcal {L}^{(tr)}_{m}, \end{aligned}$$

(6)

where $\alpha _{m}$ is the learning rate outputted from $\varPsi _{\alpha }$, and $\nabla _{\varTheta }\mathcal {L}^{(tr)}_{m}$ are the derivatives of the training loss, i.e, gradients. $\tau $ represents either $x^{(tr)}$ in the inductive setting, or $\{x^{(tr)}, x^{(te)}\}$ in the transductive setting. Note that $\varTheta _0$ is introduced to make the notation consistent, and a subscript m is omitted from $\varPsi _{\alpha }$ for conciseness. Let $F(x; \varTheta _m)$ denote the prediction scores of input x, so the base-training loss $\mathcal {T}^{(tr)}=\big \{x^{(tr)}, y^{(tr)}\big \}$ can be unfolded as,

$$\begin{aligned} \mathcal {L}^{(tr)}_{m}= L_{ce}\big (F(x^{(tr)}; \varTheta _{m-1}), y^{(tr)}\big ), \end{aligned}$$

(7)

where $L_{ce}$ is the softmax cross entropy loss. During episode test, each base-learner BL-m infers the prediction scores $z_m$ for test samples $x^{(te)}$,

$$\begin{aligned} z_m = F(x^{(te)}; \varTheta _m). \end{aligned}$$

(8)

Assume the hyperprior learner $\varPsi _{v}$ generates the combination weight $v_m$ for BL-m. The final prediction score is initialized as $\hat{y}^{(te)}_1=v_1 z_1$. For the m-th base epoch, the prediction $z_m$ will be calculated and added to $\hat{y}^{(te)}$ as follows,

$$\begin{aligned} \hat{y}^{(te)}_m \leftarrow v_m z_m + \hat{y}^{(te)}_{m-1} = \varPsi _{v}(\tau , \nabla _{\varTheta }\mathcal {L}^{(tr)}_{m}) F(x^{(te)}; \varTheta _{m}) + \hat{y}^{(te)}_{m-1}. \end{aligned}$$

(9)

In this way, we can update prediction scores without storing base-learners or feature maps in the memory.

4.3 Meta-learning the Hyperprior Learners

As presented in Fig. 3, we introduce two architectures, i.e., LSTM or individual FC layers, for the hyperprior learner. FC layers at different epochs are independent. Using LSTM to “connect” all epochs is expected to “grasp” more task-specific information from the overall training states of the task. In the following, we elaborate the meta-learning details for both designs.

Assume before the k-th episode, we have meta-learned the base learning rates $\{\alpha _m'\}_{m=1}^M$ and combination weights $\{v_m'\}_{m=1}^M$. Next in the k-th episode, specifically at the m-th epoch as shown in Fig. 3, we compute the mean values of $\tau $ and $\nabla _{\varTheta _m}\mathcal {L}^{(tr)}_m$, respectively, over all samples^{Footnote 1}. We then input the concatenated value to FC or LSTM mapping function as follows,

$$\begin{aligned} \varDelta \alpha _m, \varDelta v_m = \text {FC}_m(\text {concat}[\bar{\tau }; \overline{\nabla _{\varTheta _m}\mathcal {L}^{(tr)}_m}]), \mathbf{or} \end{aligned}$$

(10)

$$\begin{aligned}{}[\varDelta \alpha _m, \varDelta v_m], h_{m} = \text {LSTM}(\text {concat}[\bar{\tau }; \overline{\nabla _{\varTheta _m}\mathcal {L}^{(tr)}_m}], h_{m-1}), \end{aligned}$$

(11)

where $h_{m}$ and $h_{m-1}$ are the hidden states at epoch m and epoch $m-1$, respectively. We then use the output values to update hyperparameters as,

$$\begin{aligned} \alpha _m = \lambda _1\alpha _m' + (1-\lambda _1)\varDelta \alpha , \ v_m = \lambda _2 v_m' + (1-\lambda _2)\varDelta v, \end{aligned}$$

(12)

where $\lambda _1$ and $\lambda _2$ are fixed fractions in (0, 1). Using learning rate $\alpha _m$, we update BL-$(m-1)$ to be BL-m with Eq. (6). After M epochs, we obtain the combination of predictions $\hat{y}^{(te)}_M$ (see Eq. (9)) on test samples. In training tasks, we compute the test loss as,

$$\begin{aligned} \mathcal {L}^{(te)}={L}_{ce}(\hat{y}^{(te)}_M ,y^{(te)}). \end{aligned}$$

(13)

We use this loss to calculate meta gradients to update $\varPsi $ as follows,

$$\begin{aligned} \varPsi _{\alpha } \leftarrow \varPsi _{\alpha } - \beta _1\nabla _{\varPsi _{\alpha }}\mathcal {L}^{(te)}, \ \ \varPsi _{v} \leftarrow \varPsi _{v} - \beta _2\nabla _{\varPsi _{v}}\mathcal {L}^{(te)}, \end{aligned}$$

(14)

where $\beta _1$ and $\beta _2$ are meta-learning rates that determine the respective stepsizes for updating $\varPsi _{\alpha }$ and $\varPsi _{v}$. These updates are to back-propagate the test gradients till the input layer, through unrolling all base training gradients of $\varTheta _1\sim \varTheta _M$. The process thus involves a gradient through a gradient [13, 14, 70]. Computationally, it requires an additional backward pass through $\mathcal {L}^{(tr)}$ to compute Hessian-vector products, which is supported by standard numerical computation libraries such as TensorFlow [19] and PyTorch [55].

4.4 Plugging-In E$^3$BM to Baseline Methods

The optimization of $\varPsi $ relies on meta gradient descent method which was first applied to few-shot learning in MAML [13]. Recently, MTL [70] showed more efficiency by implementing that method on deeper pre-trained CNNs (e.g., ResNet-12 [70], and ResNet-25 [69]). SIB [25] was built on even deeper and wider networks (WRN-28-10), and it achieved top performance by synthesizing gradients in transductive learning. These three methods are all optimization-based, and use the single base-learner of the last base-training epoch. In the following, we describe how to learn and combine multiple base-learners in MTL, SIB and MAML, respectively, using our E$^3$BM approach.

According to [25, 70], we pre-train the feature extractor f on a many-shot classification task using the whole set of $\mathcal {D}$. The meta-learner in MTL is called scaling and shifting weights $\varPhi _{SS}$, and in SIB is called synthetic information bottleneck network $\phi (\lambda , \xi )$. Besides, there is a common meta-learner called base-learner initializer $\theta $, i.e., the same $\theta $ in Fig. 2, in both methods. In MAML, the only base-learner is $\theta $ and there is no pre-training for its feature extractor f.

Given an episode $\mathcal {T}$, we feed training images $x^{(tr)}$ and test images $x^{(te)}$ to the feature extractor $f\odot \varPhi _{SS}$ in MTL (f in SIB and MAML), and obtain the embedding $e^{(tr)}$ and $e^{(te)}$, respectively. Then in MTL, we use $e^{(tr)}$ with labels to train base-learner $\varTheta $ for M times to get $\{\varTheta _m\}_{m=1}^M$ with Eq. (6). In SIB, we use its multilayer perceptron (MLP) net to synthesize gradients conditional on $e^{(te)}$ to indirectly update $\{\varTheta _m\}_{m=1}^M$. During these updates, our hyperprior learner $\varPsi _{\alpha }$ derives the learning rates for all epochs. In episode test, we feed $e^{(te)}$ to $\{\varTheta _m\}_{m=1}^M$ and get the combined prediction $\{z_m\}_{m=1}^M$ with Eq. (9). Finally, we compute the test loss to meta-update $[\varPsi _{\alpha }; \varPsi _{v}; \varPhi _{SS}; \theta ]$ in MTL, $[\varPsi _{\alpha }; \varPsi _{v}; \phi (\lambda , \xi ); \theta ]$ in SIB, and $[f; \theta ]$ in MAML. We call the resulting methods MTL+E$^3$BM, SIB+E$^3$BM, and MAML+E$^3$BM, respectively, and demonstrate their improved efficiency over baseline models [13, 25, 70] in experiments.

5 Experiments

We evaluate our approach in terms of its overall performance and the effects of its two components, i.e. ensembling epoch-wise models and meta-learning hyperprior learners. In the following sections, we introduce the datasets and implementation details, compare our best results to the state-of-the-art, and conduct an ablation study.

5.1 Datasets and Implementation Details

Datasets. We conduct few-shot image classification experiments on three benchmarks: miniImageNet [73], tieredImageNet [58] and FC100 [53]. miniImageNet is the most widely used in related works [13, 24, 25, 25, 70, 71]. tieredImageNet and FC100 are either with a larger scale or a more challenging setting with lower image resolution, and have stricter training-test splits.

miniImageNet was proposed in [73] based on ImageNet [60]. There are 100 classes with 600 samples per class. Classes are divided into 64, 16, and 20 classes respectively for sampling tasks for meta-training, meta-validation and meta-test. tieredImageNet was proposed in [58]. It contains a larger subset of ImageNet [60] with 608 classes (779, 165 images) grouped into 34 super-class nodes. These nodes are partitioned into 20, 6, and 8 disjoint sets respectively for meta-training, meta-validation and meta-test. Its super-class based training-test split results in a more challenging and realistic regime with test tasks that are less similar to training tasks. FC100 is based on the CIFAR100 [33]. The few-shot task splits were proposed in [53]. It contains 100 object classes and each class has 600 samples of $32 \times 32$ color images per class. On these datasets, we consider the (5-class, 1-way) and (5-class, 5-way) classification tasks. We use the same task sampling strategy as in related works [1, 13, 25].

Backbone Architectures. In MAML+E$^3$BM, we use a 4-layer convolution network (4CONV) [1, 13]. In MTL+E$^3$BM, we use a 25-layer residual network (ResNet-25) [56, 69, 78]. Followed by convolution layers, we apply an average pooling layer and a fully-connected layer. In SIB+E$^3$BM, we use a 28-layer wide residual network (WRN-28-10) as SIB [25].

The Configuration of Base-Learners. In MTL [70] and SIB [25], the base-learner is a single fully-connected layer. In MAML [13], the base-learner is the 4-layer convolution network. In MTL and MAML, the base-learner is randomly initialized and updated during meta-learning. In SIB, the base-learner is initialized with the averaged image features of each class. The number of base-learners M in MTL+E$^3$BM and SIB+E$^3$BM are respectively 100 and 3, i.e., the original numbers of training epochs in [25, 70].

The Configuration of Hyperprior Learners. In Fig. 3, we show two options for hyperprior learners (i.e., $\varPsi _{\alpha }$ and $\varPsi _v$). Figure 3(a) is the epoch-independent option, where each epoch has two FC layers to produce $\alpha $ and v respectively. Figure 3(b) is the epoch-dependent option which uses an LSTM to generate $\alpha $ and v at all epochs. In terms of the learning hyperprior learners, we have two settings: inductive learning denoted as “Ind.”, and transductive learning as “Tra.”. “Ind.” is the supervised learning in classical few-shot learning methods [13, 37, 64, 70, 73]. “Tra.” is semi-supervised learning, based on the assumption that all test images of the episode are available. It has been applied to many recent works [24, 25, 45].

Ablation Settings. We conduct a careful ablative study for two components, i.e., “ensembling multiple base-learners” and “meta-learning hyperprior learners”. We show their effects indirectly by comparing our results to those of using arbitrary constant or learned values of v and $\alpha $. In terms of v, we have 5 ablation options: (v1) “E$^3$BM” is our method generating v from $\varPsi _{v}$; (v2) “learnable” is to set v to be update by meta gradient descent same as $\theta $ in [13]; (v3) “optimal” means using the values learned by option (a2) and freezing them during the actual learning; (v4) “equal” is an simple baseline using equal weights; (v5) “last-epoch” uses only the last-epoch base-learner, i.e., v is set to [0, 0, ..., 1]. In the experiments of (v1)-(v5), we simply set $\alpha $ as in the following (a4) [13, 25, 70]. In terms of $\alpha $, we have 4 ablation options: (a1) “E$^3$BM” is our method generating $\alpha $ from $\varPsi _{\alpha }$; (a2) “learnable” is to set $\alpha $ to be update by meta gradient descent same as $\theta $ in [13]; (a3) “optimal” means using the values learned by option (a2) and freezing them during the actual learning; (a4) “fixed” is a simple baseline that uses manually chosen $\alpha $ following [13, 25, 70]. In the experiments of (a1)-(a4), we simply set v as in (v5), same with the baseline method [70].

Table 1. The 5-class few-shot classification accuracies (%) on *mini*ImageNet, *tiered*ImageNet, and FC100. “(+time, +param)” denote the additional computational time (%) and parameter size (%), respectively, when plugging-in E$^3$BM to baselines (MAML, MTL and SIB). “–” means no reported results in original papers. The and results are highlighted.

Table 2. The 5-class few-shot classification accuracies (%) of using different hyperprior learners, on the *mini*ImageNet, *tiered*ImageNet, and FC100. “Ind.” and “Tra.” denote the inductive and transductive settings, respectively. The and results are highlighted.

5.2 Results and Analyses

In Table 1, we compare our best results to the state-of-the-arts. In Table 2, we present the results of using different kinds of hyperprior learner, i.e., regarding two architectures (FC and LSTM) and two learning strategies (inductive and transductive). In Fig. 4(a)(b), we show the validation results of our ablative methods, and demonstrate the change during meta-training iterations. In Fig. 4(c)(d), we plot the generated values of v and $\alpha $ during meta-training.

Comparing to the State-of-the-Arts. Table 1 shows that the proposed E$^3$BM achieves the best few-shot classification performance in both 1-shot and 5-shot settings, on three benchmarks. Please note that [12] reports the results of using different backbones and input image sizes. We choose its results under the same setting of ours, i.e., using WRN-28-10 networks and $80\times 80\times 3$ images, for fair comparison. In our approach, plugging-in E$^3$BM to the state-of-the-art model SIB achieves $1.6\%$ of improvement on average, based on the identical network architecture. This improvement is significantly larger as $2.9\%$ when taking MAML as baseline. All these show to be more impressive if considering the tiny overheads from pluging-in. For example, using E$^3$BM adds only $0.04\%$ learning parameters to the original SIB model, and it gains only $5.2\%$ average overhead regarding the computational time. It is worth mentioning that the amount of learnable parameters in SIB+E$^3$BM is around $80\%$ less than that of model in [12] which ensembles 5 deep networks in parallel (and later learns a distillation network).

Hyperprior Learners. In Table 2, we can see that using transductive learning clearly outperforms inductive learning, e.g., No. 5 vs. No. 4. This is because the “transduction” leverages additional data, i.e., the episode-test images (no labels), during the base-training. In terms of the network architecture, we observe that LSTM-based learners are slightly better than FC-based (e.g., No. 3 vs. No. 2). LSTM is a sequential model and is indeed able to “observe” more patterns from the adaptation behaviors of models at adjacent epochs.

Ablation Study. Figure 4(a) shows the comparisons among $\alpha $ related ablation models. Our E$^3$BM again performs the best, over the models of using any arbitrary $\alpha $ ( or ), as well as over the model with $\alpha $ optimized by the meta gradient descent [13]. Figure 4(b) shows that our approach E$^3$BM works consistently better than the ablation models related to v. We should emphasize that E$^3$BM is clearly more efficient than the model trained with meta-learned v through meta gradient descent [13]. This is because E$^3$BM hyperprior learners generate empirical weights conditional on task-specific data. The LSTM-based learners can leverage even more task-specific information, i.e., the hidden states from previous epochs, to improve the efficiency.

The values of $\alpha $ and v learned by E$^3$BM. Fig. 4(c) (d) shows the values of $\alpha $ and v during the meta-training iterations in our approach. Figure 4(c) show the base-learners working at later training epochs (e.g., BL-100) tend to get smaller values of $\alpha $. This is actually similar to the common manual schedule, i.e. monotonically decreasing learning rates, of conventional large-scale network training [21]. The difference is that in our approach, this is “scheduled” in a total automated way by hyperprior learners. Another observation is that the highest learning rate is applied to BL-1. This actually encourages BL-1 to make an influence as significant as possible. It is very helpful to reduce meta gradient diminishing when unrolling and back-propagating gradients through many base-learning epochs (e.g., 100 epochs in MTL). Figure 4(d) shows that BL-1 working at the initial epoch has the lowest values of v. In other words, BL-1 is almost disabled in the prediction of episode test. Intriguingly, BL-25 instead of BL-100 gains the highest v values. Our explanation is that during the base-learning, base-learners at latter epochs get more overfitted to the few training samples. Their functionality is thus suppressed. Note that our empirical results revealed that including the overfitted base-learners slightly improves the generalization capability of the approach.

6 Conclusions

We propose a novel E$^3$BM approach that tackles the few-shot problem with an ensemble of epoch-wise base-learners that are trained and combined with task-specific hyperparameters. In specific, E$^3$BM meta-learns the hyperprior learners to generate such hyperparameters conditional on the images as well as the training states for each episode. Its resulting model allows to make use of multiple base-learners for more robust predictions. It does not change the basic training paradigm of episodic few-shot learning, and is thus generic and easy to plug-and-play with existing methods. By applying E$^3$BM to multiple baseline methods, e.g., MAML, MTL and SIB, we achieved top performance on three challenging few-shot image classification benchmarks, with little computation or parametrization overhead.

Notes

1.
In the inductive setting, training images are used to compute $\bar{\tau }$; while in the transductive setting, test images are additionally used.

References

Antoniou, A., Edwards, H., Storkey, A.: How to train your MAML. In: ICLR (2019)
Google Scholar
Bart, E., Ullman, S.: Cross-generalization: learning novel classes from a single example by feature replacement. In: CVPR, pp. 672–679 (2005)
Google Scholar
Bengio, Y.: Gradient-based optimization of hyperparameters. Neural Comput. 12(8), 1889–1900 (2000)
Article MathSciNet Google Scholar
Bergstra, J., Bengio, Y.: Random search for hyper-parameter optimization. J. Mach. Learn. Res. 13, 281–305 (2012)
MathSciNet MATH Google Scholar
Blei, D.M., Kucukelbir, A., McAuliffe, J.D.: Variational inference: a review for statisticians. J. Am. Stat. Assoc. 112(518), 859–877 (2017)
Article MathSciNet Google Scholar
Breiman, L.: Stacked regressions. Mach. Learn. 24(1), 49–64 (1996)
MATH Google Scholar
Caruana, R.: Learning many related tasks at the same time with backpropagation. In: NIPS, pp. 657–664 (1995)
Google Scholar
Chen, W.Y., Liu, Y.C., Kira, Z., Wang, Y.C., Huang, J.B.: A closer look at few-shot classification. In: ICLR (2019)
Google Scholar
Chen, Z., Fu, Y., Zhang, Y., Jiang, Y., Xue, X., Sigal, L.: Multi-level semantic feature augmentation for one-shot learning. IEEE Trans. Image Process. 28(9), 4594–4605 (2019)
Article MathSciNet MATH Google Scholar
Domke, J.: Generic methods for optimization-based modeling. In: AISTATS, pp. 318–326 (2012)
Google Scholar
Dvornik, N., Schmid, C., Julien, M.: f-VAEGAN-D2: A feature generating framework for any-shot learning. In: ICCV, pp. 10275–10284 (2019)
Google Scholar
Dvornik, N., Schmid, C., Mairal, J.: Diversity with cooperation: Ensemble methods for few-shot classification. In: ICCV, pp. 3722–3730 (2019)
Google Scholar
Finn, C., Abbeel, P., Levine, S.: Model-agnostic meta-learning for fast adaptation of deep networks. In: ICML, pp. 1126–1135 (2017)
Google Scholar
Finn, C., Xu, K., Levine, S.: Probabilistic model-agnostic meta-learning. In: NeurIPS, pp. 9537–9548 (2018)
Google Scholar
Franceschi, L., Frasconi, P., Salzo, S., Grazzi, R., Pontil, M.: Bilevel programming for hyperparameter optimization and meta-learning. In: ICML, pp. 1563–1572 (2018)
Google Scholar
Freund, Y., Schapire, R.E.: A decision-theoretic generalization of on-line learning and an application to boosting. J. Comput. Syst. Sci. 55(1), 119–139 (1997)
Article MathSciNet MATH Google Scholar
Friedman, J.H.: Stochastic gradient boosting. Comput. Stat. Data Anal. 38(4), 367–378 (2002)
Article MathSciNet MATH Google Scholar
Geoffrey, H.E., David, P.C.: Using fast weights to deblur old memories. In: CogSci, pp. 177–186 (1987)
Google Scholar
Girija, S.S.: TensorFlow: large-scale machine learning on heterogeneous distributed systems. 39 (2016). tensorflow.org
Grant, E., Finn, C., Levine, S., Darrell, T., Griffiths, T.L.: Recasting gradient-based meta-learning as hierarchical Bayes. In: ICLR (2018)
Google Scholar
He, T., Zhang, Z., Zhang, H., Zhang, Z., Xie, J., Li, M.: Bag of tricks for image classification with convolutional neural networks. In: CVPR, pp. 558–567 (2019)
Google Scholar
Ho, T.K.: Random decision forests. In: ICDAR, vol. 1, pp. 278–282 (1995)
Google Scholar
Hoffman, M.D., Blei, D.M., Wang, C., Paisley, J.: Stochastic variational inference. J. Mach. Learn. Res. 14(1), 1303–1347 (2013)
MathSciNet MATH Google Scholar
Hou, R., Chang, H., Bingpeng, M., Shan, S., Chen, X.: Cross attention network for few-shot classification. In: NeurIPS, pp. 4005–4016 (2019)
Google Scholar
Hu, S.X., et al.: Empirical Bayes meta-learning with synthetic gradients. In: ICLR (2020)
Google Scholar
Huang, G., Li, Y., Pleiss, G., Liu, Z., Hopcroft, J.E., Weinberger, K.Q.: Snapshot ensembles: Train 1, get M for free. In: ICLR (2017)
Google Scholar
Hutter, F., Hoos, H.H., Leyton-Brown, K.: Sequential model-based optimization for general algorithm configuration. In: Coello, C.A.C. (ed.) LION 2011. LNCS, vol. 6683, pp. 507–523. Springer, Heidelberg (2011). https://doi.org/10.1007/978-3-642-25566-3_40
Chapter Google Scholar
Jaderberg, M., et al.: Population based training of neural networks. arXiv:1711.09846 (2017)
Ju, C., Bibaut, A., van der Laan, M.: The relative performance of ensemble methods with deep convolutional neural networks for image classification. J. Appl. Stat. 45(15), 2800–2818 (2018)
Article MathSciNet Google Scholar
Jung, H.G., Lee, S.W.: Few-shot learning with geometric constraints. IEEE Trans. Neural Netw. Learn. Syst. (2020)
Google Scholar
Kim, J., Kim, T., Kim, S., Yoo, C.D.: Edge-labeling graph neural network for few-shot learning. In: CVPR, pp. 11–20 (2019)
Google Scholar
Kingma, D.P., Welling, M.: Auto-encoding variational Bayes. In: ICLR (2014)
Google Scholar
Krizhevsky, A.: Learning multiple layers of features from tiny images. University of Toronto (2009)
Google Scholar
Krizhevsky, A., Sutskever, I., Hinton, G.E.: ImageNet classification with deep convolutional neural networks. In: NIPS, pp. 1097–1105 (2012)
Google Scholar
Kuncheva, L.I., Whitaker, C.J.: Measures of diversity in classifier ensembles and their relationship with the ensemble accuracy. Mach. Learn. 51(2), 181–207 (2003)
Article MATH Google Scholar
Laine, S., Aila, T.: Temporal ensembling for semi-supervised learning. In: ICLR (2017)
Google Scholar
Lee, K., Maji, S., Ravichandran, A., Soatto, S.: Meta-learning with differentiable convex optimization. In: CVPR, pp. 10657–10665 (2019)
Google Scholar
Lee, Y., Choi, S.: Gradient-based meta-learning with learned layerwise metric and subspace. In: ICML, pp. 2933–2942 (2018)
Google Scholar
Li, F., Fergus, R., Perona, P.: One-shot learning of object categories. IEEE Trans. Pattern Anal. Mach. Intell. 28(4), 594–611 (2006)
Article Google Scholar
Li, H., Eigen, D., Dodge, S., Zeiler, M., Wang, X.: Finding task-relevant features for few-shot learning by category traversal. In: CVPR, pp. 1–10 (2019)
Google Scholar
Li, H., Dong, W., Mei, X., Ma, C., Huang, F., Hu, B.: LGM-Net: learning to generate matching networks for few-shot learning. In: ICML, pp. 3825–3834 (2019)
Google Scholar
Li, L., Jamieson, K.G., DeSalvo, G., Rostamizadeh, A., Talwalkar, A.: Hyperband: a novel bandit-based approach to hyperparameter optimization. J. Mach. Learn. Res. 18, 185:1–185:52 (2017)
MathSciNet MATH Google Scholar
Li, X., et al.: Learning to self-train for semi-supervised few-shot classification. In: NeurIPS, pp. 10276–10286 (2019)
Google Scholar
Li, Z., Zhou, F., Chen, F., Li, H.: Meta-SGD: learning to learn quickly for few shot learning. arXiv:1707.09835 (2017)
Liu, Y., Lee, J., Park, M., Kim, S., Yang, Y.: Learning to propagate labels: transductive propagation network for few-shot learning. In: ICLR (2019)
Google Scholar
Liu, Y., Su, Y., Liu, A.A., Schiele, B., Sun, Q.: Mnemonics training: multi-class incremental learning without forgetting. In: CVPR, pp. 12245–12254 (2020)
Google Scholar
Luketina, J., Raiko, T., Berglund, M., Greff, K.: Scalable gradient-based tuning of continuous regularization hyperparameters. In: ICML, pp. 2952–2960 (2016)
Google Scholar
Maclaurin, D., Duvenaud, D.K., Adams, R.P.: Gradient-based hyperparameter optimization through reversible learning. In: ICML, pp. 2113–2122 (2015)
Google Scholar
Metz, L., Maheswaranathan, N., Cheung, B., Sohl-Dickstein, J.: Meta-learning update rules for unsupervised representation learning. In: ICLR (2019)
Google Scholar
Mishra, N., Rohaninejad, M., Chen, X., Abbeel, P.: Snail: a simple neural attentive meta-learner. In: ICLR (2018)
Google Scholar
Mitchell, T.: Machine Learning. Mcgraw-Hill Higher Education, New York (1997)
MATH Google Scholar
Munkhdalai, T., Yu, H.: Meta networks. In: ICML, pp. 2554–2563 (2017)
Google Scholar
Oreshkin, B.N., Rodríguez, P., Lacoste, A.: TADAM: task dependent adaptive metric for improved few-shot learning. In: NeurIPS, pp. 719–729 (2018)
Google Scholar
Ozay, M., Vural, F.T.Y.: A new fuzzy stacked generalization technique and analysis of its performance. arXiv:1204.0171 (2012)
Paszke, A., et al.: PyTorch: an imperative style, high-performance deep learning library. In: NeurIPS, pp. 8024–8035 (2019)
Google Scholar
Qiao, S., Liu, C., Shen, W., Yuille, A.L.: Few-shot image recognition by predicting parameters from activations. In: CVPR, pp. 7229–7238 (2018)
Google Scholar
Ravi, S., Larochelle, H.: Optimization as a model for few-shot learning. In: ICLR (2017)
Google Scholar
Ren, M., et al.: Meta-learning for semi-supervised few-shot classification. In: ICLR (2018)
Google Scholar
Rezende, D.J., Mohamed, S., Wierstra, D.: Stochastic backpropagation and approximate inference in deep generative models. In: ICML, pp. 1278–1286 (2014)
Google Scholar
Russakovsky, O., et al.: ImageNet large scale visual recognition challenge. Int. J. Comput. Vision 115(3), 211–252 (2015)
Article MathSciNet Google Scholar
Rusu, A.A., et al.: Meta-learning with latent embedding optimization. In: ICLR (2019)
Google Scholar
Satorras, V.G., Estrach, J.B.: Few-shot learning with graph neural networks. In: ICLR (2018)
Google Scholar
Smyth, P., Wolpert, D.: Linearly combining density estimators via stacking. Mach. Learn. 36(1–2), 59–83 (1999)
Article Google Scholar
Snell, J., Swersky, K., Zemel, R.S.: Prototypical networks for few-shot learning. In: NIPS, pp. 4077–4087 (2017)
Google Scholar
Snoek, J., Larochelle, H., Adams, R.P.: Practical Bayesian optimization of machine learning algorithms. In: NIPS, pp. 2951–2959 (2012)
Google Scholar
Snoek, J., et al.: Scalable Bayesian optimization using deep neural networks. In: ICML, pp. 2171–2180 (2015)
Google Scholar
Snoek, J., Swersky, K., Zemel, R.S., Adams, R.P.: Input warping for Bayesian optimization of non-stationary functions. In: ICML, pp. 1674–1682 (2014)
Google Scholar
Sollich, P., Krogh, A.: Learning with ensembles: how overfitting can be useful. In: NIPS, pp. 190–196 (1996)
Google Scholar
Sun, Q., Liu, Y., Chen, Z., Chua, T., Schiele, B.: Meta-transfer learning through hard tasks. arXiv:1910.03648 (2019)
Sun, Q., Liu, Y., Chua, T.S., Schiele, B.: Meta-transfer learning for few-shot learning. In: CVPR, pp. 403–412 (2019)
Google Scholar
Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H.S., Hospedales, T.M.: Learning to compare: relation network for few-shot learning. In: CVPR, pp. 1199–1208 (2018)
Google Scholar
Thrun, S., Pratt, L.: Learning to learn: introduction and overview. In: Thrun, S., Pratt, L. (eds.) Learning to Learn, pp. 3–17. Springer, Boston (1998). https://doi.org/10.1007/978-1-4615-5529-2_1
Chapter MATH Google Scholar
Vinyals, O., Blundell, C., Lillicrap, T., Kavukcuoglu, K., Wierstra, D.: Matching networks for one shot learning. In: NIPS, pp. 3630–3638 (2016)
Google Scholar
Wang, X., Huang, T.E., Darrell, T., Gonzalez, J.E., Yu, F.: Frustratingly simple few-shot object detection. In: ICML (2020)
Google Scholar
Wang, Y., Girshick, R.B., Hebert, M., Hariharan, B.: Low-shot learning from imaginary data. In: CVPR, pp. 7278–7286 (2018)
Google Scholar
Wang, Y.X., Hebert, M.: Learning from small sample sets by combining unsupervised meta-training with CNNs. In: NIPS, pp. 244–252 (2016)
Google Scholar
Xian, Y., Sharma, S., Schiele, B., Akata, Z.: f-VAEGAN-D2: a feature generating framework for any-shot learning. In: CVPR, pp. 10275–10284 (2019)
Google Scholar
Ye, H.J., Hu, H., Zhan, D.C., Sha, F.: Learning embedding adaptation for few-shot learning. arXiv:1812.03664 (2018)
Yoon, J., Kim, T., Dia, O., Kim, S., Bengio, Y., Ahn, S.: Bayesian model-agnostic meta-learning. In: NeurIPS, pp. 7343–7353 (2018)
Google Scholar
Zhang, C., Cai, Y., Lin, G., Shen, C.: DeepEMD: differentiable earth mover’s distance for few-shot learning. arXiv:2003.06777 (2020)
Zhang, C., Cai, Y., Lin, G., Shen, C.: DeepEMD: few-shot image classification with differentiable earth mover’s distance and structured classifiers. In: CVPR, pp. 12203–12213 (2020)
Google Scholar
Zhang, C., Lin, G., Liu, F., Guo, J., Wu, Q., Yao, R.: Pyramid graph networks with connection attentions for region-based one-shot semantic segmentation. In: ICCV, pp. 9587–9595 (2019)
Google Scholar
Zhang, C., Lin, G., Liu, F., Yao, R., Shen, C.: CANet: class-agnostic segmentation networks with iterative refinement and attentive few-shot learning. In: CVPR, pp. 5217–5226 (2019)
Google Scholar
Zhang, L., et al.: Nonlinear regression via deep negative correlation learning. IEEE Trans. Pattern Anal. Mach. Intell. (2019)
Google Scholar
Zhang, R., Che, T., Grahahramani, Z., Bengio, Y., Song, Y.: MetaGAN: an adversarial approach to few-shot learning. In: NeurIPS, pp. 2371–2380 (2018)
Google Scholar

Download references

Acknowledgments

This research was supported by the Singapore Ministry of Education (MOE) Academic Research Fund (AcRF) Tier 1 grant. We thank all reviewers and area chairs for their constructive suggestions.

Author information

Authors and Affiliations

Max Planck Institute for Informatics, Saarland Informatics Campus, Saarbrücken, Germany
Yaoyao Liu & Bernt Schiele
School of Information Systems, Singapore Management University, Singapore, Singapore
Qianru Sun

Authors

Yaoyao Liu
View author publications
You can also search for this author in PubMed Google Scholar
Bernt Schiele
View author publications
You can also search for this author in PubMed Google Scholar
Qianru Sun
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Yaoyao Liu .

Editor information

Editors and Affiliations

University of Oxford, Oxford, UK
Andrea Vedaldi
Graz University of Technology, Graz, Austria
Horst Bischof
University of Freiburg, Freiburg im Breisgau, Germany
Thomas Brox
University of North Carolina at Chapel Hill, Chapel Hill, NC, USA
Jan-Michael Frahm

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 1084 KB)

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Liu, Y., Schiele, B., Sun, Q. (2020). An Ensemble of Epoch-Wise Empirical Bayes for Few-Shot Learning. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, JM. (eds) Computer Vision – ECCV 2020. ECCV 2020. Lecture Notes in Computer Science(), vol 12361. Springer, Cham. https://doi.org/10.1007/978-3-030-58517-4_24

Download citation

DOI: https://doi.org/10.1007/978-3-030-58517-4_24
Published: 10 October 2020
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-58516-7
Online ISBN: 978-3-030-58517-4
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics