Abstract
Object recognition in the real-world requires handling long-tailed or even open-ended data. An ideal visual system needs to recognize the populated head visual concepts reliably and meanwhile efficiently learn about emerging new tail categories with a few training instances. Class-balanced many-shot learning and few-shot learning tackle one side of this problem, by either learning strong classifiers for head or learning to learn few-shot classifiers for the tail. In this paper, we investigate the problem of generalized few-shot learning (GFSL)—a model during the deployment is required to learn about tail categories with few shots and simultaneously classify the head classes. We propose the ClAssifier SynThesis LEarning (Castle), a learning framework that learns how to synthesize calibrated few-shot classifiers in addition to the multi-class classifiers of head classes with a shared neural dictionary, shedding light upon the inductive GFSL. Furthermore, we propose an adaptive version of Castle (a Castle) that adapts the head classifiers conditioned on the incoming tail training examples, yielding a framework that allows effective backward knowledge transfer. As a consequence, a Castle can handle GFSL with classes from heterogeneous domains effectively. Castle and a Castle demonstrate superior performances than existing GFSL algorithms and strong baselines on MiniImageNet as well as TieredImageNet datasets. More interestingly, they outperform previous state-of-the-art methods when evaluated with standard few-shot learning criteria.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Notes
We use the super-script \({{\mathcal {S}}}\) and \({{\mathcal {U}}}\) to denote a set or an instance sampled from \({{\mathcal {S}}}\) and \({{\mathcal {U}}}\), respectively.
\(|\mathcal {S}|\) and \(|\mathcal {U}|\) denote the total number of classes from the seen and unseen class sets, respectively.
Our implementation will be publicly available on https://github.com/Sha-Lab/aCASTLE.
References
Akata, Z., Perronnin, F., Harchaoui, Z., & Schmid, C. (2013). Label-embedding for attribute-based classification. In IEEE conference on computer vision and pattern recognition (pp. 819–826).
Antoniou, A., Edwards, H., & Storkey, A. J. (2019). How to train your MAML. In Proceedings of the 7th international conference on learning representations.
Ba, L. J., Kiros, R., & Hinton, G. E. (2016). Layer normalization. CoRR arXiv:1607.06450.
Bertinetto, L., Henriques, J. F., Torr, P. H. S., & Vedaldi, A. (2019). Meta-learning with differentiable closed-form solvers. In Proceedings of the 7th international conference on learning representations.
Cao, K., Wei, C., Gaidon, A., Arechiga, N., & Ma, T. (2019). Learning imbalanced datasets with label-distribution-aware margin loss. Advances in Neural Information Processing Systems, 32, 1565–1576.
Changpinyo, S., Chao, W. L., & Sha, F. (2017). Predicting visual exemplars of unseen classes for zero-shot learning. In IEEE international conference on computer vision (pp. 3496–3505).
Changpinyo, S., Chao, W. L., Gong, B., & Sha, F. (2016). Synthesized classifiers for zero-shot learning. In IEEE conference on computer vision and pattern recognition (pp. 5327–5336).
Changpinyo, S., Chao, W. L., Gong, B., & Sha, F. (2020). Classifier and exemplar synthesis for zero-shot learning. International Journal of Computer Vision, 128(1), 166–201.
Chao, W. L., Changpinyo, S., Gong, B., & Sha, F. (2016). An empirical study and analysis of generalized zero-shot learning for object recognition in the wild. In Proceedings of the 14th European conference on computer vision (pp. 52–68).
Chen, W. Y., Liu, Y. C., Kira, Z., Wang, Y. C. F., & Huang, J. B. (2019). A closer look at few-shot classification. In Proceedings of the 7th international conference on learning representations.
Cui, Y., Jia, M., Lin, T. Y., Song, Y., & Belongie, S. J. (2019). Class-balanced loss based on effective number of samples. In IEEE conference on computer vision and pattern recognition (pp. 9268–9277).
Das, D., & Lee, C. S. G. (2020). A two-stage approach to few-shot learning for image recognition. IEEE Transactions on Image Processing, 2(9), 3336–3350.
Dong, N., & Xing, E. P. (2018). Domain adaption in one-shot learning. In Proceedings of the European conference on machine learning and knowledge discovery in databases (pp. 573–588).
Finn, C., Abbeel, P., & Levine, S. (2017). Model-agnostic meta-learning for fast adaptation of deep networks. In Proceedings of the 34th international conference on machine learning (pp. 1126–1135).
Gao, H., Shou, Z., Zareian, A., Zhang, H., & Chang, S. F. (2018). Low-shot learning via covariance-preserving adversarial augmentation networks. Advances in Neural Information Processing Systems, 31, 983–993.
Ghiasi, G., Lin, T. Y., & Le, Q. V. (2018). Dropblock: A regularization method for convolutional networks. Advances in Neural Information Processing Systems, 31, 10750–10760.
Gidaris, S., & Komodakis, N. (2018). Dynamic few-shot visual learning without forgetting. In IEEE international conference on computer vision (pp. 4367–4375).
Gu, J., Wang, Y., Chen, Y., Li, V. O. K., & Cho, K. (2018). Meta-learning for low-resource neural machine translation. In Proceedings of the 2018 conference on empirical methods in natural language processing (pp. 3622–3631).
Guo, C., Pleiss, G., Sun, Y., & Weinberger, K. Q. (2017). On calibration of modern neural networks. In Proceedings of the 34th international conference on machine learning (pp. 1321–1330).
Hariharan, B., & Girshick, R. B. (2017). Low-shot visual recognition by shrinking and hallucinating features. In IEEE international conference on computer vision (pp. 3037–3046).
He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. In IEEE conference on computer vision and pattern recognition (pp. 770–778).
Hinton, G. E., Vinyals, O., & Dean, J. (2015). Distilling the knowledge in a neural network. CoRR arXiv:1503.02531.
Kang, B., & Feng, J. (2018). Transferable meta learning across domains. In Proceedings of the 34th conference on uncertainty in artificial intelligence (pp. 177–187).
Kang, B., Xie, S., Rohrbach, M., Yan, Z., Gordo, A., Feng, J., & Kalantidis, Y. (2020). Decoupling representation and classifier for long-tailed recognition. In Proceedings of the 8th international conference on learning representations.
Khosla, A., Jayadevaprakash, N., Yao, B., & Fei-Fei, L. (2011). Novel dataset for fine-grained image categorization. In 1st workshop on fine-grained visual categorization, IEEE conference on computer vision and pattern recognition.
Koch, G., Zemel, R., & Salakhutdinov, R. (2015). Siamese neural networks for one-shot image recognition. In ICML deep learning workshop (Vol. 2).
Krause, J., Stark, M., Deng, J., & Fei-Fei, L. (2013). 3D object representations for fine-grained categorization. In 4th international IEEE workshop on 3D representation and recognition.
Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2017). Imagenet classification with deep convolutional neural networks. Communications of the ACM, 60(6), 84–90.
Lampert, C. H., Nickisch, H., & Harmeling, S. (2014). Attribute-based classification for zero-shot visual object categorization. IEEE Transactions on Pattern Analysis and Machine Intelligence, 36(3), 453–465.
Larochelle, H. (2018). Few-shot learning with meta-learning: Progress made and challenges ahead.
Lee, Y., & Choi, S. (2018). Gradient-based meta-learning with learned layerwise metric and subspace. In Proceedings of the 35th international conference on machine learning (pp. 2933–2942).
Lee, K., Maji, S., Ravichandran, A., & Soatto, S. (2019). Meta-learning with differentiable convex optimization. In IEEE conference on computer vision and pattern recognition (pp. 10657–10665).
Li, H., Eigen, D., Dodge, S., Zeiler, M & Wang, X. (2019). Finding task-relevant features for few-shot learning by category traversal. In IEEE conference on computer vision and pattern recognition (pp. 1–10).
Li, Z., Zhou, F., Chen, F., & Li, H. (2017). Meta-SGD: Learning to learn quickly for few shot learning. CoRR arXiv:1707.09835.
Lifchitz, Y., Avrithis, Y., Picard, S., & Bursuc, A. (2019). Dense classification and implanting for few-shot learning. In IEEE conference on computer vision and pattern recognition (pp. 9258–9267).
Li, F. F., Fergus, R., & Perona, P. (2006). One-shot learning of object categories. IEEE Transactions on Pattern Analysis and Machine Intelligence, 28(4), 594–611.
Li, Z., & Hoiem, D. (2018). Learning without forgetting. IEEE Transactions on Pattern Analysis and Machine Intelligence, 40(12), 2935–2947.
Li, X., Sun, Q., Liu, Y., Zhou, Q., Zheng, S., Chua, T. S., et al. (2019). Learning to self-train for semi-supervised few-shot classification. Advances in Neural Information Processing Systems, 32, 10276–10286.
Liu, Y., Liu, A. A., Su, Y., Schiele, B., & Sun, Q. (2020). Mnemonics training: Multi-class incremental learning without forgetting. In IEEE conference on computer vision and pattern recognition (pp. 12245–12254).
Liu, Z., Miao, Z., Zhan, X., Wang, J., Gong, B., & Yu, S. X. (2019). Large-scale long-tailed recognition in an open world. In IEEE conference on computer vision and pattern recognition (pp. 2537–2546).
Lopez-Paz, D., & Ranzato, M. (2017). Gradient episodic memory for continual learning. Advances in Neural Information Processing Systems, 30, 6467–6476.
Maji, S., Rahtu, E., Kannala, J., Blaschko, M. B., & Vedaldi, A. (2013). Fine-grained visual classification of aircraft. CoRR arXiv:1306.5151.
Nichol, A., Achiam, J., & Schulman, J. (2018). On first-order meta-learning algorithms. CoRR arXiv:1803.02999.
Oreshkin, B. N., López, P. R., & Lacoste, A. (2018). TADAM: Task dependent adaptive metric for improved few-shot learning. Advances in Neural Information Processing Systems, 31, 719–729.
Qiao, S., Liu, C., Shen, W., & Yuille, A. L. (2018). Few-shot image recognition by predicting parameters from activations. In IEEE conference on computer vision and pattern recognition (pp. 7229–7238).
Quattoni, A., & Torralba, A. (2009). Recognizing indoor scenes. In IEEE conference on computer vision and pattern recognition (pp. 413–420).
Ravi, S., & Larochelle, H. (2017). Optimization as a model for few-shot learning. In Proceedings of the 5th international conference on learning representations.
Reed, S. E., Chen, Y., Paine, T., van den Oord, A., Eslami, S. M. A., Rezende, D. J., et al. (2018). Few-shot autoregressive density estimation: Towards learning to learn distributions. In Proceedings of the 6th international conference on learning representations.
Ren, M., Triantafillou, E., Ravi, S., Snell, J., Swersky, K., Tenenbaum, J.B., et al. (2018). Meta-learning for semi-supervised few-shot classification. In Proceedings of the 6th international conference on learning representations.
Ren, M., Liao, R., Fetaya, E., & Zemel, R. (2019). Incremental few-shot learning with attention attractor networks. Advances in Neural Information Processing Systems, 32, 5276–5286.
Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., et al. (2015). Imagenet large scale visual recognition challenge. International Journal of Computer Vision, 115(3), 211–252.
Rusu, A. A., Rao, D., Sygnowski, J., Vinyals, O., Pascanu, R., Osindero, S., et al. (2019). Meta-learning with latent embedding optimization. In Proceedings of the 7th international conference on learning representations.
Schönfeld, E., Ebrahimi, S., Sinha, S., Darrell, T., & Akata, Z. (2019). Generalized zero- and few-shot learning via aligned variational autoencoders. In IEEE conference on computer vision and pattern recognition (pp. 8247–8255).
Simonyan, K., & Zisserman, A. (2014). Very deep convolutional networks for large-scale image recognition. In Proceedings of the 3rd international conference on learning representations.
Snell, J., Swersky, K., & Zemel, R. S. (2017). Prototypical networks for few-shot learning. Advances in Neural Information Processing Systems, 30, 4080–4090.
Sun, Q., Liu, Y., Chua, T. S., & Schiele, B. (2019). Meta-transfer learning for few-shot learning. In IEEE conference on computer vision and pattern recognition (pp. 403–412).
Triantafillou, E., Zhu, T., Dumoulin, V., Lamblin, P., Evci, U., Xu, K., et al. (2020). Meta-dataset: A dataset of datasets for learning to learn from few examples. In Proceedings of the 8th international conference on learning representations.
Triantafillou, E., Zemel, R. S., & Urtasun, R. (2017). Few-shot learning through an information retrieval lens. Advances in Neural Information Processing Systems, 30, 2252–2262.
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., et al. (2017). Attention is all you need. Advances in Neural Information Processing Systems, 30, 5998–6008.
Venkateswara, H., Eusebio, J., Chakraborty, S., & Panchanathan, S. (2017). Deep hashing network for unsupervised domain adaptation. In IEEE conference on computer vision and pattern recognition (pp. 5385–5394).
Vinyals, O., Blundell, C., Lillicrap, T., Kavukcuoglu, K., & Wierstra, D. (2016). Matching networks for one shot learning. Advances in Neural Information Processing Systems, 29, 3630–3638.
Vuorio, R., Sun, S. H., Hu, H., & Lim, J. J. (2019). Multimodal model-agnostic meta-learning via task-aware modulation. Advances in Neural Information Processing Systems, 32, 1–12.
Wah, C., Branson, S., Welinder, P., Perona, P., & Belongie, S. (2011). The Caltech-UCSD Birds-200-2011 Dataset. Technical report CNS-TR-2011-001, California Institute of Technology.
Wang, Y., Chao, W. L., Weinberger, K. Q., & van der Maaten, L. (2019). Simpleshot: Revisiting nearest-neighbor classification for few-shot learning. CoRR arXiv:1911.04623.
Wang, Y. X., Girshick, R. B., Hebert, M., & Hariharan, B. (2018). Low-shot learning from imaginary data. In IEEE conference on computer vision and pattern recognition (pp. 7278–7286).
Wang, T., Zhu, J. Y., Torralba, A., & Efros, A. A. (2018). Dataset distillation. CoRR arXiv:1811.10959.
Wang, Y. X., Ramanan, D., & Hebert, M. (2017). Learning to model the tail. Advances in Neural Information Processing Systems, 30, 7032–7042.
Xian, Y., Schiele, B., & Akata, Z. (2017). Zero-shot learning—The good, the bad and the ugly. In IEEE conference on computer vision and pattern recognition (pp. 3077–3086).
Ye, H. J., Chen, H. Y., Zhan, D. C., & Chao, W. L. (2020). Identifying and compensating for feature deviation in imbalanced deep learning. CoRR arXiv:2001.01385.
Ye, H. J., Hu, H., Zhan, D. C., & Sha, F. (2020). Few-shot learning via embedding adaptation with set-to-set functions. In IEEE conference on computer vision and pattern recognition (pp. 8808–8817).
Yoon, S. W., Seo, J., & Moon, J. (2019). Tapnet: Neural network augmented with task-adaptive projection for few-shot learning. In Proceedings of the 36th international conference on machine learning (pp. 7115–7123).
Zhou, B., Cui, Q., Wei, X. S., & Chen, Z. M. (2020). BBN: Bilateral-branch network with cumulative learning for long-tailed visual recognition. In IEEE conference on computer vision and pattern recognition (pp. 9719–9728).
Acknowledgements
Thanks to Fei Sha for valuable discussions. This research (61773198, 61751306, 61632004, 62006112), NSFC-NRF Joint Research Project under Grant 61861146001, NSF Awards IIS-1513966/ 1632803/1833137, CCF-1139148, DARPA Award#: FA8750-18-2-0117, DARPA-D3M - Award UCB-00009528, Google Research Awards, gifts from Facebook and Netflix, and ARO# W911NF-12-1-0241 and W911NF-15-1-0484.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendices
Appendix A: Implementation Details
1.1 A.1 Pre-training Strategy
In particular, on MiniImageNet, we add a linear layer on the backbone output and optimize a 64-way classification problem on the meta-training set with the cross-entropy loss function. Stochastic gradient descent with initial learning rate 0.1 and momentum 0.9 is used to complete such optimization. The 16 classes in MiniImageNet for model selection also assist the choice of the pre-trained model. After each epoch, we use the current embedding and measures the nearest neighbor based few-shot classification performance on the sampled few-shot tasks from these 16 classes. The most suitable embedding function is recorded. After that, such a learned backbone is used to initialize the embedding part \(\phi \) of the whole model. The same strategy is also applied to the meta-training set of the TieredImageNet, Heterogeneous, and Office-Home datasets, where a 351-way, 100-way, and 25-way classifiers are pre-trained.
1.2 A.2 Feature Network Specification
We follow Qiao et al. (2018); Rusu et al. (2019) when investigating the multi-domain GFSL, where images are resized to \(84\times 84\times 3\). In concrete words, three residual blocks are used after an initial convolutional layer (with stride 1 and padding 1) over the image, which have channels 160/320/640, stride 2, and padding 2. After a global average pooling layer, it leads to a 640 dimensional embedding. While for the benchmark experiments on MiniImageNet and TieredImageNet, we follow Lee et al. (2019) to set the architecture of ResNet, which contains 12 layers and uses the DropBlock (Ghiasi et al. 2018) to prevent over-fitting.
We use the pre-trained backbone to initialize the embedding part \(\phi \) of a model for Castle/a Castle and our re-implemented comparison methods such as MC+kNN, ProtoNet+ProtoNet, MC+ProtoNet, L2ML (Wang et al. 2017), and DFSL (Gidaris and Komodakis 2018). When there exists a backbone initialization, we set the initial learning rate as 1e-4 and optimize the model with Momentum SGD. The learning rate will be halved after optimizing 2,000 mini-batches. During meta-learning, all methods are optimized over 5-way few-shot tasks, where the number of shots in a task is consistent with the inference (meta-test) stage. For example, if the goal is a 1-shot 5-way model, we sample 1-shot 5-way \({{\mathcal {D}}}^{{\mathcal {S}}}_\mathbf {train}\) during meta-training, together with 15 instances per class in \({{\mathcal {D}}}^{{\mathcal {S}}}_\mathbf {test}\).
For Castle/a Castle, we take advantage of the multi-classifier training technique to improve learning efficiency. We randomly sample a 24-way task from \({{\mathcal {S}}}\) in each mini-batch, and re-sample 64 5-way tasks from it. It is notable that all instances in the 24-way task are encoded by the ResNet backbone with the same parameters in advance. Therefore, by embedding the synthesized 5-way few-shot classifiers into the global many-shot classifier, it results in 64 different configurations of the generalized few-shot classifiers. To evaluate the classifier, we randomly sample instances with batch size 128 from \({{\mathcal {S}}}\) and compute the GFSL objective in Eq. 5.
1.3 A.3 Baselines for GFSL Benchmarks
Here we describe some baseline approaches compared in the GFSL benchmarks in detail.
(1) Multiclass Classifier (MC) + k NN A \(|{{\mathcal {S}}}|\)-way classifier is trained on the seen classes in a supervised learning manner as standard many-shot classification (He et al. 2016). During the inference, test examples of \({{\mathcal {S}}}\) categories are evaluated based on the \(|{{\mathcal {S}}}|\)-way classifiers and \(|{{\mathcal {U}}}|\) categories are evaluated using the support embeddings from \({{\mathcal {D}}}_{\mathbf {train}}^{\;{{\mathcal {U}}}}\) with a nearest neighbor classifier. To evaluate the generalized few-shot classification task, we take the union of multi-class classifiers’ confidence and nearest neighbor confidence [the normalized negative distance values as in Snell et al. (2017)] as joint classification scores on \({{\mathcal {S}}} \cup {{\mathcal {U}}}\).
(2) ProtoNet + ProtoNet We train a few-shot classifier (initialized by the MC classifier’s feature mapping) using the Prototypical Network (Snell et al. 2017) (a.k.a. ProtoNet), pretending they were few-shot. When evaluated on the seen categories, we randomly sample 100 training instances per category to compute the class prototypes. The class prototypes of unseen classes are computed based on the sampled few-shot training set. During the inference of generalized few-shot learning, the confidence of a test instances is jointly determined by its (negative) distance to both seen and unseen class prototypes.
(3) MC + ProtoNet We combine the learning objective of the previous two baselines ((1) and (2)) to jointly learn the MC classifier and feature embedding. Since there are two objectives for many-shot (cross-entropy loss on all seen classes) and few-shot (ProtoNet meta-learning objective) classification respectively, it trades off between many-shot and few-shot learning. Therefore, this learned model can be used as multi-class linear classifiers on the head categories, and used as ProtoNet on the tail categories. During the inference, the model predicts instances from seen class \({{\mathcal {S}}}\) with the MC classifier, while takes advantage of the few-shot prototypes to discern unseen class instances. To evaluate the generalized few-shot classification task, we take the union of multi-class classifiers’ confidence and ProtoNet confidence as joint classification scores on \({{\mathcal {S}}} \cup {{\mathcal {U}}}\).
(4) L2ML Wang et al. (2017) propose learning to model the “tail” (L2ML) by connecting a few-shot classifier with the corresponding many-shot classifier. The method is designed to learn classifier dynamics from data-poor “tail” classes to the data-rich head classes. Since L2ML is originally designed to learn with both seen and unseen classes in a transductive manner. In our experiment, we adaptive it to our setting. Therefore, we learn a classifier mapping based on the sampled few-shot tasks from seen class set \({{\mathcal {S}}}\), which transforms a few-shot classifier in unseen class set \({{\mathcal {U}}}\) inductively. Following Wang et al. (2017), we first train a many-shot classifier W upon the ResNet backbone on the seen class set \({{\mathcal {S}}}\). We use the same residual architecture as in Wang et al. (2017) to implement the classifier mapping f, which transforms a few-shot classifier to a many-shot classifier. During the meta-learning stage, a \({{\mathcal {S}}}\)-way few-shot task is sampled in each mini-batch, which produces a \({{\mathcal {S}}}\)-way linear few-shot classifier \({\hat{W}}\) based on the fixed pre-trained embedding. The objective of L2ML not only regresses the mapped few-shot classifier \(f({\hat{W}})\) close to the many-shot one W measured by square loss, but also minimizes the classification loss of \(f({\hat{W}})\) over a randomly sampled instances from \({{\mathcal {S}}}\). Therefore, L2ML uses a pre-trained multi-class classifier W for those head categories and used the predicted few-shot classifiers with f for the tail categories.
Appendix B: More Analysis on GFSL Benchmarks
In this appendix, we do analyses to show the influence of training a GFSL model by reusing the many-shot classifier and study different implementation choices in the proposed methods. We mainly investigate and provide the results over Castle on MiniImageNet. We observe the results on a Castle and other datasets reveal similar trends.
1.1 B.1 Reusing the Many-Shot Classifier Facilitates the Calibration for GFSL
We compare the strategy to train Castle from scratch and fine-tune based on the many-shot classifier. We show both the results of 1-Shot 5-Way few-shot classification performance and GFSL performance with 5 unseen tasks for Castle when trained from random or with provided initialization. From the results in Table 7, we find training from scratch gets only a bit lower few-shot classification results with the fine-tune strategy, but much lower GFSL harmonic mean accuracy. Therefore, reusing the parameters in the many-shot classifier benefits the predictions on seen and unseen classes of a GFSL model. Therefore, we use the pre-trained embedding to initialize the backbone.
1.2 B.2 Comparison with One-Phase Incremental Learning Methods
The inductive generalized few-shot learning is also related to the one-phase incremental learning (Li and Hoiem 2018; Liu et al. 2020), where a model is required to adapt itself to the open set environment. In other words, after training over the closed set categories, a classifier should be updated based on the data with novel distributions or categories accordingly. One important thread of incremental learning methods relies on the experience replay, where a set of the closed set instances is preserved and the classifier for all classes is optimized based on the saved and novel few-shot data. In our inductive GFSL, the Castle variants do not save seen class instances and rely on the neural dictionary to adapt the classifier for a joint classification. Thus, Castle variants have lower computational (time) costs during the inference stage.
Towards comprehensive comparisons, we also investigate two popular incremental learning methods, i.e., LwF (Li and Hoiem 2018) and iCARL (Li and Hoiem 2018). We randomly save 5 images per seen class for both methods. By combining the stored images and the newly given unseen class images together, the model will be updated based on a cross-entropy loss and a distillation loss (Hinton et al. 2015). We tune the balance weight between the classification and distillation loss, the initial learning rate for fine-tuning, and the optimization steps for both methods over the validation set. The harmonic mean accuracy in various evaluation scenarios over 10,000 tasks are listed in Table 8.
In our empirical evaluations, we find that incremental learning methods can get better results than our baselines since it fine-tunes the model with the distillation loss. However, their results are not stable since there are many hyper-parameters. Compared with these approaches, our Castle variants still keep their superiority over all criteria.
1.3 B.3 Light-Weight Adaptation on Castle Variants
As shown in the previous subsection, directly fine-tuning the whole model is prone to over-fit even with another distillation loss. Inspired by Sun et al. (2019) and Li et al. (2019), we consider a light-weight fine-tune step based on the synthesized classifier by Castle variants. In detail, we reformulate the model \(\mathbf {W}^\top \phi (\mathbf{x})\) as \(\mathbf {W}^\top ((\mathbf{1} + \mathrm{scale})\cdot \phi (\mathbf{x}) + \mathrm{bias})\), where \(\mathbf {W}\) is the classifier output by the neural dictionary, the \(\mathrm{scale}\in {\mathbb {R}}^d\) and \(\mathrm{bias}\in {\mathbb {R}}^d\) are additional learnable vectors, and \(\mathbf{1}\) is a size d vector with all values equal 1.
Given a few-shot task with unseen class instances, the model will be updated in the following ways. 5 images per seen class are randomly selected, after freezing the backbone \(\phi \), the classifier \(\mathbf {W}\), the scale, and the bias are optimized based on a cross-entropy loss over both stored seen and unseen classes images. We tune the initial learning rate and the optimization steps over the validation set.
The results of such model adaptation strategies are listed in Table 9. With further model adaptation, both Castle and a Castle could be improved.
1.4 B.4 Effects on the Neural Dictionary Size \(|{{\mathcal {B}}}|\)
We show the effects of the dictionary size (as the ratio of seen class size 64) for the standard few-shot learning (measured by mean accuracy when there are 5 unseen classes) in Fig. 11. We observe that the neural dictionary with a ratio of 2 or 3 works best amongst all other dictionary sizes. Therefore, we fix the dictionary size as 128 across all experiments. Note that when \(|{{\mathcal {B}}}|=0\), our method degenerates to case optimizing the unified objective in Eq. 5 without using the neural dictionary (the Castle\(^-\) model in Sect. 5).
1.5 B.5 How Well is Synthesized Classifiers Comparing with Multi-class Classifiers?
To assess the quality of synthesized classifier, we made a comparison against ProtoNet and also the Multi-class Classifier on the head seen concepts. To do so, we sample few-shot training instances on each seen category to synthesize classifiers (or compute class prototypes for ProtoNet), and then use the synthesized classifiers/class prototypes solely to evaluate multi-class accuracy. The results are shown in Fig. 12. We observe that the learned synthesized classifier outperforms over ProtoNet. Also, the model trained with unified learning objective improves over the vanilla synthesized classifiers. Note that there is still a gap left against multi-class classifiers trained on the entire dataset. It suggests that the classifier synthesis we learned is effective against using sole instance embeddings.
1.6 B.6 Different Choices of the Classifier Synthesis
As in Eq. 6, when there is more than one instance per class in a few-shot task (i.e., \(K > 1\)), Castle compute the averaged embeddings first, and then use the prototype of each class as the input of the neural dictionary to synthesize their corresponding classifiers. Here we explore another choice to deal with multiple instances in each class. We synthesize classifiers based on each instance first, and then average the corresponding synthesized classifiers for each class. This option equals an ensemble strategy to average the prediction results of each instance’s synthesized classifier. We denote the pre-average strategy (the one used in Castle) as “Pre-AVG”, and the post-average strategy as “Post-AVG”. The 5-Shot 5-way classification results on MiniImageNet for these two strategies are shown in Table 10. From the results, “Post-AVG” does not improve the FSL and GFSL performance obviously. Since averaging the synthesized classifiers in a hindsight way costs more memory during meta-training, we choose the “Pre-AVG” option to synthesize classifiers when there are more than 1 shot in each class. In our experiments, the same conclusion also applies to a Castle.
1.7 B.7 How is Multiple Classifiers Learning’s Impact over the Training?
Both Castle and a Castle adopt a multi-classifier training strategy (as described in Sect. 3), i.e., considering multiple GFSL tasks with different combinations of classifiers in a single mini-batch. In Table 11, we show the influence of the multi-classifier training method based on their GFSL performance (harmonic mean). It shows that with a large number of classifiers during the training, the performance of Castle asymptotically converges to its upper-bound. We find a Castle shares a similar trend.
1.8 B.8 The Gap to the Performance “Upper Bound” (UB)
We focus on the (generalized) few-shot learning scenario where there are only budgeted examples in the unseen class tasks. To show the potential improvement space in such tasks, we also investigate a kind of upper bound model where all the available images are used to build the unseen class classifier during the inference stage.
We implement the upper bound model based on the ProtoNet, and the results are in Table 12. Specifically, in the FSL classification scenario, all the unseen class images except those preserved for evaluation are used to build more precise prototypes, and the mean accuracy over 10,000 tasks are recorded; in the GFSL classification scenario, the many-shot unseen class images are utilized as well, and the calibrated harmonic mean is used as the performance measure.
Since the upper bound takes advantage of all the available training images for the few-shot categories, it performs better than the few-shot Castle and a Castle in all the scenarios. The gap between the few-shot learning methods and the upper bound becomes larger when more unseen classes (ways) are involved.
Rights and permissions
About this article
Cite this article
Ye, HJ., Hu, H. & Zhan, DC. Learning Adaptive Classifiers Synthesis for Generalized Few-Shot Learning. Int J Comput Vis 129, 1930–1953 (2021). https://doi.org/10.1007/s11263-020-01381-4
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11263-020-01381-4