1 Introduction

Although artificial intelligence (AI) is now pervasive and widely used, there is a troubling reality that a handful of tech giants tightly control the technology (Verdegem 2022). One group of tech giants that eliminate numerous friction points in our lives as consumers include the route optimization algorithms used by Google Maps, personalized shopping suggestions on Amazon, and the natural language comprehension employed by Apple’s Siri. Although we engage with AI regularly, less than 15% implement AI in operational use within the corporate sector, and approximately an equivalent proportion are confident in their possession of the necessary technological framework to sustain AI ventures (c.f. Fig. 1). The disparity between Big Tech. and the rest widens further when the focus is narrowed to Machine Learning (ML).Footnote 1 This is because, firstly, big data is necessary for training an ML model, which ultimately allows the model to produce accurate predictions. Additionally, a corporation aiming to automate an internal process could only have 100 relevant samples, unlike Google, which has the advantage of \(130\) trillion sites to improve its algorithms. Second, engaging a team of ML experts to automate internal operations is impossible for average or small businesses because these experts are a tiny talent pool and can work almost anywhere and at any price. As a result, many companies use off-the-shelf tools from outside vendors rather than developing their own AI tools. This helps small and mid-sized businesses overcome the lack of resources and expertise. Thus, the absolute monopoly of Big Tech companies on AI is primarily narrowed down to one key reason: the shortage of data. Big Tech. Corporations have easy access to sufficient data, and small and mid-size corporations must strive for it, bringing AI inequality.

Fig. 1
figure 1

Illustration of real-world ML pipeline

Considering that data is the fuel for ML models, it is possible to end Big Tech’s artificial intelligence (AI) monopoly and promote equality by building AI products with less data. The last several years have seen tremendous developments in deep learning (DL) (Lecun et al. 2015; Menghani 2023; Marcus et al 2018), a subfield of AI and ML that has encouraged enterprises across sectors to incorporate DL solutions into their AI strategy. DL has enabled many sophisticated new AI applications, ranging from chatbots in customer service to image and object identification in retail, among others. The remarkable success of DL algorithms with complicated tasks has made them particularly desirable to many businesses in recent years. However, we live in a world where data is never endless. DL systems often need to generalize beyond the data they are trained on, such as when encountering a new word pronunciation or a different image. Due to the limited nature of data, the use of formal arguments to ensure high-quality performance has constraints.

The significant contributions of this review paper are subsequently described. This article outlines some significant challenges associated with DL models.

  • A PRISMA model search study is conducted to identify relevant studies, considering 175+ research articles.

  • The study thoroughly reviewed historical and contemporary DL techniques explicitly designed for small datasets.

  • The paper investigated an alternative approach that employs DL techniques specifically for small datasets, deviating from their conventional utilization with large datasets. We concentrated on small, structured datasets and assessed the performance of several DL algorithms explicitly designed for such datasets. Our goal is to evaluate the efficacy and possible benefits of using these specialized approaches in the context of smaller datasets.

  • A comparative study of different small dataset techniques using different metrics is performed.

  • The article discusses several unresolved research issues and provides recommendations for more studies.

The remaining sections of the paper are organised as follows. Section 2 discusses the search methodology and statistical distribution analysis of publications utilising DL models for small datasets in detail. Section 3 presents the limitations of DL models. Section 4 describes the motivation for this article. Section 5 discusses the detailed techniques for developing DL models for small datasets. Section 6 overviews some open research issues and provides recommendations for more studies. Finally, Sect. 7 summarises the paper’s conclusion.

2 Search methodology and statistical distribution analysis of deep learning models utilising small datasets

2.1 PRISMA model design

The Preferred Reporting Items for Systematic reviews and Meta-Analysis (PRISMA) guideline offers instructions for writing systematic reviews, incorporating advancements in techniques for identifying, choosing, evaluating, and producing studies (Page et al. 2021). A PRISMA model search study is conducted to identify relevant studies, considering 175+ research articles. There is a considerable amount of existing research focused on DL models that leverage big datasets [e.g., (Marcus et al 2018; Chen and Lin 2014; Ahmed et al. 2023)]. However, much less research has gone into reviewing the usage of small datasets for training DL models [e.g., (Gheisari et al. 2017; Ahmed et al. 2023; Bansal et al. 2022)]. This study performs a systematic search that utilises IEEE, PubMed, Google Scholar, Science Direct and Arxiv. The keywords used for our literature search are “Small datasets” OR “Short datasets” OR “Limited datasets” OR “Low datasets” OR “Few samples” AND “Machine Learning” OR “Deep Learning” OR “Computer Vision”. We also performed reference tracking, i.e. the papers cited in the reviews. We carefully examined the papers referenced in the recent review articles (Zhang et al. 2019; ul Sabha et al. 2024). Additionally, we thoroughly investigated the recent articles that cited these review papers on their Google Scholar pages. It is important to note that our search was limited to articles written in English and published in peer-reviewed journals or conference proceedings.

In the literature review process, 300+ papers were initially identified as relevant to DL models for small datasets. Of these, 200 highly related articles were selected by excluding the papers that: (i) do not use AI (including its branches such as Computer Vision (CV), Machine Learning (ML), and Deep Learning (DL)); (ii) have insufficient data, and (iii) are not relevant. Among the selected articles, 42 papers had small, limited, short, low, or few keywords in their title. Figure 2 shows the PRISMA model, illustrating the references related to ML, DL and CV techniques relevant to this review paper.

Fig. 2
figure 2

PRISMA analysis for selection of studies

2.2 Novel contributions compared to existing reviews

Survey articles that focus on applications of DL techniques to small datasets are very scarce. This gap in the literature is mainly due to a prevailing misconception that DL techniques cannot be applied to small dataset problems. This notion has led many researchers to disregard small datasets when considering DL applications. Thus, most existing research prioritises large datasets traditionally considered more suitable for DL models. However, several innovative techniques make DL models compatible with small datasets. Our study systematically categorises these small dataset techniques and highlights the diversity of methods available. The existing survey articles, such as Gheisari et al. (2017), Ahmed et al. (2023), Bansal et al. (2022), Zhang et al. 2019, ul Sabha et al. (2024), address only a few small dataset techniques, as shown in Table 1. The comparison illustrates the broader scope and greater depth of our review. By thoroughly examining the methods and categorizing them comprehensively, our study fills a critical gap in the literature. We aim to challenge the prevailing misconceptions and encourage more researchers to explore DL applications with small datasets. Our survey, therefore, stands out by offering a more extensive and detailed overview of the strategies that make DL viable in data-constrained environments.

Table 1 Comparison with the existing review articles on deep learning with small datasets

2.3 Overview of the reviewed studies

In this study, 175+ high-quality papers were reviewed. Figure 3 shows the word-cloud visualisation of the titles of these articles, which offers a thorough and insightful summary of the major areas covered in the reviewed articles, generally and specifically. It can be seen that the most frequently used words are “Data”, “Deep”, “Learning”, “Augmentation”, “networks”, “classification”, “Small”, and “Semi-supervised”.

Fig. 3
figure 3

Word-cloud visualization of the titles of all 175+ articles

2.4 Statistical distributions

As DL and ML continue to advance globally, knowing the publications and their countries of origin is crucial. Our study was conducted to shed light on this subject based on the number of publications per year for the last 10 years, as depicted in Fig. 4a. While an upward trend in publications occurred from 2015 to 2019, there was a decline between 2020 and 2021. The maximum number of articles was recorded in 2022. Figure 4b shows the yearly number of publications that have “small”, “shot”, “few”, “low”, or “limited” data keywords in the article title. The pie chart in Fig. 6a illustrates the percentage according to the publishers of the articles, indicating that IEEE contributed the highest percentage of papers (43%), followed by arXiv (29%), Springer (9%), and lastly Elsevier (8%). The Association for Computing Machinery (ACM) and Multidisciplinary Digital Publishing Institute (MDPI) each contributed 6% and 3% of publications in this review article, respectively. In addition, 2% of the publications came from Nature, and the remaining 13% came from other sources.

Fig. 4
figure 4

a Number of publications selected per year over the last 10 years. b Number of publications per year with keywords, such as “Small” OR “Limited” OR “Shot” OR “few” OR “small” AND “data’’ OR “dataset” OR “sample” in the paper titles

The pie chart in Fig. 5a shows the percentage of publications belonging to a particular country. It can be seen that the USA accounts for 38% of the total publications included in this study, followed by China (24%), UK and Australia (7% each), and Germany (4%). Figure 5b shows the yearly number of publications that have “small”, “shot”, “few”, “low”, or “limited” data keywords in the article title. The data reveal that the US and China are the primary contributors, with 39% and 37% of the total publications, respectively. This suggests that both countries are making significant strides in DL that leverages small datasets. Germany contributed 10%, the UK and Japan contributed 5% each, followed by Singapore and India with 2% each.

Fig. 5
figure 5

a Country-wise percentage of publications selected in this review article. b Country-wise percentage of publications with titles containing keywords like “Small”, “Shot”, “Few samples”, “limited”, and “low” in the article titles

Researchers have investigated several techniques to address DL models for small datasets (DLS), such as Data Augmentation (DA), transfer learning (TL), generative adversarial networks (GANs), few-shot learning (FSL), loss function, model architecture, and regularization-based methods. Moreover, these methods are integrated to improve DL performance on limited datasets. Figure 6b shows the contribution percentage in each of these methods. The traditional DA-based techniques contribute 11% of the DLS, followed by TL (26%), GANs-based techniques (15%), and transfer learning methods (26%), and loss function-based methods (2%). Some papers combined these techniques, contributing to 6% of the publications. The articles that used small dataset techniques other than the ones mentioned contributed to 23% of the total articles.

Fig. 6
figure 6

a Proportion of publishers involved in the article review. b Proportion of articles employing small dataset techniques

3 Limitations of deep learning

Despite significant achievements in DL models, two aspects of human conceptual understanding have escaped computer systems. First, although humans can acquire new concepts from a few samples with high generalisation ability, traditional DL models require thousands of instances to perform with comparable accuracy. Second, even for basic notions, humans learn richer representations than machines and use them for broader purposes (Lake et al. 2022). Some of the limitations of DL models are mentioned below.

3.1 Deep learning has been data-hungry thus far

DL models often demand significant data to attain optimal results. This is primarily due to the large parameter count for optimisation throughout the training phase. The availability of big datasets improves their generalisation capacity, enabling them to predict and respond to new inputs accurately. The test data are from the same distribution for the model to interpolate new responses between existing ones (Kim et al. 2023; Power et al. 2022). In Krizhevsky et al. (2017), a convolutional neural network (CNN) with nine layers, 60 million parameters, and 650,000 nodes was trained on nearly a million different samples with a thousand classes. On the ImageNet dataset, this sort of brute-force method works well. The quantity of data required for high-quality DL depends on the problem’s complexity and network size. For example, GPT-3 has 175 billion parameters and is one of the largest networks ever trained (Chen and Lin 2014). When a model has many parameters that must be optimised exclusively by feeding it training data, we must ensure that it has many training samples. A model that needs extensive data for training clashes with the nature of human intelligence (Świechowski 2022); we can state that no small or medium-sized business can have this much training data.

AI draws inspiration from human intelligence, and most approaches for determining whether or not we have achieved it include analogies to humans. Humans learn to be efficient at an almost infinite variety of tasks. On the other hand, for instance, children do not need to see many different cats to recognise one. While the learning process in children is not entirely comprehended, it is most plausible that the human brain constructs an abstract representation of a concept internally very quickly (Spicer and Sanborn 2019). Furthermore, utilising sizeable DL networks with extensive data results in high computing costs and, as a result, extended training times (OpenAI et al. 2019). It would be absolutely inefficient to train a DL model for an extended period for each job that AI is confronted with. We require quicker methods of training or creating AI models (OpenAI et al. 2019). DL, thus far, has not been regarded as an optimal solution for small dataset problems.

3.2 Deep learning models lack interpretability

The largest compliance barrier for AI is a lack of interpretability. DL models are black box models (Quinn et al. 2022; Rai 2020) whereby the “black box” problem is characterised as difficulty in interpreting and expressing the reasons behind a model’s forecast outcome. How can we determine whether a model is adequately trained and tested if we cannot understand its output? The greater the interpretability of an ML model, the simpler it is to know why particular judgements or predictions were made (Quinn et al. 2022). If humans more easily understand a model’s outcome than those of other models, the former model is said to be more interpretable or explainable (Huang et al. 2020; Gu et al. 2021; Miller 2017). Moreover, fairness and unbiasedness have lately emerged as crucial auxiliary criteria for model improvement. ML interpretability is a critical tool for testing key features of ML systems.

For ML models, interpretability is critical. When it comes to DL model predictive analysis, there is a trade-off: what prediction is made by the model, for example, the chance that the patient has a brain tumour, or why is the prediction made? In certain circumstances, we do not care why a judgement was made; instead, what truly matters is the prediction performance on test data (Heider and Simmel 1944). However, understanding the “why” provides valuable insights into the problem, the data, and potential model failures. Some approaches may not require explanations as they are used in a low-risk context (e.g., a movie recommender system) or the approaches that are previously widely investigated and evaluated (e.g. optical character recognition). The requirement for interpretability stems from an incompleteness in problem formalisation (Heider and Simmel 1944), which means that for particular issues or tasks, simply getting the prediction is insufficient (the “what”). Because a right forecast only partially answers our initial problem, the model must explain how it arrived at the prediction (the “why”). The more significant the impact of an ML model’s choice on a person’s life, the more vital it is for the machine to explain its behaviour. For example, if a DL model rejects a loan application, the applicants may be utterly surprised (Rayhan and Hashem 2023). They can only reconcile this discrepancy between anticipation and reality by providing some explanation. Incorporating DL-based models into our daily lives is critical for increasing societal acceptability. People attribute beliefs, desires, and intents to these models (Heider and Simmel 1944). Regarding compliance, the black box dilemma impedes AI’s march towards global integration. AI can never fulfil its full potential as long as it remains unexplained.

The majority of applications where interpretability is required involve small dataset problems, such as human illness diagnosis, disaster analysis, defence-related applications, etc. For these applications to be accepted by society, they must be able to explain the predicted behaviours.

3.3 Weakly supervised learning is a problem

Current supervised approaches have succeeded significantly, but getting sufficient supervision information, such as entirely ground-truth labels, is challenging. We require a pre-collected dataset with ground truth, such as ImageNet (Krizhevsky et al. 2017) or PASCAL VOC (Everingham et al. 2009) datasets, to train a model with many parameters and hundreds of layers (ul Sabha et al. 2024). However, due to the high cost of data labelling processes (such as data labelling of small-scale events from surveillance footage, including crowded large-scale situations) or a lack of expertise (such as annotating MRI scans with tumour and non-tumour or any other disease), it may be challenging to achieve such high-quality annotations for many samples in practice. Additionally, many datasets are gathered using crowdsourcing or search engines to cut the cost of human labour. However, they often have a lot of mediocre annotations (i.e., coarse or even inaccurate). This causes well-known, weakly supervised learning problems (Zhou 2018). One solution to this problem would be a DL model that learns effectively from a few samples. The alternative option is to create a model that can function with weak supervision. Some studies are already being done in this area (Settles 2009; Chen and Wang 2011).

3.4 Long-tail phenomena in big datasets degrade the deep learning model’s performance

Big datasets frequently suffer from the long-tail phenomenon, which occurs when a small number of classes have frequent data, but many more have rare data. Due to this data imbalance, a DL model can perform exceptionally in classes with more data but perform poorly in classes with fewer data (ul Sabha et al. 2024). The number of samples in various classes varies greatly across many datasets, such as credit card fraud detection datasets where the difference is significant. Considering that one fraudulent transaction can occur for every 10,000 valid ones, even if a model forecasts fraudulent transactions as legitimate ones, it would still be 99% accurate.

Rebalancing training data, which involves increasing the frequency of the samples from rare classes or reducing the number of samples from the top-numbered classes, is a straightforward improvement method (Shen et al. 2016). However, this approach is typically heuristic and ineffective. The latter tends to lose crucial feature information inside the classes with more samples, whilst the former tends to develop sample redundancy and confront the issue of over-fitting to the rare classes (Wang et al. 2017). Therefore, it is anticipated that the required small data learning approaches will be helpful in resolving the long-tail training problem by utilising more advantageous previous information from small-sample classes.

3.5 Hype vs reality

Humans have unreasonable short-term expectations of DL and artificial intelligence. Although DL research is advancing at a breakneck pace, the fact remains that relatively little of this knowledge has made its way into the products and processes that make up the world. However, most of the research findings are yet to be applied.Footnote 2 Hype may be problematic for emerging technologies because it increases the likelihood of their failure to deliver on projected promises or making exaggerated claims beyond reality. In 2011, producers of the popular game show “Jeopardy”Footnote 3 organised a unique competition by pitting IBM’s AI supercomputer “Watson” against two of the show’s most accomplished champions, Ken Jennings and Brad Rutter. Watson was victorious. Cancer was widely anticipated as Watson’s next battle. IBM has partnered with numerous cancer centres since 2012 to apply Watson’s abilities to cancer therapy. Watson’s entry into cancer care and the interpretation of cancer genomes was well-publicised, with largely positive news coverage: “IBM to team up with UNC, Duke hospitals to fight cancer with big data”Footnote 4; “The future of health care could be elementary with Watson”.Footnote 5 But, three years after IBM introduced Watson to doctors worldwide to propose optimal cancer treatment, it was revealed that Watson was not living up to the enormous expectations proposed by IBM. The supercomputer is still grappling with the fundamental stage of understanding different types of cancer. Only a few hospitals have accepted the system, which falls significantly short of IBM’s goal of dominating the market worth billions of dollars.

Yet, AI is still in its early stages and will take time to reach its full potential. When it does, it will have a long-term societal and economic impact that most people tend to underestimate. AI will change medicine, transportation, science, communication, and culture, becoming our portal to the outside world.

4 Motivation

The acquisition, processing, and privacy costs associated with data must be balanced against the benefits it offers. Technologies or societies that generate big data also produce vast numbers of small datasets. There are cases where small data is preferred over big data. High-quality small data can yield better inference than low-quality big data (Faraway and Augustin 2018). In 1936, for example, the prominent Literary Digest magazine polled its readers to predict the outcome of the US presidential election.Footnote 6 An overwhelming 2.4 million people participated in the poll, with 57% supporting Alfred London and 43% favouring Franklin Roosevelt. During this election, George Gallup’s polling organisation (Gallup, Inc.) was getting started. Gallup anticipated a victory for Roosevelt with 56% using a sample size of just thousands. Roosevelt defeated Landon by a landslide margin of 62% to 38%. The small dataset of thousands outperformed the big dataset of millions.

Bias and variance may affect any estimation. In a time of severe economic distress, readers of the Literary Digest were more wealthy, had the discretionary income to spend on a magazine, and thus were generally more affluent than the general population. The large sample size did not mitigate the bias. Gallup’s tiny sample would have been exposed to more significant variance, but this was a considerably less serious problem than bias. Statistical inference works effectively with small data but not on low-quality extensive data (Martin Lindstrom Company 2016). Developing DL models that are less data-intensive has several advantages. Accordingly, this research was motivated by the following aims.

Reduce capability differences between big and small tech companies: The greater dependence of AI applications on big datasets has created a concern about increasing the differences among organisations’ capability to collect, store, and process relevant data. With this scenario, there is the possibility of widening the gap between the AI “haves”, such as tech giants that can afford, and “have-nots” that cannot satisfy these demands. The approaches that utilise small datasets to apply AI can break the barrier for smaller organisations.

Minimise the incentive for accumulating much personal information: The use of AI will greatly reduce privacy (DIlmaghani et al. 2019). There are worries that major tech corporations continue to gather increasing amounts of consumer information related to individual identities to train their AI models. By decreasing the requirement to gather big real-world data for training ML models, certain small data approaches offer the ability to alleviate such worries to some extent.

Advance in areas where fewer data points are available: Problems where few data exist can be solved in an AI system, for example, by developing an ML model to detect a rare skin disease like Urticaria where there is no possibility of having a large amount of data. Small data techniques can give a rational strategy to cope with data scarcity.

Address challenges with dirty data: Small data techniques can help businesses with many unclean and unstructured data, making it unfit for analysis. For example, the US Department of Defense has a considerable volume of “dirty data” due to legacy systems, necessitating inefficient, labour-intensive data cleaning, labelling, and organising operations (Chahal et al. 2021). Small data techniques can reduce the quantity of data that needs to be pre-processed, saving labour and time.

There are trade-offs between quality and quantity in the real world of limited budgets. Small data sometimes outperform big data, enabling faster, more reliable, and cost-efficient conclusions. Small data are derived from experimental or intentionally collected data on a human scale, with an emphasis on causality and comprehension rather than prediction (Faraway and Augustin 2018).

5 Techniques for solving small dataset problems in deep learning

In this section, data augmentation, generative adversarial networks, transfer learning, few-shot learning, and loss function-based techniques are extensively reviewed for solving the problem of small datasets, along with their advantages and drawbacks.

5.1 Data augmentation

The main problem with small data learning is overfitting (Power et al. 2022). In small datasets, the model is not exposed to every possible aspect of the data distribution, which ultimately has an issue of generalisation (Kim et al. 2023; Yousefzadeh 2022; Lemberger 2017; Jiang et al. 2019; Nagarajan 2021). Data augmentation (DA) is one approach to address the issue of overfitting in the DL models. DA aims to generate additional training data from current training samples by augmenting them with a range of random alterations that result in realistic-looking images. The model should never encounter the same image twice during training. As a result, the model is exposed to more aspects of the data and can better generalise.

The more data an ML system can access, the more successful it may be (Wang and Perez 2017). Even if the data is of poorer quality, the model can extract relevant information from the original dataset (Wang and Perez 2017). Text-to-speech and text-based models, for example, have enhanced dramatically due to Google’s publication of a trillion-word corpus (Halevy et al. 2009). This is despite the fact that the data was gathered from unfiltered websites and contained several inaccuracies.

DA addresses overfitting at the cause of the problem, namely the training dataset. DA is employed because augmentation extracts additional information from the original data (Shorten and Khoshgoftaar 2019). These augmentations play a crucial role in expanding the training dataset’s size through data warping or oversampling. Data warping provides extra training samples by applying transformations to the data space and alters existing pictures to retain their labels. This includes geometric and colour changes, adversarial training, random erasure, and neural style transfer. Synthetic oversampling generates more samples in feature space (Wong et al. 2016). The technique involves mixing images, GANs, and feature space enhancements (Goodfellow et al. 2014).

In the form of data warping, one of the earliest utilisations of DA can be found in LeNet-5 (LeCun et al. 1998) for the classification of handwritten digits. DA is also employed in the AlexNet CNN architecture (Krizhevsky et al. 2017), which revolutionised image classification using convolutional networks on the ImageNet dataset. This method increases the dataset’s size by a factor of 2048 and assists in reducing overfitting when training a deep neural network. According to the study, augmentation lowered the error rate by more than 1%. The study Shijie et al. (2017) tested the performance of DA on different classification problems with the pre-trained model AlexNet. The training data are separated into three scales: small, medium and large, with 200, 500 and 1000 samples per class, respectively. The ImageNet dataset was used to identify a subset of ten classes, and the size of the dataset was doubled and tripled with different augmentation techniques. The percentage gain in accuracy is substantially higher with the smaller dataset. According to the paper, triple combinations can reduce performance, which might be due to the images being overly augmented by triple pairings. The DA techniques that involve fundamental image manipulations are subsequently elucidated.

5.1.1 Geometric transformations

These transformations alter an image’s shape or geometry by translating the individual pixel values to new locations. Several studies, such as Rodrigues et al. (2019), Majurski et al. (2019) utilised geometric transformations as a DA technique on data. Some effective geometric transformations include flipping, rotation, and cropping methods. For example, Krizhevsky et al. (2017) used flipping to supplement the ImageNet dataset. Due to the popularity of this study, it became one of the most used augmentation schemes. These transformations are computationally efficient and straightforward because only rows of image matrices need to be averted. When applying these transformations, safety consideration is essential concerning label preservation of the data sample after the transformation (Shorten and Khoshgoftaar 2019). For instance, rotation and flips are often safe on ImageNet problems, such as cat vs dog datasets, but not on digit identification datasets, such as MNIST or SVNH datasets, because of digits like “6” versus “9”. The ability of a model to respond could potentially be strengthened by non-label transform preservation, indicating that the model is uncertain about its prediction. However, implementing this requires post-augmentation label refinement (Bagherinezhad et al. 2018), which is expensive computationally. One more augmentation strategy involves modifications in the colour channels (Shorten and Khoshgoftaar 2019). Cropping samples by creating new image patches or selecting central patches for image data with varying height and width sizes. Some commonly employed DA techniques include rotation, translation, and noise injection (Moreno-Barea et al. 2019).

Geometric transformations are excellent remedies for positional biases in training data. Many possible causes of bias arise when there are discrepancies in the distribution of training data from testing data. These transformations are valuable when positional biases are present, such as in a facial recognition dataset where every face is properly centred in the image (Zhang et al. 2020). These transformations are effective not just for their remarkable capacity to eliminate positional biases but also for their ease of implementation. However, they have several drawbacks, including increased memory for storing augmented data, computational costs, and increased training time. Finally, the biases between training and test data in applications, such as medical image analysis, often involve more complicated factors beyond positional and translational variances. As a result, the applications of geometric transformations are minimal.

5.1.2 Photometric methods

These transformations modify RGB channels based on predetermined heuristics by shifting each pixel value (r, g, b) to a new pixel value (r ′, g′, b′). This manipulation alters the lighting and colour of the image while preserving the geometry. As a result, the efficiency of colour photometric modifications is relatively simple to grasp (Shorten and Khoshgoftaar 2019). A simple remedy for images that are too bright or too dark is to iterate through them and modify the intensity values by a predefined amount. Some other modification methods include splicing off individual RGB colour matrices and limiting intensity to a particular maximum and minimum value. The intrinsic colour representation in digital images provides a range of augmentation techniques.

5.1.3 Colour space transformation

Transforming the RGB matrices into a single grayscale image simplifies the representation of image datasets, resulting in smaller images of \(height \times width \times Channel\) dimension with less complexity. Nevertheless, it has been demonstrated that this reduces performance accuracy. The study Chatfield et al. (2014) discovered a 3% decline in classification performance between grayscale and RGB images using ImageNet (Deng et al. 2010) and the PASCAL (Everingham, et al. 2009) VOC dataset. Like geometric transformations, colour space conversions have certain disadvantages, such as requiring more memory, high transformation costs, and long training time. Additionally, colour transformations may lead to omitting important colour information and, thus, may potentially impact label preservation. The study by Wah et al. (2011) showed that DA increased CNN classification performance. In terms of Top-1 and Top-5 scores, geometric augmentation schemes performed better than the photometric schemes.

5.1.4 Kernel filter-based data augmentation

Kernel filtering techniques are used to sharpen and pre-process images. For example, a Gaussian Blur Filter produces blurred images, but sharper images are formed using a high-contrast vertical or horizontal edge filter. Image sharpening for DA may capture additional details about things of interest. The augmentation procedure, termed PatchShufe Regularization, proposed in Kang et al. (2017) employs a kernel filter that swaps intensity values in a sliding window at random. Experiments were conducted using ResNet CNN architecture with varying filter widths and adjusting the pixel shuffling probabilities at each step. This method achieved a 5.66% error rate on CIFAR-10 compared to a 6.33% error rate without including PatchShufe Regularization. In DA, kernel filters are relatively less explored.

5.1.5 Mixing images to augment the dataset

Averaging the pixel intensities by combining images is an unconventional DA approach. In the research Inoue (2018), two or more images were cropped randomly from \(256\times 256\) to \(224\times 224\) and flipped horizontally. The samples were combined by taking an average of the pixel values for each RGB channel. Consequently, a mixed image was created, which was then utilised to train the classifier algorithm. The new image’s label was found to be identical to the label applied to the first randomly picked image. After employing the SamplePairing DA approach on the CIFAR-10 dataset, there was a drop in the error rate from 8.22 to 6.93%. When the researchers tested a smaller dataset, CIFAR-10 decreased to 100 samples per category, i.e. 1000 total samples. SamplePairing lowered the error rate from 43.1 to 31.0% using the smaller dataset.

5.1.6 Random erasing

Random erasing (Zhong et al.) is another intriguing DA approach inspired by dropout regularisation processes. Specifically, random erasing is similar to dropout except that it occurs in the input data space rather than being integrated into the network architecture. This approach was created primarily to address image identification issues caused by occlusion, which occurs when some aspects of an item are obscured (Shorten and Khoshgoftaar 2019). Random erasing randomly selects a rectangular section in an image and replaces its pixels with random values during training. This approach generates training samples with varying degrees of occlusion, which decreases the danger of overfitting and makes the model resistant to occlusion. Random erasing does not need parameter learning, is easy to implement, and can be utilised with most CNN-based recognition algorithms. Although it is simple, random erasing complements commonly used DA techniques, such as flipping and random cropping, and offers consistent enhancement over robust baselines in image classification, object recognition, and human reidentification. Table 2 shows the summary of approaches based on DA for solving small data problems.

Table 2 Overview of DA techniques for addressing small data challenges

Traditional DA techniques significantly enhance the predictive performance of DL models and are used extensively across various applications. However, these methods often involve laborious manual efforts. Identifying the optimal DA strategy is highly dependent on the specific dataset and task, necessitating testing a virtually infinite number of augmentation permutations to discover the most effective approach. Additionally, manually crafted augmentations are typically limited in variety. Furthermore, different types of augmentations yield better results for different DL tasks, making the selection of an appropriate augmentation method a complex and challenging problem. Additionally, techniques that enhance generalization on one dataset may not be effective on others. For example, research (DeVries and Taylor 2017) showed that while CutOut (DeVries and Taylor 2017) boosts performance on CIFAR-10, it does not have the same effect on the ImageNet dataset. Furthermore, another study (Raileanu et al. 2021) suggests that traditional DA methods are not well-suited for reinforcement learning tasks.

Extensive research is being conducted to automate the process of DA. Automated Machine Learning (AutoML) (Kim et al. 2022) aims to automate all aspects of designing, training, deploying, and monitoring ML solutions. AutoML frameworks can perform DA, feature engineering, and even construct the network architecture of DL models.

The concept of automated DA involves creating a variety of basic transformation functions (e.g., rotations, flipping, color jittering, solarization, scaling) and then using AutoML techniques to algorithmically apply different combinations of these operations to the data. The goal is to select the most effective set of DA operations. Typically, black-box optimization techniques are employed to determine the best augmentation strategies. This optimization process must identify not only the relevant transformations but also the optimal levels for each transformation. For image augmentation, these levels might include rotation angles, translation offsets, and saturation values. The AutoML field is relatively new, and extensive research is required in this field to enhance its capability of selecting augmentation strategies (Mumuni and Mumuni 2024).

5.2 GANS for solving limited data problems

“GANs are the most interesting idea in the last ten years in Machine Learning”.

– Facebook AI research director Wang (2020)

DA may not be enough to train DL models efficiently when training data are lacking (dos Santos Tanaka and Aranha 2019). It frequently fails to produce the variance needed to accurately reflect the whole task distribution (www.causaLens.com), which leads to model overfitting. In order to ensure that a larger portion of the task distribution may be represented, similar data can be generated to enhance the variance in the training data. The term “Generative Adversarial Network(s)” or GAN(s) for short, was initially introduced by Goodfellow (Goodfellow, et al. 2014). GANs are a generative model that synthesises new images based on training data. The study Marchesi (2017) created high-resolution photorealistic images (up to \(1024\times 1024\) pixels) using less than \(2000\) images. They used the DCGAN (Gao et al. 2018) version of generative modelling. The generated photorealistic images can prove to be a good asset for commercial use of the samples. The study proposed Deep Adversarial Data Augmentation (DADA) model, or learning-based DA on the GAN model (Zhang et al. 2021a). The paper also offers a novel loss for the GAN discriminator, referred to as \(2k\) loss, compared to the \(k+1\) loss employed by many existing GANs. The experiments were conducted on the CIFAR-10, CIFAR100 and SVNH datasets, which were sampled to simulate very low data regimes (less than 1000). The study compared augmentation based on DADA, traditional methods with and without augmentation by measuring the classifier’s performance. The experimental findings reveal that DADA outperforms both TDA and a few GAN-based models significantly.

In another study dos Santos Tanaka and Aranha (2019) on augmenting small datasets using the GAN model, high-quality skin lesion samples were synthesised by employing the style-GAN model. This study utilised the classification challenge International Skin Imaging Collaboration (ISIC 2018) dataset (Codella et al. 2018), which consists of 10,015 images of skin lesions images. The dataset is imbalanced and categorised among seven classes. More than 77% of samples belong to only two categories, namely melanocytic nevi and melanoma. The other classes only have a few hundred samples, for instance, vascular skin lesions have only 115 samples. GAN-based models were used to synthesise images, which is very challenging with such a small dataset. The CNN model exhibited improved classification accuracy when synthesised images were added to the training set (Frid-Adar et al. 2018).

One of the drawbacks of training GANs on limited datasets is that only images with small variances are produced around a restricted number of modes, which characterise the manifold learned by the generator (Bowles et al. 2018). There must be sufficient data to enable the learning of a smooth manifold. However, with so much annotated data available, performing augmentation is probably unnecessary, making the use of a GAN unnecessary. One study (Bowles et al. 2018) aimed to transition to learning this smooth manifold from a considerably smaller set of labelled images. This was accomplished by applying a method influenced by TL, many unlabelled images, and a few labelled images. In small datasets, the parameters of the network are not fully determined, resulting in poor generalisation (Antoniou et al. 2017). The study by Antoniou et al. (2017) demonstrated the possibility of a much more comprehensive range of augmentations. The Data Augmentation Generative Adversarial Network(s) (DAGAN) model is based on the CGAN model and shows a noteworthy enhancement in the overall performance and generalisation when applied to augment data in low data regimes. The paper Frid-Adar et al. (2018) utilised the GAN model to augment CT images of the liver on a dataset with only 182 liver images of three categories (65 haemangiomas, 64 metastasis, and 53 cysts images). CNN was used for classification purposes, and its performance was evaluated by comparison with TDA and GAN-based augmentation. The study revealed better performance with GAN-based augmentation.

The authors of Karras et al. (2020) suggest an adaptive discriminator augmentation technique that significantly stabilises training in restricted data regimes. Without changing loss functions or network designs, the approach can be utilised to start over and improve an existing GAN on a different dataset. With a number of restricted data problems, the authors used this technique and produced noteworthy results, consequently in applications where the restricted data synthesis appears to yield poor-quality synthetic images. This method can be applied to improve the data. Table 3 shows the summary of techniques that consist of GAN-based techniques for solving small data problems applied in the literature.

Table 3 Overview of GAN-based techniques for addressing small data challenges

GANs have emerged as the most versatile generative models due to their unparalleled data synthesis capabilities across various domains. However, training GANs remains highly unstable for several reasons, including vanishing or exploding gradients and oscillatory or diverging dynamics when attempting to find Nash equilibrium. Although recent studies have proposed various solutions to enhance training stability, achieving stable GAN training continues to be an open research challenge. Another issue with GANs is mode collapse, where the model generates limited diversity in its outputs. Numerous solutions have been proposed to mitigate this problem, including modifications to network structures, optimized loss functions, and improved training algorithms. While these techniques have partially addressed mode collapse, further research is needed to enhance the diversity of generated data, particularly in large-scale datasets (Ahmad et al. 2024).

5.3 Transfer learning

“Transfer Learning will be the next ML Success”.

– Andrew Ng NIPS 2016 Tutorial

In traditional ML models, it is typically assumed that the training and testing data are from the same distribution. However, in many real-world scenarios, this assumption does not always hold. The solutions discussed in Sects. 5.1 and 5.2 address the one problem of these models, i.e. insufficient data. The other challenge, like incomparable computation, can be solved with the help of cloud computing and distributed learning. Nevertheless, there are several drawbacks to these mentioned solutions, such as high cost, inefficiency, and security. TL addresses all three challenges and has recently become a viable approach to mitigate such problems (Chen et al. 2021). Some recent survey papers that utilised TL can be referred to in this regard (Niu et al. 2020; Tan et al. 2018; Zhuang et al. 2021). To avoid starting from the beginning with big datasets, TL primarily seeks to complete the target task by leveraging the information gained from source tasks across multiple domains (Pan et al. 2011). TL is frequently used to minimise the impact of small datasets (Pan et al. 2011; Ibragimov and Xing 2017; Interian et al. 2018). On small datasets, the algorithms overfit easily (Yosinski et al. 2014). It has been demonstrated that feature transfer performance degrades as the source and target become increasingly dissimilar (Yosinski et al. 2014). The relation between a model’s training data size and trainable parameters count significantly impacts model performance (Romero et al. 2019). As a result, there is an increasing interest in employing TL to train big models, such as CNNs, in areas with a dearth of training data or other limitations. It has been proven that small target datasets are considerably subtler to variations in TL hyperparameters; hence, it is helpful to distinguish across target dataset sizes (Plested and Gedeon 2019a). The research (Plested and Gedeon 2019a) demonstrates that the frequently used TL protocols for small target datasets lead to increased overfitting and dramatically lower accuracy than optimum protocols (Goceri 2021). The relationship between the appropriate layers count to transfer, and the hyperparameters used for fine-tuning is shown in the study. The work of Yosinski et al. (2014) represents the most organised and extensive examination of TL on CNNs to-date. After being pre-trained on a comparable dataset, they demonstrated that fine-tuned networks generalise better than those trained directly on a massive target dataset. Performance on the target dataset improves with more source datasets (Plested and Gedeon 2022). However, pre-training on bigger, more broad source datasets can sometimes outperform source data that has been carefully selected to more closely resemble target data (Singh et al. 2022; Mormont et al. 2018). A preliminary study (Huh et al. 2016) demonstrates that extra pre-training data is only helpful if it is highly related to the target task. In certain circumstances, augmenting with irrelevant training data degrades performance.

The study (Zhao et al. 2022) presents a deep TL strategy utilising CNN to address the cross-domain diagnostic challenge. The technique extracts features with CNN from the source domain data and generates a pre-trained model. Subsequently, the model is fine-tuned with a small dataset from the target domain through TL strategy, leading to the final intelligent diagnostic model. The paper Zhao et al. (2022) employed a massive-training artificial neural network (MTANN) to detect lung nodules in a small dataset of lung CT scans. The model performed considerably better than TL-based AlexNet.

Nevertheless, pre-training with the smaller source dataset resulted in much lower performance when ImageNet 5k or 9k, or the more problem-specific Caltech Birds (Wah et al. 2011) and Places365 (Zhou et al. 2018), was used as the target dataset. There are just a handful of large image datasets unrelated to the image classification tasks often used for pre-training Places365 with 1.8 million training images as a source task. It was demonstrated that when the source and target datasets were less connected, the larger and more diversified the source training dataset, the better the results on the target dataset.

5.3.1 Short target datasets

When the magnitude of the target dataset diminishes, TL becomes more heavily reliant on it. TL hyperparameters significantly influence performance as the target dataset magnitude decreases (Plested and Gedeon 2019b). Two challenging variables affect TL’s performance as the target dataset’s size diminishes: (1) the empirical risk estimate becomes less trustworthy, increasing the likelihood of overfitting in the target dataset; and (2) the pre-trained weights implicitly regularise the fine-tuned model, and the final weights do not deviate much from their pre-trained values (Raghu et al. 2019; Neyshabur et al. 2020). As a result of Point 1, there is a growing need to apply TL and other approaches to prevent overfitting. The implicit regularisation mentioned in Point 2 may help to reduce the overfitting of the empirical risk estimate mentioned in Point 1. If the transferred weights from the source dataset are inappropriate for the target dataset, it might have a negative impact (negative transfer). Point 1 (Plested and Gedeon 2019b) can exacerbate the negative influence on performance when the weight and features created are confined to being far from ideal.

5.3.2 Smaller target datasets with similar tasks

A well-known work on TL (Yosinski et al. 2014) used an AlexNet (Krizhevsky et al. 2017) with vast, tightly connected source and target datasets. In Plested and Gedeon (2019b), the same experiments were performed with various datasets but with a smaller target dataset size than used by Yosinski et al. (2014). Compared to conventional hyperparameters from Yosinski et al. (2014), there was a considerable improvement when employing more optimum TL hyperparameters. As the sample size shrank, the improvement in accuracy was significant. The average accuracy increased from 20.86 to 30.12% for the lowest target dataset of only ten samples for each of the 500 classes while employing optimal instead of commonly utilised hyperparameters. The study also demonstrates that the conventional method of transferring all but the final classification layer is not the best. The improvement of TL over random initialisation correlates positively when the target and source datasets are closely related. This was demonstrated in Deng et al. (2010) by transferring the pre-trained CNN model to significantly smaller datasets. However, the improvements correlate negatively with the target dataset size. The study Kornblith et al.( 2019) states that the performance improvement of the model trained from scratch is marginal for CARS and FGVC AIRCRAFT (Kornblith et al. 2019) datasets, which are approximately 0.6% and 0.2%, respectively. This is because the similarity between the source (ImageNet 1k) and the target datasets is very low. According to Kornblith et al. (2019), there is a negative association between target dataset size and the improvement over the baseline, but the lower the baseline accuracy numbers, the greater the gain in accuracy since there is more space for development.

5.3.3 Smaller target datasets with less similar tasks

TL often works better on smaller target datasets that are more closely connected to the source dataset than on big datasets that are less related (Kornblith et al. 2019). Self-supervised learning approaches customised to a specific task and applied to more comparable but unlabelled source datasets usually outperform supervised learning techniques used for less similar source datasets (Azizi et al. 2021; Zoph et al. 2020b). Recent research shows that TL may accelerate convergence even when the source and target datasets are vastly dissimilar (Azizi et al. 2021; Siuly and Zhang 2016). Some tasks that rely on TL because of having significantly less target datasets include face detection (Zhang et al. 2020), Facial Expression Recognition (FER) (Li and Deng 2022; Revina and Emmanuel 2021) and medical image diagnosis (Siuly and Zhang 2016; Chen et al. 2018; Singha et al. 2021; Anwar et al. 2018; Xu et al. 2021; Afshar et al. 2019). These applications usually have very few training datasets available. Some unique challenges arise with this type of research. In the case of face recognition, there is minimal variation among the samples within class, because each class represents only one individual. One more challenge in this research is that there can be hundreds of thousands or even millions of classes, which is much higher than the classes in the ImageNet dataset. TL plays a significant role in these types of problems. A DL model can be trained with celebrity faces that are publicly available, and then TL can be used for limited datasets (Plested and Gedeon 2022). In the case of FER, the data are often limited, making these problems challenging. Less than 10,000 images or videos make up the majority of popular FER datasets. Even the largest ones that are frequently used only feature 100 different subjects, which results in a great correlation between the individual images. The fact that there are significant intraclass variances caused by many personal characteristics, such as age, gender, ethnic origin, and expressiveness degree, presents an additional obstacle specific to facial expression recognition (Li and Deng 2022).

The study Deng et al. (2010) demonstrated that pre-training with source data that are more closely related to the target dataset improves performance. Pre-training on a sizable facial recognition dataset outperformed training on the more general and distantly related ImageNet 1k (Deng et al. 2010). Performance was enhanced by a multi-stage pre-training pipeline in Ng et al. (2015), which uses an extensive FER dataset for interim fine-tuning before the final fine-tuning on the short target dataset. Medical imaging tasks are another use of TL. Regarding medical imaging, DL models face two issues: (1) data scarcity—the medical image training databases are frequently in the hundreds or thousands, which is insufficient for a DL model to get effectively trained (Mazurowski et al. 2019); and (2) imbalanced datasets—there are frequently many more examples of healthy data samples than unhealthy ones. DL models are not adequately trained as a result of these issues. TL techniques are used in every medical imaging modality, including CT scans, pathological samples, X-rays, CET, and MRI (Mazurowski et al. 2019). Despite this, there is relatively little research on the optimal practices for deep TL in medical scan identification. In Tajbakhsh et al. (2016), the researchers investigated AlexNet pre-trained on ImageNet 1k with and without fine-tuning, an AlexNet trained from scratch, and traditional models with hand-created features. Employing a pre-trained AlexNet CNN with appropriate fine-tuning regularly outperformed or is on par with training from random initialisation and conventional methods. While the performance advantage from utilising a trained and fine-tuned AlexNet was slight for comparatively bigger target datasets, it became considerably more critical when the target dataset size was lowered. The study Tajbakhsh et al. (2016) utilised a simplified AlexNet (AlexNet with lower parameters) architecture as a DL model. The paper constructed a classifier for online face expressions using a short dataset of only 480 images. As measured by average fivefold cross-validation, the model achieved an accuracy of 78.69%. It is also pertinent to mention that expanding the dataset boosted the classifier’s accuracy.

The study Keshari et al. (2020) presents a Dynamic Attention Pooling (DAP) method that helps to extract global knowledge from the most discriminative sub-part of the feature map. The performance of the DAP was analysed with a ResNet model on comparatively small publicly existing datasets, such as SVHN, C10, C100, and TinyImageNet. The proposed ResNet-based DAP showed an improvement of 1.75%, 0.47%, and 1.87% on C10, C100, and TinyImageNet, respectively. However, several recent results fit this category where deep TL shows slight or no increase over random initialisation (Raghu et al. 2019; He et al. 2019; Zoph et al. 2020a). The findings of the study Barbero-Aparicio et al. (2024) highlight the potential of deep transfer learning as a cutting-edge approach for protein fitness prediction. Researchers can attain performance levels that exceed those of traditional supervised and semi-supervised methods by utilising pre-trained models and fine-tuning them on small datasets. Table 4 presents the summary of techniques based on TL for solving small data problems applied in the literature. The success of TL is not always assured. When the source and target tasks are unrelated, or if the transferred representation lacks sufficient information relevant to the target task, TL may fail to improve and could even degrade the performance compared to training from scratch on the target task, a phenomenon known as negative transfer (Zhang et al. 2023). Research focused on understanding when and what to transfer between tasks to ensure the effectiveness of transfer learning is an essential area of study (Tan et al. 2024). A current trend in transferability research (Tan et al. 2024) focuses on efficiently predicting transfer performance beforehand with minimal or no training of the transfer model. Several effective transferability metrics have been introduced, such as negative conditional entropy (NCE) (Tran et al. 2019) and the H-score (Bao et al. 2019). The study (Barbero-Aparicio et al. 2024) presents a DL model that leverages TL, using the pre-trained Inception V3 network to apply its knowledge to a small (Barbero-Aparicio et al. 2024), labeled dataset within the construction context. This enables the model to effectively learn meaningful representations from the limited training data, thereby enhancing its accuracy in classifying material conditions. Moreover, GLCM-based texture features are extracted from the images to capture textural variations in construction materials. The proposed approach achieved an accuracy of 97% with 208 images and 71% with 70 images, respectively.

Table 4 Overview of TL-based techniques for addressing small data challenges

5.4 Few short learning

In 1950, Alan Turing posed a query, “can machines think?” in his famous paper “Computing Machinery and Intelligence” (Tsai and Salakhutdinov 2017). The paper states that “the idea behind digital computers may be explained by saying that these machines are intended to carry out any operations that could be done by a human computer”. Turing was referring to the idea that these machines are capable of performing any task that a human computer could perform. The ultimate aim of machines is to match human intelligence. Numerous DL methods have helped AI to surpass human accuracy levels. CNN (Krizhevsky et al. 2017) and LSTM (Hochreiter and Schmidhuber 1997) are two examples of such models that have contributed to this advancement in AI. Big datasets like ImageNet, which contains 1000 categories (Krizhevsky et al. 2017), are readily available in the era of big data and are used to train DL models. Additionally, AI has advanced thanks to the development of distributed platforms and powerful processing hardware like GPUs.

ML models must generalise from a small number of instances and learn from experience in order to narrow the gap between AI and humans (Fei-Fei et al. 2006). A novel ML paradigm called Few Shot Learning (FSL) allows for the learning of new information from small datasets. ResNet (He et al. 2016) surpasses humans in ImageNet classification; however, each class in the dataset must have enough samples, which may not be feasible for all applications. For data-intensive applications, FSL can reduce data collection effort. Examples that FSL is used in include face recognition, image classification (Liu et al. 2019), object tracking (Bertinetto et al. 2016) image retrieval (Triantafillou et al. 2017), video event detection (Zhang et al. 2019), language modelling (Vinyals et al. 2016; Bansal et al. 2019), and gesture recognition (Pfister et al. 2014; Feng and Duarte 2019).

The study Sun et al. (2021) devised a model based on FSL that investigates discriminative features by emphasising critical areas in the image. The model employs the focus-area localisation method to identify visually comparable areas among different objects. Furthermore, a real-world fine-grained dataset miniPPlankton, a typical FSL dataset in marine ecological environments, was constructed and extensively validated. In this dataset, fine-grained phytoplankton images were collected using an electron microscope. However, there are only a few samples in the dataset. Image classification of plankton is becoming increasingly crucial for marine observations and aquaculture. Medical datasets are another application where FSL can have a significant impact because most of these datasets available are small and, thus, insufficient for training. FSL approaches can be very helpful in resolving these issues. Another study in this field (Medela et al. 2019) validated FSL approaches for knowledge transfer. The study focused on knowledge transfer from a well-defined source domain of colon tissue to a more general domain comprising colon, lung, and breast tissue using only a small number of training samples. With only 60 training images, FSL achieved a balanced accuracy of 90%. Other studies that have utilised limited medical datasets to investigate the use of FSL are Cai et al. (2020), Chen et al. (2020a, b), Wibowo et al. (2022), Feyjie et al. (2020). The paper Feng and Duarte (2019) presented a few shot human activity recognition technique that employs a DL approach to extract features and perform classification, while knowledge transfer is done via model parameter transfer. Due to the expensive nature of obtaining human-generated activity data and the inherent similarities among activity modes, borrowing information may be more efficient from existing activity recognition models than collecting additional data for training a new model from scratch when only limited training data are available.

In Iwata and Kumagai (2020), the authors offer an FSL approach that can predict the future value of a time series in an objective task based on a limited number of time series data in the target domain. The study Bansal et al. (2019) presents a new approach, LEOPARD, that enables optimisation-based meta-learning across tasks with distinct categories and analysis of alternative strategies for generalisation to various NLP classification problems. LEOPARD is trained using cutting-edge transformer architecture and exhibits improved generalisation to unseen tasks during training, with as little as four examples per class. In an evaluation that involved 17 NLP tasks, spanning diverse domains of entity, sentiment analysis typing, natural language inference, and many other text classification tasks, LEOPARD outperformed several robust baseline approaches by more effectively learning initial parameters for FSL than self-supervised pre-training or multi-task training, for example, yielding a 14.6% average relative improvement in accuracy on unseen problems with only four samples per class. The research Zhao et al. (2023) investigated the FSL model with TL using a short dataset of import and export commodities. They used a ResNet18 as the backbone and DA to enlarge the tiny initial dataset before training, which helped to mitigate the CNN model’s overfitting issue. Also, the attention module is included in the backbone.

The paper Drumond et al. (2023) provides a few-shot motion prediction model incorporating the underlying network structure. The model employs heterogeneous sensors, showing a considerable performance increase overall relevant baselines from 10.4 to 39.3%. The study attempted to anticipate motion for previously unknown actions using only a few labelled instances. The benefit of the model is that the end users can contribute additional movements by showing an activity a few times before the model can reliably categorise and forecast future frames. Another study Zheng et al. (2022) proposed ANomaly dEtection framework with Multi-scale cONtrastive lEarning (ANEMONE), a broad framework based on contrastive learning for graph anomaly detection. The approach uses multi-scale information at the patch and context levels to detect abnormal patterns concealed in complicated networks. Comprehensive trials with ANEMONE and its variation ANEMONE-FS in totally unsupervised and few-shot anomaly detection situations show that both approaches consistently outperformed state-of-the-art methods on six benchmark datasets. Table 5 provides a summary of techniques that consist of FSL-based approaches for solving small data challenges applied in the literature.

Table 5 Overview of FSL techniques for addressing small data challenges

In recent years, innovative FSL approaches have been developed to address various computer vision challenges, including object detection, 3D reconstruction, and video inputs. Moreover, current few-shot image classification methods primarily focus on general datasets (Gharoun et al. 2024). Due to data security concerns and data collection challenges, there is limited research on specialized datasets. Consequently, constructing larger-scale, higher-quality image datasets and developing dedicated datasets for specific fields is a critical research challenge in few-shot image classification. FSL techniques often encounter overfitting issues because they rely on very few samples. The limited diversity of these samples makes it difficult to learn complex features effectively. Consequently, considerable research is required to overcome these challenges and enhance the performance of FSL techniques. FSL is still in its early stages, particularly in its application to multimodal data. Some research in this area, such as studies Peng et al. (2019), Xing et al. (2019) has explored their use for image classification. For instance, study Drumond et al. (2023) combined image features and semantic features for FSL classification. Similarly, the AM3 method proposed in Zheng et al. (2022) adaptively and selectively combines semantic and visual features, significantly enhancing the classification performance compared to the original algorithm.

5.5 Loss function, regularisation, and architecture-based methods

The one consistent finding in the present DL discourse: categorical cross-entropy loss following softmax activation is the preferred technique for classification. One study Barz and Denzler (2020) revealed that the cosine loss function performs significantly better than cross-entropy on datasets with only a few samples per class. The authors demonstrated a 30% gain in accuracy without pre-training on the CUB 200-2011 dataset compared to the cross-entropy loss.

The Orthogonal Softmax Layer (OSL), which keeps the weight vectors in the classification layer orthogonal during both the training and test phases, is proposed as a solution in Li et al. (2020). The suggested OSL shows superior performance compared to the techniques used for comparison on four benchmark datasets with small samples, and experimental findings also suggest that it applies to datasets with large samples. Table 6 provides a summary of techniques, including loss function, DL architecture, and regularisation-based techniques, for solving small data challenges in the literature.

Table 6 Overview of loss function, architectural changes, and regularisation-based techniques for addressing small data challenges in literature

Table 7 shows the comparative study of recent DL approaches applied to various small datasets. Each reviewed paper is categorised based on three characteristics: reference abbreviated as Ref.; the approach employed to address small dataset problems abbreviated as TDA, ODA, GAN, TL, FSL, L&A, and OT; and the evaluation metric utilised. The complete form of these abbreviations is given in Table 7. Table 6 reveals that TDA was the least used technique, while TL and FSL were the most frequently employed techniques, highlighting their popularity and effectiveness for small dataset scenarios. A substantial number of studies have explored other methods, such as cosine loss function, changes in network architecture like adding orthogonal softmax layer, and regularisation-based variations, also demonstrating that a good number of papers used these methods to tackle small dataset challenges. The primary evaluation metric employed across the studies is classification accuracy, emphasising its importance as a measure of model effectiveness.

Table 7 Comparative assessment of deep learning models for small datasets

In Table 8, the dataset size is distributed into five categories, namely D1, D2, D3, D4, and D5, representing different ranges of dataset sizes based on the number of samples per class in the training dataset. D1 includes datasets with less than 100 samples per class. D2 and D3 encompass datasets with 101 to 1000 and 1001 to 3000 samples per class, respectively. D4 includes datasets with 3000 to 10,000 samples per class, and D5 represent datasets with 10,001 or more samples per class. The distribution summary depicted in Fig. 7 shows that 33% of the studies comprise D1 datasets, indicating datasets with less than 100 samples per class. The next most common category is D2, accounting for 29% of the studies, representing datasets with 101 to 1000 samples per class. D3, which includes datasets with 1001 to 3000 samples per class, contributes to 18% of the studies. D4 and D5, representing bigger datasets, comprise 10% of the studies. Table 9 provides additional information, such as references (cited as “Ref.”), the NDS column denoting the number of datasets studied, the technique employed to address small dataset problems, and an additional method apart from DA, TL, GAN, FSL, L&A. It is worth noting that the datasets marked with (*) in the NDS column indicate imbalanced datasets.

Table 8 Size distribution of small datasets and the wide range of methods used in recent research articles
Fig. 7
figure 7

The percentage distribution of studies conducted based on different dataset sizes

Table 9 Abbreviations used in comparison tables

6 Open issues and future research directions

6.1 Open issues

A study conducted by Gartner, Inc.Footnote 7 reported that 70% of enterprises will shift their focus from big datasets to small and wide datasets by 2025, giving more context for analytics and making AI less data-hungry. Nevertheless, there are several open issues in DL with small datasets. Subsequently, some of the major outstanding problems in DL with limited datasets are discussed.

6.1.1 Poor generalisation with small datasets

The theoreticians of ML have focused on the Independent and Identical Distribution (IID) assumptions, which state that the test cases are likely drawn from the same distribution as the training samples. Unfortunately, in the actual world, this is not a reasonable assumption. As a result, the performance of today’s state-of-the-art AI systems suffers when they transition from the controlled laboratory to the field.

Our goal is to improve the model’s robustness when challenged with variations in sample distribution. Generalisation refers to the model’s capacity to perform effectively on unknown data after being trained on a small dataset. One major reason this impacts generalisation is overfitting, in which the model gets overly specialised to the training data and fails to generalise successfully to new cases. Overfitting arises when the model is excessively complex for the available data, causing it to memorise the training instances instead of learning generalisable patterns. Overfitting also arises when the training data do not accurately represent the population the model is meant to serve, resulting in biased predictions.

Recent research assists us in understanding how different DL architectures perform in terms of systematic generalisation capability. How can we develop future ML systems with improved generalisation capabilities and adapt faster in scenarios where the data is out-of-distribution? Some studies that discuss generalisation in DL in detail (Kawaguchi et al. 2022; Bousquet and Elisseeff 2002; Zhang et al. 2021b; Olson et al. 2018; Power et al. 2022; Caro, et al. 2022; Chatterjee and Zielinski 2022).

Enabling higher-level cognition in deep learning models.

DL models frequently lack the ability to reason like humans do. Researchers are creating new strategies that allow DL models to reason and learn from small datasets in order to enable higher-level cognition. Incorporating symbolic reasoning into DL models is one interesting method. The capacity to handle abstract symbols and utilise logical principles to execute tasks has long been a characteristic of human cognition. Researchers hope to give DL models the ability to reason about relationships and abstract concepts and generalise to new tasks and domains. For instance, it has demonstrated that DL models that include symbolic reasoning perform well on tasks like answering visual questions and exercising common sense (Storrs and Kriegeskorte 2019a; Perconti and Plebe 2020; Goyal and Bengio 2022).

Incorporating previous information into DL models is another strategy for allowing higher-level cognition. This may entail integrating information from subject-matter specialists or from different sources, including language models.

Overall, allowing higher-level cognition in DL models is an active research topic, and there is still more to be done to allow DL models to reason and generalise in a way that is more akin to human cognition (Goyal and Bengio 2022; Storrs and Kriegeskorte 2019b; Battleday et al. 2021).

6.1.2 Robustness

DL models are sometimes vulnerable to changes in the input data, including adversarial attacks. Even a small number of adversarial cases can greatly impact the model’s performance in the case of small datasets, which is particularly problematic (Battleday et al. 2021; Qian et al. 2022; Allen-Zhu and Li 2022; Shaukat et al. 2022). Further study is required to determine how to strengthen DL models, particularly for limited datasets.

6.1.3 Unsupervised learning

Unsupervised learning is one fundamental solution beyond supervised, data-hungry DL versions. DL and unsupervised learning are not in logical opposition. DL is generally utilised in a supervised scenario with labelled data, but there are applications of the unsupervised version where excellent results can be obtained by employing DL techniques. There are certainly reasons in many sectors to shift away from the huge data needs that supervised DL typically necessitates.

Unsupervised learning seeks to build usable data representations without explicit labelling or supervision. Autoencoders, variational autoencoders, and generative adversarial networks are DL algorithms that have shown remarkable strides in unsupervised learning tasks, including clustering, anomaly detection, and dimensionality reduction. Nevertheless, employing DL for unsupervised learning still has a lot of obstacles and unanswered problems. Developing efficient training algorithms, creating appropriate architecture, and understanding the theoretical foundations of deep unsupervised learning are some of these difficulties (Agarwal et al. 2022; Akcakaya et al. 2022; Tao et al. 2022).

6.1.4 Data diversity problem

The data diversity problem is still open in DL models, especially with small datasets. It can be minimised by using techniques like DA, TL, regularisation, and ensemble learning; although, these methods may not always work or be feasible in all circumstances. The data diversity issue might also get worse as DL models get more complicated and there are more data to choose from. This is because complex models need more varied instances to acquire robust input representations since they are more prone to overfitting the training data.

Another problem is that defining or quantifying the data diversity issue is not always straightforward. It might be challenging to distinguish between bias, noise, labelling mistakes, or a lack of variety in the training data as the cause of a model’s poor performance on fresh samples. As a result, considerable studies are still being done in the domain of data diversity in DL, and as the subject develops, new methods and techniques are expected to be developed.

6.1.5 How small is actually “small” in deep learning models?

It is uncertain what minimum amount of data is required for successful DL models to function well. The concept of small datasets differs for specific applications and model architectures.

6.1.6 Effective data augmentation and regularisation

DA techniques, such as rotation, scaling, and cropping, can help generate additional training data, but it is unclear which augmentation technique is the most effective and how the model balances augmentation with overfitting. While regularisation approaches reduce overfitting, it is difficult to ascertain which strategies are most effective on small datasets. Some techniques like weight decay and dropout are extensively used in the literature.

6.2 Future research directions

Some primary future research directions other than the ones discussed in the study articles, such as Zhang et al., ul Sabha et al. (2024), are mentioned as follows:

  • Future research on the above-mentioned small dataset techniques should extensively validate these small dataset techniques on real-world applications. Applying these techniques to diverse fields such as healthcare, agriculture, and finance can demonstrate their practicality and lead to the development of tailored solutions for specific domains.

  • Over the past decade, DL models, particularly deep neural networks, have seen substantial advancements. In many practical applications, the model architecture has matured to a point where it can be considered a solved problem. Consequently, it is now more beneficial to maintain a fixed neural network architecture and direct research efforts towards enhancing the quality and quantity of data. Future research should, therefore, prioritize data-centric AI, focusing on innovative ways to improve data to boost model performance.

  • Another promising future research direction is the targeted use of GAN-based data synthesis for problem-solving. For example, consider a model trained on a dataset with five classes, where it performs well on four classes but poorly on one. Instead of enhancing the overall model or dataset, we can specifically improve the data for that underperforming class using GAN-based data augmentation. This targeted approach to addressing specific weaknesses can be a valuable strategy for future research, allowing for more precise and effective improvements.

  • Another future research direction could be the development of a tool designed to identify the most beneficial subset of a big dataset for model training. This tool would select a small, representative dataset that maximizes training efficiency. Additionally, it could pinpoint specific areas where data augmentation is needed, thereby reducing the effort and resources required to collect additional data across the board. Instead of gathering more data indiscriminately, this targeted approach would focus on augmenting data for only those classes that truly need it, streamlining the data collection process.

  • A significant research direction in the area of small datasets involves developing customized loss functions and model architectures specifically suited for limited data scenarios. Future research could explore adaptive loss functions that dynamically adjust based on data characteristics and the learning stage. Additionally, lightweight model architectures requiring fewer parameters should be investigated to enhance robustness and generalisation when working with small datasets. This approach aims to optimize model performance and efficiency, making DL more effective in data-constrained environments.

7 Conclusion

This study comprehensively analyses the advancements in DL models trained on small datasets. The state-of-the-art techniques used in this area were thoroughly reviewed, illustrating their advantages and disadvantages. The PRISMA model search was performed to identify 165 relevant studies, which were subsequently analysed based on various attributes, such as publisher, country, utilisation of small dataset technique, dataset size, and performance. A comparative analysis of different small dataset techniques using different metrics was then conducted. According to our findings, several critical paths for future research in DL on small datasets were identified. Overall, this publication is anticipated to be a helpful resource for academicians and industry professionals interested in this area and inspire more studies to address the challenges of DL with small datasets. Besides the limitations caused by the lack of data, there is significant interest in investigating cutting-edge methods to enhance the effectiveness of DL models.