Introduction

Disasters, whether natural or human-induced, often result in loss of lives, property, or damage that can impose a significant impact on communities over a long period. With the proliferation of smart mobile devices, people are now increasingly using social media applications during disasters to share updates, check on loved ones, or inform authorities of issues that need to be addressed (e.g., damaged infrastructure, stranded livestock). Besides physical sensors and many other sources; human sensors, such as people who use smart mobile devices, generate massive amounts of data in different modalities (such as text, audio, video, and images) during a crisis. Such datasets are generally characterised as multimodal [17].

Disaster response (DR) tasks bring together groups of officials who often serve different organizations and represent different positions, and their information requirements remain complex, dynamic, and ad hoc [101]. Also, it is beyond the capacity of the individual human brain to combine different forms of data in real time and process them to form meaningful information in a complex and fast-moving situation [102]. Therefore, the main challenge faced by emergency responders is effectively extracting, analyzing, and interpreting the enormous range of multimodal data that is available from different sources within a short time period. As a result, emergency responders still depend mostly on text-based reports prepared by field officers for their decision-making processes, avoiding many other sources that could provide them with useful information.

Previously, the DR research community applied classical Machine Learning (ML) techniques to automate DR activities [2, 94]. However, the requirement of ML algorithms for handcrafted features prevented the timely use of such models. Furthermore, the research processes with these methods were labour-intensive and time-consuming [86]. More recently, Deep Neural Networks, which rely less on handcrafted features, instead learning directly from input data, have been used extensively to learn high-level representations through deep features and have proven to be highly effective in many application areas, such as speech recognition, image captioning, and emotion recognition [14, 17, 66, 119]. As DL techniques gain popularity among researchers, there is a timely need to discuss the potential for their use for DR activities. Researchers and practitioners need to understand what has been done in the literature and the current knowledge gaps to make further improvements. Thus, this article analyses and systematically reviews the intersection of the two research fields (DL for DR).

We have organized our review around the components of learning as proposed by Abu-Mostafa [121] and used by Watson et al. [125] for their systematic review. Abu-Mostafa [121] demonstrated the application of five components of learning for any ML problem. These components provide a clear mapping to establish a roadmap for investigating DL approaches in DR research. Our objective is to identify application scenarios, best practices and future research directions in using DL to support DR activities. Therefore, we synthesize five main Research Questions (RQs) and eight sub-questions that support the main RQs according to the components of learning. To answer the RQs, we create a data extraction form having 15 attributes such as DR Task, Data Type, Data Source, and DL Architecture. We create a taxonomy of DR tasks in response to the first RQ, which is then utilized to derive answers for the next RQs. Finally, we use the Knowledge Discovery in Databases (KDD) process to uncover hidden relationships among extracted values for the attributes in the extraction form. Based on our findings, we propose a flowchart with guidelines for DR researchers to follow when using DL models in future research.

We found multiple review articles that discussed the use of multimodal data for disaster response (for example, [6, 105]), outlining applications and challenges. However, many of these have not explicitly considered using DL for feature extraction. We also observed other review articles focused on individual data sources. For example, the studies [11, 55, 72, 91, 111, 124] addressed the frameworks, methodologies, technologies, future trends, and applications for disaster response while using social media datasets. Among other reviews, Gomez et al. [37] analyzed remotely sensed UAV data, considering cases of different disaster scenarios. Overall, these reviews are especially focused on addressing a single source of data and how it can be used for disaster response. The more recent article by Sun et al. [118] provides an overview of using Artificial Intelligence (AI) methods for disaster management. Our work significantly differs from the work by Sun et al. in a number of ways. Firstly, we analyze the articles systematically, adopting the learning components as proposed by Abu Moftha [121]. Secondly, our analysis is confined to trending DL techniques as a subset of AI. Thirdly, we provide a wider discussion on the datasets, preprocessing, DL architectures, hyperparameter tuning, challenges and solutions in processing data for the DL task, and clarify future research directions.

The remainder of this article is organized as follows. We first provide a synthesis of the research questions in Section “Research Question Synthesis”. Section “Methodology” outlines the methodology used to analyze the literature. Sections “RQ1: What types of DR Problems have been Addressed by DL Approaches?”– “RQ5: What are the Underlying Challenges and Replicability of DL for DR Studies?” provide the analysis of the research questions and Section “Opportunities, Directions and Future Research Challenges” summarises opportunities and future research challenges. Section “Results of the Association Rule Mining” discusses the relationships extracted during the KDD process. In Section “Flowchart and Guidelines for Applying DL in Future DR Research” a flow chart is provided with recommendations for future research. Finally, in Section “Conclusions”, we broadly discuss research gaps and conclusions. An online appendix contains the full details of the analysis process, as well as the resources [12].

Research Question Synthesis

Our overarching objectives during this study are to identify research challenges and best practices, and provide directions for future research while using DL methods for DR tasks. Therefore, we have centralized our analysis around the elements of learning (see Fig. 1) and formulated the main RQs accordingly. As a result, we ensure that our analysis effectively captures the essential components of DL applications while also allowing us to perform a descriptive content analysis across these components. Furthermore, we formulated sub-questions supporting the main RQs to analyze more details. The next subsections discuss the formulation of the main RQs and sub-questions according to the components of learning.

Fig. 1
figure 1

The components of learning as proposed by Abu Moftha [121]

The First Component of Learning: The Target Function

The first component of the learning problem is an “unknown target function \((f:x \rightarrow y)\)” as illustrated in Fig. 1, which represents the relationship between known input (x) and output (y). The Target Function is the optimal function that we are attempting to approximate with our learning model. Therefore, the first component of learning enables the researcher to identify main application areas in the research field. As a result, we formulated our first research question to identify target functions in the DR domain, as follows:

figure a

\(\mathbf {RQ_{1}}\) aims to discover DR tasks that have been investigated previously using DL methodologies. Furthermore, the answers to our first RQ provide a taxonomy for analyzing the next research questions.

The Second Component of Learning: The Training Data

The second component of learning is the historical data (training data), required by the algorithm to learn the unknown target function. A thorough understanding of the training data leads to insights about the target function, possible features, and DL architecture design. Furthermore, the quality of the output of a DL model is directly coupled with the provided training data. Therefore, our second question is formulated to understand training data.

figure b

Our goal during this question is to capture the types of training data, the extraction sources, and the preprocessing techniques applied to prepare them for the DL tasks. To support and allow a deeper understanding of the main RQ, we examine this through three sub-questions.

  • \(\mathrm{RQ}_{2.1}\) What types of DR data have been used?

  • \(\mathrm{RQ}_{2.2}\) What sources have been used to extract data, and how have data been extracted?

  • \(\mathrm{RQ}_{2.3}\) How have data been preprocessed before applying the DL models?

The answers we extract during questions \(\mathrm{RQ}_{2.1}\) and \(\mathrm{RQ}_{2.2}\) will enable future researchers to see what types and sources of data have been explored in previous studies and what data has not yet been investigated. Furthermore, \(\mathrm{RQ}_{2.3}\) provides them with the details of prepossessing techniques that have been followed during the studies.

The Third and Fourth Components of Learning: The Learning Algorithm and Hypothesis Test

According to Abu Moftha [121], the third and fourth learning elements are known as the “learning model”. The learning model consists of the learning algorithm and the hypothesis set. A learning algorithm tries to define a model to fit a given dataset. For example, the algorithm generally uses a probability distribution over the input data to approximate the optimal hypothesis from the hypothesis set. The hypothesis set consists of all the hypotheses to which the input data are mapped. Therefore, the learning algorithm and the hypothesis set are tightly coupled. Considering together the learning algorithm and hypothesis set, we formulate our third RQ as follows.

figure c

We aim to identify and evaluate the various DL models that have been applied for DR tasks. Hence, we consider three further sub-questions to capture specific architectures and types of DL models.

  • \(\mathrm{RQ}_{3.1}\) What types of DL architectures are used?

  • \(\mathrm{RQ}_{3.2}\) What types of learning algorithms and training processes are used?

  • \(\mathrm{RQ}_{3.3}\) What methods are used to avoid overfitting and underfitting?

The answers to \(\mathrm{RQ}_{3.1}\) provide DL architectures that has been adopted for various DR tasks. Our goal is to determine whether certain DL architectures are preferred by researchers and the reasons for those trends. As a part of the analysis, we capture how transfer learning approaches have been adopted to address algorithm training and performance issues. During \(\mathrm{RQ}_{3.2}\), we intend to examine the types of learning algorithms and the training processes involved, including how parameter optimization has been achieved. Moreover, in \(\mathrm{RQ}_{3.3}\), we aim to analyze the methods used to combat overfitting and underfitting. Answers to both \(\mathrm{RQ}_{3.2}\) and \(\mathrm{RQ}_{3.3}\) will provide future researchers with an idea of how parameter tuning and optimization has been applied in DL for DR research to improve the accuracy of the output.

The Fifth Component of Learning: The Final Hypothesis

The final component of learning is the “final hypothesis”. This is the target function learnt by the algorithm to predict unseen data points. Through this component of learning, we aim to analyze the effectiveness of the algorithm at achieving the hypothesis for the selected DR task. Therefore, our fourth RQ is formulated as follows:

figure d

During the analysis for \(\mathbf {RQ_{4}}\), we derive the metrics used to evaluate the performance of DL models. Future researchers can utilize these matrices and extract values to compare the results achieved by their models. Additionally, we examine two sub-questions to perform a deeper evaluation of the selected question.

  • \(\mathrm{RQ}_{4.1}\) What evaluation matrices are used to evaluate the performance of DL models?

  • \(\mathrm{RQ}_{4.2}\) What “baseline” models have been compared?

Our intention with \(\mathrm{RQ}_{4.1}\) is to derive a taxonomy of performance matrices used by the analyzed studies, while \(\mathrm{RQ}_{4.2}\) will identify those “baseline” models that have been criticized and allow future researchers to select those appropriate for comparison of their results.

The Final Analysis

Our fifth RQ is designed to identify and characterize underlying problems that arise when utilizing DL models for DR tasks. Our goal is to provide researchers with challenges faced by the DR research community in employing DL-based approaches. This will enable future research to be designed in a way that addresses or avoids these challenges and better utilizes DL algorithms to support DR tasks. Furthermore, we aim to analyze the replicability of DL models and architectures. Researchers are more likely to re-implement, improve, or compare new models if the existing DL architectures are easily replicable, which will eventually increase the quality and quantity of DL for DR research. Thus, our final RQ is formulated as follows:

figure e

In summary, the Systematic Literature Review (SLR) conducted in this paper answers the following research questions:

figure f

Methodology

Fig. 2
figure 2

Literature selection process

Multiple techniques have been proposed to understand the content of a body of scholarly literature, including scoping reviews, umbrella reviews, or systematic reviews [38]. Among them, the systematic review aims to exhaustively and comprehensively search for research evidence on a topic area and appraise and synthesize it thoroughly [38]. In this analysis, we are interested in identifying the gaps in the research and whether there are opportunities for researchers and practitioners to investigate new problems that have not yet been addressed in the DR domain using DL. We, therefore, consider a systematic review to be the most appropriate approach to find answers to the above formulated RQs. To the best of our knowledge, this is the first systematic review that investigates the intersection of the DL and DR research fields. Our study adopts the following steps to guide the SLR process, as highlighted by Yigitcanlar et al. [128].

  1. 1.

    Develop a research plan.

  2. 2.

    Search for relevant articles.

  3. 3.

    Apply exclusion criteria.

  4. 4.

    Extract relevant data from the selected articles.

  5. 5.

    Analyse the literature data.

Develop a Research Plan

As the first step for carrying out the SLR, a research plan was developed, including research aim, keywords, and a set of inclusion and exclusion criteria. The research aim was to identify the usage of DL techniques on disaster data to support DR tasks as outlined in RQs 1–5. Hence, “disaster” and “deep learning” were selected as the search keywords. The search also included variants of these keywords. The alternate search terms for “disaster” included ‘hazard’, ‘emergency’, ‘crisis’, and ‘damage’. Also, ‘deep neural network’ was used as an alternative keyword for DL. Some research has considered “machine learning” as an alternative keyword for DL. However, since we were particularly interested in Deep Neural Networks, we omitted “machine learning” as a keyword in the search. The inclusion criteria limited the sources to peer-reviewed academic publications available online in a full-text format and relevant to the research aims. The exclusion criteria were determined as publications in languages other than English; grey literature, such as government or industry reports; and non-academic research.

Search the Relevant Articles

In the second step, the search for relevant articles was conducted using a keyword search in each of the following databases: Scopus, Web of Science, and the EBSCO Discovery Service on April 2, 2021. Articles published since April 2011 were considered because a scan of existing literature suggested that there was not much literature related to DL in disaster research before then. The initial search produced 592 results.

Apply Exclusion Criteria

In this step, the results were filtered to remove duplicates between the databases, which reduced the number to 295 unique articles. We used a simple Python script to remove duplicates using the title of the article. We confined our scope to only papers that discuss natural or human-induced disasters. Therefore, the abstracts were manually read and removed if they discussed political crises, medical emergencies or financial crises. We also removed articles that did not provide sufficient details related to the attributes in our extraction form (see Table 1). Finally, 83 articles were selected for the review. Fig. 2 illustrates the process and the steps that we followed to filter the results and the quantity of papers returned at each step. Moreover, we provide the publication venues of the 83 articles in Fig. 3.

Fig. 3
figure 3

Publication venues of the articles. The number of grey boxes corresponds to the number of articles published in each publication venue. Full publication venue names are available in the Appendix B

Extract Relevant Data from the Selected Articles

The next step in our methodology was to extract relevant data from the selected articles. We developed a data extraction form including the information shown in Table 1. The extracted information was collected manually and added to a Google sheet and later downloaded as a tab-separated (.tsv) file for the data analysis steps. The extracted data sheet is available in the online appendix [12].

Table 1 Attributes in the data extraction form

Analyse Data Using the Knowledge Discovery in Databases (KDD) process

The final step in our SLR methodology was to analyze the extracted data. We used the steps discussed in [125], namely data collection, initial coding and focused coding. After the coding process, we used the Knowledge Discovery in Databases (KDD) process to understand relationships among attributes in the extraction form. The KDD process is used to extract knowledge from databases using five steps: selection, preprocessing, transformation, data mining, and interpretation/evaluation [33]. We combined data preprocessing and transformation into one step as both steps involve preparing data for the mining step. The details of each stage are listed as follows.

  • Selection: This stage is related to the selection of relevant data for the analysis. As described in the previous section, we selected 83 articles and extracted 15 attributes from them for the analysis.

  • Preprocessing: In this stage, we cleaned the extracted values by removing noise, such as misspellings, incorrect punctuation and mismatching coding. We noticed that a number of variations on particular terms, and standardized these to ensure appropriate matching (e.g., ConvNet/CNN, F-measure/F1-value/F-score/F1-score).

  • Data mining: The third stage is related to identifying relationships among extracted data. We applied association rule mining to derive relationships discussed further in Section “Association Rule Mining”.

  • Interpretation/Evaluation: We interpret the findings of the KDD process in Section “Results of the Association Rule Mining”. These relationships demonstrate actionable knowledge for future researchers from the 83 articles analyzed through the SLR process.

Association Rule Mining

We followed the association rule mining process introduced by Samia et al. [61] for literature analysis. Our association rules are extracted using the Apriori algorithm. Association rules help to discover relationships in categorical datasets. For instance, the rules generated during the process identify frequent patterns in the dataset. Associations are generally represented by “Support”, “Confidence”, and “Lift”. We illustrate this using the values in the Data Source column in the extraction form. “Support” and “Confidence” are the two indicators evaluating the interestingness of a given rule. Supp(Twitter) is the fraction of articles for which Twitter appears in the Data Source column of the extraction form as given in Eq. 1.

$$\begin{aligned} supp (Twitter) = \frac{\text {Number of Articles in which}\, \textit{Twitter}\,\text {appears in the Data Source column}}{\text {Total Number of Articles}}. \end{aligned}$$
(1)

If we consider the values in both the Data Source and the Data Type columns of the extraction form, the association rule Twitter\(\rightarrow\) Text means that each time Twitter appears in the Data Source column, Text appears in the Data Type column (see Eq. 2).

$$\begin{aligned} conf (Twitter\rightarrow Text)= \frac{supp (Twitter \cup Text)}{supp(Twitter)}. \end{aligned}$$
(2)

“Lift” measures how likely it is that item Text is found in the Data Type column when Twitter is found in the Data Source column as given in Eq. 3. A “Lift” value greater than 1 means that item Twitter is likely to appear in the Data Source column if Text appears in the Data Type column, while a value less than 1 means that Twitter is unlikely to appear if Text appears in the respective columns.

$$\begin{aligned} lift (Twitter\rightarrow Text) = \frac{supp (Twitter \cup Text)}{supp(Twitter) \times supp (Text)}. \end{aligned}$$
(3)

These associations can provide a guidance for future researchers during the planning stages of a project applying DL to DR research, supporting them in choosing different attributes, such as data source, deep learning algorithm and learning types. We used the Python apyori libraryFootnote 1 to discover association rules, details of which are presented in the online appendix [12].

\(\mathbf {RQ_{1}}\): What types of DR Problems have been addressed by DL Approaches?

This RQ explores the types of DR problems that have been investigated with DL models. We derived a taxonomy of DR tasks to capture relationships between other learning components, as illustrated in Fig. 5. From the 83 papers that we analysed, we identified nine main DR tasks (level-1 of the taxonomy) that have been addressed using DL approaches. Figure 4 shows the number of papers published in each year by the main DR tasks. During the ten-year duration of studies we analysed, unsurprisingly, little work was undertaken between 2011 and 2015. There was a sudden interest in exploring DL architectures in the DR domain from 2017 onwards. This interest coincides with the introduction of popular DL frameworks, such as KerasFootnote 2 and TensorFlowFootnote 3 in 2015 and PyTorchFootnote 4 in 2016. Disaster event detection was the first task to be explored using DL algorithms. Among the other tasks, Disaster damage assessment, Disaster-related information filtering and Disaster-related information classification were explored in 2017. Remotely sensed images were the main source of data for multiple early studies that used DL approaches. Early research may have used remotely sensed data for various reasons. Firstly, in 2011, Google EarthFootnote 5 launched a platform that allowed researchers to download massive volumes of satellite imagery. This inspired researchers to investigate remotely sensed data for DR tasks. Furthermore, researchers were also able to successfully employ DL approaches since these images were available in larger quantities. Secondly, the advancement of computer vision techniques, such as DL structures pre-trained on huge datasets, made visual data processing easier.

Fig. 4
figure 4

Papers published per year according to DR task

The number of studies combining DL and DR tasks rapidly increased from 2017 to 2018, more than doubling. Furthermore, researchers extended their interest to explore multiple DR tasks over time, including Disaster rescue and resource allocation, Location reference identification, and Understanding sentiments. However, we see a slight drop in the number of articles published in 2020. This inconsistency may be due to the COVID-19 global pandemic and the physical and mental challenges that researchers encountered. We notice a significant amount of literature emerging during the first quarter of 2021, potentially representing a COVID-19 lag effect in publication.

Fig. 5
figure 5

Taxonomy of DR Tasks

Disaster damage assessment has been the most popular DR task analysed using DL approaches over the years, with 26 articles out of the 83 exploring this. There are three likely reasons for the popularity of Disaster damage assessment. First, there is quite a strong driver and a clear need for damage assessment as it is urgently needed following an event, and there is little time for manual data collection. Second, the high availability of training datasets extracted from social media and remote sensing platforms was able to be used in supervised learning approaches. Third, there is a clear mapping between training data and the target function (e.g., images of cracked buildings). This mapping helps researchers when designing DL-based applications to extract effective features. We observed an increasing interest in Disaster-related information filtering and Disaster-related information classification tasks. These DR tasks are mainly based on text datasets extracted from Twitter. A possible explanation for this trend could be the increased popularity of using Twitter as a communication channel during disasters. Moreover, the advancement of Natural Language Processing (NLP) techniques with the increased availability of annotated data corpora aids further developments in the information filtering and classification tasks.

DR tasks, such as Missing, found and displaced people identification and Location reference identification, had received less attention from researchers, resulting in a total of 4 articles out of the 83 reviewed. The lack of availability of large-scale training datasets and annotated data to train supervised learning approaches could be the main reasons for the reduced popularity of these DR tasks. We summarise the papers addressing each of the main DR tasks in Table 2.

Table 2 Main DR tasks of the analysed articles

\(\mathbf {RQ_{2}}\): How have the Training Datasets been Extracted, Preprocessed, and Used in DL-Based Approaches for DR Tasks?

For this research question, we analyze the types of disaster data that have been used by DL models to support disaster response. The accuracy and effectiveness of DL algorithms depend on the training dataset and its clarity. Therefore, we aim to understand the various types of disaster data used by DL approaches, the sources and methods employed to extract them, and the preprocessing steps. All of these points are important in understanding and designing DL approaches for DR tasks.

\(\mathrm{RQ}_{2.1}\) What Types of DR Data have been used?

Our analysis of the types of data that have been used for DR tasks using DL approaches reveals relationships between DR tasks and data types, illustrated in Fig. 6. Among the 83 articles analysed, 37 used images as the data source. Surprisingly, in practise, disaster responders rely significantly on textual data sources, such as emails and field reports [39]. This finding indicates that these approaches have been mostly pursued in academic contexts. We assume multiple reasons contributing to the popularity of using image data for DR tasks: first, the power of visuals in conveying messages over textual content; second, the availability of pre-trained networks and the use of transfer learning techniques for image feature extraction and third, easy accessibility of image datasets through web search and web databases. Disaster damage assessment is the most popular DR task among the studies that used image datasets.

Fig. 6
figure 6

Data types used for DR task

Text data were used by 22 of the 83 articles and is more prominent in Disaster-related information filtering and Disaster-related information classification tasks. Currently available, annotated disaster-related text data repositories (particularly using social media data) provide a clear guide for specific target problems. As a result, many researchers have used text data for supervised learning approaches in information filtering and classification applications.

There has been little interest in using video datasets for DR tasks. Only 6 articles discussed the usage of video datasets for Disaster related information filtering, Classification, and Disaster event detection tasks. The possible reasons for this can be difficulties in storing and moving, and the need for special computing facilities for analysing video data such as Graphical Processing Units (GPUs).

We observed a significant interest in using multimodal data to extract information for DR tasks between 2018 and 2020. Multimodal data have been used for Disaster-related information filtering, Disaster-related information classification, Disaster damage assessment and Disaster event detection contributing to 18 articles in the analysed papers. We assume the popularity of multimodal DL networks depends on three reasons. First, the combination of multiple modalities leads to more complementary information than learning from a single data modality. Second, multimodal learning helps to integrate data from different sources and provides access to large quantities of data. Third, the more recent development of multimodal DL networks shows improved results over unimodal analysis.

\(\mathrm{RQ}_{2.2}\) What Sources have been Used to Extract Data, and How have Data been Extracted?

In this RQ, we analyse the sources (including accessible disaster data repositories) used to extract data used in DL models.

Fig. 7
figure 7

Sources used to extract data types

Image data have mainly been extracted using remote sensing from sources, such as satellites, aerial vehicles and LiDAR. Apart from that, Twitter and the Web have been used by 7 and 6 articles, respectively, to extract image datasets (we grouped research that extracted data from websites and Google search under Web). Twitter has been the prominent source of text information, and was used for a total of 19 articles out of the total 83 (and out of the 22 articles that used text data) analysed. The growing number of human-annotated disaster-related Twitter data repositories is likely to have increased the amount of research using them with DL approaches. We observed that 5 articles used a combination of multiple sources to extract data, such as Twitter, web mining, Baidu, Flicker, Instagram, and Facebook. Most notably, Facebook was rarely used (1/83) as a source due to its data extraction limitations (e.g., the requirement of prior approval from Facebook to use public feed Application Programming Interface (API) [104]). Figure 7 shows the sources used to extract different modalities of data.

Researchers have employed multiple techniques to extract data from different sources. Twitter data have been extracted through the Twitter Streaming API using general or specific keywords (e.g., earthquake, Nepal Earthquake) and a spatial bounding box covering the impacted area is often used while extracting tweets. However, it is notable that a total of 28 articles downloaded data from annotated Twitter repositories from previous research, such as CrisisNLPFootnote 6 and CrisisLex,Footnote 7 indicating the importance of annotated data repositories catering for DR problems. Web mining and web databases were used in 22 articles to download data. Workshops and conferences, for example, MediaEval,Footnote 8 have provided researchers with annotated dataests and meta-data for target problems. Table 3 summarizes the different data collection methods.

Table 3 Disaster data collection methods

\(\mathrm{RQ}_{2.3}\) How have data been Preprocessed Before Applying the DL Models?

To address \(\mathrm{RQ}_{2.3}\), we derive a taxonomy of preprocessing steps that researchers have used to clean raw data for use in DL algorithms. Cleaning and transforming data to be used effectively by DL models are critical steps towards improved performance. However, 19 articles out of 83 analysed did not explicitly mention the preprocessing steps that were undertaken.

We observe three common preprocessing steps across the articles analyzed: filtering, annotation, and dataset splitting. Data filtering helps reduce noise in raw data. Annotation deals with labelling the data depending on the target function. A total of 10 of the 83 articles employed external annotators or hired them through annotation service providers such as Figure EightFootnote 9 (formerly known as CrowdFlower). The annotated datasets are generally split into train, test, and validation sets during the preprocessing steps. The training data sets are used to train the DL model, while test datasets are used to provide unseen data to be classified by the model as a test. The validation set is used to tune hyperparameters of the DL model.

Our analysis identified that the design of the preprocessing steps largely depends upon the modality of data. For example, text data preprocessing steps included tokenizing, lowercasing, stemming, lemmatization and removal of stop words, tokens having less than 3 characters, sentences having less than 3 words, user mentions, punctuation, extra spaces, line breaks, emojis, emoticons, special characters, symbols, hashtags, numbers, and duplicates. Text normalization using the Out of Vocabulary (OOV) dictionary is used to replace slang, mistakenly added words, abbreviations, and misspellings. Image data preparation steps included data filtering, duplicate removal, patch generation, resizing, pixel value normalization, and image augmentations. Video data preprocessing included clipping to extract keyframes, shot boundary detection and removal of duplicates and blurred and noisy frames. Table 4 illustrates the preprocessing steps involved in preparing raw data for DL algorithms, as found in the analyzed articles.

Table 4 Data preprocessing steps

\(\mathbf {RQ_{3}}\): What DL Models are Used to Support DR Tasks?

In this section, we analyze the types of DL architectures used for DR tasks and learning algorithms. Our aim is to identify the relationship between DR tasks and the DL architectures. We provide a short overview of different deep learning architectures in our online appendix [12].

\(\mathrm{RQ}_{3.1}\) What Types of DL Architectures are Used?

Through this question, we analyze types of DL architectures used to extract features for DR tasks. We observed that six main DL architectures had been used, namely Convolutional Neural Networks (CNNs), Recurrent Neural Networks (RNNs), Long Short-Term Memory Networks (LSTMs) and its variant Bi-directional LSTMs (Bi-LSTMs), Domain Adversarial Neural Networks (DANNs), and AutoEncoders (AEs) across the studies we analyzed. Moreover, popular language models like Bidirectional Encoder Representations from Transformers (BERT) and Robustly Optimized BERT Pre-training Approach (RoBERTa) have been used for Natural Language Processing (NLP) tasks.

Fig. 8
figure 8

DL architectures used by DR tasks except for CNN as a single architecture

Fig. 9
figure 9

Usage of CNN by DR tasks

Figure 8 shows the usage of DL algorithms according to the DR tasks excluding CNNs. We demonstrate the application of the CNN algorithm for DR tasks in a separate diagram (see Fig. 9), and we present the usage of DL architectures based on publication year in Fig. 10. There has been a significant growing interest in using CNNs over the years across all DR tasks in 71 out of 83 articles analyzed. We consider it likely that CNNs have been adopted largely due to their capability in learning features automatically, parameter sharing and dimensionality reduction [114]. However, CNNs have performed poor for identifying word order in a sentence for text classification tasks [73]. Moreover, the computational cost (e.g., training time) for CNNs has been considerable, particularly when the training dataset is large.

Fig. 10
figure 10

DL architectures used by DR tasks by year

Fig. 11
figure 11

Pre-trained DL networks used by DR tasks

RNNs, LSTMs, and Bi-LSTMs have been used to analyze varying length sequence data such as sentences (e.g., tweet text). Although RNNs have been successful in many sequence prediction tasks, it has issues in learning long-term dependencies due to the vanishing gradient problem. This problem occurs from the gradient propagation of the recurrent network over many layers [73]. LSTM networks have been proposed to overcome these drawbacks and have shown better results for multiple text classification tasks [99]. Recent studies have demonstrated more improved results using Bi-LSTMs. One of the major advantages of using Bi-LSTMs is that they can capture and deal with long-range dependencies having variable lengths by analyzing information in both directions of a sequence (e.g., past and future entries) [43, 52].

We observe that many studies adopt DL models pre-trained on larger data sets, such as Places365Footnote 10 and ImageNet.Footnote 11 Fifty-one of the analyzed papers used pre-trained DL networks for word embeddings, visual feature extraction, object detection and classification. The advantage of adopting a pre-trained model is that it saves time and resources relative to training a model from scratch. Figure 11 provides a taxonomy of pre-trained networks adopted by our analyzed studies.

In addition, we observed that 17 studies adopted multiple DL architectures. This is very common in research that uses different modalities of data. For example, CNNs are often used to extract image features, while RNNs, LSTMs or Bi-LSTMs are used for text feature extraction.

\(\mathrm{RQ}_{3.2}\) What Training Processes are Used to Optimize DL Models?

In this RQ, we analyze the processes used to train DL algorithms focusing on optimization and error calculation.

All but four of the 83 articles used supervised learning as the training type for the selected DR problem. In supervised learning, the DL algorithm extracts features to associate data with the required classification labels. Therefore, a labelled training dataset is required. In contrast, unsupervised learning assigns a class label by grouping similar data together based on extracted features. Therefore, unsupervised approaches do not require labelled training data. Semi-supervised approaches use partially labelled data sets. However, both unsupervised and semi-supervised approaches were rarely used in the analyzed articles resulting in only 4/83. The current favour for supervised learning approaches is mostly due to the readily available labelled datasets. However, those outdated datasets would not reflect the temporal variations, and therefore, more improvements are required for DL architectures to make approximations without training.

The classical gradient descent algorithm was the most frequently adopted learning algorithm in the articles we analyzed for updating weights during backpropagation. Although researchers widely use gradient descent, the computational complexity is considerable because the entire dataset is considered every time the parameters are updated [98]. Multiple other algorithms, such as Adaptive Moment Estimation (Adam), Adadelta, and RMSProp algorithms, were proposed to overcome this issue. These new techniques have been used for optimization by 45 articles. The selection of optimization algorithm significantly affects the results of the model. However, we could observe that only 31% of the analyzed articles explicitly mention the optimization process and the algorithms they used.

Our analysis found that multiple algorithms have been adopted to calculate the error rate. Categorical cross-entropy is the most frequently used loss function, while negative log-likelihood was adopted by one article. The objective of a loss function is to optimize and tune weights in deep neural network layers. However, only 22 of the papers discussed the error function.

\(\mathrm{RQ}_{3.3}\) What Methods are Used to Avoid Overfitting and Underfitting

Fig. 12
figure 12

Methods used to avoid overfitting and underfitting by DR tasks

Two common problems associated with generalizing a trained DL model are known as “overfitting” and “underfitting”. Overfitting happens when the model learns training data extremely well but is not able to perform well on unseen data [42]. In contrast, an underfitted model fails to learn training data well and hence performs poorly on new unseen data. This happens due to the lack of capacity of the model or not having sufficient training iterations [49]. In both these cases, the model is not generalized well for the target problem.

To combat overfitting and underfitting, we observed that research had used multiple techniques such as Dropout, Batch normalization, Early stopping, Pooling layers, Cross-validation, Undersampling, Pre-trained weights and Data augmentation. Figure 12 illustrates these methods by DR tasks. A total of 24 articles used Dropout layers and 12 articles used Early stopping to avoid overfitting. Dropout layers ignore nodes in the hidden layer when training the neural network, and therefore, it prevents all neurons in a layer from optimizing their weights [116]. However, the batch normalization technique was proposed to achieve higher accuracy with fewer training steps, eliminating the need for Dropout [48]. During model training, the Early stopping technique evaluates the performance of the model on the validation dataset. The training process is stopped when the accuracy starts decreasing. As a result, however, this technique prevents the use of all available training data. Rice et al. [107] provide remedies for overfitting using a series of experimental evaluations.

Addressing underfitting while training DL models is a complex task, and these are not well-defined techniques [125]. We observed that 2 articles used pre-trained weights to avoid underfitting. However, 29 of the analyzed articles did not discuss the methods used for combating overfitting or underfitting.

\({\mathrm{RQ}_{4}}\): How well do DL Approaches Perform in Supporting Various DR tasks?

In this RQ, we analyze the effectiveness of DL approaches for DR tasks, including reviewing the evaluation matrices and baseline models and comparing results achieved.

\(\mathrm{RQ}_{4.1}\) What Evaluation Matrices are Used to Evaluate the Performance of DL Models?

Through this question, we explore the different performance matrices adopted by the studies we analysed. Our aim is to identify how the existing research evaluated their results. Evaluation of the performance of a model is a core function when employing DL algorithms, as it helps to improve the model constructively. We observed that 76 of the 83 articles had adopted standard performance evaluation matrices, such as precision, recall, accuracy, and F1-score (see the definition of these metrics matrices in Eqs. 49). These measures are based on the “true positive”, “false positive”, “true negative”, and “false negative” values, which evaluate the correctness of the results.

figure g
$$\begin{aligned} \text {Precision/Positive Predictive Value (PPV)}=\frac{T_\mathrm{p}}{T_\mathrm{p} + F_\mathrm{p}} \times 100\% \end{aligned}$$
(4)
$$\begin{aligned} \text {Recall/Sensitivity}=\frac{T_\mathrm{p}}{T_\mathrm{p} + F_\mathrm{n}} \times 100\% \end{aligned}$$
(5)
$$\begin{aligned} \text {Accuracy}=\frac{T_\mathrm{p} + T_\mathrm{n}}{T_\mathrm{p} + T_\mathrm{n} + F_\mathrm{p} + F_\mathrm{n}} \times 100\% \end{aligned}$$
(6)
$$\begin{aligned} F\text {1-Score}=2 \times \frac{\mathrm{Precision} \times \mathrm{Recall}}{\mathrm{Precision}+\mathrm{Recall}} \times 100\% \end{aligned}$$
(7)
$$\begin{aligned} \text {Specificity/True Negative Rate (TNR)}=\frac{T_\mathrm{n}}{T_\mathrm{n} + F_\mathrm{p}} \times 100\% \end{aligned}$$
(8)
$$\begin{aligned} \text {Negative Predictive Value (NPV)}= \frac{T_\mathrm{n}}{T_\mathrm{n} + F_\mathrm{n}}. \times 100\% \end{aligned}$$
(9)

We also observed that Area Under the Receiver Operating Characteristic (ROC) curve value has been used by 6 articles. The ROC curve plots the values between sensitivity and (1-specificity). Sixty-four of the analysed articles presented their performance using more than one metric, while all of the remaining 19 used one metric only. Other metrics used by our analysed articles include Average Precision (AP), and Intersection over Union (IoU). Our analysis suggests that researchers primarily selected performance metrics based on the baseline work that they selected as a comparison for their results. Therefore, it is essential to use standard metrics so other researchers can compare and contrast results in future studies. Table 5 shows the best accuracy scores obtained for level-1 and and level-2 DR tasks in our taxonomy, revealing that across most tasks DL performs very well, with slightly lower success rates for sub-tasks such as Damage evaluation and Spatial information filtering.

Table 5 Best accuracy scores for DR tasks

\(\mathrm{RQ}_{4.2}\) What “baseline” Models have been Compared?

This question explores the benchmarks that have been chosen by the analysed articles. We observed that the vast majority of the analysed articles self-generated their own benchmark. Specifically, 35 of the studies evaluated the performance of their proposed approach against self-generated tests, while 25 evaluated DL approaches against classical ML approaches. We consider it likely that this is because, until recently, there have not been many DL-based approaches with which to compare. Moreover, the majority of the studies have not published their adopted models or code for future researchers to easily implement and evaluate. Only 12 of the articles selected DL-methods proposed by previous research as baselines. We see that some benchmarks have also been compared in multiple articles as described in our online appendix [12].

\({\mathrm{RQ}_{5}}\): What are the Underlying Challenges and Replicability of DL for DR Studies?

In \({\mathrm{RQ}_{5}}\), we analyse the challenges researchers face in employing DL algorithms for DR studies and how well the current work can be adopted in future research. We aim to identify common challenges and provide future researchers with knowledge to better design future DL-based projects. Furthermore, we provide the details of research available for replication and reproduction in future research.

We observed that the challenges mostly depend on the data types and sources, including the following, which were extracted from 61 research articles:

  1. 1.

    Data annotation: Early studies using supervised approaches found very few publicly available annotated datasets. Therefore, they downloaded their own datasets and recruited people to annotate them. This took a massive amount of time and resources and delayed experiments. Furthermore, multi-label problems (one data item can belong to one or more informative categories), task subjectivity (difficulty in agreeing on one informative class), and conflicting annotation by human annotators were major issues. Even though many annotated datasets are available recently, data incompleteness and bias are common problems in processing DR data.

  2. 2.

    High-level of noise: Due to the high volume of heterogeneous data collected from social media platforms in the wake of disasters, the level of noise in the resulting data sets is extremely high (for example, spam, bots, data duplication). Furthermore, the content is informal, mostly using colloquial language, and very brief with casual acronyms and sometimes with non-literal language devices, like sarcasm, metaphors, and double entendre. Thus, it is challenging to train a DL model that can correctly interpret the intention of human expressions of this kind.

  3. 3.

    High variability: High variability in image quality resulting from different sensors and environmental conditions (for example, mist, cloud cover, and poor illumination) is challenging when applying DL models. Moreover, debris and damaged buildings look completely different depending on the disaster and structure of the building (e.g. concrete buildings, masonry buildings, or buildings made from natural materials), and are characterised by different features and patterns when captured in an image. As a result, the replicability of an already implemented solution for such a task is very low.

  4. 4.

    Semantic segmentation: Semantic segmentation of images to differentiate ground objects, such as roads and trees, from intact and damaged buildings, is a major challenge while using satellite, airborne and UAV imagery.

Despite these challenges, we observed that a very limited number of studies had made available their datasets, annotations, and implementation code for future research. For example, only 5 of the analysed articles made their resources publicly available. This trend results in researchers generating their own baseline and hence reducing research quality and the evolution of the field. Therefore, there is a considerable gap for researchers in adopting previous research as baselines.

Opportunities, Directions and Future Research Challenges

With the rapid change of climate and human-induced global warming, the variety and frequency of disasters have increased at a rate that has not happened before [28]. As a result, managing disasters while reducing their impacts on the communities and environment would be one of the main problems of the next decade. The increasing number of smart mobile devices and their embedded sensors enable the generation of a massive amount of heterogeneous data within a significantly shorter time than seen previously during disasters [1, 45]. Therefore, there is an immediate need for robust methods to automatically analyze and fuse such multimodal datasets and provide consolidated information to assist disaster management.

Data from different sources and formats bring complementary information regarding an event and lead to more robust inferences. Thus, future DL models will require analysis of heterogeneous, incomplete, and high-dimensional data sets to fill the missing information gaps in each data source or modality [98]. Multiple studies have explored the use of multimodal data for understanding the big picture of a disaster event [1, 3, 92, 99, 123]. However, more and more advanced DL approaches are required to solve core challenges in multimodal deep learning, such as missing data, dealing with different noise levels and effective fusing of heterogeneous data [17].

To address this problem, we identify that training data acquisition and preprocessing plays a major role when employing DL approaches. For example, large-scale human-annotated datasets are required to train DL algorithms to successfully predict the class label for unseen data. While a few annotated data repositories have been created (e.g., CrisisNLP, CrisisMMD, and CrisisLex), more datasets are required to reflect temporal variations. Furthermore, there are still no large-scale benchmark datasets incorporating a variety of disaster data types except for CrisisMMD [10]. Therefore, the current research is mostly limited to small-scale home-grown datasets covering specific disaster types.

This leads to the next challenge of data irregularities occurring in datasets and which reduce a classifier’s ability to learn from the data. The most common data irregularities include class imbalance, missing features, absent features, class skew and small disjuncts [29]. Class imbalance occurs when all classes present in a dataset do not have equal training instances. For example, datasets for classifying disaster-related social media posts have resulted in most non-related posts. Data-level methods, such as under-sampling techniques (e.g., Random Under-Sampling (RUS) [50]) and over-sampling techniques (e.g., Generative Adversarial Minority Oversampling (GAMO) [84] and Major-to-minor Translation (M2m) [54]), have been explored to mitigate the effects of class imbalance. Although researchers assume fully observed instances, practical datasets, however, contain missing features. Data imputation methods, model-based methods and more recently, DL methods have been proposed to handle missing features. A complete guide to methods that enable tackling these data irregularities is provided by Das et al. [29]. Even though methods to handle irregularities have been largely explored, more research is required as the velocity and variability of data generation accelerate.

Another key area is the variety characteristics of disasters that limit the reusability and generalizability of already trained DL algorithms. This means the variations of input data representations extracted during different disasters. Recent DL studies have focused on domain adaptation during learning where the distribution of the training data differs from the distribution of the test data [47]. Future research focus requires developing domain adaptation techniques for the DR domain.

According to the current trends, people will increasingly use social media platforms for disaster data acquisition, and dissemination, challenging the traditional media sources [40, 111, 118]. Therefore, crowd-sourced data will be more prominent in providing first-hand experiences of disaster scenes. However, responding organizations have concerns regarding the trustworthiness of user-generated content, a problem which is largely unsolved [23]. For example, fake news, misinformation, rumours, digital manipulation of images (e.g., deepfake [126]) and re-posting contents from previous events are a few challenges that future researchers will face to improve the integrity of social media content.

Another challenge in the DR domain is that previous research has largely explored the most common tasks, such as Disaster damage assessment, Disaster event detection and Location reference identification. However, there are other important DR tasks, including evacuation management, health and safety assurance, and critical infrastructure service, as illustrated in the Guidance of Emergency Response and Recovery [32]. These tasks have not yet been analyzed using DL approaches. Some possible reasons could be insufficient training datasets, lack of computational resources to store, manage, and process data, and inadequate accuracy of existing DL architectures. These underrepresented topics need further attention by DL researchers to better support DR tasks. Moreover, the accuracy of the output produced by DL algorithms is determined by a number of factors, including the optimization algorithm and the loss function used. Thus, further research is important in this area to find the correct combination of data, DL architecture, optimization algorithm, and loss function.

Results of the Association Rule Mining

Table 6 Some association rules extracted from the analysed papers

This section discusses the interesting relationships discovered through our association rule mining task. We introduced the association rule mining process in Section “Association Rule Mining”. Our goal is to identify hidden relationships between the values extracted from the articles for the attributes in the extraction form. The most highly scoring rules are listed in Table 6. We discuss the patterns that resulted in having higher “Support”, “Confidence” and “Lift” values. However, all the associations are illustrated in our online appendix [12]. Our analysis highlights that CNN, Supervised, Image and Twitter have higher support values (\(>0.45\)). This result indicates that the majority of studies discussed Image as data type, CNN as DL architecture, Supervised as learning type and Twitter as their data source.

Disaster Damage Assessment \(\rightarrow\) Remote Sensing; Remote Sensing \(\rightarrow\) Image; Multimodal, CrisisMMD \(\rightarrow\) Twitter and Remote Sensing, CNN\(\rightarrow\) Image are some of the association rules having a confidence score of 1.0. This means that, for example, rule Disaster Damage Assessment \(\rightarrow\) Remote Sensing implies that the pattern appears in 100% of the analysed articles. Similarly, all the research that used Remote Sensing as the data extraction method analysed Image as their data source.

The highest lift score of 4.5 resulted for the multimodal, Twitter\(\rightarrow\) CrisisMMD rule. This means that when researchers used multimodal as their data type and Twitter as the Data Source, CrisisMMD has commonly been the data extraction method. Furthermore, multimodal\(\rightarrow\)CrisisMMD, Twitter; Remote Sensing \(\rightarrow\) Disaster Damage Assessment, Image; Image\(\rightarrow\)Remote Sensing, CNN rules were among the other high lift values. Interestingly, we observed rules such as Twitter\(\rightarrow\)CNN; CNN \(\rightarrow\) text and text, Twitter \(\rightarrow\) CNN having a “Lift” score of less than 1. This indicates a negative relationship between the parameter values. For example, it is very unlikely that research that used Text as the Data Type and CNN as the DL architecture. All these association rules provide future researchers a guide to select parameters in a DL-based project, such as data sources, learning algorithms, and learning type.

Flowchart and Guidelines for Applying DL in Future DR Research

Fig. 13
figure 13

Flowchart for conducting DL for DR research

In this section, we provide a flowchart and guidelines for conducting future work using DL for DR tasks based on the findings of our SLR. Figure 13 shows how we have mapped the components of learning into RQs and then as the steps in the flowchart. The extracted flowchart is a general one based on the 83 analyzed papers. However, more specific details can be added to it based on the DR task to be solved.

After identifying the DR problem to be addressed, researchers should consider whether DL is a suitable approach. That decision can be made partly based on whether it is possible to obtain or create the necessary data. If enough data can be obtained, the researcher can select either supervised, unsupervised and semi-supervised learning methods. We discussed these methods in the Section “RQ3.2 What Training Processes are Used to Optimize DL Models?”. If the identified problem can be better solved using a supervised approach, the next step is to decide where the annotated datasets can be obtained, or whether raw data must be annotated. Data annotation is generally labour intensive and time-consuming, and therefore, the researcher can hire paid workers or arrange volunteers based on budget and availability. We have discussed the annotated data sources and annotation methods in the Sections RQ2.2 What Sources have been Used to Extract Data, and How Have Data Been Extracted? and RQ2.3 How have data been Preprocessed Before Applying the DL Models?. Once the dataset is ready, the researcher should conduct an exploratory analysis to identify the nature of this raw data. This analysis provides the researcher with an overview including the size, distribution, and characteristics of the data. Proper understanding of raw data provides guidelines for the design of the preprocessing steps, which have to be well reported to enable replication. This includes outlining all the steps involved, including the normalization processes and data augmentation strategies.

After the data filtering and cleaning steps, the researcher should identify the learning algorithm, and DL architecture. The researcher should report the details of the DL architecture, including the type of layer (e.g., embedding, dropout and soft-max), number of layers, filters, and learning rate. Furthermore, all necessary details regarding optimizers, loss function and hyper-parameter tuning, have to be reported to enable replication. The information regarding training, such as number of iterations (epochs), strategies combating overfitting and underfitting, training time, computing environment, special computing resources (e.g., GPUs, high-performance computing) and platforms used (e.g., Google Colaborotory) should also be explained (see Section “RQ3: What DL Models are Used to Support DR Tasks?”).

Finally, the researcher should report the results compared to the selected “baseline model”. If the researchers used their own dataset, they must first implement the baseline against their data to compare the results. Any limitations and challenges encountered while applying DL models should also be discussed to provide guidance for future researchers in designing DL-based approaches for DR tasks. Furthermore, researchers can support the quality and the future of the DR research field by making publicly available the datasets, annotations, and DL architectures.

Conclusion

This study has presented a systematic literature review of DL in DR research. We started by identifying RQs for the analysis according to the components of learning described by Abu Moftha [121]. Then, a data extraction form with 15 attributes was created to extract answers for the questions from the selected articles. Finally, we used the KDD process to identify relationships among different attributes of the extracted data. The answers to the research questions indicate that, while some DR tasks have received much investigation, others have received less attention. Furthermore, there are multiple challenges while collecting, annotating, and preprocessing datasets for DL tasks. However, researchers have achieved better performance than traditional methods when using DL methods for DR tasks despite these challenges.

This research has identified opportunities, future research challenges, and many directions for further investigation. For example, multiple DR tasks are yet to be studied using DL approaches, such as evacuation management and critical infrastructure services. Moreover, we highlighted the need for new annotated multimodal datasets targeted at DR concerns. Some of the future research challenges are handling data irregularities, improving the integrity of social media data, and developing generalizable DL approaches across multiple disasters. Additionally, data preprocessing, DL architecture selection, word embeddings and hyperparameter tuning are areas of further exploration. Finally, we emphasized the importance of comprehensive reporting and making implemented DL methodologies publicly available for the advancement of the DL in the DR area.