survey

Open access

Self-tuning Database Systems: A Systematic Literature Review of Automatic Database Schema Design and Tuning

Authors:

Maryam Mozaffari,

Anton Dignös,

Johann Gamper,

Uta StörlAuthors Info & Claims

ACM Computing Surveys, Volume 56, Issue 11

Article No.: 277, Pages 1 - 37

https://doi.org/10.1145/3665323

Published: 29 June 2024 Publication History

PDF eReader

Abstract

Self-tuning is a feature of autonomic databases that includes the problem of automatic schema design. It aims at providing an optimized schema that increases the overall database performance. While in relational databases automatic schema design focuses on the automated design of the physical schema, in NoSQL databases all levels of representation are considered: conceptual, logical, and physical. This is mainly because the latter are mostly schema-less and lack a standard schema design procedure as is the case for SQL databases. In this work, we carry out a systematic literature survey on automatic schema design in both SQL and NoSQL databases. We identify the levels of representation and the methods that are used for the schema design problem, and we present a novel taxonomy to classify and compare different schema design solutions. Our comprehensive analysis demonstrates that, despite substantial progress that has been made, schema design is still a developing field and considerable challenges need to be addressed, notably for NoSQL databases. We highlight the most important findings from the results of our analysis and identify areas for future research work.

1 Introduction

The maintenance of database systems has become costly and time-consuming due to their growing complexity. In fact, a major part of today’s database expenses relates to the database administration. For this reason, recently we have seen an increasing number of studies on self-adaptive features in database management systems (DBMSs), which led to the design and development of so-called self-managing or autonomic DBMSs (ADBMSs). They have the capability to manage and maintain themselves without or with little human intervention. Autonomic database management systems can be classified in the following self-CHOP four categories [87]: self-configuring, self-healing, self-optimizing, and self-protecting. Some works [77] consider additional categories, such as self-organizing and self-inspecting.

Self-configuring (or self-tuning) is the ability of the database system to configure itself in order to achieve its performance goals [94]. There are hundreds of configuration parameters, which can be categorized into knobs configuration, resource consumption thresholds, and data structures design in the database schema [77]. Thus, the field of self-tuning databases is large. Due to lack of space, in this study we only focus on database schema design and tuning; other areas of schema management, such as schema evolution and data migration, are not covered.

From a schema design viewpoint, SQL databases are designed to manage well-structured data, requiring users to design a schema before storing data. This is a schema-on-write approach, in which data is written into a fix and predefined schema. There exists also efforts to extend SQL databases to manage flexible schema [154]. In contrast, NoSQL schema design is based on a schema-on-read approach, offering flexibility to handle unstructured or dynamic data without a fixed schema. While SQL databases can efficiently handle structured data, they are not as flexible as NoSQL databases in dealing with unstructured or semi-structured data due to their fixed schema requirements.

Generally, three levels of representation are distinguished in databases: conceptual, logical, and physical data models. The automatic schema design in relational databases focuses on automated physical schema design [47]. This is mainly because there are well-known principles and procedures for translating a conceptual data model to a normalized logical data model, i.e., a set of tables. The physical schema describes the implementation of the logical model on secondary storage. In order to tune the physical design for a given workload and achieve a more efficient data access, tables are supplemented with additional structures such as indexes and materialized views. As pointed out in [47], the problem of physical database design can be defined as a search problem of finding a good configuration of physical structures (i.e., indexes, materialized views), which minimizes the execution cost of a given workload.

In contrast, NoSQL stores adopt more flexible data models that are mostly schemaless. A schemaless data model allows data to have arbitrary structures as they are not explicitly defined using a data definition language (schema-on-write), but are instead implicitly encoded in the application logic (schema-on-read) [69]. This implies a high level of flexibility in data modeling, allowing to structure the data with greater freedom compared to relational databases. Due to the schemaless nature of NoSQL stores, automatic schema design is not limited to the physical level, but all levels of representation (conceptual, logical, or physical) are considered. There is no fully standard schema design (or data modeling) procedure for NoSQL databases, and manual schema design often relies on rules of thumb for choosing a good schema. For instance, eBay [183] and Netflix [133] have shared examples and general guidelines for the schema design of NoSQL wide column stores. The problem of NoSQL schema design is to transform conceptual, logical, and physical models into each other. At the conceptual level, the traditional models used in SQL databases are adopted, such as UML, ER, and EER. In contrast, at the logical level new schemas were defined based on models such as documents and graphs, thereby bringing the logical model closer to specific NoSQL databases. Finally, at the physical level, the specific schema and features of the NoSQL databases is used, such as MongoDB, Neo4j, or Cassandra. At this point, some questions arise such as: At which level of representation (conceptual, logical, or physical) does the schema design process occur? Which schema structures or models are designed at each level of representation? What are the methods used to model NoSQL databases?

The goal of this study is to conduct a systematic literature review on existing approaches for automated database schema design, considering academic works in both SQL and NoSQL databases. We analyze 186 primary studies that were published between 1970 and 2021, and we provide a comprehensive taxonomy to compare the numerous approaches along relevant facets.

Studies on physical schema design in relational databases have been conducted since the early 70’s. While initially the focus was on the index selection problem, later on extensive works on selecting an optimized set of indexes, materialized views, and other physical structures have been conducted. Nowadays all major commercial database vendors ship automated physical design tools, e.g., the SQL Access Advisor in Oracle [64, 138], Tuning Advisor in Microsoft SQL Server [6], and DB2 Design Advisor in IBM’s DB2 [236, 237]. In contrast, there have been fewer studies on automatic database schema design for NoSQL databases, which only appeared since 2010 when NoSQL databases became popular.

The remainder of this article is organized as follows: Section 2 summarizes previous surveys on database schema design, followed by a brief description of the systematic reviewing approach adopted in this article in Section 3. Section 4 gives an overview of the selected primary studies and the proposed taxonomy. A thorough overview of the research works in automated schema design in SQL and NoSQL databases is presented in Sections 5 and 6, respectively, whereas Section 7 analyzes previous works from the viewpoint of workload type and tuning approach. A discussion of the review results is presented in Section 8. In Section 9, the limitations and threats to validity are outlined. Finally, Section 10 concludes the article.

2 Related Surveys

In the literature, various surveys or assessments of automatic database schema design have been conducted, which focus either on SQL or NoSQL databases. However, to the best of our knowledge, no comprehensive survey on automatic database schema design covering both SQL and NoSQL databases exists yet. This work is the first survey to address this topic, providing a thorough analysis of the state-of-the-art as well as precise and well-supported suggestions for future research.

For SQL databases, the works in [34, 152] pay serious attention to the principles and advances of physical database design, and they examine physical design methodologies and automatic design tools used in commercial RDBMS. The studies in [43, 77, 94, 160] analyze autonomic (or self-managing) features in popular relational database products, namely DB2, Oracle, and Microsoft SQL Server. While Chaudhuri and Narasayya [47] review prominent solutions in the area of automated physical database design with a particular focus on work done in the AutoAdmin project, there exist several publications that provide further details on self-tuning databases [49, 200, 223]. Weikum et al. [223] survey and assess general approaches of automatic tuning, putting a particular emphasis on the COMFORT automatic tuning project [222]. A tutorial on foundations and principles of automated tuning as well as an overview of tuning tools in commercial RDBMS is provided in [49, 200]. The work in [134] compares eight existing algorithms for the index selection problem along different dimensions, such as solution quality, runtime, multi-column support, solution granularity, and complexity. This comparison includes an experimental evaluation of the algorithms for three analytical benchmarks, namely, the TPC-H, TPC-DS, and Join Order Benchmark.

With the growth of NoSQL databases and the need for schema design in such databases, various academic studies have been published. A survey on NoSQL design methodologies is described in [17]. It presents a classification of methodologies, first by an attempt to unify most NoSQL databases under a uniform design methodology, and then by type of NoSQL databases. Finally, an evaluation process for NoSQL design methodologies is proposed. The authors of [194] identify several key criteria in the design of new databases (including Object-Oriented, NoSQL, and NewSQL databases) and analyze existing methods accordingly. The work in [62] provides a static analysis of five representative JSON schema inference approaches to identify their strengths and weaknesses. The authors of [220] conducted a systematic review to analyze academic works and extract characteristics of data modeling in NoSQL databases. This systematic review aims at answering three research questions about the level of representation, models used, and contexts where the modeling process occurred in NoSQL databases. [216] provides a general overview of challenges and state-of-the-art in NoSQL schema management, e.g., schema design and schema evolution, and provides an outlook on research opportunities related to NoSQL schema evolution and data migration.

3 Systematic Review Process

In a systematic literature review, it is of paramount importance to consider all primary research works that are relevant for the investigated topic. We adopt the guidelines introduced by Kitchenham [129, 130], which provide the crucial steps that are required for identifying, evaluating, and interpreting all relevant research works. The process is divided into three phases: planning the review, conducting the review, and reporting the review (cf. Figure 1).

Fig. 1.

3.1 Phase 1: Planning the Review

The planning phase involves three steps: the first justifies the need of the systematic review, the second defines the research questions, and the third develops the review protocol.

3.1.1 Need for a Systematic Review.

Automated physical schema tuning in SQL databases is well-studied, and there is a large number of related articles (cf. Section 5). In contrast, research on automatic schema tuning in NoSQL databases started only in 2010, hence less research has been conducted so far (cf. Section 6). While there exist several surveys on automatic schema tuning for SQL and NoSQL databases, none of these studies provides an in-depth overview, classification, or comparison of existing approaches from multiple facets. To the best of our knowledge, our work is the first substantial effort to explore automatic schema tuning approaches in both SQL and NoSQL databases, as well as the first work to propose a taxonomy for evaluating and comparing different schema tuning solutions. This survey helps to investigate the gap between the motivations and capabilities of current solutions, and consequently to identify areas for future research.

3.1.2 Research Questions.

This survey concentrates on the following four research questions:

RQ1

What are the venues and publication statistics of existing studies on automated database schema design?

Answering this question allows us to identify the distribution of the publications across publishers and publication types as well as periods when the topic was particularly attractive to researchers.

RQ2

At what level of representation does the schema design occur, and what are the schema design structures or models?

The purpose of this question is to identify which level of representation (conceptual, logical, and physical) is considered in the schema design process, and which schema structures or models are designed at the respective level.

RQ3

At each level of representation, what methods are used to solve the schema design problem?

This question explores the various solutions that have been presented for addressing the schema design problem, and whether the proposed methods support fully automatic schema tuning.

RQ4

How can we define a taxonomy for schema design approaches considering significant facets?

Presenting a classification and taxonomy reveals a comprehensive understanding of significant facets of the schema design problem and helps to position existing works in the research landscape. This also facilitates the development of new approaches.

3.1.3 Systematic Review Protocol.

This is the most vital stage of the process and defines data sources, search strategy, and the selection criteria for the primary publications. We used electronic indexing systems, such as IEEE, Springer, Science Direct, ACM, Citeseer, Web of Science, and Scopus to find candidates for primary studies. The search was carried out using search strings derived from the research questions and combined with Boolean operators following the guidelines of Spanos and Angelis [212]. According to [212], the determination of search keywords is an iterative procedure. It begins with trial searches using different keywords, considering an initial set of articles that is known to belong to the research field of the review. The procedure ends when the initial set of articles is found. In our case, we used the following primary keywords: “NoSQL databases”, “relational databases”, “schema design”, “modeling”, “schema tuning”, “schema configuration”. To search academic works related to the structures or models of schema design, the keywords “index”, “materialized view”, “partitioning”, “clustering”, “graph model”, “document model”, “column model”, “key-value model” were used. Furthermore, an in-depth analysis was conducted to identify more representative keywords and maximize the candidate set of primary studies. Finally, these keywords are combined using Boolean ANDs and ORs, yielding our search string shown in Figure 2.

Fig. 2.

Next, the following inclusion and exclusion criteria were applied:

Inclusion Criteria (IC):

(IC1)

Academic works that develop or extend schema design and tuning approaches for databases (SQL or NoSQL) as a main part of the study.

(IC2)

Academic works published as book, book chapter, journal, conference article, thesis, or technical report.

Exclusion Criteria (EC):

(EC1)

Academic works written in languages other than English.

(EC2)

Academic works not available on the web.

(EC3)

Academic works representing a survey or systematic review.

(EC4)

Academic works published as short article.

(EC5)

Academic works focusing on other areas (i.e., not schema design and data modeling).

(EC6)

Academic works related to other areas of self-tuning databases, such as memory and other database components.

(EC7)

Academic works related to other areas of schema management, such as schema evolution and data migration.

(EC8)

Academic works outside the time period 1970 to 2021.

3.2 Phase 2: Conducting the Review

Phase 2 is divided into the following steps: identifying primary studies using the search string, selecting the primary studies based on the inclusion and exclusion criteria, assessing the quality of the primary studies, and extracting and synthesizing the data.

3.2.1 Identifying Primary Studies.

To find candidates for primary studies, we use the search string shown in Figure 2 in the electronic indexing systems IEEE, Springer, Science Direct, ACM, Citeseer, Web of Science, and Scopus. For completeness, a manual search was carried out in Google Scholar with the aim of finding additional works on automatic database schema design. Our search strategy was based on finding literature published as book, book chapter, journal, conference article, thesis, or technical report in the period from 1st January 1970 to 31st December 2021. As a result, we obtained 8,275 candidates of primary studies.

3.2.2 Selecting Primary Studies.

This step applies the inclusion and exclusion criteria in order to filter out academic works that are not relevant to our study. We first applied the exclusion criteria EC5–EC8, and as a result 1,824 studies remained. Then, the other inclusion and exclusion criteria (EC1–EC4, IC1–IC2) were applied. Finally, 142 academic works related to SQL databases and 98 academic works related to NoSQL databases remained for a more detailed study.

3.2.3 Assessing the Quality of Primary Studies.

In addition to the inclusion and exclusion criteria, it is important to ensure that the primary studies have a high quality. According to Kitchenham [130], the quality assessment of the primary studies is complex and depends on various factors. We adopted the following criteria for our study: (a) availability of the publication, including information on how to access them, e.g., URLs, DOIs, or databases; (b) detailed description and documentation of the presented methods and models for schema design; and (c) comprehensive presentation of the results. Based on these criteria, the academic works were subjected to thorough text-reading, leaving 114 works related to SQL databases and 72 works related to NoSQL databases for this survey. An overview of the identified primary studies is given in Table 1. The table shows the statistics with respect to publication type and year of publication. Symposia and workshop publications are considered as conferences. We can see that research on automated physical design in SQL databases covers the entire period of our study, whereas research works on schema design in NoSQL databases have appeared only starting from 2010.

Table 1.

Database Type	Publication Type	Primary Studies	Years of Publication
SQL Databases	Journal	40	1974–2021
	Conference	64	1976–2021
	Book	3	1983–2020
	Dissertation or technical report	7	1970–2020
	Total	114	1970–2021
NoSQL Databases	Journal	21	2014–2021
	Conference	49	2010–2021
	Book	1	2019
	Dissertation or technical report	1	2018
	Total	72	2010–2021

Table 1. The Information about Selected Studies

3.2.4 Extracting Data of Primary Studies.

In this stage, we analyze the primary studies in order to answer the research questions. For this, we extracted the following main features:

(1)

Publication source, year of publication, and type of article (RQ1).

(2)

Database type SQL or NoSQL, type of NoSQL databases (Key-Value, Document, Wide-column, Graph) (RQ2, RQ3, RQ4).

(3)

Level of representation, i.e., conceptual, logical, or physical (RQ2).

(4)

Schema design structures, for example physical structures, such as indexes, materialized views, multidimensional clustering, and partitioning (RQ2).

(5)

Source and target models in the schema design process (RQ2).

(6)

Schema design methods (RQ3, RQ4).

3.3 Phase 3: Reporting the Review

The reporting phase consists of the final assessment and presentation of the results from the systematic review process. In the following sections, we describe in detail the results and findings that we obtained by answering and analyzing our four research questions.

4 Overview of Primary Studies and Taxonomy

This section provides a general description of the selected primary studies (answer to RQ1) and a new taxonomy to classify schema design and tuning approaches (answer to RQ4).

4.1 Overview of Selected Primary Studies

This section provides publication statistics for the selected primary studies with the goal of answering RQ1. Figure 3(a) shows the distribution of primary studies in terms of publication type, i.e., journal article, conference article, book, dissertation or technical report (D & R). Recall that symposiums and workshops are considered as conferences. A substantial number of primary studies are published in conferences, for both SQL databases (64) and NoSQL databases (49).

Fig. 3.

Figure 3(b) reveals that ACM and IEEE are the dominating publishers for research in SQL databases. This is mainly because they are publishing specialist conference proceedings, such as ACM SIGMOD and ICDE. Springer is dominating for NoSQL databases, due to many articles that are published in Springer conferences, such as the ER conference, which was one of the earliest conferences seeking for contributions in the area of data modeling in NoSQL databases. VLDB is one of the main conferences for databases. The proceedings were published by Morgan Kaufmann (MK) until 2004, then by ACM from 2005–2007, and since 2008 by the VLDB endowment in open-access (PVLDB).

Figure 4 shows the publication statistics with respect to the publication year. Studies about physical schema design in relational databases cover the entire period from 1970 to 2021 considered in our study. The rapidly growing number of publications over the years demonstrate an increasing interest in this topic among researchers and practitioners. Studies on schema design in NoSQL databases have only been conducted since 2010, and their number is growing fast.

Fig. 4.

Table 2 shows the main conferences of our primary studies. In the area of physical schema design and tuning for SQL databases, the most notable conferences are ACM SIGMOD (15), VLDB (14), and ICDE (13). In the area of schema design and tuning for NoSQL databases, the most notable conferences are ER (7), and ICEIS (5). Articles published in journals were spread over 19 journals for SQL databases and 16 journals for NoSQL databases. Table 3 shows the most relevant journals.

Table 2.

Database Type	Acronym	Conference Name	Primary Studies
SQL Databases	ACM SIGMOD	ACM SIGMOD International Conference on Management of Data	15
	VLDB	International Conference on Very Large Data Bases	14
	ICDE	IEEE International Conference on Data Engineering	13
	EDBT	International Conference on Extending Database Technology	3
NoSQL Databases	ER	International Conference on Conceptual Modeling	7
	ICEIS	International Conference on Enterprise Information Systems	5
	IEEE BigData	IEEE International Conference on Big Data	3
	IEEE IRI	International Conference on Information Reuse and Integration for Data Science	3

Table 2. Main Conferences

Table 3.

Database Type	Publisher	Journal Title	Primary Studies
SQL Databases	PVLDB	PVLDB	10
	Springer	The VLDB Journal	5
	IEEE Computer Society	IEEE Transactions on Software Engineering	4
	IEEE Computer Society	IEEE Transactions on Knowledge and Data Engineering	3
	Elsevier	Information Systems	3
NoSQL Databases	Elsevier	Information Systems	4
	Springer	Journal of Big Data	3
	Springer	The VLDB Journal	2
	IEEE Computer Society	IEEE Transactions on Knowledge and Data Engineering	2

Table 3. Main Journals

4.2 A Taxonomy of Schema Design and Tuning

To facilitate a systematic and thorough analysis of the numerous approaches, we present a new taxonomy of schema design and tuning in Figure 5 (answer to RQ4). The first level distinguishes between database type, workload type, and tuning approach. We use this taxonomy in the following sections to provide a thorough analysis of the selected primary studies (answer to RQ2 and RQ3). We begin our analysis with respect to the two different database types considered in this study and discuss SQL databases in Section 5 and NoSQL databases in Section 6. Afterward, in Section 7 we analyze the selected works from the perspective of workload type and tuning approach.

Fig. 5.

5 Schema Design and Tuning in SQL Databases

The schema design problem in relational databases focuses on automated physical model design, i.e., finding a configuration of physical structures that minimize the execution cost for a given workload. Physical design structures include auxiliary objects, such as indexes and materialized views, which provide faster access to data and thus faster response times. It also includes objects that define how the data is physically organized in tables, e.g., the use of clustering keys in multidimensional relational databases or partitioning keys in shared-nothing parallel database systems [87]. Generally, automated physical design approaches first generate as set of alternative designs, apply a cost model, and choose then the design with the lowest cost. Accordingly, we distinguish two main aspects: a cost model to evaluate alternative designs and to estimate the total workload cost for each design, and (2) a search strategy to find the physical configuration with the lowest total cost.

In the following, we first analyze cost models and search strategies, followed by a comprehensive overview of research conducted on automated physical design.

5.1 Cost Model

Research on cost models can be categorized into three main groups: stand-alone cost model, optimizer cost model, and learned cost model.

5.1.1 Stand-alone Approaches.

Stand-alone approaches build an external cost model outside of the DBMS for comparing alternative design configurations. Stonebraker [214] and Schkolnick [202] presented a probabilistic model that considers statistical properties of the transactions as input parameters, e.g., the probability that a particular column appears in a query. These approaches are not practical since maintaining the statistics is expensive and the required amount of storage is very high. Thus, they work in special restricted cases, but not in the general case. Hammer and Chan [102] used an exponential smoothing technique, which is a forecasting model in the derivation of parameters for the cost model. Another approach adopts rule-based expert systems [63, 80, 195]. They formalize the knowledge of human experts as rules, and use these rules to generate a good physical design configuration. [105] introduced a simple linear cost model for estimating the number of rows in a table that a query is expected to process. Similarly, [126] developed an analytical cost model to predict the impact of data correlations on the performance of secondary index look-up. The work in [234] uses an approximate mean value analysis to estimate the cost of executing a configuration with a workload. Finally, the studies in [16, 21, 24, 57, 78, 95, 120, 157, 192, 224, 225, 226] use I/O based cost models to calculate the total workload cost, which is measured in terms of number of page or block accesses, or as a weighted sum of I/O and CPU operations.

Approaches based on external cost models suffer from serious drawbacks. They cannot guarantee that the proposed storage scheme will be used to its full potential by the optimizer. Moreover, the tool becomes obsolete whenever the optimizer changes.

5.1.2 Optimizer Cost Model.

Instead of an external cost model, the pioneering work in [81] uses the cost model of the query optimizer, which maintains statistics about the queries’ execution cost. Such an approach has significant advantages compared to external cost models. The generated physical schemes are based on information extracted from the optimizer, and thus will be used by the optimizer to its full potential. Moreover, they remain synchronized if the optimizer evolves over time. A shortcoming of this approach is that creating statistics by traditional full scan techniques for very large databases with many columns was impractical.

The Auto-admin project initiated by Microsoft in 1997 extends a “what-if” interface to the optimizer for simulating hypothetical physical structures [46]. These structures are not materialized, instead, they are simulated inside the optimizer by adding metadata and statistical information to the system catalog. This interface allows physical design tools to stay in-sync with the optimizer and consider a large space of alternative physical designs without materializing them. Similar “what-if” interfaces were subsequently adopted by other commercial DBMSs, for instance HypoPG,¹ which allows to create hypothetical indices in PostgreSQL and see their impact on the optimizer. Large problem instances imply a very high number of what-if optimization calls, which dramatically increases the cost of physical structure tuning. To tackle this problem, several studies [38, 68, 181, 203] propose fast what-if optimization techniques, which execute fewer what-if calls and significantly reduce the cost of index tuning.

Other studies, such as [56, 58, 59], are based on a combination of an optimizer-based and a knowledge-based approach to reap the benefits of both techniques.

5.1.3 Learned Cost Model.

As pointed out in [143, 181], invoking the optimizer to estimate the cost of queries under various configurations is costly, and good configurations might be missed due to erroneous cardinality estimates. Therefore, recent research has studied the use of machine learning techniques, especially deep neural networks and reinforcement learning, to estimate cost models [26, 136, 146, 176, 179, 184, 185, 187, 197, 231]. The goal is to develop a tuning strategy that does not require prior knowledge of a cost model. Instead, the cost model is learned. Oracle’s cloud-based autonomous DBMS [176] is one notable example of augmenting an existing DBMS with machine learning agents. Instead of augmenting an existing DBMS, other works [184, 185] have looked into creating new DBMS architectures, which use different ML algorithms to predict the cost and benefit of deploying configuration components.

5.2 Search Strategy

Research on search strategies can be categorized into two main groups: exact approaches and approximate approaches.

5.2.1 Exact Approaches.

Early works use exact approaches based on an analytical model to find an optimal physical configuration [16, 128, 178, 202, 214]. Exact solutions find an optimal set of physical structures by considering all possible design choices. Since the physical structure selection problem is NP-complete [188], an exhaustive search strategy is impractical and not scalable. Therefore, the above approaches made many simplifying assumptions. General problems with analytical approaches include: (1) significant simplifications have to be made, and (2) if the query processing strategy or other modeled aspects of the DBMS change, the model becomes obsolete.

5.2.2 Approximate Approaches.

To tackle the complexity of exact approaches, approximate solutions have been investigated. The most widely used technique is to use heuristic search methods to prune the search space of design models and to choose a near-optimal configuration. Most of the works use heuristic solutions [21, 23, 24, 30, 56, 57, 58, 59, 63, 80, 81, 82, 102, 103, 195, 211, 217, 224, 229], greedy approaches [6, 45, 96, 97, 105, 127], evolutionary algorithms like genetic algorithms [137, 172, 191, 232], branch-and-bound algorithms [171, 186, 234], relaxation-based approaches [35, 36, 37], or hybrid approaches [233]. Other studies exploit combinatorial optimization techniques to choose an optimal set in potentially exponential time, or find an near-optimal solution along with a distance bound to the optimal solution. The works in [40, 41, 68, 126, 180, 192] adopt an Integer Linear Programming (ILP) formulation of the problem and employ a commercial ILP solver, such as branch-and-bound, Lagrangian relaxation, or column generation. In [78, 95, 119, 120], a polynomial time approximation algorithm is used to solve the index selection problem, which is formulated as a knapsack optimization problem. The work in [237] models the index and materialized view selection problem as a variant of the knapsack problem, followed by a random-swapping algorithm.

The approaches described above lack a mechanism to learn about the goodness of the recommended set of physical structures. Their quality can be improved by learning from the effects of deployed actions, e.g., adding an index. Recent advances in (deep) reinforcement learning inspired researchers to move toward learning-based approaches. For instance, [140, 197, 210] demonstrate how deep reinforcement learning can be used for index selection. The works in [26] and [179] specify a general reinforcement learning approach for automated and dynamic database indexing. Liang et al. [150] investigate how to use deep reinforcement learning for selecting materialized views and improve the performance of OLAP workloads. Yuan et al. [231] solve the view selection problem by first modeling it as an ILP problem, then modeling the ILP as a Markov Decision Process (MDP), and finally using a deep reinforcement learning technique to solve it.

5.3 Overview of Physical Database Design Approaches

This section provides a survey of research on automated physical design in RDBMSs. Since in 1998, the AutoAdmin project at Microsoft Research developed the first automated physical design tool in a commercial DBMS, we review the work in three categories: works prior to the AutoAdmin project, the evolution of the AutoAdmin project, and related works following the AutoAdmin project.

5.3.1 Physical Database Design before the AutoAdmin Project.

The physical design problem has been studied since the early 70’s, initially focusing on the index selection problem for files [16, 128, 178, 202, 214]. These early works adopt an analytical approach, using a probabilistic cost model of the transactions, to derive an exact solution, i.e., optimal set of indices. These exact approaches suffer from performing a complete enumeration of all possible index sets. Other works use approximate solutions and deal with a limited number of physical structures instead of all possible subsets, e.g., indexes [21, 24, 30, 56, 57, 58, 59, 63, 78, 81, 82, 102, 107, 108, 120, 195, 224, 234], partitioning [103, 169, 170, 195, 234], and multi-dimensional clustering [153]. Finally, exist also some approximate solutions for materialized views or index selection, which are specifically tailored for OLAP/Data Cubes [23, 96, 97, 105, 139, 211, 217, 229, 232]. Notice that none of the above works developed physical design tools for commercial DBMSs.

5.3.2 History of the AutoAdmin Project.

The AutoAdmin project² started in 1996 at Microsoft Research with the aim at making RDBMSs self-tuning. A primary focus was automating the physical database design. In 1998, the first physical database design tool was released, called Index Tuning Wizard (ITW), which shipped with Microsoft SQL Server 7.0 and is based on techniques presented in [45]. A key feature of this tool is an extension of the query optimizer to support a “what-if” interface [46]. In the SQL Server 2,000 release, ITW was enhanced with integrated recommendations for indexes and materialized views in the context of multidimensional (OLAP) databases [7].

Horizontal and vertical partitioning are other important aspects of physical database design with significant impact on performance and manageability. Horizontal partitioning allows tables, indexes and materialized views to be partitioned into disjoint sets of rows, which are separately stored and accessed. Vertical partitioning allows a table to be partitioned into disjoint sets of columns. In [8] a scalable solution to the integrated physical design problem of indexes, materialized views, vertical and horizontal partitioning for both performance and manageability was presented. In the Microsoft SQL Server 2005, the functionality of ITW was replaced by a full-fledged application, called the Database Engine Tuning Advisor (DTA) [6]. It significantly increased the scope and usability of ITW. DTA provides integrated recommendations for indexes, indexed views, indexes on indexed views, and horizontal range partitioning. It scales to large databases and workloads using techniques, such as workload compression and reduced statistics creation. ITW and DTA employ a “candidate selection” step to identify a set of likely configurations in a cost-based manner by consulting the query optimizer.

The ITW and DTA tools assume that the DBA knows when to invoke a physical design tuning tool. In this regard, DBAs face two main challenges. First, changes in data distributions and workloads might lead to the current configuration becoming sub-optimal. Thus, DBAs must continuously invoke the tuning tool to recommend changes in the current configuration. Second, only if the current configuration is no longer optimal, it is necessary to run a tuning tool and change the configuration, since running a tuning tool causes an overhead. Therefore, the work in [36] introduced a more lightweight tool, called Alerter, which identifies when a physical design tool should be invoked. Finally, the work in [37] introduced an online index selection approach that continuously monitors changes in the workload and in the data, and modifies the physical design as needed. It was prototyped inside the Microsoft SQL Server engine.

Since an crucial requirement for a self-tuning database system is the ability to efficiently monitor the database and identify potential performance threats, the AutoAdmin project developed also a SQL Continuous Monitoring engine (SQLCM) [44]. It enables continuous monitoring inside the database system and has the capability to automatically take actions based on the monitoring.

5.3.3 Database Physical Design following the AutoAdmin Project.

Physical database schema tuning can be carried out in three different modes: manual, semi-automatic, and full-automatic. Manual (or offline) approaches are invoked by the DBA when expected to be necessary and recommend a physical configuration to the DBA to improve the DBMS performance. Semi-automatic techniques use the DBA in the loop by recommending a physical configuration that requires feedback from the DBA. Full-automatic approaches, known as online physical design tuning, continuously tune the configuration as the workload or database state changes without the need for any DBA intervention.

Manual Mode. Manual tools for physical design tuning in commercial database systems other than the Microsoft SQL Server are IBM DB2 Design Advisor [219, 236, 237] and the SQL Access Advisor in Oracle 10g [64]. The DB2 Design Advisor [237] recommends indexes and materialized views for a given workload. As of DB2 version 8.2 [236], the Design Advisor provides integrated recommendations for physical design features, including indexes, materialized views, partitioning, and multi-dimensional clustering. It adopts a hybrid algorithm that efficiently searches through the large solution space while taking into consideration the interactions of related features. This approach allows to break the implementation of different features into smaller components. The implementation details can be found in [151, 191, 237]. For instance, a knapsack algorithm followed by a random-swapping phase is proposed for index and materialized view selection [219, 237], and a rank-based search and a genetic algorithm were proposed for partitioning [191]. The SQL Access Advisor in Oracle 10g [64] recommends indexes and materialized views for a given workload. It receives as input a workload and a set of candidates (created on a per-query basis by the Automatic Tuning Optimizer) and outputs a recommendation for the overall workload. In Oracle 11g [28, 138], the advisor was extended to make recommendations also for partitioning.

The index selection approach in [203] recursively applies a greedy selection that effectively accounts for index interactions. The approach decreases the number of “what-if” optimizer calls by employing caching and workload information, and thus can scales for large problems.

The work in [192] presents a physical design tool for C-store [215], a clustered column-store RDBMS. This tool builds an optimal candidate set consisting of materialized views and clustering indexes for OLAP workloads. The work in [233] provides a hybrid evolutionary approach, combining the advantages of heuristic and evolutionary algorithms to select materialized views in a data warehouse, whereas [98] presents a simple greedy heuristic for selecting views. The cost model in these studies is calculated in terms of query processing and view maintenance.

In [127], the authors investigate how to effectively integrate index compression techniques into physical database design. The presented index selection approach considers compressed alternatives for each index based on a given memory budget. The optimizer’s cost model is extended to handle compressed indexes in the configurations. The work in [193] jointly optimizes different configuration decisions (compression, index, and ordering) using linear programming to determine the best runtime performance for a given workload and memory budget.

Finally, there exist a number of state-of-the-art offline approaches for suggesting vertical and horizontal partitions [124, 171, 186] and multidimensional hierarchical Clustering of OLAP data [159].

Semi-automatic or Full-automatic Mode. The work in [205] proposes a semi-automatic index tuning technique, which adapts its recommendations to changes in the workload and feedback from the DBA. Fully-automatic approaches become popular in the form of database cracking [113, 114], continuous monitor-and-tune approaches [13, 14, 37, 88, 111, 112, 122, 147, 198, 204, 227], and machine learning-based approaches [26, 76, 135, 136, 146, 150, 179, 184, 185, 187, 210, 231]. Database cracking is an online index selection approach, where indexes are created and continuously refined as part of query processing. In this approach, each query is interpreted not only as a request for a particular result set, but also as an advice to crack the physical database store into smaller pieces. There is a large body of work on extending and improving database cracking [90, 91, 92, 93, 100, 110, 115, 189, 206, 207]. A comparative study of different cracking techniques is presented in [208].

Continuous monitor-and-tune approaches [13, 14, 37, 88, 111, 112, 122, 147, 198, 204, 227] monitor changes in the workload in a continuous way, collect statistics and make online decisions automatically to perform changes to the physical design as needed. The work in [204] and [198] share similar goals as [37], but differ in the design. The authors of [198] have developed the earliest online index selection solution, called the QUIET framework and implemented on top of DB2. It aims at creating indexes automatically at runtime, while considering a DBA-defined space budget. In comparison to [37], this approach requires multiple additional calls to the optimizer to obtain information about candidate indexes. The work in [204] developed a self-tuning framework, called COLT, that continuously monitors the workload of a database system and enriches the existing physical design with a set of effective indexes. COLT has been implemented inside PostgreSQL, extending the optimizer with a “what-if” interface. In contrast to [37], it is not fully integrated in the query optimizer. The authors of [227] have developed an autonomic tuning expert for IBM DB2 and evaluated it under changing workloads and system states. The approach stores DBA experts’ tuning knowledge for the autonomic management process.

A online DBMS-independent framework for automatic index tuning was introduced in [14], which can be used with different DBMSs. It is based on heuristics that run continuously to guide decisions on the current physical database configuration in order to react to workload changes. In [13], a model-driven approach was developed to support tuning decisions by different heuristics. A instantiation of the model is the database self-tuning task for physical design tuning, including indexes and materialized views. In this model, a database tuning requester can be a human agent, such as a DBA, or a computational agent, such as a monitoring tool. The work in [147] employs an online approach to monitor database queries. It collects statistical information in order to detect performance trends and to make decisions for automatic re-partitioning. In [111, 112], the authors presented a method for automatically deriving database configurations by evaluating an objective functions. The proposed self-management logic provides a continuous monitoring and light-weight analysis of the workload and database state, compares it against goals (e.g., response times, throughput, CPU or disk space usages, availability, and operation costs) defined by a DBA, and starts a reconfiguration analysis when there is a risk of missing the goals. Multi-objective optimization techniques are used to automatically derive database configurations that meet these goals. Additionally, this approach stores an explicit workload model to identify patterns in the workload that allow to predict workload shifts. Finally, an online technique to recommend indexes for high-dimensional databases was introduced in [88]. This approach uses a control feedback to monitor the query workload and to detect when the query pattern changes and corresponding adjustments to the index set are recommended.

There are numerous studies [26, 76, 135, 136, 140, 146, 150, 179, 184, 185, 187, 210, 231] that have leveraged machine learning techniques for automated tuning of DBMSs. They extract the state and performance metrics from a DBMS and use this data to train models for choosing tuning features that will benefit the most. Some studies [135, 136, 146, 184, 185] propose autonomous (self-managing) database systems, which leverage machine learning techniques for workload modeling and forecasting, fully automated physical design, and other configuration components. Thus, they do not focus on one particular configuration aspect and take a holistic view, which considers combinations of various configuration components.

The authors of [184] present the first self-driving RDBMS, called Peloton (later NoisePage [185]), which automatically configures and optimizes various aspects (e.g., indexes, materialized views, data partitioning, knobs) as the database and the workload evolve over time. It supports hybrid workloads, OLTP and OLAP, and predicts future workloads based on past arrival rates [156] so that the system can adapt itself. NoisePage uses behavior models generated by different ML algorithms to predict the cost and benefit of deploying configuration components. Based on these cost/benefit estimates, NoisePage adopts tree-based optimization methods (e.g., Monte Carlo tree search (MCTS) or RL methods) to select actions that improve the system’s target objective function.

The authors in [135, 136] proposed a framework for self-driving database systems that automatically tune various configuration features, such as physical database design, knob configurations, or hardware resources. The framework is able to predict future workloads and to deliver configurations based on these predictions. They also propose a linear programming (LP) model to tune the multiple dependent features in a recursive way. Additionally, an approach for adaptive cost estimation is suggested, which can be obtained by applying linear regressions, gradient boosting regressors, or neural networks. The proposed framework is integrated into the research DBMS Hyrise [74], which is a relational columnar store for transactional and analytical transaction processing (HTAP).

In [146], the authors built an autonomous database framework and integrated ML techniques into the open-source database system openGauss [175]. They proposed learning-based models for developing learned optimizers and advisors, such as learned cost estimation, learned knob tuning, learned database diagnosis, and a learned view/index advisor. For the latter, a deep reinforcement learning model to select high-quality indexes and materialized views is proposed.

Other researchers have developed learning approaches to solve the problem of automatically indexing a database [26, 140, 179, 187, 197, 210], materialized view selection [150, 231], and partitioning [76] using reinforcement learning or deep reinforcement learning.

Finally, we would like to mention some autonomous features in commercial databases, including Oracle Autonomous database [176], Alibaba self-driving database [12, 145], and fully-automated indexing in Microsoft Azure SQL Database [67].

6 Schema Design and Tuning in NoSQL Databases

This section provides an analysis of the research on NoSQL schema design using the taxonomy facets in NoSQL databases, including NoSQL type, level of data model, and modeling approach.

6.1 NoSQL Type and Level of Data Model

These two sub-facets in our taxonomy deal with database type and the level of the data model in NoSQL schema design. Concerning the type of databases, more than 200 different NoSQL stores have been reported, and the number is growing.³ Here we use the categorization from [69]:

—

Key-Value Stores are the least complex NoSQL databases from an architectural point of view. Data are represented as (key, value)-pairs in which a key serves as a unique identifier.

—

Document Stores are extended key-value stores, in which the value is a document stored in a semi-structured format, such as JSON or XML. Documents have flexible schemas, i.e., documents are not required to have the same schema or to contain the same level of information.

—

Wide-Column Stores, also known as extensible record stores, store data in columns and groups of columns. In fact, these systems store data in a tabular format of rows, called column family. Each row is identified by a key and can contain an arbitrary number of columns with a similar structure, which are usually accessed together. Rows in a column family are not required to share the same columns, i.e., columns can be added or modified at runtime.

—

Graph Stores are based on the graph structure that includes nodes and edges. Nodes and edges may also have properties.

From the viewpoint of data model, we distinguish between single-model and multi-model systems. The latter support the storage and processing of structurally different data, i.e., data with different models. Multi-model databases can be further divided into single databases and multi databases [155]. Multi-model single databases build one single engine to manage different data models with a fully integrated back-end to satisfy the system’s demand for performance, scalability, and fault tolerance. In contrast, multi-model multi databases (i.e., polystores [75] and multistores [132]) combine multiple different data stores, each with distinct native data models and different languages. An integrated access to all stores through one or more query languages is provided. In multi-model databases, a store-independent global schema is needed to integrate the different data models and to offer a uniform access. That is, at the logical and physical level, a unified data model can be defined, which is mapped to the local schemas. Several generic metamodels, such as SOS [19], NoAM [18], TDM [142], U-Schema [39], have been proposed.

6.2 Modeling Approach

Modeling NoSQL databases can be performed by using either a forward, a reverse or a roundtrip engineering methodology.

6.2.1 Forward Engineering/Schema-first Approaches.

Forward Engineering (or schema-first) approaches start from a conceptual model or a relational database and transform it into a NoSQL data model. In this process, workloads may be considered to optimize the proposed schema. Thus, we can distinguish two categories of approaches: workload-agnostic and workload-driven approaches.

Workload-agnostic Approaches. These approaches define a set of conversion rules that transform a relational or conceptual schema into a NoSQL schema. They largely ignore workloads as well as the evaluation of alternative designs. In this context, the authors of [1] provide a set of mapping rules that transform a UML conceptual model into different NoSQL physical models (column, document or graph). They provide a generic intermediate logical model that describes data according to the features of NoSQL models and support a sufficient degree of independence, so as to enable the mapping to several platforms. The work in [65] presents a set of mappings from conceptual schemas expressed in UML/OCL into logical schemas (data store specific models) and integrate them in a unique multi-data store design method. This approach embeds three transformations, each specific to a data store: UmlToSQL, UmlToGraphDB, and UmlToDocumentDB. Some authors propose transformations for column store databases [144, 149] and document databases [104, 174]. For the mapping of conceptual schemas to graph databases, the works in [5, 11, 66] have transformed a UML model, relational and RDF schemas, and extended ER model to a graph model. Additionally, data modeling for document-oriented and graph-based databases have been investigated in [125]. The study in [15] has investigated the process of transforming a relational database schema into various NoSQL database schemas, namely document-based, column-based, and graph-based.

The studies in [25, 89] proposed a modeling language for data modeling in hybrid polystores, which involve a combination of relational and NoSQL databases. The language provides mapping rules between the conceptual schema of a polystore according to the Entity-Relationship model and the underlying physical schemas.

Finally, it is worth mentioning contributions related to modeling and implementing data warehouses in NoSQL databases [10, 51, 52, 53, 54, 72, 73, 209, 230]. These approaches define a set of mapping rules that transform a multidimensional conceptual data model into NoSQL models.

Workload-driven Approaches. Workload-driven schema design considers the workload when optimizing the schema, i.e., the schema design is based on the analysis of the workload queries. The workload-driven approach is suitable when the queries and data structure are known beforehand. Approaches in this area often rely on schema denormalization and data redundancy to optimize the schema and improve query performance.

The works in [29, 218] use denormalization to optimize the database schema for read queries. By analyzing read queries, [218] employs foreign key constraints in the conceptual schema to generate all possible denormalized relations for each query, whereas [29] uses secondary indexes to identify all feasible table variants that could answer each read query via a single read request. A cost function is used to select the schema with the minimal cost.

Rafik et al. [177] present a set of conversion rules to support the schema conversion of an existing relational database into HBase. To obtain a proper design, the conversion rules depend on an analysis of the most frequent access patterns and of the heavy queries that are frequently run.

Mior [163] developed an approach, called NoSE, for automated schema generation in extensible record stores. This cost-based approach uses a novel binary integer programming (BIP) formulation to map a conceptual model to a NoSQL schema. NoSE receives a conceptual model and workload as input, and then formulates the schema optimization problem as a BIP, which selects a set of column families that minimize the execution cost of the entire workload. This approach has been implemented for the Cassandra extensible record store [164].

The works in [50] and [71] consider workload information to select an optimized logical schema for read queries by reducing the number of accesses to the database. In [50], a big data modeling tool, called KDM, is presented, which automates Cassandra database schema design for queries defined in an application workflow. It starts with a query-driven mapping from a conceptual data model to a logical data model and ends with a logical-to-physical mapping. In [71], the researchers defined mapping rules that guide a workload-driven transition from a conceptual schema to a logical schema and its physical implementation for NoSQL document stores.

Imam et al. [118] presented a long list of NoSQL modeling guidelines for the logical and physical design of document-store databases with the aim at facilitating the data modeling process and improving developers’ modeling skills. In another work [117], the authors presented a schema suggestion model for NoSQL document stores. The model suggests a schema based on system requirements and parameters, such as list of entities and Create, Read, Update and Delete (CRUD) statements. This model aims at balancing significant system requirements, such as consistency, availability, and scalability. An extension of the model, called DSP model [116], focuses on performance and describes how the model supports query complexities. For this, a binary integer formulation was used to map the conceptual data model to a database schema.

The work in [70] presents a framework for the automatic design of NoSQL databases (column family model and document model), which takes a conceptual data model and access queries as inputs. It includes the definition of a set of rules to transform the conceptual data model first to a logical model and then to the final implementation in the target NoSQL database.

Physical design guidelines for graph databases with the aim at improving the query execution time are presented in [123]. In this work, indexes, path materialization, and query rewriting are considered as guidelines for the physical design in Neo4j.

CONST loop [167] is the first online NoSQL schema design approach. It adopts a self-tuning feedback control loop for continuous workload monitoring and analysis, as well as optimized schema design in wide column stores, and has been implemented in Apache Cassandra. This loop describes a design pattern for the self-tuning feature, which automatically modify the current schema design reacting to changes in the application’s workload. Details of the architecture of the monitor and analyze phases are described in [166].

Noguera and Lucrédio [173] attempted to implement a schema-independent way to specify and execute NoSQL queries, which means the developer is not free to change the data schema in which the data is stored without prior adaptation of the data access code (queries). This work employed a generative approach to simplify the writing of queries independently from the actual data structure in which the data is stored. It uses different mapping information between the ER model and the NoSQL document-oriented databases, and automatically transforms ER-based queries into actual native code for a specific NoSQL database. It generates a more flexible schema and allows changes to be made in a NoSQL database schema without causing changes in all related queries. But it requires effort beforehand to correctly create the mapping information.

Several metamodels have been proposed to represent multi-model NoSQL schemas. SOS [19] is a metamodel to represent schemas for aggregate-based NoSQL stores (i.e., key/value, column, and document), supporting application development by hiding the specific details of the various databases systems. Later, SOS evolved to the NoSQL Abstract Data Model (NoAM) [18], which is an intermediate system-independent data model to support a design methodology for aggregate-based NoSQL databases. The design methodology uses aggregate design (a group of related objects), which is mainly driven by data access patterns of the application, as well as by scalability and consistency needs. Mali et al. [158] describe an optimization process to generate a platform-independent, generic logical model used to generate five target models (relational & NoSQL models). They apply a transformation heuristic to reduce the search space of the model generation. This work did not address mapping rules for transforming the generic logical model to physical models. In another work, Li and Chen [148] proposed a query-oriented data modeling approach, called QODM, to generate a platform-independent data model for NoSQL databases based on query requirements, which can be transformed to a specific NoSQL database. The QODM approach includes key-value stores, document stores, and extensible record stores; graph databases are not supported.

The research efforts in [31, 42, 109, 199, 228] introduce methods to design NoSQL schemas with the goal of enhancing the performance of OLAP queries. For document warehouses (DocW), Messaoud et al. [161, 162] propose an approach for transforming a document warehouse model into a column-oriented or document-oriented NoSQL model.

6.2.2 Reverse Engineering Approaches.

Most NoSQL data stores are schema-less, that is, data can be stored without a previous formal definition of its structure. The flexibility of these data models allows the user to easily incorporate new data without making changes in the data model (evolution of data over time). However, to express queries there is still a need to know how data is structured. Therefore, for processing data without explicit schema information, reverse engineering can be used, which allows to generate the data model from an existing database. This is important to facilitate the understanding, the documentation, and the schema visualization of the data.

In [131] a schema extraction approach has been introduced that generates a JSON schema from collections of JSON documents. This approach is based on an earlier approach for XML schema extraction from XML documents [165]. Similarly, [20, 121, 221] deal with the problem of inferring schemas from JSON sources by using hierarchical data structures to summarize the structural information of JSON documents. Some studies [2, 3, 4, 32, 33, 55, 61, 196] present an MDE-based reverse engineering approach for extracting the data model from schema-less NoSQL databases. Model-Driven Engineering (MDE) techniques provide a formal framework for automatic model transformation, i.e., mapping one input model into an output model. This transformation process defines a set of transformation rules to be applied between source and target models.

Frozza et al. [84] present an approach for extract a schema from a JSON or Extended JSON document collection stored in a MongoDB database. This work combines hierarchical structures and MDE-based processes. Additionally, aggregation operations are considered to improve the performance of the schema extraction task. The authors present also a method for schema extraction in JSON document schema format from graph [85] and columnar models [83].

Other works, such as [106, 141], present property graph schema inference methods from graph databases that are based on the property graph data model that is used in Neo4j. Some works focus only on extracting the physical or logical model from the data stored in NoSQL databases [4, 20, 33, 83, 84, 85, 106, 131, 141, 196, 221], while others address the extraction of the conceptual model from the NoSQL physical schema [2, 3, 32, 55] or the stored data [61, 121].

Regarding multi-model databases, a Tensor-based Data Model (TDM) [142] was defined to support logical data independence in polystore systems. TDM subsumes various data models (relational, NoSQL, etc.) and proposes a set of operators for defining views (virtual or materialized views) that can dynamically transform data into appropriate data structures. Reverse mappings from relational, key/value, wide-column, and graph models to TDM were presented.

There exist also some approaches for directly executing OLAP queries on NoSQL data sources [60, 86, 101]. The works in [60] and [86] focus on extracting multidimensional schemas (facts and hierarchies) from document stores. The work in [101] proposes a pay-as-you-go approach, which enables OLAP queries against a polystore supporting relational, document, and column data models by hiding heterogeneity behind a dataspace layer.

6.2.3 Roundtrip Engineering Approaches.

Roundtrip engineering (RTE) represents one aspect of model-driven engineering (MDE) techniques and combines forward and reverse engineering. Since code and model are interrelated, changing code will change the model and vice versa. Akoka and Comyn-Wattiau [9] present a framework describing a roundtrip engineering process for graph databases, which was the first RTE approach of NoSQL databases. As pointed out in this work, roundtrip engineering allows a bi-transformation between the model and the source code. The authors of [168] first present the process of extracting the physical schema from the data stored in MongoDB, which captures characteristics, such as existing indexes, data organization, and statistical features, and then propose a bidirectional transformation between logical and physical models.

Candel et al. [39] propose a logical unified metamodel, termed U-Schema, which represents logical schemas for a multi-model database supporting relational and the most popular NoSQL systems (columnar, document, key–value, and graph). The U-Schema metamodel has been defined with the Ecore metamodeling language of the Eclipse Modeling Framework (EMF) [213]. A MapReduce operation is performed on the database to infer the logical data model for each NoSQL paradigm. Next, a round-trip strategy is defined, consisting of forward mapping to a mapping from the inferred logical schema to U-Schema, and reverse mapping to a mapping in the opposite direction.

7 Workload Type and Tuning Approach

7.1 Workload Type

The main objective of the schema design process is to provide an optimized schema that increases the performance of an application’s workload. Thus, the type of the workload plays an important role. This facet of the proposed taxonomy deals with the classification of workloads into analytical (OLAP), transactional (OLTP), or hybrid (OLAP and OLTP) workload.

Since the goal of SQL physical schema design is finding a good physical configuration for a given workload, the schema design totally depends on the current workload type. In NoSQL workload-driven approaches, the schema design is based on the analysis of workload queries to optimize the schema and improve the performance of OLTP workload [29, 50, 71, 163, 167, 218] or OLAP workload [31, 42, 109, 199, 228]. As a result, the workload type plays an important role in these approaches. In contrast, NoSQL workload-agnostic approaches largely ignore the analysis of the workload, and hence the workload type is unimportant. However, there are some workload-agnostic approaches for modeling and implementing data warehouse and OLAP systems in NoSQL databases, defining a set of mapping rules that transform a multidimensional conceptual data model into NoSQL models [10, 51, 52, 53, 54, 72, 73, 209, 230]. There exist also some reverse approaches for directly executing OLAP queries on NoSQL data sources [60, 86, 101]

While there exist solutions for the optimization of physical structures for hybrid workloads in SQL databases [6, 136, 184, 185, 236], no solutions to support hybrid workloads in NoSQL schema design were suggested so far. This is because of different ways of representing conceptual and logical schemas in multidimensional (OLAP) as compared to OLTP databases.

7.2 Tuning Approach

Another important facet of the taxonomy is the approach that is used to incorporate tuning into a database management system. The tuning approaches can be categorized into static and dynamic.

7.2.1 Static (or Offline) Tuning.

Static tuning approaches recommend an efficient schema for a given workload to the DBA. While they are helpful to decrease the time required to tune a database, they leave several important decisions to DBAs, such as the continuous monitoring of the database in order to detect changes and re-tune the schema design. Static tuning approaches are designed to be invoked manually when expected to be necessary. Thus, they require a lot of human intervention and do not meet the requirements of an autonomic database. Besides, a shortcoming of these approaches is that they can only be used when the workload of the system is known and stable. Most studies for SQL [6, 45, 64, 98, 192, 203, 233, 236] and NoSQL [29, 50, 70, 71, 117, 164, 177, 218] databases generally take an offline approach to the schema design problem.

7.2.2 Dynamic (or online) Tuning.

In dynamic environments where workloads are unknown, unstable, and change over time, offline schema tuning is far from ideal. Instead, dynamic approaches are able to track changes in both the workload and the database state and perform decisions online to reconfigure the current schema design as needed. Such approaches provide fully automated solutions that meet the requirements of an autonomic database system and reduce the maintenance overhead for DBAs. The topic of the on-line physical tuning of SQL databases raised to popularity in the form of database cracking [113, 114], continuous monitor-and-tune approaches [13, 14, 37, 88, 111, 112, 122, 147, 198, 204, 227], and machine learning-based approaches [26, 76, 135, 136, 146, 150, 179, 184, 185, 187, 210, 231]. For NoSQL databases, the CONST [167] approach is the only online NoSQL schema design approach, which is a self-tuning feedback control loop for continuous workload monitoring, analysis and optimized schema design in wide column stores.

To be fully autonomic, a DBMS must be able to predict future workloads based on historical data in order to select proper configurations in a timely manner. Since database systems are often exposed to unanticipated events, it is difficult to anticipate future workloads. Therefore, robustness is a crucial component for self-tuning DBMSs, and it comes with facets. The first facet of robustness regards the tuning decisions in the presence of reconfiguration costs. It might happen that a reconfiguration of the database schema produces only a minor performance improvement. To avoid such situations, the cost of the reconfiguration need to be considered in the optimization. The second facet of robustness refers to how workload changes affect the system’s performance. Robust configurations aim at providing an acceptable performance for most workload scenarios so that small changes do not have a large impact [135]. To obtain robust configurations, not only the current workload but also information about the distribution of potential future workloads must be incorporated. Robust workload forecasting models are essential since the workload may deviate significantly from the past. Some approaches explicitly incorporate these robustness aspects [111, 135, 136, 156, 185].

Within dynamic tuning approaches, we identify two sub-facets: reactive/proactive tuning when approaches adapt to changes in the data/workload, and internal/external tuning that expresses if approaches are internal or external to a system.

Reactive/Proactive Tuning. This category specifically deals with the time point when dynamic approaches apply the tuning process. In the reactive mode, systems respond to a change when it already happened. In the proactive mode, systems predict when changes are expected to occur (e.g., predict periodic changes in the workload), and therefore they can adapt the DBS configuration beforehand. For example, a self-tuning database system chooses its configurations proactively according to the expected workload patterns in the future. For SQL databases, there are numerous dynamic tuning approaches, which are either reactive [26, 37, 113, 114, 122, 147, 150, 179, 198, 204, 231] or proactive [111, 112, 136, 184, 185]. Currently, there are no proactive tuning approaches for NoSQL databases, and only one work [167] is reactive.

Internal/External Tuning. From a different perspective, dynamic tuning can be divided into two categories with respect to the separation of the tuning mechanism and database system. For internal approaches, the schema self-tuning process is integrated as a module in the DBMS engine. A DBMS internally monitors and analyzes the workload and database state. If it discovers that the database requires changes, the tuning component proceeds to recommend schema changes for better performance. The internal approach has some notable drawbacks. For instance, the system is costly to maintain or evolve, and it is often not scalable. Some works [37, 113, 114, 136, 184, 185, 204] present a dynamic internal approach for physical schema design in SQL databases. However, there is currently no work on the dynamic internal approach for NoSQL databases.

On the other hand, works such as [26, 76, 111, 112, 150, 179, 198, 231] for SQL and [167] for NoSQL databases, employ an external self-tuning manager. These approaches are based on an external self-tuning logic, which controls the reconfiguration decisions of the DBMS. A significant advantage of the external approach is the reusability of the tuning process with respect to different systems, which can be easily customized and configured for different applications.

8 Discussion of Review Results

In this section, we discuss the review results and analyze research questions RQ2, RQ3, and RQ4.

8.1 Level of Representation for Database Schema Design

To analyze the past work in database schema design, three levels of representation – conceptual, logical, and physical – were taken into account. In SQL databases, studies only focused on the automated design of physical structures, such as indexes, materialized views, multidimensional clustering, and horizontal or vertical partitioning. Table 4 provides an overview. We can observe that the majority of studies on physical schema design in RDBMSs are about index selection, i.e., choosing indexes on base tables or materialized views. Most of the works focus on a single structure.

Table 4.

Study	Index	Mat. Views	Partitioning	Multidim. Clustering
[16], [128], [178], [202], [214], [21], [24], [30], [58], [57], [67] [59], [56], [63], [78], [80], [81], [82], [102], [107], [108], [120], [119], [224], [219], [45], [203], [205], [97], [114], [113], [37], [198], [204], [88], [112], [111], [26], [179], [187], [197], [210], [140], [14], [137], [172], [40], [41], [127], [193]	\(\checkmark\)
[23], [98], [105], [211], [217], [229], [232], [233], [231], [98], [150]		\(\checkmark\)
[76], [103], [147], [191], [186], [235], [124], [171], [169], [170], [122]			\(\checkmark\)
[151], [153], [159]				\(\checkmark\)
[139], [7], [13], [64], [146], [237]	\(\checkmark\)	\(\checkmark\)
[234], [195], [136], [135]	\(\checkmark\)		\(\checkmark\)
[6], [8], [138], [184]	\(\checkmark\)	\(\checkmark\)	\(\checkmark\)
[236]	\(\checkmark\)	\(\checkmark\)	\(\checkmark\)	\(\checkmark\)

Table 4. Classification of Schema Design Studies in SQL Databases by Physical Structures

Conversely, due to the flexibility of NoSQL data models, schema design has been studied at all three levels of representation – conceptual, logical, and physical. Table 5 shows source model, target model, and the level of representation investigated in our primary studies. Forward methods start with a conceptual model or relational database and infer logical or physical models from them. Reverse methods work in the opposite direction: they either extract the conceptual model from NoSQL physical schemas [2, 3, 32, 55] or from the stored data in the database [61, 121], or they extract the physical or logical model from the stored data in the database [4, 20, 33, 84, 131, 196, 221].

Table 5.

Method	Study	Source Model	Target Model	Level of Representation
Forward	[1]	UML	Generic logical model, Physical models (Column, Document, Graph)	Logical, Physical
[65]	UML/OCL	Multi-store (Relational, Document, Graph)	Logical, Physical
[144], [218]	RDB	Column	Logical
[149]	UML	Column	Physical
[5]	RDB/RDF	Graph	Physical
[66]	UML/OCL	Graph	Logical, Physical
[125]	ER	Graph, Document	Logical, Physical
[15]	RDB	Column, Document, Graph	Logical
[31]	DW Relational Schema	Column	Logical
[10, 209]	Multidimensional conceptual model	Graph	Logical
[53], [230]	Multidimensional conceptual model	Document, Column	Logical
[51], [72]	Multidimensional conceptual model	Column	Logical
[73]	Multidimensional conceptual model	Key-Value	Logical, Physical
[52], [54]	Multidimensional conceptual model	Document	Logical
[29]	ER	Column	Logical
[177]	RDB	Column	Logical, Physical
[164]	Entity Graph (restricted type of ER)	Column	Logical, Physical
[50], [167]	ER	Column	Logical, Physical
[71]	EER (Extended ER)	Document	Logical, Physical
[11]	EER (Extended ER)	Graph	Logical, Physical
[116], [104], [173]	ER	Document	Logical, Physical
[89]	ER	Polystore (Relational, NoSQL models)	Physical
[70]	UML	Column, Document	Logical, Physical
[18]	UML	Generic logical model, Physical models (Column, Document, Key-value)	Logical, Physical
[158]	UML	Generic logical model (Relational, NoSQL models)	Logical
[31]	Relational DW	Column	Logical, Physical
[162]	Multidimensional galaxy model of DocW	Document	Logical, Physical
[161]	Multidimensional of DocW	Column	Logical, Physical
Reverse	[61]	Graph DB	EER Conceptual Model	Logical, Conceptual
[85]	Graph DB	JSON Document Schema	Physical
[83]	Column DB	JSON Document Schema	Physical
[141], [106]	Graph DB	Property Graph Schema	Logical, Physical
[121]	JSON Document DB	UML Conceptual Model	Physical, Conceptual
[55]	Document Schema	Entity Differentiation Conceptual model	Conceptual
[20], [131], [4], [33], [221]	JSON Document DB	Document Schema	Physical
[196]	Aggregate-oriented NoSQL DB	Logical Schemas (UML model of EMF)	Logical
[2], [3], [32]	Document Schema (Physical model)	UML Conceptual Model	Conceptual
[84]	JSON and Extended JSON DB	Document Schema	Physical
[142]	Polystore (Relational, Key/value, Column, Graph)	Generic logical model	Logical
[60], [86]	JSON Document DB	Multidimensional Conceptual Schema	Conceptual
Roundtrip	[9]	Graph DB	EER Conceptual Model	Conceptual, Logical, Physical
[39]	Relational and NoSQL DBs Logical data models	Logical data models Generic Logical Model (EMP Model)	Logical
[168]	JSON Document DB Physical	Physical Logical	Logical, Physical

Regarding the roundtrip methods (last three rows in Table 5), the first work [9] presents a bidirectional transformation between a conceptual model and a NoSQL graph DB. Similarly, the work in [168] first presents the process of extracting the physical schema from the stored data in MongoDB and then defines a bidirectional transformation between logical and physical models. Finally, [39] proposes a unified logical metamodel, termed U-Schema, which represents a logical schema for a multi-model database supporting both relational and the four most popular NoSQL systems (columnar, document, key–value, and graph). First, a MapReduce operation infers the logical data model for each NoSQL paradigm. Then, a round-trip strategy is defined, consisting of forward and reverse mappings between the inferred logical schema and the U-Schema.

8.2 Methods used for Schema Design and Tuning Process

Cost model and search strategy are the two main factors when choosing a physical design for SQL databases. Table 6 shows the different methods used in our primary studies for cost model and search strategy. The methods that can be used for cost model are stand-alone, optimizer, and learned cost model. For search strategy, early work used exact approaches to select an optimal configuration in files. The exact solutions apply exhaustive search to consider all possible configurations, they are very time consuming and hence impractical. Later studies shifted the focus toward determining a good (near-optimal) physical design instead of an optimal design using approximate solutions. Most of work use heuristic solutions, greedy search, genetic algorithm, branch-and-bound, relaxation-based approach, and hybrid approaches. Recently, advances in machine learning, especially in reinforcement learning and deep reinforcement learning, motivate researchers to move toward the learning-based approaches for finding a physical configuration with the lowest total cost.

Table 6.

Study	Cost Model	Search Strategy	Workload	Static	Dynamic
				Tuning approach
					Internal	External	Reactive	Proactive
[16], [128], [178], [202], [214]	Stand-alone	Exact approaches	OLTP	\(\checkmark\)
[21], [24], [30], [58], [102], [224], [103]	Stand-alone	Heuristic solutions	OLTP	\(\checkmark\)
[81], [82]	Optimizer	Heuristic solutions	OLTP	\(\checkmark\)
[120]	Stand-alone	Knapsack problem	OLTP	\(\checkmark\)
[234]	Stand-alone	branch-and-bound	OLTP	\(\checkmark\)
[96], [97], [105]	Stand-alone	Greedy approach	OLAP	\(\checkmark\)
[23], [211], [217], [229]	Stand-alone	Heuristic solutions	OLAP	\(\checkmark\)
[232]	Stand-alone	Genetic algorithm	OLAP	\(\checkmark\)
[139]	Stand-alone	A* algorithm	OLAP	\(\checkmark\)
[63], [80], [195]	Stand-alone (knowledge-based approaches)	Heuristic solutions	OLTP	\(\checkmark\)
[58], [59], [56]	Combination of an optimizer-based and a knowledge-based approach	Heuristic solutions	OLTP	\(\checkmark\)
[45]	Optimizer	Greedy approach	OLTP	\(\checkmark\)
[6]	Optimizer	Greedy approach	Hybrid	\(\checkmark\)
[219]	Optimizer	Knapsack algorithm	OLTP	\(\checkmark\)
[236]	Optimizer	Hybrid algorithm	Hybrid	\(\checkmark\)
[192]	Stand-alone	Integer Linear Programming (ILP)	OLAP	\(\checkmark\)
[98]	Stand-alone	Greedy algorithm	OLAP	\(\checkmark\)
[233]	Stand-alone	Hybrid evolutionary approach	OLAP	\(\checkmark\)
[127]	Compression-aware optimizer	Greedy approach	OLTP	\(\checkmark\)
[193]	Stand-alone	Linear programming	OLTP	\(\checkmark\)
[68]	Optimizer	Binary integer programming (BIP)	OLTP	\(\checkmark\)
[203]	Optimizer	Recursive greedy approach	OLTP	\(\checkmark\)
[37]	Optimizer	Relaxation-based approach	OLTP		\(\checkmark\)		\(\checkmark\)
[198]	Optimizer	Greedy approach	OLTP			\(\checkmark\)	\(\checkmark\)
[204]	Optimizer	Heuristic solution	OLTP		\(\checkmark\)		\(\checkmark\)
[112], [111]	Optimizer	Multi objective optimization technique	OLTP			\(\checkmark\)		\(\checkmark\)
[114], [113]	Cracker-aware optimizer	Cracking algorithm	OLTP		\(\checkmark\)		\(\checkmark\)
[185]	Several ML algorithms (for each action, one ML cost model)	MCTS or RL methods	Hybrid		\(\checkmark\)			\(\checkmark\)
[136]	Several ML algorithms	Recursively Linear Programming	Hybrid		\(\checkmark\)			\(\checkmark\)
[26], [179]	Reinforcement learning	Modeling index selection as a MDP and solving by RL	OLTP			\(\checkmark\)	\(\checkmark\)
[140]	Optimizer	DRL	OLTP	\(\checkmark\)
[150]	DRL	Modeling MV selection as a MDP and solving by DRL	OLAP			\(\checkmark\)	\(\checkmark\)
[231]	DRL	Modeling MV selection as an ILP problem, modeling ILP as an MDP, and solving DRL	OLAP			\(\checkmark\)	\(\checkmark\)

Table 6. A Comparison of Existing Works in Schema Design for SQL Databases

Regarding NoSQL modeling, the approaches presented in existing works can be categorized into forward, reverse, or roundtrip engineering methods. While forward engineering methods go from a conceptual to a physical data model, reverse engineering methods move from an existing database (the stored data) to a conceptual model. Roundtrip engineering is based on a combination of forward and reverse engineering. The modeling approach used in existing work is illustrated in Table 7. Roundtrip modeling is the least explored method.

Table 7.

Study	NoSQL Type	Level of Data Model	Workload	Modeling Approach			Tuning Approach
				Forward	Reverse	Roundtrip	Static	Dynamic
[1]	Column, Document, Graph	Multi-model	OLTP	\(\checkmark\)			\(\checkmark\)
[15]	Column, Document, Graph	Single-model	OLTP	\(\checkmark\)			\(\checkmark\)
[65]	Document, Graph	Multi-model (Multistore)	OLTP	\(\checkmark\)			\(\checkmark\)
[125]	Document, Graph	Single-model	OLTP	\(\checkmark\)			\(\checkmark\)
[149], [144], [177]	Column	Single-model	OLTP	\(\checkmark\)			\(\checkmark\)
[104]	Document	Single-model	OLTP	\(\checkmark\)			\(\checkmark\)
[66], [5], [11]	Graph	Single-model	OLTP	\(\checkmark\)			\(\checkmark\)
[72], [51], [31], [199]	Column	Single-model	OLAP	\(\checkmark\)			\(\checkmark\)
[230], [53]	Column, Document	Single-model	OLAP	\(\checkmark\)			\(\checkmark\)
[52], [54]	Document	Single-model	OLAP	\(\checkmark\)			\(\checkmark\)
[209], [10]	Graph	Single-model	OLAP	\(\checkmark\)			\(\checkmark\)
[73]	key-value	Single-model	OLAP	\(\checkmark\)			\(\checkmark\)
[218], [29], [164], [50], [70]	Column	Single-model	OLTP	\(\checkmark\)			\(\checkmark\)
[71], [117], [173], [116]	Document	Single-model	OLTP	\(\checkmark\)			\(\checkmark\)
[18], [19]	key-value, Column, Document	Multi-model	OLTP	\(\checkmark\)			\(\checkmark\)
[158]	key-value, Column, Document, Graph	Multi-model	OLTP	\(\checkmark\)			\(\checkmark\)
[167]	Column	Single-model	OLTP	\(\checkmark\)				\(\checkmark\)
[89]	key-value, Column, Document, Graph	Multi-model (Polystore)	OLTP	\(\checkmark\)			\(\checkmark\)
[161]	Column	Single-model	OLAP (DocW)	\(\checkmark\)			\(\checkmark\)
[162]	Document	Single-model	OLAP (DocW)	\(\checkmark\)			\(\checkmark\)
[42]	Graph	Single-model	OLAP	\(\checkmark\)			\(\checkmark\)
[131], [20], [221], [55], [121], [3], [2], [4], [32], [84], [33]	Document	Single-model	OLTP		\(\checkmark\)		\(\checkmark\)
[196]	Document, Key-value, Column	Single-model	OLTP		\(\checkmark\)		\(\checkmark\)
[61], [85], [141]	Graph	Single-model	OLTP		\(\checkmark\)		\(\checkmark\)
[83]	Column	Single-model	OLTP		\(\checkmark\)		\(\checkmark\)
[142]	key-value, Column, Graph	Multi-model (Polystore)	OLTP		\(\checkmark\)		\(\checkmark\)
[60], [86]	Document	Single-model	OLAP		\(\checkmark\)		\(\checkmark\)
[101]	Document, Column	Multi-model (Polystore)	OLAP		\(\checkmark\)		\(\checkmark\)
[168]	Document	Single-model	OLTP			\(\checkmark\)	\(\checkmark\)
[9]	Graph	Single-model	OLTP			\(\checkmark\)	\(\checkmark\)
[39]	Document, Column, Graph, Key-value	Multi-model	OLTP			\(\checkmark\)	\(\checkmark\)

Table 7. A Comparison of Existing Work in Schema Design of NoSQL Data Stores

8.3 Taxonomy of Schema Design and Tuning Approaches

The different classifications of the numerous approaches for database schema design not only reveals a comprehensive overview about the significant facets and aspects of this research field, but it also provides a framework for identifying open challenges, research gaps, and promising directions for future research. Several works have already discussed different aspects of database schema design. In the automated physical schema design of relational databases, Chaudhuri and Narasayya [47] discuss the two cost models stone-alone and optimizer as well as offline and online tuning approaches. However, they just address a small number of publications which were released until 2007. On the other hand, Störl et al. [216] focus more on the approaches of NoSQL Modeling and classify them into forward and reverse engineering. Our article unifies these classifications into a taxonomy, and also introduces new facets to fill in the gaps.

Using our taxonomy in Figure 5, Table 6 and Table 7 compare the state-of-the-art in schema design for SQL and NoSQL databases, respectively. The following points are taken from the two tables: (a) As evident from the columns Cost Model and Search Strategy of Table 6, the current research trend is toward learning-based approaches, especially reinforcement learning and deep reinforcement learning, for finding optimized physical configurations in relational databases. This is because learning-based approaches can potentially help to provide faster approximations for optimization problems. (b) While research interest in polystores and multistores is growing, research on schema design for such systems is still in an early stage. First articles [65, 101, 142] outline potential solutions. (c) The online approaches are recommended for providing fully automated solutions that meet the goals of an autonomic database system. There is extensive work on online physical design tuning in relational databases, yet research on NoSQL online schema tuning is an open task. A first article [167] sketches ideas about potential solutions.

8.4 Additional Results

As additional results, we incorporate contributions from the industrial area, provide examples of real-world applications (e.g., cloud environments and large-scale data systems), and explore interactions of the schema design problem with other fields.

8.4.1 Schema Design Efforts in the Industrial Area.

As mentioned in Section 5.3, all major commercial database vendors ship automated physical design tools with relational databases, e.g., DTA in Microsoft SQL Server [6], DB2 Design Advisor [236, 237], and the SQL Access Advisor in Oracle [64]. There exist also several tools for database schema design in industrial applications, such as Hackolade,⁴erwin,⁵ and ER/Studio.⁶Hackolade is a multi-platform schema design tool for NoSQL databases, which offers graphical schema design, physical data modeling, and forward and reverse engineering. Currently supported databases include Apache Cassandra, Apache HBase, and ScyllaDB in column databases; MongoDB, Couchbase, CouchDB, Elasticsearch, Google Real time Firebase, and Google Cloud Fire store in document stores; Neo4j and TinkerPop in graph databases; and DynamoDB in key-value stores. The popular relational modeling tools erwin and ER/Studio are expanding toward NoSQL stores. erwin Data Modeler now supports Apache Cassandra, MongoDB, and Couchbase as target NoSQL databases, allowing schema forward and reverse engineering. The ER/Studio tool supports schema forward and reverse engineering for MongoDB.

8.4.2 Cloud Environments.

Offline schema tuning approaches rely on DBAs, which is expensive and cannot adapt to cloud computing platforms because of their scale and complexity. Chaudhuri et al. [48] argue that expenses for DBAs are the key factor in the total cost of ownership. These costs can be further increased by a higher complexity of schema tuning tasks caused by non-stable workloads, a lack of domain knowledge and application context [67] in cloud environments. Therefore, fully automatic schema tuning for cloud platforms is an urgent research issue [67, 145, 146, 176, 185].

8.4.3 Large-scale Data Systems.

In large-scale systems with thousands of queries in the workload and hundreds of tables, schema selection approaches often face specific challenges, such as solution constraints (e.g., memory budget), reconfiguration costs, and prohibitive runtimes. Some studies presented scalable and efficient approaches for such systems. For example, DTA tool [6] scales to large databases and workloads using techniques such as workload compression and reduced statistics. The methods in [38, 68, 181, 203] significantly speedup what-if optimization by decreasing the number of what-if optimizer calls. The studies in [136, 146, 185] proposed autonomous database systems using machine Learning methods that are applicable for large-scale problems. There are only a few works [18, 20, 39] that investigate scalability requirements for schema design in NoSQL databases. For instance, the works in [20, 39] use MapReduce on tackle scalability issues.

8.4.4 Exploring Broader Connections.

The importance of automated schema design goes beyond SQL and NoSQL databases. For instance, schema extraction from a wide variety of heterogeneous data sources is a crucial element in data lakes, as the source data needs to be described by a data model to capture its semantics. To support the “load-first, schema-later” paradigm, the assessment of data quality and cleaning large volumes of heterogeneous data sources are essential tasks in unveiling the value of big data [79]. The quality of data can refer to the extension of the data (i.e., data values) or to the intension of the data (i.e., the schema) [27]. The schema extraction from raw data may be intertwined with the preparation for operations such as discovery, integration, and cleaning [79, 99, 182]. For example, CLAMS transforms heterogeneous data in a data lake into a unified data model and enforces quality constraints on this model for cleaning purposes [79]. Data quality verification and data cleaning methods are based on reverse engineering [190, 201].

As another example, [22] proposes an ontology driven meta-model to conceptualize data representation in heterogeneous databases, proposing a common conceptual abstraction on semantically enriched vocabularies for both NoSQL and SQL databases to generate logical or physical schemas.

9 Limitations and Threats to Validity

The validity of a systematic review is largely defined by the extent to which the design and conduction of the study are likely to be true and free from bias. Here we briefly discuss potential threats to the validity of our work as well as actions we took to minimize these risks. We followed the methodology of Kitchenham [129, 130] to minimize bias and maximize the validity of the systematic review. Based on this methodology, this study is subject to the following limitations:

Construct Validity. One of the most common threats of systematic reviews is to ensure that all relevant academic articles are covered. In order to minimize this threat, our search for primary studies was driven by a deterministic list of inclusion and exclusion criteria and based on a set of specified research questions. Furthermore, we have defined the keywords using relevant terms based on the research questions to determine a reliable search string, and we used well-known data sources to find candidates for primary studies (cf. Section 3.1.3).

Internal Validity. This is the extent to which the design and conduct of the study are likely to prevent systematic errors [130]. We designed our methodology (cf. Section 3.2) to minimize systematic errors in the data extraction process. To ensure coherent and correct data extraction, we supplemented our core methodology with thorough discussions among all authors of this article.

External Validity. This is the extent to which the effects observed in the study are applicable outside of the study [130]. In this context, we strongly believe that both industrial and scientific communities in the area of databases and, more specifically, schema design and tuning, will largely benefit from this systematic review and the introduced taxonomy.

10 Conclusions and Future Trends

Automated database schema design has attracted a broad range of interests from academia and industry. We performed a systematic literature review to accurately present the state-of-the-art and to build an understanding of different solutions in the field of automated database schema design. For this, 75 primary studies for NoSQL databases and 108 primary studies for SQL databases from 1st January 1970 to 31st December 2021 were analyzed. The number of identified documents indicates that there has been less research on this topic for NoSQL databases. This is due to the fact that research on NoSQL automatic schema design first appeared in 2010.

The main findings in this review show that there is extensive work on automated physical schema design for SQL databases, and tools are available in commercial DBMSs. Recent research points toward the development of self-tuning database systems, which automatically configure and optimize the model in response to changes in the workload and the data. Such systems automatically tune various configuration features, such as physical database design, knob configurations, or hardware resources. Our main conclusion with respect to schema design in NoSQL databases is that there is no consensus yet on a standard to model NoSQL databases, and only a few contributions from industry have been made so far. An emerging research trend is to shift the focus toward global and generic schema design in polystores and multistores that handle heterogeneous data.

Based on our investigation and insights, we identify various directions for future research:

Online approaches for full autonomic schema tuning. Online approaches provide fully automated solutions that meet the goals of an autonomic database system. They are able to track changes in both the workload and the database state, and they make online decisions to modify the current schema design as needed. Contrary to SQL databases, little work has been done for NoSQL online schema design. Thus, there is a huge potential to explore this gap for NoSQL data stores.

Handling overhead and overreaction in online approaches. Online approaches aim at reducing the maintenance overhead for DBAs, but they impose a processing overhead on the database system. Reducing this overhead is therefore critical for self-tuning databases. On the other hand, online approaches should be robust against temporary variations in the workload and database state, and should prevent unnecessary reactions to these variations. Therefore, another challenge in self-tuning databases is the development of approaches that are able to accurately predict the effect of re-tuning and avoid overreaction in the presence of temporary variations.

Proactive approaches. Online schema design approaches rely on change prediction to determine when a tuning process should be applied. A major challenge is to accurately predict the change point, and to adapt the current database schema accordingly. Machine learning models can be applied to accurately predict when a change will occur. While there exist some proactive schema tuning strategies for SQL databases, so far there is no such work for NoSQL databases.

Generic schema design for multi-model databases. Multi-model database systems support more than one data model. To integrate concepts from several data models yet offering a uniform representation, a store-independent global schema is required.

Acknowledgments

We would like to thank André Conrad for his thoughtful comments.

Footnotes

https://github.com/HypoPG/hypopg

https://research.microsoft.com/dmx/AutoAdmin

https://hostingdata.co.uk/nosql-database/

⁴

https://hackolade.com/

⁵

http://erwin.com/

⁶

https://www.idera.com/er-studio-enterprise-architecture-solutions

References

[1]

Fatma Abdelhédi, Amal Ait Brahim, Faten Atigui, and Gilles Zurfluh. 2017. UMLtoNoSQL: Automatic transformation of conceptual schema to NoSQL databases. In Proceedings of the AICCSA. IEEE Computer Society, 272–279.

Abstract

1 Introduction

2 Related Surveys

3 Systematic Review Process

3.1 Phase 1: Planning the Review

3.1.1 Need for a Systematic Review.

3.1.2 Research Questions.

3.1.3 Systematic Review Protocol.

3.2 Phase 2: Conducting the Review

3.2.1 Identifying Primary Studies.

3.2.2 Selecting Primary Studies.

3.2.3 Assessing the Quality of Primary Studies.

3.2.4 Extracting Data of Primary Studies.

3.3 Phase 3: Reporting the Review

4 Overview of Primary Studies and Taxonomy

4.1 Overview of Selected Primary Studies

4.2 A Taxonomy of Schema Design and Tuning

5 Schema Design and Tuning in SQL Databases

5.1 Cost Model

5.1.1 Stand-alone Approaches.

5.1.2 Optimizer Cost Model.

5.1.3 Learned Cost Model.

5.2 Search Strategy

5.2.1 Exact Approaches.

5.2.2 Approximate Approaches.

5.3 Overview of Physical Database Design Approaches

5.3.1 Physical Database Design before the AutoAdmin Project.

5.3.2 History of the AutoAdmin Project.

5.3.3 Database Physical Design following the AutoAdmin Project.

6 Schema Design and Tuning in NoSQL Databases

6.1 NoSQL Type and Level of Data Model

6.2 Modeling Approach

6.2.1 Forward Engineering/Schema-first Approaches.

6.2.2 Reverse Engineering Approaches.

6.2.3 Roundtrip Engineering Approaches.

7 Workload Type and Tuning Approach

7.1 Workload Type

7.2 Tuning Approach

7.2.1 Static (or Offline) Tuning.

7.2.2 Dynamic (or online) Tuning.

8 Discussion of Review Results

8.1 Level of Representation for Database Schema Design

8.2 Methods used for Schema Design and Tuning Process

8.3 Taxonomy of Schema Design and Tuning Approaches

8.4 Additional Results

8.4.1 Schema Design Efforts in the Industrial Area.

8.4.2 Cloud Environments.

8.4.3 Large-scale Data Systems.

8.4.4 Exploring Broader Connections.

9 Limitations and Threats to Validity

10 Conclusions and Future Trends

Acknowledgments

Footnotes

References

Index Terms

Recommendations

Design a Data Warehouse Schema from Document-Oriented database

Some remarks on self-tuning logical database design

Database technologies in the world of big data

Comments

Information

Published In

Publisher

Publication History

Check for updates

Author Tags

Qualifiers

Funding Sources

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

View options

PDF

eReader

Get Access

Login options

Full Access