Nothing Special   »   [go: up one dir, main page]

skip to main content
survey
Open access

Self-tuning Database Systems: A Systematic Literature Review of Automatic Database Schema Design and Tuning

Published: 29 June 2024 Publication History

Abstract

Self-tuning is a feature of autonomic databases that includes the problem of automatic schema design. It aims at providing an optimized schema that increases the overall database performance. While in relational databases automatic schema design focuses on the automated design of the physical schema, in NoSQL databases all levels of representation are considered: conceptual, logical, and physical. This is mainly because the latter are mostly schema-less and lack a standard schema design procedure as is the case for SQL databases. In this work, we carry out a systematic literature survey on automatic schema design in both SQL and NoSQL databases. We identify the levels of representation and the methods that are used for the schema design problem, and we present a novel taxonomy to classify and compare different schema design solutions. Our comprehensive analysis demonstrates that, despite substantial progress that has been made, schema design is still a developing field and considerable challenges need to be addressed, notably for NoSQL databases. We highlight the most important findings from the results of our analysis and identify areas for future research work.

1 Introduction

The maintenance of database systems has become costly and time-consuming due to their growing complexity. In fact, a major part of today’s database expenses relates to the database administration. For this reason, recently we have seen an increasing number of studies on self-adaptive features in database management systems (DBMSs), which led to the design and development of so-called self-managing or autonomic DBMSs (ADBMSs). They have the capability to manage and maintain themselves without or with little human intervention. Autonomic database management systems can be classified in the following self-CHOP four categories [87]: self-configuring, self-healing, self-optimizing, and self-protecting. Some works [77] consider additional categories, such as self-organizing and self-inspecting.
Self-configuring (or self-tuning) is the ability of the database system to configure itself in order to achieve its performance goals [94]. There are hundreds of configuration parameters, which can be categorized into knobs configuration, resource consumption thresholds, and data structures design in the database schema [77]. Thus, the field of self-tuning databases is large. Due to lack of space, in this study we only focus on database schema design and tuning; other areas of schema management, such as schema evolution and data migration, are not covered.
From a schema design viewpoint, SQL databases are designed to manage well-structured data, requiring users to design a schema before storing data. This is a schema-on-write approach, in which data is written into a fix and predefined schema. There exists also efforts to extend SQL databases to manage flexible schema [154]. In contrast, NoSQL schema design is based on a schema-on-read approach, offering flexibility to handle unstructured or dynamic data without a fixed schema. While SQL databases can efficiently handle structured data, they are not as flexible as NoSQL databases in dealing with unstructured or semi-structured data due to their fixed schema requirements.
Generally, three levels of representation are distinguished in databases: conceptual, logical, and physical data models. The automatic schema design in relational databases focuses on automated physical schema design [47]. This is mainly because there are well-known principles and procedures for translating a conceptual data model to a normalized logical data model, i.e., a set of tables. The physical schema describes the implementation of the logical model on secondary storage. In order to tune the physical design for a given workload and achieve a more efficient data access, tables are supplemented with additional structures such as indexes and materialized views. As pointed out in [47], the problem of physical database design can be defined as a search problem of finding a good configuration of physical structures (i.e., indexes, materialized views), which minimizes the execution cost of a given workload.
In contrast, NoSQL stores adopt more flexible data models that are mostly schemaless. A schemaless data model allows data to have arbitrary structures as they are not explicitly defined using a data definition language (schema-on-write), but are instead implicitly encoded in the application logic (schema-on-read) [69]. This implies a high level of flexibility in data modeling, allowing to structure the data with greater freedom compared to relational databases. Due to the schemaless nature of NoSQL stores, automatic schema design is not limited to the physical level, but all levels of representation (conceptual, logical, or physical) are considered. There is no fully standard schema design (or data modeling) procedure for NoSQL databases, and manual schema design often relies on rules of thumb for choosing a good schema. For instance, eBay [183] and Netflix [133] have shared examples and general guidelines for the schema design of NoSQL wide column stores. The problem of NoSQL schema design is to transform conceptual, logical, and physical models into each other. At the conceptual level, the traditional models used in SQL databases are adopted, such as UML, ER, and EER. In contrast, at the logical level new schemas were defined based on models such as documents and graphs, thereby bringing the logical model closer to specific NoSQL databases. Finally, at the physical level, the specific schema and features of the NoSQL databases is used, such as MongoDB, Neo4j, or Cassandra. At this point, some questions arise such as: At which level of representation (conceptual, logical, or physical) does the schema design process occur? Which schema structures or models are designed at each level of representation? What are the methods used to model NoSQL databases?
The goal of this study is to conduct a systematic literature review on existing approaches for automated database schema design, considering academic works in both SQL and NoSQL databases. We analyze 186 primary studies that were published between 1970 and 2021, and we provide a comprehensive taxonomy to compare the numerous approaches along relevant facets.
Studies on physical schema design in relational databases have been conducted since the early 70’s. While initially the focus was on the index selection problem, later on extensive works on selecting an optimized set of indexes, materialized views, and other physical structures have been conducted. Nowadays all major commercial database vendors ship automated physical design tools, e.g., the SQL Access Advisor in Oracle [64, 138], Tuning Advisor in Microsoft SQL Server [6], and DB2 Design Advisor in IBM’s DB2 [236, 237]. In contrast, there have been fewer studies on automatic database schema design for NoSQL databases, which only appeared since 2010 when NoSQL databases became popular.
The remainder of this article is organized as follows: Section 2 summarizes previous surveys on database schema design, followed by a brief description of the systematic reviewing approach adopted in this article in Section 3. Section 4 gives an overview of the selected primary studies and the proposed taxonomy. A thorough overview of the research works in automated schema design in SQL and NoSQL databases is presented in Sections 5 and 6, respectively, whereas Section 7 analyzes previous works from the viewpoint of workload type and tuning approach. A discussion of the review results is presented in Section 8. In Section 9, the limitations and threats to validity are outlined. Finally, Section 10 concludes the article.

2 Related Surveys

In the literature, various surveys or assessments of automatic database schema design have been conducted, which focus either on SQL or NoSQL databases. However, to the best of our knowledge, no comprehensive survey on automatic database schema design covering both SQL and NoSQL databases exists yet. This work is the first survey to address this topic, providing a thorough analysis of the state-of-the-art as well as precise and well-supported suggestions for future research.
For SQL databases, the works in [34, 152] pay serious attention to the principles and advances of physical database design, and they examine physical design methodologies and automatic design tools used in commercial RDBMS. The studies in [43, 77, 94, 160] analyze autonomic (or self-managing) features in popular relational database products, namely DB2, Oracle, and Microsoft SQL Server. While Chaudhuri and Narasayya [47] review prominent solutions in the area of automated physical database design with a particular focus on work done in the AutoAdmin project, there exist several publications that provide further details on self-tuning databases [49, 200, 223]. Weikum et al. [223] survey and assess general approaches of automatic tuning, putting a particular emphasis on the COMFORT automatic tuning project [222]. A tutorial on foundations and principles of automated tuning as well as an overview of tuning tools in commercial RDBMS is provided in [49, 200]. The work in [134] compares eight existing algorithms for the index selection problem along different dimensions, such as solution quality, runtime, multi-column support, solution granularity, and complexity. This comparison includes an experimental evaluation of the algorithms for three analytical benchmarks, namely, the TPC-H, TPC-DS, and Join Order Benchmark.
With the growth of NoSQL databases and the need for schema design in such databases, various academic studies have been published. A survey on NoSQL design methodologies is described in [17]. It presents a classification of methodologies, first by an attempt to unify most NoSQL databases under a uniform design methodology, and then by type of NoSQL databases. Finally, an evaluation process for NoSQL design methodologies is proposed. The authors of [194] identify several key criteria in the design of new databases (including Object-Oriented, NoSQL, and NewSQL databases) and analyze existing methods accordingly. The work in [62] provides a static analysis of five representative JSON schema inference approaches to identify their strengths and weaknesses. The authors of [220] conducted a systematic review to analyze academic works and extract characteristics of data modeling in NoSQL databases. This systematic review aims at answering three research questions about the level of representation, models used, and contexts where the modeling process occurred in NoSQL databases. [216] provides a general overview of challenges and state-of-the-art in NoSQL schema management, e.g., schema design and schema evolution, and provides an outlook on research opportunities related to NoSQL schema evolution and data migration.

3 Systematic Review Process

In a systematic literature review, it is of paramount importance to consider all primary research works that are relevant for the investigated topic. We adopt the guidelines introduced by Kitchenham [129, 130], which provide the crucial steps that are required for identifying, evaluating, and interpreting all relevant research works. The process is divided into three phases: planning the review, conducting the review, and reporting the review (cf. Figure 1).
Fig. 1.
Fig. 1. The review process adopted in this article.

3.1 Phase 1: Planning the Review

The planning phase involves three steps: the first justifies the need of the systematic review, the second defines the research questions, and the third develops the review protocol.

3.1.1 Need for a Systematic Review.

Automated physical schema tuning in SQL databases is well-studied, and there is a large number of related articles (cf. Section 5). In contrast, research on automatic schema tuning in NoSQL databases started only in 2010, hence less research has been conducted so far (cf. Section 6). While there exist several surveys on automatic schema tuning for SQL and NoSQL databases, none of these studies provides an in-depth overview, classification, or comparison of existing approaches from multiple facets. To the best of our knowledge, our work is the first substantial effort to explore automatic schema tuning approaches in both SQL and NoSQL databases, as well as the first work to propose a taxonomy for evaluating and comparing different schema tuning solutions. This survey helps to investigate the gap between the motivations and capabilities of current solutions, and consequently to identify areas for future research.

3.1.2 Research Questions.

This survey concentrates on the following four research questions:
RQ1
What are the venues and publication statistics of existing studies on automated database schema design?
Answering this question allows us to identify the distribution of the publications across publishers and publication types as well as periods when the topic was particularly attractive to researchers.
RQ2
At what level of representation does the schema design occur, and what are the schema design structures or models?
The purpose of this question is to identify which level of representation (conceptual, logical, and physical) is considered in the schema design process, and which schema structures or models are designed at the respective level.
RQ3
At each level of representation, what methods are used to solve the schema design problem?
This question explores the various solutions that have been presented for addressing the schema design problem, and whether the proposed methods support fully automatic schema tuning.
RQ4
How can we define a taxonomy for schema design approaches considering significant facets?
Presenting a classification and taxonomy reveals a comprehensive understanding of significant facets of the schema design problem and helps to position existing works in the research landscape. This also facilitates the development of new approaches.

3.1.3 Systematic Review Protocol.

This is the most vital stage of the process and defines data sources, search strategy, and the selection criteria for the primary publications. We used electronic indexing systems, such as IEEE, Springer, Science Direct, ACM, Citeseer, Web of Science, and Scopus to find candidates for primary studies. The search was carried out using search strings derived from the research questions and combined with Boolean operators following the guidelines of Spanos and Angelis [212]. According to [212], the determination of search keywords is an iterative procedure. It begins with trial searches using different keywords, considering an initial set of articles that is known to belong to the research field of the review. The procedure ends when the initial set of articles is found. In our case, we used the following primary keywords: “NoSQL databases”, “relational databases”, “schema design”, “modeling”, “schema tuning”, “schema configuration”. To search academic works related to the structures or models of schema design, the keywords “index”, “materialized view”, “partitioning”, “clustering”, “graph model”, “document model”, “column model”, “key-value model” were used. Furthermore, an in-depth analysis was conducted to identify more representative keywords and maximize the candidate set of primary studies. Finally, these keywords are combined using Boolean ANDs and ORs, yielding our search string shown in Figure 2.
Fig. 2.
Fig. 2. Search String.
Next, the following inclusion and exclusion criteria were applied:
Inclusion Criteria (IC):
(IC1)
Academic works that develop or extend schema design and tuning approaches for databases (SQL or NoSQL) as a main part of the study.
(IC2)
Academic works published as book, book chapter, journal, conference article, thesis, or technical report.
Exclusion Criteria (EC):
(EC1)
Academic works written in languages other than English.
(EC2)
Academic works not available on the web.
(EC3)
Academic works representing a survey or systematic review.
(EC4)
Academic works published as short article.
(EC5)
Academic works focusing on other areas (i.e., not schema design and data modeling).
(EC6)
Academic works related to other areas of self-tuning databases, such as memory and other database components.
(EC7)
Academic works related to other areas of schema management, such as schema evolution and data migration.
(EC8)
Academic works outside the time period 1970 to 2021.

3.2 Phase 2: Conducting the Review

Phase 2 is divided into the following steps: identifying primary studies using the search string, selecting the primary studies based on the inclusion and exclusion criteria, assessing the quality of the primary studies, and extracting and synthesizing the data.

3.2.1 Identifying Primary Studies.

To find candidates for primary studies, we use the search string shown in Figure 2 in the electronic indexing systems IEEE, Springer, Science Direct, ACM, Citeseer, Web of Science, and Scopus. For completeness, a manual search was carried out in Google Scholar with the aim of finding additional works on automatic database schema design. Our search strategy was based on finding literature published as book, book chapter, journal, conference article, thesis, or technical report in the period from 1st January 1970 to 31st December 2021. As a result, we obtained 8,275 candidates of primary studies.

3.2.2 Selecting Primary Studies.

This step applies the inclusion and exclusion criteria in order to filter out academic works that are not relevant to our study. We first applied the exclusion criteria EC5–EC8, and as a result 1,824 studies remained. Then, the other inclusion and exclusion criteria (EC1–EC4, IC1–IC2) were applied. Finally, 142 academic works related to SQL databases and 98 academic works related to NoSQL databases remained for a more detailed study.

3.2.3 Assessing the Quality of Primary Studies.

In addition to the inclusion and exclusion criteria, it is important to ensure that the primary studies have a high quality. According to Kitchenham [130], the quality assessment of the primary studies is complex and depends on various factors. We adopted the following criteria for our study: (a) availability of the publication, including information on how to access them, e.g., URLs, DOIs, or databases; (b) detailed description and documentation of the presented methods and models for schema design; and (c) comprehensive presentation of the results. Based on these criteria, the academic works were subjected to thorough text-reading, leaving 114 works related to SQL databases and 72 works related to NoSQL databases for this survey. An overview of the identified primary studies is given in Table 1. The table shows the statistics with respect to publication type and year of publication. Symposia and workshop publications are considered as conferences. We can see that research on automated physical design in SQL databases covers the entire period of our study, whereas research works on schema design in NoSQL databases have appeared only starting from 2010.
Table 1.
Database TypePublication TypePrimary StudiesYears of Publication
SQL DatabasesJournal401974–2021
Conference641976–2021
Book31983–2020
Dissertation or technical report71970–2020
Total1141970–2021
NoSQL DatabasesJournal212014–2021
Conference492010–2021
Book12019
Dissertation or technical report12018
Total722010–2021
Table 1. The Information about Selected Studies

3.2.4 Extracting Data of Primary Studies.

In this stage, we analyze the primary studies in order to answer the research questions. For this, we extracted the following main features:
(1)
Publication source, year of publication, and type of article (RQ1).
(2)
Database type SQL or NoSQL, type of NoSQL databases (Key-Value, Document, Wide-column, Graph) (RQ2, RQ3, RQ4).
(3)
Level of representation, i.e., conceptual, logical, or physical (RQ2).
(4)
Schema design structures, for example physical structures, such as indexes, materialized views, multidimensional clustering, and partitioning (RQ2).
(5)
Source and target models in the schema design process (RQ2).
(6)
Schema design methods (RQ3, RQ4).

3.3 Phase 3: Reporting the Review

The reporting phase consists of the final assessment and presentation of the results from the systematic review process. In the following sections, we describe in detail the results and findings that we obtained by answering and analyzing our four research questions.

4 Overview of Primary Studies and Taxonomy

This section provides a general description of the selected primary studies (answer to RQ1) and a new taxonomy to classify schema design and tuning approaches (answer to RQ4).

4.1 Overview of Selected Primary Studies

This section provides publication statistics for the selected primary studies with the goal of answering RQ1. Figure 3(a) shows the distribution of primary studies in terms of publication type, i.e., journal article, conference article, book, dissertation or technical report (D & R). Recall that symposiums and workshops are considered as conferences. A substantial number of primary studies are published in conferences, for both SQL databases (64) and NoSQL databases (49).
Fig. 3.
Fig. 3. Distribution of primary studies by publication type and by publisher.
Figure 3(b) reveals that ACM and IEEE are the dominating publishers for research in SQL databases. This is mainly because they are publishing specialist conference proceedings, such as ACM SIGMOD and ICDE. Springer is dominating for NoSQL databases, due to many articles that are published in Springer conferences, such as the ER conference, which was one of the earliest conferences seeking for contributions in the area of data modeling in NoSQL databases. VLDB is one of the main conferences for databases. The proceedings were published by Morgan Kaufmann (MK) until 2004, then by ACM from 2005–2007, and since 2008 by the VLDB endowment in open-access (PVLDB).
Figure 4 shows the publication statistics with respect to the publication year. Studies about physical schema design in relational databases cover the entire period from 1970 to 2021 considered in our study. The rapidly growing number of publications over the years demonstrate an increasing interest in this topic among researchers and practitioners. Studies on schema design in NoSQL databases have only been conducted since 2010, and their number is growing fast.
Fig. 4.
Fig. 4. Distribution of primary studies by publication type over the years.
Table 2 shows the main conferences of our primary studies. In the area of physical schema design and tuning for SQL databases, the most notable conferences are ACM SIGMOD (15), VLDB (14), and ICDE (13). In the area of schema design and tuning for NoSQL databases, the most notable conferences are ER (7), and ICEIS (5). Articles published in journals were spread over 19 journals for SQL databases and 16 journals for NoSQL databases. Table 3 shows the most relevant journals.
Table 2.
Database TypeAcronymConference NamePrimary Studies
SQL DatabasesACM SIGMODACM SIGMOD International Conference on Management of Data15
VLDBInternational Conference on Very Large Data Bases14
ICDEIEEE International Conference on Data Engineering13
EDBTInternational Conference on Extending Database Technology3
NoSQL DatabasesERInternational Conference on Conceptual Modeling7
ICEISInternational Conference on Enterprise Information Systems5
IEEE BigDataIEEE International Conference on Big Data3
IEEE IRIInternational Conference on Information Reuse and Integration for Data Science3
Table 2. Main Conferences
Table 3.
Database TypePublisherJournal TitlePrimary Studies
SQL DatabasesPVLDBPVLDB10
SpringerThe VLDB Journal5
IEEE Computer SocietyIEEE Transactions on Software Engineering4
IEEE Computer SocietyIEEE Transactions on Knowledge and Data Engineering3
ElsevierInformation Systems3
NoSQL DatabasesElsevierInformation Systems4
SpringerJournal of Big Data3
SpringerThe VLDB Journal2
IEEE Computer SocietyIEEE Transactions on Knowledge and Data Engineering2
Table 3. Main Journals

4.2 A Taxonomy of Schema Design and Tuning

To facilitate a systematic and thorough analysis of the numerous approaches, we present a new taxonomy of schema design and tuning in Figure 5 (answer to RQ4). The first level distinguishes between database type, workload type, and tuning approach. We use this taxonomy in the following sections to provide a thorough analysis of the selected primary studies (answer to RQ2 and RQ3). We begin our analysis with respect to the two different database types considered in this study and discuss SQL databases in Section 5 and NoSQL databases in Section 6. Afterward, in Section 7 we analyze the selected works from the perspective of workload type and tuning approach.
Fig. 5.
Fig. 5. Taxonomy of Schema Design and Tuning.

5 Schema Design and Tuning in SQL Databases

The schema design problem in relational databases focuses on automated physical model design, i.e., finding a configuration of physical structures that minimize the execution cost for a given workload. Physical design structures include auxiliary objects, such as indexes and materialized views, which provide faster access to data and thus faster response times. It also includes objects that define how the data is physically organized in tables, e.g., the use of clustering keys in multidimensional relational databases or partitioning keys in shared-nothing parallel database systems [87]. Generally, automated physical design approaches first generate as set of alternative designs, apply a cost model, and choose then the design with the lowest cost. Accordingly, we distinguish two main aspects: a cost model to evaluate alternative designs and to estimate the total workload cost for each design, and (2) a search strategy to find the physical configuration with the lowest total cost.
In the following, we first analyze cost models and search strategies, followed by a comprehensive overview of research conducted on automated physical design.

5.1 Cost Model

Research on cost models can be categorized into three main groups: stand-alone cost model, optimizer cost model, and learned cost model.

5.1.1 Stand-alone Approaches.

Stand-alone approaches build an external cost model outside of the DBMS for comparing alternative design configurations. Stonebraker [214] and Schkolnick [202] presented a probabilistic model that considers statistical properties of the transactions as input parameters, e.g., the probability that a particular column appears in a query. These approaches are not practical since maintaining the statistics is expensive and the required amount of storage is very high. Thus, they work in special restricted cases, but not in the general case. Hammer and Chan [102] used an exponential smoothing technique, which is a forecasting model in the derivation of parameters for the cost model. Another approach adopts rule-based expert systems [63, 80, 195]. They formalize the knowledge of human experts as rules, and use these rules to generate a good physical design configuration. [105] introduced a simple linear cost model for estimating the number of rows in a table that a query is expected to process. Similarly, [126] developed an analytical cost model to predict the impact of data correlations on the performance of secondary index look-up. The work in [234] uses an approximate mean value analysis to estimate the cost of executing a configuration with a workload. Finally, the studies in [16, 21, 24, 57, 78, 95, 120, 157, 192, 224, 225, 226] use I/O based cost models to calculate the total workload cost, which is measured in terms of number of page or block accesses, or as a weighted sum of I/O and CPU operations.
Approaches based on external cost models suffer from serious drawbacks. They cannot guarantee that the proposed storage scheme will be used to its full potential by the optimizer. Moreover, the tool becomes obsolete whenever the optimizer changes.

5.1.2 Optimizer Cost Model.

Instead of an external cost model, the pioneering work in [81] uses the cost model of the query optimizer, which maintains statistics about the queries’ execution cost. Such an approach has significant advantages compared to external cost models. The generated physical schemes are based on information extracted from the optimizer, and thus will be used by the optimizer to its full potential. Moreover, they remain synchronized if the optimizer evolves over time. A shortcoming of this approach is that creating statistics by traditional full scan techniques for very large databases with many columns was impractical.
The Auto-admin project initiated by Microsoft in 1997 extends a “what-if” interface to the optimizer for simulating hypothetical physical structures [46]. These structures are not materialized, instead, they are simulated inside the optimizer by adding metadata and statistical information to the system catalog. This interface allows physical design tools to stay in-sync with the optimizer and consider a large space of alternative physical designs without materializing them. Similar “what-if” interfaces were subsequently adopted by other commercial DBMSs, for instance HypoPG,1 which allows to create hypothetical indices in PostgreSQL and see their impact on the optimizer. Large problem instances imply a very high number of what-if optimization calls, which dramatically increases the cost of physical structure tuning. To tackle this problem, several studies [38, 68, 181, 203] propose fast what-if optimization techniques, which execute fewer what-if calls and significantly reduce the cost of index tuning.
Other studies, such as [56, 58, 59], are based on a combination of an optimizer-based and a knowledge-based approach to reap the benefits of both techniques.

5.1.3 Learned Cost Model.

As pointed out in [143, 181], invoking the optimizer to estimate the cost of queries under various configurations is costly, and good configurations might be missed due to erroneous cardinality estimates. Therefore, recent research has studied the use of machine learning techniques, especially deep neural networks and reinforcement learning, to estimate cost models [26, 136, 146, 176, 179, 184, 185, 187, 197, 231]. The goal is to develop a tuning strategy that does not require prior knowledge of a cost model. Instead, the cost model is learned. Oracle’s cloud-based autonomous DBMS [176] is one notable example of augmenting an existing DBMS with machine learning agents. Instead of augmenting an existing DBMS, other works [184, 185] have looked into creating new DBMS architectures, which use different ML algorithms to predict the cost and benefit of deploying configuration components.

5.2 Search Strategy

Research on search strategies can be categorized into two main groups: exact approaches and approximate approaches.

5.2.1 Exact Approaches.

Early works use exact approaches based on an analytical model to find an optimal physical configuration [16, 128, 178, 202, 214]. Exact solutions find an optimal set of physical structures by considering all possible design choices. Since the physical structure selection problem is NP-complete [188], an exhaustive search strategy is impractical and not scalable. Therefore, the above approaches made many simplifying assumptions. General problems with analytical approaches include: (1) significant simplifications have to be made, and (2) if the query processing strategy or other modeled aspects of the DBMS change, the model becomes obsolete.

5.2.2 Approximate Approaches.

To tackle the complexity of exact approaches, approximate solutions have been investigated. The most widely used technique is to use heuristic search methods to prune the search space of design models and to choose a near-optimal configuration. Most of the works use heuristic solutions [21, 23, 24, 30, 56, 57, 58, 59, 63, 80, 81, 82, 102, 103, 195, 211, 217, 224, 229], greedy approaches [6, 45, 96, 97, 105, 127], evolutionary algorithms like genetic algorithms [137, 172, 191, 232], branch-and-bound algorithms [171, 186, 234], relaxation-based approaches [35, 36, 37], or hybrid approaches [233]. Other studies exploit combinatorial optimization techniques to choose an optimal set in potentially exponential time, or find an near-optimal solution along with a distance bound to the optimal solution. The works in [40, 41, 68, 126, 180, 192] adopt an Integer Linear Programming (ILP) formulation of the problem and employ a commercial ILP solver, such as branch-and-bound, Lagrangian relaxation, or column generation. In [78, 95, 119, 120], a polynomial time approximation algorithm is used to solve the index selection problem, which is formulated as a knapsack optimization problem. The work in [237] models the index and materialized view selection problem as a variant of the knapsack problem, followed by a random-swapping algorithm.
The approaches described above lack a mechanism to learn about the goodness of the recommended set of physical structures. Their quality can be improved by learning from the effects of deployed actions, e.g., adding an index. Recent advances in (deep) reinforcement learning inspired researchers to move toward learning-based approaches. For instance, [140, 197, 210] demonstrate how deep reinforcement learning can be used for index selection. The works in [26] and [179] specify a general reinforcement learning approach for automated and dynamic database indexing. Liang et al. [150] investigate how to use deep reinforcement learning for selecting materialized views and improve the performance of OLAP workloads. Yuan et al. [231] solve the view selection problem by first modeling it as an ILP problem, then modeling the ILP as a Markov Decision Process (MDP), and finally using a deep reinforcement learning technique to solve it.

5.3 Overview of Physical Database Design Approaches

This section provides a survey of research on automated physical design in RDBMSs. Since in 1998, the AutoAdmin project at Microsoft Research developed the first automated physical design tool in a commercial DBMS, we review the work in three categories: works prior to the AutoAdmin project, the evolution of the AutoAdmin project, and related works following the AutoAdmin project.

5.3.1 Physical Database Design before the AutoAdmin Project.

The physical design problem has been studied since the early 70’s, initially focusing on the index selection problem for files [16, 128, 178, 202, 214]. These early works adopt an analytical approach, using a probabilistic cost model of the transactions, to derive an exact solution, i.e., optimal set of indices. These exact approaches suffer from performing a complete enumeration of all possible index sets. Other works use approximate solutions and deal with a limited number of physical structures instead of all possible subsets, e.g., indexes [21, 24, 30, 56, 57, 58, 59, 63, 78, 81, 82, 102, 107, 108, 120, 195, 224, 234], partitioning [103, 169, 170, 195, 234], and multi-dimensional clustering [153]. Finally, exist also some approximate solutions for materialized views or index selection, which are specifically tailored for OLAP/Data Cubes [23, 96, 97, 105, 139, 211, 217, 229, 232]. Notice that none of the above works developed physical design tools for commercial DBMSs.

5.3.2 History of the AutoAdmin Project.

The AutoAdmin project2 started in 1996 at Microsoft Research with the aim at making RDBMSs self-tuning. A primary focus was automating the physical database design. In 1998, the first physical database design tool was released, called Index Tuning Wizard (ITW), which shipped with Microsoft SQL Server 7.0 and is based on techniques presented in [45]. A key feature of this tool is an extension of the query optimizer to support a “what-if” interface [46]. In the SQL Server 2,000 release, ITW was enhanced with integrated recommendations for indexes and materialized views in the context of multidimensional (OLAP) databases [7].
Horizontal and vertical partitioning are other important aspects of physical database design with significant impact on performance and manageability. Horizontal partitioning allows tables, indexes and materialized views to be partitioned into disjoint sets of rows, which are separately stored and accessed. Vertical partitioning allows a table to be partitioned into disjoint sets of columns. In [8] a scalable solution to the integrated physical design problem of indexes, materialized views, vertical and horizontal partitioning for both performance and manageability was presented. In the Microsoft SQL Server 2005, the functionality of ITW was replaced by a full-fledged application, called the Database Engine Tuning Advisor (DTA) [6]. It significantly increased the scope and usability of ITW. DTA provides integrated recommendations for indexes, indexed views, indexes on indexed views, and horizontal range partitioning. It scales to large databases and workloads using techniques, such as workload compression and reduced statistics creation. ITW and DTA employ a “candidate selection” step to identify a set of likely configurations in a cost-based manner by consulting the query optimizer.
The ITW and DTA tools assume that the DBA knows when to invoke a physical design tuning tool. In this regard, DBAs face two main challenges. First, changes in data distributions and workloads might lead to the current configuration becoming sub-optimal. Thus, DBAs must continuously invoke the tuning tool to recommend changes in the current configuration. Second, only if the current configuration is no longer optimal, it is necessary to run a tuning tool and change the configuration, since running a tuning tool causes an overhead. Therefore, the work in [36] introduced a more lightweight tool, called Alerter, which identifies when a physical design tool should be invoked. Finally, the work in [37] introduced an online index selection approach that continuously monitors changes in the workload and in the data, and modifies the physical design as needed. It was prototyped inside the Microsoft SQL Server engine.
Since an crucial requirement for a self-tuning database system is the ability to efficiently monitor the database and identify potential performance threats, the AutoAdmin project developed also a SQL Continuous Monitoring engine (SQLCM) [44]. It enables continuous monitoring inside the database system and has the capability to automatically take actions based on the monitoring.

5.3.3 Database Physical Design following the AutoAdmin Project.

Physical database schema tuning can be carried out in three different modes: manual, semi-automatic, and full-automatic. Manual (or offline) approaches are invoked by the DBA when expected to be necessary and recommend a physical configuration to the DBA to improve the DBMS performance. Semi-automatic techniques use the DBA in the loop by recommending a physical configuration that requires feedback from the DBA. Full-automatic approaches, known as online physical design tuning, continuously tune the configuration as the workload or database state changes without the need for any DBA intervention.
Manual Mode. Manual tools for physical design tuning in commercial database systems other than the Microsoft SQL Server are IBM DB2 Design Advisor [219, 236, 237] and the SQL Access Advisor in Oracle 10g [64]. The DB2 Design Advisor [237] recommends indexes and materialized views for a given workload. As of DB2 version 8.2 [236], the Design Advisor provides integrated recommendations for physical design features, including indexes, materialized views, partitioning, and multi-dimensional clustering. It adopts a hybrid algorithm that efficiently searches through the large solution space while taking into consideration the interactions of related features. This approach allows to break the implementation of different features into smaller components. The implementation details can be found in [151, 191, 237]. For instance, a knapsack algorithm followed by a random-swapping phase is proposed for index and materialized view selection [219, 237], and a rank-based search and a genetic algorithm were proposed for partitioning [191]. The SQL Access Advisor in Oracle 10g [64] recommends indexes and materialized views for a given workload. It receives as input a workload and a set of candidates (created on a per-query basis by the Automatic Tuning Optimizer) and outputs a recommendation for the overall workload. In Oracle 11g [28, 138], the advisor was extended to make recommendations also for partitioning.
The index selection approach in [203] recursively applies a greedy selection that effectively accounts for index interactions. The approach decreases the number of “what-if” optimizer calls by employing caching and workload information, and thus can scales for large problems.
The work in [192] presents a physical design tool for C-store [215], a clustered column-store RDBMS. This tool builds an optimal candidate set consisting of materialized views and clustering indexes for OLAP workloads. The work in [233] provides a hybrid evolutionary approach, combining the advantages of heuristic and evolutionary algorithms to select materialized views in a data warehouse, whereas [98] presents a simple greedy heuristic for selecting views. The cost model in these studies is calculated in terms of query processing and view maintenance.
In [127], the authors investigate how to effectively integrate index compression techniques into physical database design. The presented index selection approach considers compressed alternatives for each index based on a given memory budget. The optimizer’s cost model is extended to handle compressed indexes in the configurations. The work in [193] jointly optimizes different configuration decisions (compression, index, and ordering) using linear programming to determine the best runtime performance for a given workload and memory budget.
Finally, there exist a number of state-of-the-art offline approaches for suggesting vertical and horizontal partitions [124, 171, 186] and multidimensional hierarchical Clustering of OLAP data [159].
Semi-automatic or Full-automatic Mode. The work in [205] proposes a semi-automatic index tuning technique, which adapts its recommendations to changes in the workload and feedback from the DBA. Fully-automatic approaches become popular in the form of database cracking [113, 114], continuous monitor-and-tune approaches [13, 14, 37, 88, 111, 112, 122, 147, 198, 204, 227], and machine learning-based approaches [26, 76, 135, 136, 146, 150, 179, 184, 185, 187, 210, 231]. Database cracking is an online index selection approach, where indexes are created and continuously refined as part of query processing. In this approach, each query is interpreted not only as a request for a particular result set, but also as an advice to crack the physical database store into smaller pieces. There is a large body of work on extending and improving database cracking [90, 91, 92, 93, 100, 110, 115, 189, 206, 207]. A comparative study of different cracking techniques is presented in [208].
Continuous monitor-and-tune approaches [13, 14, 37, 88, 111, 112, 122, 147, 198, 204, 227] monitor changes in the workload in a continuous way, collect statistics and make online decisions automatically to perform changes to the physical design as needed. The work in [204] and [198] share similar goals as [37], but differ in the design. The authors of [198] have developed the earliest online index selection solution, called the QUIET framework and implemented on top of DB2. It aims at creating indexes automatically at runtime, while considering a DBA-defined space budget. In comparison to [37], this approach requires multiple additional calls to the optimizer to obtain information about candidate indexes. The work in [204] developed a self-tuning framework, called COLT, that continuously monitors the workload of a database system and enriches the existing physical design with a set of effective indexes. COLT has been implemented inside PostgreSQL, extending the optimizer with a “what-if” interface. In contrast to [37], it is not fully integrated in the query optimizer. The authors of [227] have developed an autonomic tuning expert for IBM DB2 and evaluated it under changing workloads and system states. The approach stores DBA experts’ tuning knowledge for the autonomic management process.
A online DBMS-independent framework for automatic index tuning was introduced in [14], which can be used with different DBMSs. It is based on heuristics that run continuously to guide decisions on the current physical database configuration in order to react to workload changes. In [13], a model-driven approach was developed to support tuning decisions by different heuristics. A instantiation of the model is the database self-tuning task for physical design tuning, including indexes and materialized views. In this model, a database tuning requester can be a human agent, such as a DBA, or a computational agent, such as a monitoring tool. The work in [147] employs an online approach to monitor database queries. It collects statistical information in order to detect performance trends and to make decisions for automatic re-partitioning. In [111, 112], the authors presented a method for automatically deriving database configurations by evaluating an objective functions. The proposed self-management logic provides a continuous monitoring and light-weight analysis of the workload and database state, compares it against goals (e.g., response times, throughput, CPU or disk space usages, availability, and operation costs) defined by a DBA, and starts a reconfiguration analysis when there is a risk of missing the goals. Multi-objective optimization techniques are used to automatically derive database configurations that meet these goals. Additionally, this approach stores an explicit workload model to identify patterns in the workload that allow to predict workload shifts. Finally, an online technique to recommend indexes for high-dimensional databases was introduced in [88]. This approach uses a control feedback to monitor the query workload and to detect when the query pattern changes and corresponding adjustments to the index set are recommended.
There are numerous studies [26, 76, 135, 136, 140, 146, 150, 179, 184, 185, 187, 210, 231] that have leveraged machine learning techniques for automated tuning of DBMSs. They extract the state and performance metrics from a DBMS and use this data to train models for choosing tuning features that will benefit the most. Some studies [135, 136, 146, 184, 185] propose autonomous (self-managing) database systems, which leverage machine learning techniques for workload modeling and forecasting, fully automated physical design, and other configuration components. Thus, they do not focus on one particular configuration aspect and take a holistic view, which considers combinations of various configuration components.
The authors of [184] present the first self-driving RDBMS, called Peloton (later NoisePage [185]), which automatically configures and optimizes various aspects (e.g., indexes, materialized views, data partitioning, knobs) as the database and the workload evolve over time. It supports hybrid workloads, OLTP and OLAP, and predicts future workloads based on past arrival rates [156] so that the system can adapt itself. NoisePage uses behavior models generated by different ML algorithms to predict the cost and benefit of deploying configuration components. Based on these cost/benefit estimates, NoisePage adopts tree-based optimization methods (e.g., Monte Carlo tree search (MCTS) or RL methods) to select actions that improve the system’s target objective function.
The authors in [135, 136] proposed a framework for self-driving database systems that automatically tune various configuration features, such as physical database design, knob configurations, or hardware resources. The framework is able to predict future workloads and to deliver configurations based on these predictions. They also propose a linear programming (LP) model to tune the multiple dependent features in a recursive way. Additionally, an approach for adaptive cost estimation is suggested, which can be obtained by applying linear regressions, gradient boosting regressors, or neural networks. The proposed framework is integrated into the research DBMS Hyrise [74], which is a relational columnar store for transactional and analytical transaction processing (HTAP).
In [146], the authors built an autonomous database framework and integrated ML techniques into the open-source database system openGauss [175]. They proposed learning-based models for developing learned optimizers and advisors, such as learned cost estimation, learned knob tuning, learned database diagnosis, and a learned view/index advisor. For the latter, a deep reinforcement learning model to select high-quality indexes and materialized views is proposed.
Other researchers have developed learning approaches to solve the problem of automatically indexing a database [26, 140, 179, 187, 197, 210], materialized view selection [150, 231], and partitioning [76] using reinforcement learning or deep reinforcement learning.
Finally, we would like to mention some autonomous features in commercial databases, including Oracle Autonomous database [176], Alibaba self-driving database [12, 145], and fully-automated indexing in Microsoft Azure SQL Database [67].

6 Schema Design and Tuning in NoSQL Databases

This section provides an analysis of the research on NoSQL schema design using the taxonomy facets in NoSQL databases, including NoSQL type, level of data model, and modeling approach.

6.1 NoSQL Type and Level of Data Model

These two sub-facets in our taxonomy deal with database type and the level of the data model in NoSQL schema design. Concerning the type of databases, more than 200 different NoSQL stores have been reported, and the number is growing.3 Here we use the categorization from [69]:
Key-Value Stores are the least complex NoSQL databases from an architectural point of view. Data are represented as (key, value)-pairs in which a key serves as a unique identifier.
Document Stores are extended key-value stores, in which the value is a document stored in a semi-structured format, such as JSON or XML. Documents have flexible schemas, i.e., documents are not required to have the same schema or to contain the same level of information.
Wide-Column Stores, also known as extensible record stores, store data in columns and groups of columns. In fact, these systems store data in a tabular format of rows, called column family. Each row is identified by a key and can contain an arbitrary number of columns with a similar structure, which are usually accessed together. Rows in a column family are not required to share the same columns, i.e., columns can be added or modified at runtime.
Graph Stores are based on the graph structure that includes nodes and edges. Nodes and edges may also have properties.
From the viewpoint of data model, we distinguish between single-model and multi-model systems. The latter support the storage and processing of structurally different data, i.e., data with different models. Multi-model databases can be further divided into single databases and multi databases [155]. Multi-model single databases build one single engine to manage different data models with a fully integrated back-end to satisfy the system’s demand for performance, scalability, and fault tolerance. In contrast, multi-model multi databases (i.e., polystores [75] and multistores [132]) combine multiple different data stores, each with distinct native data models and different languages. An integrated access to all stores through one or more query languages is provided. In multi-model databases, a store-independent global schema is needed to integrate the different data models and to offer a uniform access. That is, at the logical and physical level, a unified data model can be defined, which is mapped to the local schemas. Several generic metamodels, such as SOS [19], NoAM [18], TDM [142], U-Schema [39], have been proposed.

6.2 Modeling Approach

Modeling NoSQL databases can be performed by using either a forward, a reverse or a roundtrip engineering methodology.

6.2.1 Forward Engineering/Schema-first Approaches.

Forward Engineering (or schema-first) approaches start from a conceptual model or a relational database and transform it into a NoSQL data model. In this process, workloads may be considered to optimize the proposed schema. Thus, we can distinguish two categories of approaches: workload-agnostic and workload-driven approaches.
Workload-agnostic Approaches. These approaches define a set of conversion rules that transform a relational or conceptual schema into a NoSQL schema. They largely ignore workloads as well as the evaluation of alternative designs. In this context, the authors of [1] provide a set of mapping rules that transform a UML conceptual model into different NoSQL physical models (column, document or graph). They provide a generic intermediate logical model that describes data according to the features of NoSQL models and support a sufficient degree of independence, so as to enable the mapping to several platforms. The work in [65] presents a set of mappings from conceptual schemas expressed in UML/OCL into logical schemas (data store specific models) and integrate them in a unique multi-data store design method. This approach embeds three transformations, each specific to a data store: UmlToSQL, UmlToGraphDB, and UmlToDocumentDB. Some authors propose transformations for column store databases [144, 149] and document databases [104, 174]. For the mapping of conceptual schemas to graph databases, the works in [5, 11, 66] have transformed a UML model, relational and RDF schemas, and extended ER model to a graph model. Additionally, data modeling for document-oriented and graph-based databases have been investigated in [125]. The study in [15] has investigated the process of transforming a relational database schema into various NoSQL database schemas, namely document-based, column-based, and graph-based.
The studies in [25, 89] proposed a modeling language for data modeling in hybrid polystores, which involve a combination of relational and NoSQL databases. The language provides mapping rules between the conceptual schema of a polystore according to the Entity-Relationship model and the underlying physical schemas.
Finally, it is worth mentioning contributions related to modeling and implementing data warehouses in NoSQL databases [10, 51, 52, 53, 54, 72, 73, 209, 230]. These approaches define a set of mapping rules that transform a multidimensional conceptual data model into NoSQL models.
Workload-driven Approaches. Workload-driven schema design considers the workload when optimizing the schema, i.e., the schema design is based on the analysis of the workload queries. The workload-driven approach is suitable when the queries and data structure are known beforehand. Approaches in this area often rely on schema denormalization and data redundancy to optimize the schema and improve query performance.
The works in [29, 218] use denormalization to optimize the database schema for read queries. By analyzing read queries, [218] employs foreign key constraints in the conceptual schema to generate all possible denormalized relations for each query, whereas [29] uses secondary indexes to identify all feasible table variants that could answer each read query via a single read request. A cost function is used to select the schema with the minimal cost.
Rafik et al. [177] present a set of conversion rules to support the schema conversion of an existing relational database into HBase. To obtain a proper design, the conversion rules depend on an analysis of the most frequent access patterns and of the heavy queries that are frequently run.
Mior [163] developed an approach, called NoSE, for automated schema generation in extensible record stores. This cost-based approach uses a novel binary integer programming (BIP) formulation to map a conceptual model to a NoSQL schema. NoSE receives a conceptual model and workload as input, and then formulates the schema optimization problem as a BIP, which selects a set of column families that minimize the execution cost of the entire workload. This approach has been implemented for the Cassandra extensible record store [164].
The works in [50] and [71] consider workload information to select an optimized logical schema for read queries by reducing the number of accesses to the database. In [50], a big data modeling tool, called KDM, is presented, which automates Cassandra database schema design for queries defined in an application workflow. It starts with a query-driven mapping from a conceptual data model to a logical data model and ends with a logical-to-physical mapping. In [71], the researchers defined mapping rules that guide a workload-driven transition from a conceptual schema to a logical schema and its physical implementation for NoSQL document stores.
Imam et al. [118] presented a long list of NoSQL modeling guidelines for the logical and physical design of document-store databases with the aim at facilitating the data modeling process and improving developers’ modeling skills. In another work [117], the authors presented a schema suggestion model for NoSQL document stores. The model suggests a schema based on system requirements and parameters, such as list of entities and Create, Read, Update and Delete (CRUD) statements. This model aims at balancing significant system requirements, such as consistency, availability, and scalability. An extension of the model, called DSP model [116], focuses on performance and describes how the model supports query complexities. For this, a binary integer formulation was used to map the conceptual data model to a database schema.
The work in [70] presents a framework for the automatic design of NoSQL databases (column family model and document model), which takes a conceptual data model and access queries as inputs. It includes the definition of a set of rules to transform the conceptual data model first to a logical model and then to the final implementation in the target NoSQL database.
Physical design guidelines for graph databases with the aim at improving the query execution time are presented in [123]. In this work, indexes, path materialization, and query rewriting are considered as guidelines for the physical design in Neo4j.
CONST loop [167] is the first online NoSQL schema design approach. It adopts a self-tuning feedback control loop for continuous workload monitoring and analysis, as well as optimized schema design in wide column stores, and has been implemented in Apache Cassandra. This loop describes a design pattern for the self-tuning feature, which automatically modify the current schema design reacting to changes in the application’s workload. Details of the architecture of the monitor and analyze phases are described in [166].
Noguera and Lucrédio [173] attempted to implement a schema-independent way to specify and execute NoSQL queries, which means the developer is not free to change the data schema in which the data is stored without prior adaptation of the data access code (queries). This work employed a generative approach to simplify the writing of queries independently from the actual data structure in which the data is stored. It uses different mapping information between the ER model and the NoSQL document-oriented databases, and automatically transforms ER-based queries into actual native code for a specific NoSQL database. It generates a more flexible schema and allows changes to be made in a NoSQL database schema without causing changes in all related queries. But it requires effort beforehand to correctly create the mapping information.
Several metamodels have been proposed to represent multi-model NoSQL schemas. SOS [19] is a metamodel to represent schemas for aggregate-based NoSQL stores (i.e., key/value, column, and document), supporting application development by hiding the specific details of the various databases systems. Later, SOS evolved to the NoSQL Abstract Data Model (NoAM) [18], which is an intermediate system-independent data model to support a design methodology for aggregate-based NoSQL databases. The design methodology uses aggregate design (a group of related objects), which is mainly driven by data access patterns of the application, as well as by scalability and consistency needs. Mali et al. [158] describe an optimization process to generate a platform-independent, generic logical model used to generate five target models (relational & NoSQL models). They apply a transformation heuristic to reduce the search space of the model generation. This work did not address mapping rules for transforming the generic logical model to physical models. In another work, Li and Chen [148] proposed a query-oriented data modeling approach, called QODM, to generate a platform-independent data model for NoSQL databases based on query requirements, which can be transformed to a specific NoSQL database. The QODM approach includes key-value stores, document stores, and extensible record stores; graph databases are not supported.
The research efforts in [31, 42, 109, 199, 228] introduce methods to design NoSQL schemas with the goal of enhancing the performance of OLAP queries. For document warehouses (DocW), Messaoud et al. [161, 162] propose an approach for transforming a document warehouse model into a column-oriented or document-oriented NoSQL model.

6.2.2 Reverse Engineering Approaches.

Most NoSQL data stores are schema-less, that is, data can be stored without a previous formal definition of its structure. The flexibility of these data models allows the user to easily incorporate new data without making changes in the data model (evolution of data over time). However, to express queries there is still a need to know how data is structured. Therefore, for processing data without explicit schema information, reverse engineering can be used, which allows to generate the data model from an existing database. This is important to facilitate the understanding, the documentation, and the schema visualization of the data.
In [131] a schema extraction approach has been introduced that generates a JSON schema from collections of JSON documents. This approach is based on an earlier approach for XML schema extraction from XML documents [165]. Similarly, [20, 121, 221] deal with the problem of inferring schemas from JSON sources by using hierarchical data structures to summarize the structural information of JSON documents. Some studies [2, 3, 4, 32, 33, 55, 61, 196] present an MDE-based reverse engineering approach for extracting the data model from schema-less NoSQL databases. Model-Driven Engineering (MDE) techniques provide a formal framework for automatic model transformation, i.e., mapping one input model into an output model. This transformation process defines a set of transformation rules to be applied between source and target models.
Frozza et al. [84] present an approach for extract a schema from a JSON or Extended JSON document collection stored in a MongoDB database. This work combines hierarchical structures and MDE-based processes. Additionally, aggregation operations are considered to improve the performance of the schema extraction task. The authors present also a method for schema extraction in JSON document schema format from graph [85] and columnar models [83].
Other works, such as [106, 141], present property graph schema inference methods from graph databases that are based on the property graph data model that is used in Neo4j. Some works focus only on extracting the physical or logical model from the data stored in NoSQL databases [4, 20, 33, 83, 84, 85, 106, 131, 141, 196, 221], while others address the extraction of the conceptual model from the NoSQL physical schema [2, 3, 32, 55] or the stored data [61, 121].
Regarding multi-model databases, a Tensor-based Data Model (TDM) [142] was defined to support logical data independence in polystore systems. TDM subsumes various data models (relational, NoSQL, etc.) and proposes a set of operators for defining views (virtual or materialized views) that can dynamically transform data into appropriate data structures. Reverse mappings from relational, key/value, wide-column, and graph models to TDM were presented.
There exist also some approaches for directly executing OLAP queries on NoSQL data sources [60, 86, 101]. The works in [60] and [86] focus on extracting multidimensional schemas (facts and hierarchies) from document stores. The work in [101] proposes a pay-as-you-go approach, which enables OLAP queries against a polystore supporting relational, document, and column data models by hiding heterogeneity behind a dataspace layer.

6.2.3 Roundtrip Engineering Approaches.

Roundtrip engineering (RTE) represents one aspect of model-driven engineering (MDE) techniques and combines forward and reverse engineering. Since code and model are interrelated, changing code will change the model and vice versa. Akoka and Comyn-Wattiau [9] present a framework describing a roundtrip engineering process for graph databases, which was the first RTE approach of NoSQL databases. As pointed out in this work, roundtrip engineering allows a bi-transformation between the model and the source code. The authors of [168] first present the process of extracting the physical schema from the data stored in MongoDB, which captures characteristics, such as existing indexes, data organization, and statistical features, and then propose a bidirectional transformation between logical and physical models.
Candel et al. [39] propose a logical unified metamodel, termed U-Schema, which represents logical schemas for a multi-model database supporting relational and the most popular NoSQL systems (columnar, document, key–value, and graph). The U-Schema metamodel has been defined with the Ecore metamodeling language of the Eclipse Modeling Framework (EMF) [213]. A MapReduce operation is performed on the database to infer the logical data model for each NoSQL paradigm. Next, a round-trip strategy is defined, consisting of forward mapping to a mapping from the inferred logical schema to U-Schema, and reverse mapping to a mapping in the opposite direction.

7 Workload Type and Tuning Approach

7.1 Workload Type

The main objective of the schema design process is to provide an optimized schema that increases the performance of an application’s workload. Thus, the type of the workload plays an important role. This facet of the proposed taxonomy deals with the classification of workloads into analytical (OLAP), transactional (OLTP), or hybrid (OLAP and OLTP) workload.
Since the goal of SQL physical schema design is finding a good physical configuration for a given workload, the schema design totally depends on the current workload type. In NoSQL workload-driven approaches, the schema design is based on the analysis of workload queries to optimize the schema and improve the performance of OLTP workload [29, 50, 71, 163, 167, 218] or OLAP workload [31, 42, 109, 199, 228]. As a result, the workload type plays an important role in these approaches. In contrast, NoSQL workload-agnostic approaches largely ignore the analysis of the workload, and hence the workload type is unimportant. However, there are some workload-agnostic approaches for modeling and implementing data warehouse and OLAP systems in NoSQL databases, defining a set of mapping rules that transform a multidimensional conceptual data model into NoSQL models [10, 51, 52, 53, 54, 72, 73, 209, 230]. There exist also some reverse approaches for directly executing OLAP queries on NoSQL data sources [60, 86, 101]
While there exist solutions for the optimization of physical structures for hybrid workloads in SQL databases [6, 136, 184, 185, 236], no solutions to support hybrid workloads in NoSQL schema design were suggested so far. This is because of different ways of representing conceptual and logical schemas in multidimensional (OLAP) as compared to OLTP databases.

7.2 Tuning Approach

Another important facet of the taxonomy is the approach that is used to incorporate tuning into a database management system. The tuning approaches can be categorized into static and dynamic.

7.2.1 Static (or Offline) Tuning.

Static tuning approaches recommend an efficient schema for a given workload to the DBA. While they are helpful to decrease the time required to tune a database, they leave several important decisions to DBAs, such as the continuous monitoring of the database in order to detect changes and re-tune the schema design. Static tuning approaches are designed to be invoked manually when expected to be necessary. Thus, they require a lot of human intervention and do not meet the requirements of an autonomic database. Besides, a shortcoming of these approaches is that they can only be used when the workload of the system is known and stable. Most studies for SQL [6, 45, 64, 98, 192, 203, 233, 236] and NoSQL [29, 50, 70, 71, 117, 164, 177, 218] databases generally take an offline approach to the schema design problem.

7.2.2 Dynamic (or online) Tuning.

In dynamic environments where workloads are unknown, unstable, and change over time, offline schema tuning is far from ideal. Instead, dynamic approaches are able to track changes in both the workload and the database state and perform decisions online to reconfigure the current schema design as needed. Such approaches provide fully automated solutions that meet the requirements of an autonomic database system and reduce the maintenance overhead for DBAs. The topic of the on-line physical tuning of SQL databases raised to popularity in the form of database cracking [113, 114], continuous monitor-and-tune approaches [13, 14, 37, 88, 111, 112, 122, 147, 198, 204, 227], and machine learning-based approaches [26, 76, 135, 136, 146, 150, 179, 184, 185, 187, 210, 231]. For NoSQL databases, the CONST [167] approach is the only online NoSQL schema design approach, which is a self-tuning feedback control loop for continuous workload monitoring, analysis and optimized schema design in wide column stores.
To be fully autonomic, a DBMS must be able to predict future workloads based on historical data in order to select proper configurations in a timely manner. Since database systems are often exposed to unanticipated events, it is difficult to anticipate future workloads. Therefore, robustness is a crucial component for self-tuning DBMSs, and it comes with facets. The first facet of robustness regards the tuning decisions in the presence of reconfiguration costs. It might happen that a reconfiguration of the database schema produces only a minor performance improvement. To avoid such situations, the cost of the reconfiguration need to be considered in the optimization. The second facet of robustness refers to how workload changes affect the system’s performance. Robust configurations aim at providing an acceptable performance for most workload scenarios so that small changes do not have a large impact [135]. To obtain robust configurations, not only the current workload but also information about the distribution of potential future workloads must be incorporated. Robust workload forecasting models are essential since the workload may deviate significantly from the past. Some approaches explicitly incorporate these robustness aspects [111, 135, 136, 156, 185].
Within dynamic tuning approaches, we identify two sub-facets: reactive/proactive tuning when approaches adapt to changes in the data/workload, and internal/external tuning that expresses if approaches are internal or external to a system.
Reactive/Proactive Tuning. This category specifically deals with the time point when dynamic approaches apply the tuning process. In the reactive mode, systems respond to a change when it already happened. In the proactive mode, systems predict when changes are expected to occur (e.g., predict periodic changes in the workload), and therefore they can adapt the DBS configuration beforehand. For example, a self-tuning database system chooses its configurations proactively according to the expected workload patterns in the future. For SQL databases, there are numerous dynamic tuning approaches, which are either reactive [26, 37, 113, 114, 122, 147, 150, 179, 198, 204, 231] or proactive [111, 112, 136, 184, 185]. Currently, there are no proactive tuning approaches for NoSQL databases, and only one work [167] is reactive.
Internal/External Tuning. From a different perspective, dynamic tuning can be divided into two categories with respect to the separation of the tuning mechanism and database system. For internal approaches, the schema self-tuning process is integrated as a module in the DBMS engine. A DBMS internally monitors and analyzes the workload and database state. If it discovers that the database requires changes, the tuning component proceeds to recommend schema changes for better performance. The internal approach has some notable drawbacks. For instance, the system is costly to maintain or evolve, and it is often not scalable. Some works [37, 113, 114, 136, 184, 185, 204] present a dynamic internal approach for physical schema design in SQL databases. However, there is currently no work on the dynamic internal approach for NoSQL databases.
On the other hand, works such as [26, 76, 111, 112, 150, 179, 198, 231] for SQL and [167] for NoSQL databases, employ an external self-tuning manager. These approaches are based on an external self-tuning logic, which controls the reconfiguration decisions of the DBMS. A significant advantage of the external approach is the reusability of the tuning process with respect to different systems, which can be easily customized and configured for different applications.

8 Discussion of Review Results

In this section, we discuss the review results and analyze research questions RQ2, RQ3, and RQ4.

8.1 Level of Representation for Database Schema Design

To analyze the past work in database schema design, three levels of representation – conceptual, logical, and physical – were taken into account. In SQL databases, studies only focused on the automated design of physical structures, such as indexes, materialized views, multidimensional clustering, and horizontal or vertical partitioning. Table 4 provides an overview. We can observe that the majority of studies on physical schema design in RDBMSs are about index selection, i.e., choosing indexes on base tables or materialized views. Most of the works focus on a single structure.
Table 4.
StudyIndexMat. ViewsPartitioningMultidim. Clustering
[16], [128], [178], [202], [214], [21], [24], [30], [58], [57], [67] [59], [56], [63], [78], [80], [81], [82], [102], [107], [108], [120], [119], [224], [219], [45], [203], [205], [97], [114], [113], [37], [198], [204], [88], [112], [111], [26], [179], [187], [197], [210], [140], [14], [137], [172], [40], [41], [127], [193]\(\checkmark\)   
[23], [98], [105], [211], [217], [229], [232], [233], [231], [98], [150] \(\checkmark\)  
[76], [103], [147], [191], [186], [235], [124], [171], [169], [170], [122]  \(\checkmark\) 
[151], [153], [159]   \(\checkmark\)
[139], [7], [13], [64], [146], [237]\(\checkmark\)\(\checkmark\)  
[234], [195], [136], [135]\(\checkmark\) \(\checkmark\) 
[6], [8], [138], [184]\(\checkmark\)\(\checkmark\)\(\checkmark\) 
[236]\(\checkmark\)\(\checkmark\)\(\checkmark\)\(\checkmark\)
Table 4. Classification of Schema Design Studies in SQL Databases by Physical Structures
Conversely, due to the flexibility of NoSQL data models, schema design has been studied at all three levels of representation – conceptual, logical, and physical. Table 5 shows source model, target model, and the level of representation investigated in our primary studies. Forward methods start with a conceptual model or relational database and infer logical or physical models from them. Reverse methods work in the opposite direction: they either extract the conceptual model from NoSQL physical schemas [2, 3, 32, 55] or from the stored data in the database [61, 121], or they extract the physical or logical model from the stored data in the database [4, 20, 33, 84, 131, 196, 221].
Table 5.
MethodStudySource ModelTarget ModelLevel of Representation
Forward[1]UMLGeneric logical model, Physical models (Column, Document, Graph)Logical, Physical
[65]UML/OCLMulti-store (Relational, Document, Graph)Logical, Physical
[144], [218]RDBColumnLogical
[149]UMLColumnPhysical
[5]RDB/RDFGraphPhysical
[66]UML/OCLGraphLogical, Physical
[125]ERGraph, DocumentLogical, Physical
[15]RDBColumn, Document, GraphLogical
[31]DW Relational SchemaColumnLogical
[10, 209]Multidimensional conceptual modelGraphLogical
[53], [230]Multidimensional conceptual modelDocument, ColumnLogical
[51], [72]Multidimensional conceptual modelColumnLogical
[73]Multidimensional conceptual modelKey-ValueLogical, Physical
[52], [54]Multidimensional conceptual modelDocumentLogical
[29]ERColumnLogical
[177]RDBColumnLogical, Physical
[164]Entity Graph (restricted type of ER)ColumnLogical, Physical
[50], [167]ERColumnLogical, Physical
[71]EER (Extended ER)DocumentLogical, Physical
[11]EER (Extended ER)GraphLogical, Physical
[116], [104], [173]ERDocumentLogical, Physical
[89]ERPolystore (Relational, NoSQL models)Physical
[70]UMLColumn, DocumentLogical, Physical
[18]UMLGeneric logical model, Physical models (Column, Document, Key-value)Logical, Physical
[158]UMLGeneric logical model (Relational, NoSQL models)Logical
[31]Relational DWColumnLogical, Physical
[162]Multidimensional galaxy model of DocWDocumentLogical, Physical
[161]Multidimensional of DocWColumnLogical, Physical
Reverse[61]Graph DBEER Conceptual ModelLogical, Conceptual
[85]Graph DBJSON Document SchemaPhysical
[83]Column DBJSON Document SchemaPhysical
[141], [106]Graph DBProperty Graph SchemaLogical, Physical
[121]JSON Document DBUML Conceptual ModelPhysical, Conceptual
[55]Document SchemaEntity Differentiation Conceptual modelConceptual
[20], [131], [4], [33], [221]JSON Document DBDocument SchemaPhysical
[196]Aggregate-oriented NoSQL DBLogical Schemas (UML model of EMF)Logical
[2], [3], [32]Document Schema (Physical model)UML Conceptual ModelConceptual
[84]JSON and Extended JSON DBDocument SchemaPhysical
[142]Polystore (Relational, Key/value, Column, Graph)Generic logical modelLogical
[60], [86]JSON Document DBMultidimensional Conceptual SchemaConceptual
Roundtrip[9]Graph DB EER Conceptual ModelConceptual, Logical, Physical
[39]Relational and NoSQL DBs Logical data models Logical data models Generic Logical Model (EMP Model)Logical
[168]JSON Document DB Physical Physical LogicalLogical, Physical
Table 5. Classification of Schema Design Studies in NoSQL Databases by Level of Representation
Regarding the roundtrip methods (last three rows in Table 5), the first work [9] presents a bidirectional transformation between a conceptual model and a NoSQL graph DB. Similarly, the work in [168] first presents the process of extracting the physical schema from the stored data in MongoDB and then defines a bidirectional transformation between logical and physical models. Finally, [39] proposes a unified logical metamodel, termed U-Schema, which represents a logical schema for a multi-model database supporting both relational and the four most popular NoSQL systems (columnar, document, key–value, and graph). First, a MapReduce operation infers the logical data model for each NoSQL paradigm. Then, a round-trip strategy is defined, consisting of forward and reverse mappings between the inferred logical schema and the U-Schema.

8.2 Methods used for Schema Design and Tuning Process

Cost model and search strategy are the two main factors when choosing a physical design for SQL databases. Table 6 shows the different methods used in our primary studies for cost model and search strategy. The methods that can be used for cost model are stand-alone, optimizer, and learned cost model. For search strategy, early work used exact approaches to select an optimal configuration in files. The exact solutions apply exhaustive search to consider all possible configurations, they are very time consuming and hence impractical. Later studies shifted the focus toward determining a good (near-optimal) physical design instead of an optimal design using approximate solutions. Most of work use heuristic solutions, greedy search, genetic algorithm, branch-and-bound, relaxation-based approach, and hybrid approaches. Recently, advances in machine learning, especially in reinforcement learning and deep reinforcement learning, motivate researchers to move toward the learning-based approaches for finding a physical configuration with the lowest total cost.
Table 6.
    Tuning approach
StudyCost ModelSearch StrategyWorkloadStaticDynamic
     InternalExternalReactiveProactive
[16], [128], [178], [202], [214]Stand-aloneExact approachesOLTP\(\checkmark\)    
[21], [24], [30], [58], [102], [224], [103]Stand-aloneHeuristic solutionsOLTP\(\checkmark\)    
[81], [82]OptimizerHeuristic solutionsOLTP\(\checkmark\)    
[120]Stand-aloneKnapsack problemOLTP\(\checkmark\)    
[234]Stand-alonebranch-and-boundOLTP\(\checkmark\)    
[96], [97], [105]Stand-aloneGreedy approachOLAP\(\checkmark\)    
[23], [211], [217], [229]Stand-aloneHeuristic solutionsOLAP\(\checkmark\)    
[232]Stand-aloneGenetic algorithmOLAP\(\checkmark\)    
[139]Stand-aloneA* algorithmOLAP\(\checkmark\)    
[63], [80], [195]Stand-alone (knowledge-based approaches)Heuristic solutionsOLTP\(\checkmark\)    
[58], [59], [56]Combination of an optimizer-based and a knowledge-based approachHeuristic solutionsOLTP\(\checkmark\)    
[45]OptimizerGreedy approachOLTP\(\checkmark\)    
[6]OptimizerGreedy approachHybrid\(\checkmark\)    
[219]OptimizerKnapsack algorithmOLTP\(\checkmark\)    
[236]OptimizerHybrid algorithmHybrid\(\checkmark\)    
[192]Stand-aloneInteger Linear Programming (ILP)OLAP\(\checkmark\)    
[98]Stand-aloneGreedy algorithmOLAP\(\checkmark\)    
[233]Stand-aloneHybrid evolutionary approachOLAP\(\checkmark\)    
[127]Compression-aware optimizerGreedy approachOLTP\(\checkmark\)    
[193]Stand-aloneLinear programmingOLTP\(\checkmark\)    
[68]OptimizerBinary integer programming (BIP)OLTP\(\checkmark\)    
[203]OptimizerRecursive greedy approachOLTP\(\checkmark\)    
[37]OptimizerRelaxation-based approachOLTP \(\checkmark\) \(\checkmark\) 
[198]OptimizerGreedy approachOLTP  \(\checkmark\)\(\checkmark\) 
[204]OptimizerHeuristic solutionOLTP \(\checkmark\) \(\checkmark\) 
[112], [111]OptimizerMulti objective optimization techniqueOLTP  \(\checkmark\) \(\checkmark\)
[114], [113]Cracker-aware optimizerCracking algorithmOLTP \(\checkmark\) \(\checkmark\) 
[185]Several ML algorithms (for each action, one ML cost model)MCTS or RL methodsHybrid \(\checkmark\)  \(\checkmark\)
[136]Several ML algorithmsRecursively Linear ProgrammingHybrid \(\checkmark\)  \(\checkmark\)
[26], [179]Reinforcement learningModeling index selection as a MDP and solving by RLOLTP  \(\checkmark\)\(\checkmark\) 
[140]OptimizerDRLOLTP\(\checkmark\)    
[150]DRLModeling MV selection as a MDP and solving by DRLOLAP  \(\checkmark\)\(\checkmark\) 
[231]DRLModeling MV selection as an ILP problem, modeling ILP as an MDP, and solving DRLOLAP  \(\checkmark\)\(\checkmark\) 
Table 6. A Comparison of Existing Works in Schema Design for SQL Databases
Regarding NoSQL modeling, the approaches presented in existing works can be categorized into forward, reverse, or roundtrip engineering methods. While forward engineering methods go from a conceptual to a physical data model, reverse engineering methods move from an existing database (the stored data) to a conceptual model. Roundtrip engineering is based on a combination of forward and reverse engineering. The modeling approach used in existing work is illustrated in Table 7. Roundtrip modeling is the least explored method.
Table 7.
StudyNoSQL TypeLevel of Data ModelWorkloadModeling Approach Tuning Approach
    ForwardReverseRoundtrip StaticDynamic
[1]Column, Document, GraphMulti-modelOLTP\(\checkmark\)   \(\checkmark\) 
[15]Column, Document, GraphSingle-modelOLTP\(\checkmark\)   \(\checkmark\) 
[65]Document, GraphMulti-model (Multistore)OLTP\(\checkmark\)   \(\checkmark\) 
[125]Document, GraphSingle-modelOLTP\(\checkmark\)   \(\checkmark\) 
[149], [144], [177]ColumnSingle-modelOLTP\(\checkmark\)   \(\checkmark\) 
[104]DocumentSingle-modelOLTP\(\checkmark\)   \(\checkmark\) 
[66], [5], [11]GraphSingle-modelOLTP\(\checkmark\)   \(\checkmark\) 
[72], [51], [31], [199]ColumnSingle-modelOLAP\(\checkmark\)   \(\checkmark\) 
[230], [53]Column, DocumentSingle-modelOLAP\(\checkmark\)   \(\checkmark\) 
[52], [54]DocumentSingle-modelOLAP\(\checkmark\)   \(\checkmark\) 
[209], [10]GraphSingle-modelOLAP\(\checkmark\)   \(\checkmark\) 
[73]key-valueSingle-modelOLAP\(\checkmark\)   \(\checkmark\) 
[218], [29], [164], [50], [70]ColumnSingle-modelOLTP\(\checkmark\)   \(\checkmark\) 
[71], [117], [173], [116]DocumentSingle-modelOLTP\(\checkmark\)   \(\checkmark\) 
[18], [19]key-value, Column, DocumentMulti-modelOLTP\(\checkmark\)   \(\checkmark\) 
[158]key-value, Column, Document, GraphMulti-modelOLTP\(\checkmark\)   \(\checkmark\) 
[167]ColumnSingle-modelOLTP\(\checkmark\)    \(\checkmark\)
[89]key-value, Column, Document, GraphMulti-model (Polystore)OLTP\(\checkmark\)   \(\checkmark\) 
[161]ColumnSingle-modelOLAP (DocW)\(\checkmark\)   \(\checkmark\) 
[162]DocumentSingle-modelOLAP (DocW)\(\checkmark\)   \(\checkmark\) 
[42]GraphSingle-modelOLAP\(\checkmark\)   \(\checkmark\) 
[131], [20], [221], [55], [121], [3], [2], [4], [32], [84], [33]DocumentSingle-modelOLTP \(\checkmark\)  \(\checkmark\) 
[196]Document, Key-value, ColumnSingle-modelOLTP \(\checkmark\)  \(\checkmark\) 
[61], [85], [141]GraphSingle-modelOLTP \(\checkmark\)  \(\checkmark\) 
[83]ColumnSingle-modelOLTP \(\checkmark\)  \(\checkmark\) 
[142]key-value, Column, GraphMulti-model (Polystore)OLTP \(\checkmark\)  \(\checkmark\) 
[60], [86]DocumentSingle-modelOLAP \(\checkmark\)  \(\checkmark\) 
[101]Document, ColumnMulti-model (Polystore)OLAP \(\checkmark\)  \(\checkmark\) 
[168]DocumentSingle-modelOLTP  \(\checkmark\) \(\checkmark\) 
[9]GraphSingle-modelOLTP  \(\checkmark\) \(\checkmark\) 
[39]Document, Column, Graph, Key-valueMulti-modelOLTP  \(\checkmark\) \(\checkmark\) 
Table 7. A Comparison of Existing Work in Schema Design of NoSQL Data Stores

8.3 Taxonomy of Schema Design and Tuning Approaches

The different classifications of the numerous approaches for database schema design not only reveals a comprehensive overview about the significant facets and aspects of this research field, but it also provides a framework for identifying open challenges, research gaps, and promising directions for future research. Several works have already discussed different aspects of database schema design. In the automated physical schema design of relational databases, Chaudhuri and Narasayya [47] discuss the two cost models stone-alone and optimizer as well as offline and online tuning approaches. However, they just address a small number of publications which were released until 2007. On the other hand, Störl et al. [216] focus more on the approaches of NoSQL Modeling and classify them into forward and reverse engineering. Our article unifies these classifications into a taxonomy, and also introduces new facets to fill in the gaps.
Using our taxonomy in Figure 5, Table 6 and Table 7 compare the state-of-the-art in schema design for SQL and NoSQL databases, respectively. The following points are taken from the two tables: (a) As evident from the columns Cost Model and Search Strategy of Table 6, the current research trend is toward learning-based approaches, especially reinforcement learning and deep reinforcement learning, for finding optimized physical configurations in relational databases. This is because learning-based approaches can potentially help to provide faster approximations for optimization problems. (b) While research interest in polystores and multistores is growing, research on schema design for such systems is still in an early stage. First articles [65, 101, 142] outline potential solutions. (c) The online approaches are recommended for providing fully automated solutions that meet the goals of an autonomic database system. There is extensive work on online physical design tuning in relational databases, yet research on NoSQL online schema tuning is an open task. A first article [167] sketches ideas about potential solutions.

8.4 Additional Results

As additional results, we incorporate contributions from the industrial area, provide examples of real-world applications (e.g., cloud environments and large-scale data systems), and explore interactions of the schema design problem with other fields.

8.4.1 Schema Design Efforts in the Industrial Area.

As mentioned in Section 5.3, all major commercial database vendors ship automated physical design tools with relational databases, e.g., DTA in Microsoft SQL Server [6], DB2 Design Advisor [236, 237], and the SQL Access Advisor in Oracle [64]. There exist also several tools for database schema design in industrial applications, such as Hackolade,4erwin,5 and ER/Studio.6Hackolade is a multi-platform schema design tool for NoSQL databases, which offers graphical schema design, physical data modeling, and forward and reverse engineering. Currently supported databases include Apache Cassandra, Apache HBase, and ScyllaDB in column databases; MongoDB, Couchbase, CouchDB, Elasticsearch, Google Real time Firebase, and Google Cloud Fire store in document stores; Neo4j and TinkerPop in graph databases; and DynamoDB in key-value stores. The popular relational modeling tools erwin and ER/Studio are expanding toward NoSQL stores. erwin Data Modeler now supports Apache Cassandra, MongoDB, and Couchbase as target NoSQL databases, allowing schema forward and reverse engineering. The ER/Studio tool supports schema forward and reverse engineering for MongoDB.

8.4.2 Cloud Environments.

Offline schema tuning approaches rely on DBAs, which is expensive and cannot adapt to cloud computing platforms because of their scale and complexity. Chaudhuri et al. [48] argue that expenses for DBAs are the key factor in the total cost of ownership. These costs can be further increased by a higher complexity of schema tuning tasks caused by non-stable workloads, a lack of domain knowledge and application context [67] in cloud environments. Therefore, fully automatic schema tuning for cloud platforms is an urgent research issue [67, 145, 146, 176, 185].

8.4.3 Large-scale Data Systems.

In large-scale systems with thousands of queries in the workload and hundreds of tables, schema selection approaches often face specific challenges, such as solution constraints (e.g., memory budget), reconfiguration costs, and prohibitive runtimes. Some studies presented scalable and efficient approaches for such systems. For example, DTA tool [6] scales to large databases and workloads using techniques such as workload compression and reduced statistics. The methods in [38, 68, 181, 203] significantly speedup what-if optimization by decreasing the number of what-if optimizer calls. The studies in [136, 146, 185] proposed autonomous database systems using machine Learning methods that are applicable for large-scale problems. There are only a few works [18, 20, 39] that investigate scalability requirements for schema design in NoSQL databases. For instance, the works in [20, 39] use MapReduce on tackle scalability issues.

8.4.4 Exploring Broader Connections.

The importance of automated schema design goes beyond SQL and NoSQL databases. For instance, schema extraction from a wide variety of heterogeneous data sources is a crucial element in data lakes, as the source data needs to be described by a data model to capture its semantics. To support the “load-first, schema-later” paradigm, the assessment of data quality and cleaning large volumes of heterogeneous data sources are essential tasks in unveiling the value of big data [79]. The quality of data can refer to the extension of the data (i.e., data values) or to the intension of the data (i.e., the schema) [27]. The schema extraction from raw data may be intertwined with the preparation for operations such as discovery, integration, and cleaning [79, 99, 182]. For example, CLAMS transforms heterogeneous data in a data lake into a unified data model and enforces quality constraints on this model for cleaning purposes [79]. Data quality verification and data cleaning methods are based on reverse engineering [190, 201].
As another example, [22] proposes an ontology driven meta-model to conceptualize data representation in heterogeneous databases, proposing a common conceptual abstraction on semantically enriched vocabularies for both NoSQL and SQL databases to generate logical or physical schemas.

9 Limitations and Threats to Validity

The validity of a systematic review is largely defined by the extent to which the design and conduction of the study are likely to be true and free from bias. Here we briefly discuss potential threats to the validity of our work as well as actions we took to minimize these risks. We followed the methodology of Kitchenham [129, 130] to minimize bias and maximize the validity of the systematic review. Based on this methodology, this study is subject to the following limitations:
Construct Validity. One of the most common threats of systematic reviews is to ensure that all relevant academic articles are covered. In order to minimize this threat, our search for primary studies was driven by a deterministic list of inclusion and exclusion criteria and based on a set of specified research questions. Furthermore, we have defined the keywords using relevant terms based on the research questions to determine a reliable search string, and we used well-known data sources to find candidates for primary studies (cf. Section 3.1.3).
Internal Validity. This is the extent to which the design and conduct of the study are likely to prevent systematic errors [130]. We designed our methodology (cf. Section 3.2) to minimize systematic errors in the data extraction process. To ensure coherent and correct data extraction, we supplemented our core methodology with thorough discussions among all authors of this article.
External Validity. This is the extent to which the effects observed in the study are applicable outside of the study [130]. In this context, we strongly believe that both industrial and scientific communities in the area of databases and, more specifically, schema design and tuning, will largely benefit from this systematic review and the introduced taxonomy.

10 Conclusions and Future Trends

Automated database schema design has attracted a broad range of interests from academia and industry. We performed a systematic literature review to accurately present the state-of-the-art and to build an understanding of different solutions in the field of automated database schema design. For this, 75 primary studies for NoSQL databases and 108 primary studies for SQL databases from 1st January 1970 to 31st December 2021 were analyzed. The number of identified documents indicates that there has been less research on this topic for NoSQL databases. This is due to the fact that research on NoSQL automatic schema design first appeared in 2010.
The main findings in this review show that there is extensive work on automated physical schema design for SQL databases, and tools are available in commercial DBMSs. Recent research points toward the development of self-tuning database systems, which automatically configure and optimize the model in response to changes in the workload and the data. Such systems automatically tune various configuration features, such as physical database design, knob configurations, or hardware resources. Our main conclusion with respect to schema design in NoSQL databases is that there is no consensus yet on a standard to model NoSQL databases, and only a few contributions from industry have been made so far. An emerging research trend is to shift the focus toward global and generic schema design in polystores and multistores that handle heterogeneous data.
Based on our investigation and insights, we identify various directions for future research:
Online approaches for full autonomic schema tuning. Online approaches provide fully automated solutions that meet the goals of an autonomic database system. They are able to track changes in both the workload and the database state, and they make online decisions to modify the current schema design as needed. Contrary to SQL databases, little work has been done for NoSQL online schema design. Thus, there is a huge potential to explore this gap for NoSQL data stores.
Handling overhead and overreaction in online approaches. Online approaches aim at reducing the maintenance overhead for DBAs, but they impose a processing overhead on the database system. Reducing this overhead is therefore critical for self-tuning databases. On the other hand, online approaches should be robust against temporary variations in the workload and database state, and should prevent unnecessary reactions to these variations. Therefore, another challenge in self-tuning databases is the development of approaches that are able to accurately predict the effect of re-tuning and avoid overreaction in the presence of temporary variations.
Proactive approaches. Online schema design approaches rely on change prediction to determine when a tuning process should be applied. A major challenge is to accurately predict the change point, and to adapt the current database schema accordingly. Machine learning models can be applied to accurately predict when a change will occur. While there exist some proactive schema tuning strategies for SQL databases, so far there is no such work for NoSQL databases.
Generic schema design for multi-model databases. Multi-model database systems support more than one data model. To integrate concepts from several data models yet offering a uniform representation, a store-independent global schema is required.

Acknowledgments

We would like to thank André Conrad for his thoughtful comments.

Footnotes

References

[1]
Fatma Abdelhédi, Amal Ait Brahim, Faten Atigui, and Gilles Zurfluh. 2017. UMLtoNoSQL: Automatic transformation of conceptual schema to NoSQL databases. In Proceedings of the AICCSA. IEEE Computer Society, 272–279.
[2]
Fatma Abdelhédi, Amal Ait Brahim, Rabah Tighilt Ferhat, and Gilles Zurfluh. 2020. Discovering of a conceptual model from a NoSQL database. In Proceedings of the ICEIS. SCITEPRESS, 61–72.
[3]
Fatma Abdelhédi, Amal Ait Brahim, Rabah Tighilt Ferhat, and Gilles Zurfluh. 2020. Reverse engineering approach for NoSQL databases. In Proceedings of the DaWaK(LNCS, Vol. 12393). Springer, 60–69.
[4]
Fatma Abdelhédi, Amal Ait Brahim, Hela Rajhi, Rabah Tighilt Ferhat, and Gilles Zurfluh. 2021. Automatic extraction of a document-oriented NoSQL schema. In Proceedings of the ICEIS. SCITEPRESS, 192–199.
[5]
Dippy Aggarwal and Karen C. Davis. 2016. Employing graph databases as a standardization model towards addressing heterogeneity. In Proceedings of the IRI. IEEE Computer Society, 198–207.
[6]
Sanjay Agrawal, Surajit Chaudhuri, Lubor Kollár, Arunprasad P. Marathe, Vivek R. Narasayya, and Manoj Syamala. 2004. Database tuning advisor for microsoft SQL server 2005. In Proceedings of the VLDB. 1110–1121.
[7]
Sanjay Agrawal, Surajit Chaudhuri, and Vivek R. Narasayya. 2000. Automated selection of materialized views and indexes in SQL databases. In Proceedings of the VLDB. 496–505.
[8]
Sanjay Agrawal, Vivek R. Narasayya, and Beverly Yang. 2004. Integrating vertical and horizontal partitioning into automated physical database design. In Proceedings of the SIGMOD. ACM, 359–370.
[9]
Jacky Akoka and Isabelle Comyn-Wattiau. 2018. Roundtrip engineering of NoSQL databases. Enterprise Modelling and Information Systems Architectures 13, Special (2018), 281–292.
[10]
Jacky Akoka, Isabelle Comyn-Wattiau, Cédric du Mouza, and Nicolas Prat. 2021. Mapping multidimensional schemas to property graph models. In Proceedings of the ER (Workshops)(LNCS, Vol. 13012). Springer, 3–14.
[11]
Jacky Akoka, Isabelle Comyn-Wattiau, and Nicolas Prat. 2017. A four v’s design approach of NoSQL graph databases. In Proceedings of the ER Workshops(LNCS, Vol. 10651). Springer, 58–68.
[12]
Alibaba. 2020. Alibaba self-driving database. Retrieved March 1, 2021 from https://www.alibabacloud.com/product/das/
[13]
Ana Carolina Almeida, Fernanda Baião, Sérgio Lifschitz, Daniel Schwabe, and Maria Luiza Machado Campos. 2021. Tun-O\({}_{\mbox{CM}}\): A model-driven approach to support database tuning decision making. Decis. Support Syst. 145 (2021), 113538.
[14]
Ana Carolina Almeida, Angelo Brayner, José Maria Monteiro, Sérgio Lifschitz, and Rafael Pereira de Oliveira. 2016. Automatic physical design tuning based on hypothetical plans. In Proceedings of the SBBD. SBC, 115–120.
[15]
Obaid Alotaibi and Eric Pardede. 2019. Transformation of schema from relational database (RDB) to NoSQL databases. Data 4, 4 (2019), 148.
[16]
Henry D. Anderson and P. Bruce Berra. 1977. Minimum cost selection of secondary indexes for formatted files. ACM Transactions on Database Systems 2, 1 (1977), 68–90.
[17]
Chaimae Asaad and Karim Baïna. 2018. NoSQL databases–seek for a design methodology. In Proceedings of the International Conference on Model and Data Engineering. Springer, 25–40.
[18]
Paolo Atzeni, Francesca Bugiotti, Luca Cabibbo, and Riccardo Torlone. 2020. Data modeling in the NoSQL world. Computer Standards & Interfaces 67 (2020), 103149.
[19]
Paolo Atzeni, Francesca Bugiotti, and Luca Rossi. 2014. Uniform access to NoSQL systems. Inf. Syst. 43 (2014), 117–133.
[20]
Mohamed Amine Baazizi, Dario Colazzo, Giorgio Ghelli, and Carlo Sartiani. 2019. Parametric schema inference for massive JSON datasets. VLDB J. 28, 4 (2019), 497–521.
[21]
R Bananno, Dario Maio, and Paolo Tiberio. 1985. An approximation algorithm for secondary index selection in relational database physical design. Comput. J. 28, 4 (1985), 398–405.
[22]
Shreya Banerjee and Anirban Sarkar. 2016. Ontology driven meta-modeling for NoSQL databases: A conceptual perspective. International Journal of Software Engineering and Its Applications 10, 12 (2016), 41–64.
[23]
Elena Baralis, Stefano Paraboschi, and Ernest Teniente. 1997. Materialized view selection in a multidimensional database. In Proceedings of the VLDB. 156–165.
[24]
Elena Barcucci, Renzo Pinzani, and Renzo Sprugnoli. 1990. Optimal selection of secondary indexes. IEEE Trans. Software Eng. 16, 1 (1990), 32–38.
[25]
Francesco Basciani, Juri Di Rocco, Davide Di Ruscio, Alfonso Pierantonio, and Ludovico Iovino. 2020. TyphonML: A modeling environment to develop hybrid polystores. In Proceedings of the MoDELS (Companion). ACM, 2:1–2:5.
[26]
Debabrota Basu, Qian Lin, Weidong Chen, Hoang Tam Vo, Zihong Yuan, Pierre Senellart, and Stéphane Bressan. 2016. Regularized cost-model oblivious database tuning with reinforcement learning. Trans. Large Scale Data Knowl. Centered Syst. 28 (2016), 96–132.
[27]
Carlo Batini, Cinzia Cappiello, Chiara Francalanci, and Andrea Maurino. 2009. Methodologies for data quality assessment and improvement. ACM Comput. Surv. 41, 3 (2009), 16:1–16:52.
[28]
Peter Belknap, Benoît Dageville, Karl Dias, and Khaled Yagoub. 2009. Self-tuning for SQL performance in oracle database 11g. In Proceedings of the ICDE. IEEE Computer Society, 1694–1700.
[29]
David Bermbach, Steffen Müller, Jacob Eberhardt, and Stefan Tai. 2015. Informed schema design for column store-based database services. In Proceedings of the SOCA. IEEE Computer Society, 163–172.
[30]
Flavio Bonfatti, Dario Maio, and Paolo Tiberio. 1983. A separability-based method for secondary index selection in physical database design. In Proceedings of the Methodology and Tools for Data Base Design. North-Holland, 149–160.
[31]
Mohamed Boussahoua, Omar Boussaid, and Fadila Bentayeb. 2017. Logical schema for data warehouse on column-oriented NoSQL databases. In Proceedings of the DEXA(LNCS, Vol. 10439). Springer, 247–256.
[32]
Amal Ait Brahim, Rabah Tighilt Ferhat, and Gilles Zurfluh. 2019. Extraction process of conceptual model from a document-oriented NoSQL database. In Proceedings of the KSE. IEEE, 1–5.
[33]
Amal Ait Brahim, Rabah Tighilt Ferhat, and Gilles Zurfluh. 2019. Model driven extraction of NoSQL databases schema: Case of MongoDB. In Proceedings of the KDIR. 145–154.
[34]
Nicolas Bruno. 2011. Automated Physical Database Design and Tuning. CRC Press.
[35]
Nicolas Bruno and Surajit Chaudhuri. 2005. Automatic physical database tuning: A relaxation-based approach. In Proceedings of the SIGMOD. ACM, 227–238.
[36]
Nicolas Bruno and Surajit Chaudhuri. 2006. To tune or not to tune? A lightweight physical design alerter. In Proceedings of the VLDB. ACM, 499–510.
[37]
Nicolas Bruno and Surajit Chaudhuri. 2007. An online approach to physical design tuning. In Proceedings of the ICDE. IEEE, 826–835.
[38]
Nicolas Bruno and Rimma V. Nehme. 2008. Configuration-parametric query optimization for physical design tuning. In Proceedings of the SIGMOD. ACM, 941–952.
[39]
Carlos Javier Fernández Candel, Diego Sevilla Ruiz, and Jesús Joaquín García Molina. 2021. A unified metamodel for NoSQL and relational databases. Inf. Syst. 104 (2021), 101898.
[40]
Alberto Caprara, Matteo Fischetti, and Dario Maio. 1995. Exact and approximate algorithms for the index selection problem in physical database design. IEEE Trans. Knowl. Data Eng. 7, 6 (1995), 955–967.
[41]
Alberto Caprara and Juan José Salazar González. 1999. Separating lifted odd-hole inequalities to solve the index selection problem. Discrete Applied Mathematics 92, 2-3 (1999), 111–134.
[42]
Arnaud Castelltort and Anne Laurent. 2014. NoSQL graph-based OLAP analysis. In Proceedings of the KDIR. SciTePress, 217–224.
[43]
Surajit Chaudhuri, Benoît Dageville, and Guy M. Lohman. 2004. Self-managing technology in database management systems. In Proceedings of the VLDB. 1243.
[44]
Surajit Chaudhuri, Arnd Christian König, and Vivek R. Narasayya. 2004. SQLCM: A continuous monitoring framework for relational database engines. In Proceedings of the ICDE. IEEE, 473–484.
[45]
Surajit Chaudhuri and Vivek R. Narasayya. 1997. An efficient cost-driven index selection tool for microsoft SQL server. In Proceedings of the VLDB. 146–155.
[46]
Surajit Chaudhuri and Vivek R. Narasayya. 1998. AutoAdmin ’what-if’ index analysis utility. In Proceedings of the SIGMOD. ACM,367–378.
[47]
Surajit Chaudhuri and Vivek R. Narasayya. 2007. Self-tuning database systems: A decade of progress. In Proceedings of the VLDB. ACM, 3–14.
[48]
Surajit Chaudhuri and Gerhard Weikum. 2018. Self-management technology in databases. In Proceedings of the Encyclopedia of Database Systems. Springer.
[49]
Surajit Chaudhuri and Gerhard Weikum. 2006. Foundations of automated database tuning. In Proceedings of the ICDE. 104.
[50]
Artem Chebotko, Andrey Kashlev, and Shiyong Lu. 2015. A big data modeling methodology for apache cassandra. In Proceedings of the BigData Congress. IEEE Computer Society, 238–245.
[51]
Max Chevalier, Mohammed El Malki, Arlind Kopliku, Olivier Teste, and Ronan Tournier. 2015. Implementation of multidimensional databases in column-oriented NoSQL systems. In Proceedings of the ADBIS. Springer, 79–91.
[52]
Max Chevalier, Mohammed El Malki, Arlind Kopliku, Olivier Teste, and Ronan Tournier. 2015. Implementation of multidimensional databases with document-oriented NoSQL. In Proceedings of the DaWaK(LNCS, Vol. 9263). Springer, 379–390.
[53]
Max Chevalier, Mohammed El Malki, Arlind Kopliku, Olivier Teste, and Ronan Tournier. 2015. Implementing multidimensional data warehouses into NoSQL. In Proceedings of the ICEIS. SciTePress, 172–183.
[54]
Max Chevalier, Mohammed El Malki, Arlind Kopliku, Olivier Teste, and Ronan Tournier. 2016. Document-oriented models for data warehouses - NoSQL document-oriented for data warehouses. In Proceedings of the ICEIS. SciTePress, 142–149.
[55]
Alberto Hernández Chillón, Diego Sevilla Ruiz, Jesus García Molina, and Severino Feliciano Morales. 2019. A model-driven approach to generate schemas for object-document mappers. Ieee Access 7 (2019), 59126–59142.
[56]
Ramtjander Choenni. 1995. On the Automation of Physical Database Design.Ph. D. Dissertation. Univ. Twente.
[57]
Sunil Choenni, Henk Blanken, and Thiel Chang. 1993. Index selection in relational databases. In Proceedings of the ICCI. IEEE, 491–496.
[58]
Sunil Choenni, Henk M Blanken, and Thiel Chang. 1993. On the automation of physical database design. In Proceedings of the 1993 ACM/SIGAPP Symposium on Applied Computing: States of the art and Practice. 358–367.
[59]
Sunil Choenni, Henk Wagterveld, Henk M. Blanken, and Thiel Chang. 1995. TOPYDE: A tool for physical database design. In Proceedings of the International Conference on Database and Expert Systems Applications. Springer, 502–511.
[60]
Mohamed L. Chouder, Stefano Rizzi, and Rachid Chalal. 2019. EXODuS: Exploratory OLAP over document stores. Inf. Syst. 79 (2019), 44–57.
[61]
Isabelle Comyn-Wattiau and Jacky Akoka. 2017. Model driven reverse engineering of NoSQL property graph databases: The case of Neo4j. In Proceedings of the BigData. IEEE, 453–458.
[62]
Pavel Čontoš and Martin Svoboda. 2020. JSON schema inference approaches. In Proceedings of the ER. Springer, 173–183.
[63]
Christopher E. Dabrowski, David K. Jefferson, John V. Carlis, and Salvatore T. March. 1989. Integrating a knowledge-based component into a physical database design system. Inf. Manag. 17, 2 (1989), 71–86.
[64]
Benoît Dageville, Dinesh Das, Karl Dias, Khaled Yagoub, Mohamed Zaït, and Mohamed Ziauddin. 2004. Automatic SQL tuning in oracle 10g. In Proceedings of the VLDB. 1098–1109.
[65]
Gwendal Daniel, Abel Gómez, and Jordi Cabot. 2019. UMLto[No]SQL: Mapping conceptual schemas to heterogeneous datastores. In Proceedings of the RCIS. IEEE, 1–13.
[66]
Gwendal Daniel, Gerson Sunyé, and Jordi Cabot. 2016. UMLtoGraphDB: Mapping conceptual schemas to graph databases. In Proceedings of the ER. Springer, 430–444.
[67]
Sudipto Das, Miroslav Grbic, Igor Ilic, Isidora Jovandic, Andrija Jovanovic, Vivek R. Narasayya, Miodrag Radulovic, Maja Stikic, Gaoxiang Xu, and Surajit Chaudhuri. 2019. Automatically indexing millions of databases in microsoft azure SQL database. In Proceedings of the SIGMOD. ACM, 666–679.
[68]
Debabrata Dash, Neoklis Polyzotis, and Anastasia Ailamaki. 2011. Cophy: A scalable, portable, and interactive index advisor for large workloads. Proc. VLDB Endow. 4, 6 (2011), 362–372.
[69]
Ali Davoudian, Liu Chen, and Mengchi Liu. 2018. A survey on NoSQL stores. ACM Computing Surveys 51, 2 (2018), 1–43.
[70]
Alfonso de la Vega, Diego García-Saiz, Carlos Blanco, Marta E. Zorrilla, and Pablo Sánchez. 2018. Mortadelo: A model-driven framework for NoSQL database design. In Proceedings of the MEDI. Springer, 41–57.
[71]
Claudio de Lima and Ronaldo dos Santos Mello. 2015. A workload-driven logical design approach for NoSQL document databases. In Proceedings of the iiWAS. ACM, 73:1–73:10.
[72]
Khaled Dehdouh, Fadila Bentayeb, Omar Boussaid, and Nadia Kabachi. 2015. Using the column oriented NoSQL model for implementing big data warehouses. In Proceedings of the PDPTA’15. CSREA Press, 469–475.
[73]
Francesco Di Tria, Ezio Lefons, and Filippo Tangorra. 2014. Design process for big data warehouses. In Proceedings of the 2014 International Conference on Data Science and Advanced Analytics. IEEE, 512–518.
[74]
Markus Dreseler, Jan Kossmann, Martin Boissier, Stefan Klauck, Matthias Uflacker, and Hasso Plattner. 2019. Hyrise re-engineered: An extensible database system for research in relational in-memory data management. In Proceedings of the EDBT. 313–324.
[75]
Jennie Duggan, Aaron J. Elmore, Michael Stonebraker, Magdalena Balazinska, Bill Howe, Jeremy Kepner, Sam Madden, David Maier, Tim Mattson, and Stanley B. Zdonik. 2015. The BigDAWG polystore system. SIGMOD Rec. 44, 2 (2015), 11–16.
[76]
Gabriel Campero Durand, Marcus Pinnecke, Rufat Piriyev, Mahmoud Mohsen, David Broneske, Gunter Saake, Maya S. Sekeran, Fabián Rodriguez, and Laxmi Balami. 2018. GridFormation: Towards self-driven online data partitioning using reinforcement learning. In Proceedings of the aiDM@SIGMOD. ACM, 1:1–1:7.
[77]
Said Elnaffar, Wendy Powley, Darcy Benoit, and Pat Martin. 2003. Today’s DBMSs: How autonomic are they. In Proceedings of the DEXA Workshops IEEE, 651–655. https://ieeexplore.ieee.org/abstract/document/1232095/
[78]
Bernd-Juergen Falkowski. 1992. Comments on an optimal set of indices for a relational database. IEEE Transactions on Software Engineering 18, 2 (1992), 168.
[79]
Mina H. Farid, Alexandra Roatis, Ihab F. Ilyas, Hella-Franziska Hoffmann, and Xu Chu. 2016. CLAMS: Bringing quality to data lakes. In Proceedings of the SIGMOD. ACM, 2089–2092.
[80]
Yishai A. Feldman and Jacob Reouven. 2003. A knowledge-based approach for index selection in relational databases. Expert Systems with Applications 25, 1 (2003), 15–37.
[81]
Schkolnick Finkelstein, Mario Schkolnick, and Paolo Tiberio. 1988. Physical database design for relational databases. ACM Transactions on Database Systems 13, 1 (1988), 91–128.
[82]
Martin R. Frank, Edward R. Omiecinski, and Shamkant B. Navathe. 1992. Adaptive and automated index selection in RDBMS. In Proceedings of the International Conference on Extending Database Technology. Springer, 277–292.
[83]
Angelo Augusto Frozza, Eduardo Dias Defreyn, and Ronaldo dos Santos Mello. 2021. An approach for schema extraction of NoSQL columnar databases: The HBase case study. J. Inf. Data Manag. 12, 5 (2021), 384–395.
[84]
Angelo Augusto Frozza, Ronaldo dos Santos Mello, and Felipe de Souza da Costa. 2018. An approach for schema extraction of JSON and extended JSON document collections. In Proceedings of the IRI. IEEE, 356–363.
[85]
Angelo Augusto Frozza, Salomão Rodrigues Jacinto, and Ronaldo dos Santos Mello. 2020. An approach for schema extraction of NoSQL graph databases. In Proceedings of the IRI. IEEE, 271–278.
[86]
Enrico Gallinucci, Matteo Golfarelli, and Stefano Rizzi. 2019. Approximate OLAP of document-oriented databases: A variety-aware approach. Inf. Syst. 85 (2019), 114–130.
[87]
Christian Garcia-Arellano, Sam Lightstone, Guy M. Lohman, Volker Markl, and Adam J. Storm. 2006. Autonomic features of the IBM DB2 universal database for linux, UNIX, and windows. IEEE Trans. Syst. Man Cybern. Syst. 36, 3 (2006), 365–376.
[88]
Michael Gibas, Guadalupe Canahuate, and Hakan Ferhatosmanoglu. 2008. Online index recommendations for high-dimensional databases using query workloads. IEEE Trans. Knowl. Data Eng. 20, 2 (2008), 246–260.
[89]
Maxime Gobert, Loup Meurice, and Anthony Cleve. 2021. Conceptual modeling of hybrid polystores. In Proceedings of the ER. Springer, 113–122.
[90]
Goetz Graefe, Felix Halim, Stratos Idreos, Harumi Kuno, Stefan Manegold, and Bernhard Seeger. 2014. Transactional support for adaptive indexing. The VLDB Journal 23, 2 (2014), 303–328.
[91]
Goetz Graefe, Stratos Idreos, Harumi A. Kuno, and Stefan Manegold. 2010. Benchmarking adaptive indexing. In Proceedings of the TPCTC(LNCS, Vol. 6417). Springer, 169–184.
[92]
Goetz Graefe and Harumi Kuno. 2010. Adaptive indexing for relational keys. In Proceedings of the ICDEW. IEEE, 69–74.
[93]
Goetz Graefe and Harumi A. Kuno. 2010. Self-selecting, self-tuning, incrementally optimized indexes. In Proceedings of the EDBT(ACM International Conference Proceeding Series). ACM, 371–381.
[94]
Katarina Grolinger and Miriam AM Capretz. 2012. Autonomic database management: State of the art and future trends. In Proceedings of the CATA. 276–281. https://ir.lib.uwo.ca/electricalpub/23/
[95]
TI Gündem. 1999. Near optimal multiple choice index selection for relational databases. Computers & Mathematics with Applications 37, 2 (1999), 111–120.
[96]
Himanshu Gupta. 1997. Selection of views to materialize in a data warehouse. In Proceedings of the ICDT. Springer, 98–112.
[97]
Himanshu Gupta, Venky Harinarayan, Anand Rajaraman, and Jeffrey D. Ullman. 1997. Index selection for OLAP. In Proceedings of the ICDE. IEEE, 208–219.
[98]
Himanshu Gupta and Inderpal Singh Mumick. 2005. Selection of views to materialize in a data warehouse. IEEE Transactions on Knowledge and Data Engineering 17, 1 (2005), 24–43.
[99]
Rihan Hai, Christoph Quix, and Dan Wang. 2019. Relaxed functional dependency discovery in heterogeneous data lakes. In Proceedings of the ER(LNCS, Vol. 11788). Springer, 225–239.
[100]
Felix Halim, Stratos Idreos, Panagiotis Karras, and Roland H. C. Yap. 2012. Stochastic database cracking: Towards robust adaptive indexing in main-memory column-stores. Proc. VLDB Endow. 5, 6 (2012), 502–513.
[101]
Hamdi Ben Hamadou, Enrico Gallinucci, and Matteo Golfarelli. 2019. Answering GPSJ queries in a polystore: A dataspace-based approach. In Proceedings of the ER(LNCS, Vol. 11788). Springer, 189–203.
[102]
Michael Hammer and Arvola Chan. 1976. Index selection in a self-adaptive data base management system. In Proceedings of the SIGMOD. 1–8.
[103]
Michael Hammer and Bahram Niamir. 1979. A heuristic approach to attribute partitioning. In Proceedings of the SIGMOD. 93–101.
[104]
Shady Hamouda and Zurinahni Zainol. 2017. Document-oriented data schema for relational database migration to NoSQL. In Proceedings of the Innovate-Data. IEEE, 43–50.
[105]
Venky Harinarayan, Anand Rajaraman, and Jeffrey D. Ullman. 1996. Implementing data cubes efficiently. In Proceedings of the SIGMOD. ACM Press, 205–216.
[106]
Olaf Hartig and Jan Hidders. 2019. Defining schemas for property graphs by using the GraphQL schema definition language. In Proceedings of the GRADES/NDA@SIGMOD/PODS. ACM, 6:1–6:11.
[107]
Michael Hatzopoulos and J. Kollias. 1985. On the selection of a reduced set of indexes. Comput. J. 28, 4 (1985), 406–408.
[108]
Michael Hatzopoulos and John G. Kollias. 1984. On the optimal selection of multilist database structures. IEEE Transactions on Software Engineering 10, 6 (1984), 681–687.
[109]
Victor Herrero, Alberto Abelló, and Oscar Romero. 2016. NOSQL design for analytical workloads: Variability matters. In Proceedings of the ER(LNCS, Vol. 9974). 50–64.
[110]
Pedro Holanda, Mark Raasveldt, Stefan Manegold, and Hannes Mühleisen. 2019. Progressive indexes: Indexing for interactive data analysis. Proceedings of the VLDB Endowment 12, 13 (2019), 2366–2378.
[111]
Marc Holze. 2012. Self-Management Concepts for Relational Database Systems. Ph. D. Dissertation. University of Hamburg.
[112]
Marc Holze and Norbert Ritter. 2011. System models for goal-driven self-management in autonomic databases. Data & Knowledge Engineering 70, 8 (2011), 685–701.
[113]
Stratos Idreos, Martin L. Kersten, and Stefan Manegold. 2007. Database cracking. In Proceedings of the CIDR. 68–78. www.cidrdb.org.
[114]
Stratos Idreos, Martin L. Kersten, and Stefan Manegold. 2007. Updating a cracked database. In Proceedings of the SIGMOD. ACM, 413–424.
[115]
Stratos Idreos, Stefan Manegold, Harumi Kuno, and Goetz Graefe. 2011. Merging what’s cracked, cracking what’s merged: Adaptive indexing in main-memory column-stores. Proceedings of the VLDB Endowment 4, 9 (2011), 586–597.
[116]
Abdullahi Abubakar Imam, Shuib Basri, Rohiza Ahmad, Amirudin A. Wahab, María T. González-Aparicio, Luiz Fernando Capretz, Ammar K. Alazzawi, and Abdullateef O. Balogun. 2020. Dsp: Schema design for non-relational applications. Symmetry 12, 11 (2020), 1799.
[117]
Abdullahi Abubakar Imam, Shuib Basri, Rohiza Ahmad, Junzu Watada, and María T. González-Aparicio. 2018. Automatic schema suggestion model for NoSQL document-stores databases. Journal of Big Data 5, 1 (2018), 1–17.
[118]
Abdullahi Abubakar Imam, Shuib Basri, Rohiza Ahmad, Junzo Watada, María Teresa González Aparicio, and Malek Ahmad Almomani. 2018. Data modeling guidelines for NoSQL document-store databases. International Journal of Advanced Computer Science and Applications, 9, 10 (2018), 544–555.
[119]
Maggie Yuet Lin Ip. 1981. An Approximation Algorithm for the Index Selection Problem. University of Regina.
[120]
Maggie Y. L. Ip, Lawrence V. Saxton, and Vijay V. Raghavan. 1983. On the selection of an optimal set of indexes. IEEE Transactions on Software Engineering 9, 2 (1983), 135–143.
[121]
Javier Luis Cánovas Izquierdo and Jordi Cabot. 2016. JSONDiscoverer: Visualizing the schema lurking behind JSON documents. Knowledge-Based Systems 103 (2016), 52–55.
[122]
Alekh Jindal and Jens Dittrich. 2011. Relax and let the database do the partitioning online. In Proceedings of the BIRTE(Lecture Notes in Business Information Processing, Vol. 126). Springer, 65–80.
[123]
Marcos Jota, Marlene Goncalves, and Ritces Parra. 2019. A physical design strategy on a nosql dbms. In Proceedings of the Data Science. CRC Press, 69–93.
[124]
Donghe Kang, Ruochen Jiang, and Spyros Blanas. 2021. Jigsaw: A data storage and query processing engine for irregular table partitioning. In Proceedings of the SIGMOD. 898–911.
[125]
Karamjit Kaur and Rinkle Rani. 2013. Modeling and querying data in NoSQL databases. In Proceedings of the BigData. IEEE, 1–7.
[126]
Hideaki Kimura, George Huo, Alexander Rasin, Samuel Madden, and Stanley B. Zdonik. 2010. CORADD: Correlation aware database designer for materialized views and indexes. Proc. VLDB Endow. 3, 1 (2010), 1103–1113.
[127]
Hideaki Kimura, Vivek R. Narasayya, and Manoj Syamala. 2011. Compression aware physical database design. PVLDB 4, 10 (2011), 657–668.
[128]
WF King. 1974. On the Selection of Indices for a File. Research report RJ 1341. IBM, San Jose, CA.
[129]
Barbara Kitchenham, Pearl Brereton, David Budgen, Mark Turner, John Bailey, and Stephen Linkman. 2009. Systematic literature reviews in software engineering–a systematic literature review. Inf. Softw. Technol. 51, 1 (2009), 7–15.
[130]
B. Kitchenham and S. Charters. 2007. Guidelines for performing Systematic Literature Reviews in Software Engineering. Technical report, Ver. 2.3. School of Computer Science and Mathematics, Keele University, Keele, U.K. 57 pages.
[131]
Meike Klettke, Uta Störl, and Stefanie Scherzinger. 2015. Schema extraction and structural outlier detection for JSON-based NoSQL data stores. In Proceedings of the BTW(LNI, Vol. P-241). GI, 425–444.
[132]
Boyan Kolev, Patrick Valduriez, Carlyna Bondiombouy, Ricardo Jiménez-Peris, Raquel Pau, and José Pereira. 2016. CloudMdsQL: Querying heterogeneous cloud data stores with a common language. Distributed and Parallel Databases 34, 4 (2016), 463–503.
[133]
N. Korla. 2013. Cassandra data modeling-practical considerations@Netflix. Retrieved October 22, 2019 from http://www.slideshare.net/nkorla1share/cass-summit-3
[134]
Jan Kossmann, Stefan Halfpap, Marcel Jankrift, and Rainer Schlosser. 2020. Magic mirror in my hand, which is the best in the land? An experimental evaluation of index selection algorithms. Proc. VLDB Endow. 13, 11 (2020), 2382–2395.
[135]
Jan Kossmann and Rainer Schlosser. 2019. A framework for self-managing database systems. In Proceedings of the ICDEW. IEEE, 100–106.
[136]
Jan Kossmann and Rainer Schlosser. 2020. Self-driving database systems: A conceptual approach. Distributed and Parallel Databases 38, 4 (2020), 795–817.
[137]
Jozef Kratica, Ivana Ljubić, and Dušan Tošić. 2003. A genetic algorithm for the index selection problem. In Workshops on Applications of Evolutionary Computation. Springer, 280–290.
[138]
Darl Kuhn, Sam R. Alapati, and Bill Padfield. 2012. Sql access advisor. In Proceedings of the Expert Indexing in Oracle Database 11g. Springer, 233–248.
[139]
Wilburt Juan Labio, Dallan Quass, and Brad Adelberg. 1997. Physical database design for data warehouses. In Proceedings of the ICDE. IEEE, 277–288.
[140]
Hai Lan, Zhifeng Bao, and Yuwei Peng. 2020. An index advisor using deep reinforcement learning. In Proceedings of the CIKM. 2105–2108.
[141]
Hanâ Lbath, Angela Bonifati, and Russ Harmer. 2021. Schema inference for property graphs. In Proceedings of the EDBT. 499–504.
[142]
Éric Leclercq, Annabelle Gillet, Thierry Grison, and Marinette Savonnet. 2019. Polystore and tensor data model for logical data independence and impedance mismatch in big data analytics. Trans. Large Scale Data Knowl. Centered Syst. 42 (2019), 51–90.
[143]
Viktor Leis, Andrey Gubichev, Atanas Mirchev, Peter Boncz, Alfons Kemper, and Thomas Neumann. 2015. How good are query optimizers, really? Proceedings of the VLDB Endowment 9, 3 (2015), 204–215.
[144]
Chongxin Li. 2010. Transforming relational database into HBase: A case study. In Proceedings of the ICSESS. IEEE, 683–687.
[145]
Feifei Li. 2019. Cloud-native database systems at alibaba: Opportunities and challenges. Proceedings of the VLDB Endowment 12, 12 (2019), 2263–2272.
[146]
Guoliang Li, Xuanhe Zhou, Ji Sun, Xiang Yu, Yue Han, Lianyuan Jin, Wenbo Li, Tianqing Wang, and Shifu Li. 2021. opengauss: An autonomous database system. Proceedings of the VLDB Endowment 14, 12 (2021), 3028–3042.
[147]
Liangzhe Li and Le Gruenwald. 2013. Self-managing online partitioner for databases (SMOPD): A vertical database partitioning system with a fully automatic online approach. In Proceedings of the IDEAS. ACM, 168–173.
[148]
Xiang Li, Zhiyi Ma, and Hongjie Chen. 2014. QODM: A query-oriented data modeling approach for NoSQL databases. In Proceedings of the 2014 IEEE Workshop on Advanced Research and Technology in Industry Applications. IEEE, 338–345.
[149]
Yan Li, Ping Gu, and Chao Zhang. 2014. Transforming UML class diagrams into HBase based on meta-model. In Proceedings of the 2014 International Conference on Information Science, Electronics and Electrical Engineering. IEEE, 720–724.
[150]
Xi Liang, Aaron J. Elmore, and Sanjay Krishnan. 2019. Opportunistic view materialization with deep reinforcement learning. CoRR abs/1903.01363 (2019).
[151]
Sam Lightstone and Bishwaranjan Bhattacharjee. 2004. Automated design of multidimensional clustering tables for relational databases. In Proceedings of the VLDB. 1170–1181.
[152]
Sam S. Lightstone, Toby J. Teorey, and Tom Nadeau. 2010. Physical Database Design: The Database Professional’s Guide to Exploiting Indexes, Views, Storage, and More. Morgan Kaufmann.
[153]
JH Liou and S. Bing Yao. 1977. Multi-dimensional clustering for data base organizations. Inf. Syst. 2, 4 (1977), 187–198.
[154]
Zhen Hua Liu and Dieter Gawlick. 2015. Management of flexible schema data in RDBMSs - opportunities and limitations for NoSQL -. In Proceedings of the CIDR. www.cidrdb.org.
[155]
Jiaheng Lu and Irena Holubová. 2019. Multi-model databases: A new journey to handle the variety of data. ACM Computing Surveys 52, 3 (2019), 1–38.
[156]
Lin Ma, Dana Van Aken, Ahmed Hefny, Gustavo Mezerhane, Andrew Pavlo, and Geoffrey J. Gordon. 2018. Query-based workload forecasting for self-driving database management systems. In Proceedings of the SIGMOD. ACM, 631–645.
[157]
Dario Maio, Claudio Sartori, and Maria Rita Scalas. 1987. A modular user-oriented decision support for physical database design. Decision Support Systems 3, 2 (1987), 155–163.
[158]
Jihane Mali, Faten Atigui, Ahmed Azough, and Nicolas Travers. 2020. ModelDrivenGuide: An approach for implementing NoSQL schemas. In Proceedings of the DEXA(LNCS, Vol. 12391). Springer, 141–151.
[159]
Volker Markl, Frank Ramsak, and Rudolf Bayer. 1999. Improving OLAP performance by multidimensional hierarchical clustering. In Proceedings of the IDEAS. IEEE Computer Society, 165–177.
[160]
Abdul Mateen, Basit Raza, Muhammad Sher, Mian Muhammad Awais, and Tauqeer Hussain. 2010. Evolution of autonomic database management systems. In Proceedings of the ICCAE. IEEE, 33–37.
[161]
Ines Ben Messaoud, Refka Ben Ali, and Jamel Feki. 2017. From document warehouse to column-oriented NoSQL document warehouse. In Proceedings of the ICSOFT. SciTePress, 85–94.
[162]
Ines Ben Messaoud, Abdulrahman A. Alshdadi, and Jamel Feki. 2021. Building a document-oriented warehouse using NoSQL. Int. J. Oper. Res. Inf. Syst. 12, 2 (2021), 33–54.
[163]
Michael J. Mior. 2018. Physical Design for Non-relational Data Systems. Ph. D. Dissertation. University of Waterloo.
[164]
Michael J. Mior, Kenneth Salem, Ashraf Aboulnaga, and Rui Liu. 2017. NoSE: Schema design for NoSQL applications. IEEE Trans. Knowl. Data Eng. 29, 10 (2017), 2275–2289.
[165]
Chuang-Hue Moh, Ee-Peng Lim, and Wee-Keong Ng. 2000. DTD-miner: A tool for mining DTD from XML documents. In Proceedings of the WECWIS. IEEE, 144–151.
[166]
Maryam Mozaffari, Eslam Nazemi, and Amir-Masoud Eftekhari-Moghadam. 2020. Feedback control loop design for workload change detection in self-tuning NoSQL wide column stores. Expert Syst. Appl. 142 (2020), 112973.
[167]
Maryam Mozaffari, Eslam Nazemi, and Amir-Masoud Eftekhari-Moghadam. 2021. CONST: Continuous online NoSQL schema tuning. Softw. Pract. Exp. 51, 5 (2021), 1147–1169.
[168]
Pablo D. Muñoz-Sánchez, Carlos Javier Fernández Candel, Jesús García Molina, and Diego Sevilla Ruiz. 2020. Managing physical schemas in MongoDB stores. In Proceedings of the ER (Workshops). Springer, 162–172.
[169]
Shamkant B. Navathe, Stefano Ceri, Gio Wiederhold, and Jinglie Dou. 1984. Vertical partitioning algorithms for database design. ACM Trans. Database Syst. 9, 4 (1984), 680–710.
[170]
Shamkant B. Navathe and Minyoung Ra. 1989. Vertical partitioning for database design: A graphical algorithm. In Proceedings of the SIGMOD. ACM,440–450.
[171]
Rimma Nehme and Nicolas Bruno. 2011. Automated partitioning design in parallel database systems. In Proceedings of the SIGMOD. 1137–1148.
[172]
Priscilla Neuhaus, Julia Couto, Jonatas Wehrmann, Duncan Dubugras Alcoba Ruiz, and Felipe Meneguzzi. 2019. GADIS: A genetic algorithm for database index selection (S). In Proceedings of the SEKE. KSI, 39–54.
[173]
Viviana Noguera and Daniel Lucrédio. 2019. Implementing a classic ER algebra to automatically generate complex queries for document-oriented databases. In Proceedings of the SBCARS. ACM, 43–52.
[174]
Harley Vera Olivera, Maristela Holanda, Valeria Guimarâes, Fernanda Hondo, and Wagner Boaventura. 2015. Data modeling for NoSQL document-oriented databases. In Proceedings of the SIMBig. CEUR-WS.org, 129–135.
[175]
openGauss. 2020. openGauss database system. Retrieved March 21, 2021 from https://github.com/opengauss-mirror/
[176]
Oracle. 2020. Oracle autonomous database.https://docs.oracle.com/en/cloud/paas/autonomous-database/. Accessed on April 20, 2021.
[177]
Rafik Ouanouki, Alain April, Alain Abran, Abraham Gomez, and Jean-Marc Desharnais. 2017. Toward building RDB to HBase conversion rules. Journal of Big Data 4, 1 (2017), 1–21.
[178]
F. Palermo. 1970. A quantitative approach to the selection of secondary indexes. Research report RJ 730. IBM, San Jose, CA.
[179]
Gabriel Paludo Licks, Julia Colleoni Couto, Priscilla de Fátima Miehe, Renata De Paris, Duncan Dubugras Ruiz, and Felipe Meneguzzi. 2020. SMARTIX: A database indexing agent based on reinforcement learning. Applied Intelligence 50, 8 (2020), 2575–2588.
[180]
Stratos Papadomanolakis and Anastassia Ailamaki. 2007. An integer linear programming approach to database design. In Proceedings of the ICDE Workshops. IEEE Computer Society, 442–449.
[181]
Stratos Papadomanolakis, Debabrata Dash, and Anastassia Ailamaki. 2007. Efficient use of the query optimizer for automated database design. In Proceedings of the VLDB. ACM, 1093–1104.
[182]
Thorsten Papenbrock and Felix Naumann. 2016. A hybrid approach to functional dependency discovery. In Proceedings of the SIGMOD. ACM, 821–833.
[183]
Patel, J.2012. Cassandra data modeling best practices. Retrieved October 22, 2019. from http://ebaytechblog.com/?p=1308
[184]
Andrew Pavlo, Gustavo Angulo, Joy Arulraj, Haibin Lin, Jiexi Lin, Lin Ma, Prashanth Menon, Todd C. Mowry, Matthew Perron, Ian Quah, Siddharth Santurkar, Anthony Tomasic, Skye Toor, Dana Van Aken, Ziqi Wang, Yingjun Wu, Ran Xian, and Tieying Zhang. 2017. Self-driving database management systems. In Proceedings of the CIDR. 1.
[185]
Andy Pavlo, Matthew Butrovich, Lin Ma, Prashanth Menon, Wan Shen Lim, Dana Van Aken, and William Zhang. 2021. Make your database system dream of electric sheep: Towards self-driving operation. Proc. VLDB Endow. 14, 12 (2021), 3211–3221.
[186]
Andrew Pavlo, Carlo Curino, and Stanley B. Zdonik. 2012. Skew-aware automatic database partitioning in shared-nothing, parallel OLTP systems. In Proceedings of the SIGMOD. ACM, 61–72.
[187]
Wendel Góes Pedrozo, Júlio César Nievola, and Deborah Carvalho Ribeiro. 2018. An adaptive approach for index tuning with learning classifier systems on hybrid storage environments. In Proceedings of the HAIS. Springer, 716–729.
[188]
Gregory Piatetsky-Shapiro. 1983. The optimal selection of secondary indices is NP-complete. ACM SIGMOD Record 13, 2 (1983), 72–75.
[189]
Holger Pirk, Eleni Petraki, Stratos Idreos, Stefan Manegold, and Martin L. Kersten. 2014. Database cracking: Fancy scan, not poor man’s sort!. In Proceedings of the DaMoN. ACM, 4:1–4:8.
[190]
Nataliya Prokoshyna, Jaroslaw Szlichta, Fei Chiang, Renée J. Miller, and Divesh Srivastava. 2015. Combining quantitative and logical data cleaning. Proc. VLDB Endow. 9, 4 (2015), 300–311.
[191]
Jun Rao, Chun Zhang, Nimrod Megiddo, and Guy M. Lohman. 2002. Automating physical database design in a parallel database. In Proceedings of the SIGMOD. ACM, 558–569.
[192]
Alexander Rasin and Stanley B. Zdonik. 2013. An automatic physical design tool for clustered column-stores. In Proceedings of the EDBT. ACM, 203–214.
[193]
Keven Richly, Rainer Schlosser, and Martin Boissier. 2021. Joint index, sorting, and compression optimization for memory-efficient spatio-temporal data management. In Proceedings of the ICDE. IEEE, 1901–1906.
[194]
Noa Roy-Hubara and Arnon Sturm. 2020. Design methods for the new database era: A systematic literature review. Software and Systems Modeling 19, 2 (2020), 297–312.
[195]
Steve Rozen and Dennis E. Shasha. 1991. A framework for automating physical database design. In Proceedings of the VLDB. 401–411.
[196]
Diego Sevilla Ruiz, Severino Feliciano Morales, and Jesús García Molina. 2015. Inferring versioned schemas from NoSQL databases and its applications. In Proceedings of the ER(LNCS, Vol. 9381). Springer, 467–480.
[197]
Zahra Sadri, Le Gruenwald, and Eleazar Leal. 2020. Online index selection using deep reinforcement learning for a cluster database. In Proceedings of the ICDEW. IEEE, 158–161.
[198]
Kai-Uwe Sattler, Ingolf Geist, and Eike Schallehn. 2003. QUIET: Continuous query-driven index tuning. In Proceedings of the VLDB. 1129–1132.
[199]
Lucas C. Scabora, Jaqueline Joice Brito, Ricardo Rodrigues Ciferri, and Cristina Dutra de Aguiar Ciferri. 2016. Physical data warehouse design on NoSQL databases - OLAP query processing over HBase. In Proceedings of the ICEIS. SciTePress, 111–118.
[200]
Ing Eike Schallehn. 2010. Database tuning and self-tuning. Technical report,. Faculty of Computer Science, OvG University Magdeburg, Germany. 130 pages.
[201]
Sebastian Schelter, Dustin Lange, Philipp Schmidt, Meltem Celikel, Felix Bießmann, and Andreas Grafberger. 2018. Automating large-scale data quality verification. Proc. VLDB Endow. 11, 12 (2018), 1781–1794.
[202]
Mario Schkolnick. 1975. The optimal selection of secondary indices for files. Inf. Syst. 1, 4 (1975), 141–146.
[203]
Rainer Schlosser, Jan Kossmann, and Martin Boissier. 2019. Efficient scalable multi-attribute index selection using recursive strategies. In Proceedings of the ICDE. IEEE, 1238–1249.
[204]
Karl Schnaitter, Serge Abiteboul, Tova Milo, and Neoklis Polyzotis. 2006. Colt: Continuous on-line tuning. In Proceedings of the SIGMOD. 793–795.
[205]
Karl Schnaitter and Neoklis Polyzotis. 2012. Semi-automatic index tuning: Keeping DBAs in the loop. Proc. VLDB Endow. 5, 5 (2012), 478–489.
[206]
Felix Martin Schuhknecht, Jens Dittrich, and Laurent Linden. 2018. Adaptive adaptive indexing. In Proceedings of the ICDE. 665–676.
[207]
Felix Martin Schuhknecht, Alekh Jindal, and Jens Dittrich. 2013. The uncracked pieces in database cracking. Proceedings of the VLDB Endowment 7, 2 (2013), 97–108.
[208]
Felix Martin Schuhknecht, Alekh Jindal, and Jens Dittrich. 2016. An experimental evaluation and analysis of database cracking. The VLDB Journal 25, 1 (2016), 27–52.
[209]
Amal Sellami, Ahlem Nabli, and Faiez Gargouri. 2018. Transformation of data warehouse schema to NoSQL graph data base. In Proceedings of the International Conference on Intelligent Systems Design and Applications. Springer, 410–420.
[210]
Ankur Sharma, Felix Martin Schuhknecht, and Jens Dittrich. 2018. The case for automatic database administration using deep reinforcement learning. CoRR abs/1801.05643 (2018).
[211]
Amit Shukla, Prasad Deshpande, and Jeffrey F. Naughton. 1998. Materialized view selection for multidimensional datasets. In Proceedings of the VLDB. 488–499.
[212]
Georgios Spanos and Lefteris Angelis. 2016. The impact of information security events to the stock market: A systematic literature review. Comput. Secur. 58 (2016), 216–229.
[213]
Dave Steinberg, Frank Budinsky, Ed Merks, and Marcelo Paternostro. 2008. EMF: eclipse modeling framework. Pearson Education.
[214]
Michael Stonebraker. 1974. The choice of partial inversions and combined indices. Int. J. Parallel Program. 3, 2 (1974), 167–188.
[215]
Michael Stonebraker, Daniel J. Abadi, Adam Batkin, Xuedong Chen, Mitch Cherniack, Miguel Ferreira, Edmond Lau, Amerson Lin, Samuel Madden, Elizabeth J. O’Neil, Patrick E. O’Neil, Alex Rasin, Nga Tran, and Stanley B. Zdonik. 2005. C-Store: A column-oriented DBMS. In Proceedings of the VLDB. ACM, 553–564.
[216]
Uta Störl, Meike Klettke, and Stefanie Scherzinger. 2020. NoSQL schema evolution and data migration: State-of-the-art and opportunities. In Proceedings of the EDBT. OpenProceedings.org, 655–658.
[217]
Dimitri Theodoratos and Timos K. Sellis. 1997. Data warehouse configuration. In Proceedings of the VLDB. 126–135.
[218]
Tamás Vajk, Péter Fehér, Krisztián Fekete, and Hassan Charaf. 2013. Denormalizing data into schema-free databases. In Proceedings of the 2013 IEEE 4th International Conference on Cognitive Infocommunications. IEEE, 747–752.
[219]
Gary Valentin, Michael Zuliani, Daniel C. Zilio, Guy M. Lohman, and Alan Skelley. 2000. DB2 advisor: An optimizer smart enough to recommend its own indexes. In Proceedings of the ICDE. IEEE Computer Society, 101–110.
[220]
Harley Vera-Olivera, Ruizhe Guo, Ruben Cruz Huacarpuma, Ana Paula Bernardi Da Silva, Ari Melo Mariano, and Maristela Holanda. 2021. Data modeling and NoSQL databases-a systematic mapping review. ACM Computing Surveys 54, 6 (2021), 1–26.
[221]
Lanjun Wang, Oktie Hassanzadeh, Shuo Zhang, Juwei Shi, Limei Jiao, Jia Zou, and Chen Wang. 2015. Schema management for document stores. Proc. VLDB Endow. 8, 9 (2015), 922–933.
[222]
Gerhard Weikum, Christof Hasse, Axel Mönkeberg, and Peter Zabback. 1994. The COMFORT automatic tuning project. Inf. Syst. 19, 5 (1994), 381–432.
[223]
Gerhard Weikum, Axel Mönkeberg, Christof Hasse, and Peter Zabback. 2002. Self-tuning database technology and information services: from wishful thinking to viable engineering. In Proceedings of the VLDB. 20–31.
[224]
Kyu-Young Whang. 1985. Index selection in relational databases. In Proceedings of the FODO. Plemum Press, New York, 487–500.
[225]
Kyu-Young Whang. 1983. A Physical Database Design Methodology Using the Property of Separability.Ph. D. Dissertation. Standford University, United States.
[226]
Kyu-Young Whang. 1984. Separability-an approach to physical database design. IEEE Trans. Computers 100, 3 (1984), 209–222.
[227]
David Wiese, Gennadi Rabinovitch, Michael Reichert, and Stephan Arenswald. 2008. Autonomic tuning expert: A framework for best-practice oriented autonomic database tuning. In Proceedings of the CASCON. IBM, 3.
[228]
Fangzhou Yang, Dragan Milosevic, and Jian Cao. 2017. Optimising column family for OLAP queries in HBase. International Journal of Big Data Intelligence 4, 1 (2017), 23–35.
[229]
Jian Yang, Kamalakar Karlapalem, and Qing Li. 1997. Algorithms for materialized view design in data warehousing environment. In Proceedings of the VLDB. 136–145.
[230]
Rania Yangui, Ahlem Nabli, and Faïez Gargouri. 2016. Automatic transformation of data warehouse schema to NoSQL data base: Comparative study. In Proceedings of the KES(Procedia Computer Science, Vol. 96). Elsevier, 255–264.
[231]
Haitao Yuan, Guoliang Li, Ling Feng, Ji Sun, and Yue Han. 2020. Automatic view generation with deep learning and reinforcement learning. In Proceedings of the ICDE. IEEE, 1501–1512.
[232]
Chuan Zhang and Jian Yang. 1999. Genetic algorithm for materialized view selection in data warehouse environments. In Proceedings of the DaWaK. Springer, 116–125.
[233]
Chuan Zhang, Xin Yao, and Jian Yang. 2001. An evolutionary approach to materialized views selection in a data warehouse environment. IEEE Trans. Syst. Man Cybern. Part C 31, 3 (2001), 282–294.
[234]
Daniel C. Zilio. 1998. Physical Database Design Decision Algorithms and Concurrent Reorganization for Parallel Database Systems. Ph. D. Dissertation. University of Toronto, Canada.
[235]
Daniel C. Zilio, Anant Jhingran, and Sriram Padmanabhan. 1994. Partitioning Key Selection for a Shared-nothing Parallel Database System. IBM TJ Watson Research Center.
[236]
Daniel C. Zilio, Jun Rao, Sam Lightstone, Guy M. Lohman, Adam J. Storm, Christian Garcia-Arellano, and Scott Fadden. 2004. DB2 design advisor: Integrated automatic physical database design. In Proceedings of the VLDB. 1087–1097.
[237]
Daniel C. Zilio, Calisto Zuzarte, Sam Lightstone, Wenbin Ma, Guy M. Lohman, Roberta Cochrane, Hamid Pirahesh, Latha S. Colby, Jarek Gryz, Eric Alton, Dongming Liang, and Gary Valentin. 2004. Recommending materialized views and indexes with IBM DB2 design advisor. In Proceedings of the ICAC. IEEE Computer Society, 180–188.

Index Terms

  1. Self-tuning Database Systems: A Systematic Literature Review of Automatic Database Schema Design and Tuning

    Recommendations

    Comments

    Please enable JavaScript to view thecomments powered by Disqus.

    Information & Contributors

    Information

    Published In

    cover image ACM Computing Surveys
    ACM Computing Surveys  Volume 56, Issue 11
    November 2024
    977 pages
    EISSN:1557-7341
    DOI:10.1145/3613686
    Issue’s Table of Contents
    This work is licensed under a Creative Commons Attribution International 4.0 License.

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 29 June 2024
    Online AM: 17 May 2024
    Accepted: 08 May 2024
    Revised: 08 March 2024
    Received: 22 May 2023
    Published in CSUR Volume 56, Issue 11

    Check for updates

    Author Tags

    1. Data modeling
    2. schema design and tuning
    3. self-tuning databases
    4. NoSQL databases
    5. SQL databases

    Qualifiers

    • Survey

    Funding Sources

    • National Recovery and Resilience Plan (NRRP)
    • Italian Ministry of University and Research
    • European Union-NextGenerationEU
    • Italian Ministry of University and Research, CUP

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • 0
      Total Citations
    • 1,870
      Total Downloads
    • Downloads (Last 12 months)1,870
    • Downloads (Last 6 weeks)432
    Reflects downloads up to 14 Nov 2024

    Other Metrics

    Citations

    View Options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Get Access

    Login options

    Full Access

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media