This section provides a survey of research on automated physical design in RDBMSs. Since in 1998, the AutoAdmin project at Microsoft Research developed the first automated physical design tool in a commercial DBMS, we review the work in three categories: works prior to the AutoAdmin project, the evolution of the AutoAdmin project, and related works following the AutoAdmin project.
5.3.2 History of the AutoAdmin Project.
The AutoAdmin project
2 started in 1996 at Microsoft Research with the aim at making RDBMSs self-tuning. A primary focus was automating the physical database design. In 1998, the first physical database design tool was released, called
Index Tuning Wizard (
ITW), which shipped with Microsoft SQL Server 7.0 and is based on techniques presented in [
45]. A key feature of this tool is an extension of the query optimizer to support a “what-if” interface [
46]. In the SQL Server 2,000 release, ITW was enhanced with integrated recommendations for indexes and materialized views in the context of multidimensional (OLAP) databases [
7].
Horizontal and vertical partitioning are other important aspects of physical database design with significant impact on performance and manageability. Horizontal partitioning allows tables, indexes and materialized views to be partitioned into disjoint sets of rows, which are separately stored and accessed. Vertical partitioning allows a table to be partitioned into disjoint sets of columns. In [
8] a scalable solution to the integrated physical design problem of indexes, materialized views, vertical and horizontal partitioning for both performance and manageability was presented. In the Microsoft SQL Server 2005, the functionality of ITW was replaced by a full-fledged application, called the
Database Engine Tuning Advisor (
DTA) [
6]. It significantly increased the scope and usability of ITW. DTA provides integrated recommendations for indexes, indexed views, indexes on indexed views, and horizontal range partitioning. It scales to large databases and workloads using techniques, such as workload compression and reduced statistics creation. ITW and DTA employ a “candidate selection” step to identify a set of likely configurations in a cost-based manner by consulting the query optimizer.
The ITW and DTA tools assume that the DBA knows when to invoke a physical design tuning tool. In this regard, DBAs face two main challenges. First, changes in data distributions and workloads might lead to the current configuration becoming sub-optimal. Thus, DBAs must continuously invoke the tuning tool to recommend changes in the current configuration. Second, only if the current configuration is no longer optimal, it is necessary to run a tuning tool and change the configuration, since running a tuning tool causes an overhead. Therefore, the work in [
36] introduced a more lightweight tool, called Alerter, which identifies when a physical design tool should be invoked. Finally, the work in [
37] introduced an online index selection approach that continuously monitors changes in the workload and in the data, and modifies the physical design as needed. It was prototyped inside the Microsoft SQL Server engine.
Since an crucial requirement for a self-tuning database system is the ability to efficiently monitor the database and identify potential performance threats, the AutoAdmin project developed also a
SQL Continuous Monitoring engine (
SQLCM) [
44]. It enables continuous monitoring inside the database system and has the capability to automatically take actions based on the monitoring.
5.3.3 Database Physical Design following the AutoAdmin Project.
Physical database schema tuning can be carried out in three different modes: manual, semi-automatic, and full-automatic. Manual (or offline) approaches are invoked by the DBA when expected to be necessary and recommend a physical configuration to the DBA to improve the DBMS performance. Semi-automatic techniques use the DBA in the loop by recommending a physical configuration that requires feedback from the DBA. Full-automatic approaches, known as online physical design tuning, continuously tune the configuration as the workload or database state changes without the need for any DBA intervention.
Manual Mode. Manual tools for physical design tuning in commercial database systems other than the Microsoft SQL Server are IBM DB2 Design Advisor [
219,
236,
237] and the SQL Access Advisor in Oracle 10g [
64]. The DB2 Design Advisor [
237] recommends indexes and materialized views for a given workload. As of DB2 version 8.2 [
236], the Design Advisor provides integrated recommendations for physical design features, including indexes, materialized views, partitioning, and multi-dimensional clustering. It adopts a hybrid algorithm that efficiently searches through the large solution space while taking into consideration the interactions of related features. This approach allows to break the implementation of different features into smaller components. The implementation details can be found in [
151,
191,
237]. For instance, a knapsack algorithm followed by a random-swapping phase is proposed for index and materialized view selection [
219,
237], and a rank-based search and a genetic algorithm were proposed for partitioning [
191]. The SQL Access Advisor in Oracle 10g [
64] recommends indexes and materialized views for a given workload. It receives as input a workload and a set of candidates (created on a per-query basis by the Automatic Tuning Optimizer) and outputs a recommendation for the overall workload. In Oracle 11g [
28,
138], the advisor was extended to make recommendations also for partitioning.
The index selection approach in [
203] recursively applies a greedy selection that effectively accounts for index interactions. The approach decreases the number of “what-if” optimizer calls by employing caching and workload information, and thus can scales for large problems.
The work in [
192] presents a physical design tool for C-store [
215], a clustered column-store RDBMS. This tool builds an optimal candidate set consisting of materialized views and clustering indexes for OLAP workloads. The work in [
233] provides a hybrid evolutionary approach, combining the advantages of heuristic and evolutionary algorithms to select materialized views in a data warehouse, whereas [
98] presents a simple greedy heuristic for selecting views. The cost model in these studies is calculated in terms of query processing and view maintenance.
In [
127], the authors investigate how to effectively integrate index compression techniques into physical database design. The presented index selection approach considers compressed alternatives for each index based on a given memory budget. The optimizer’s cost model is extended to handle compressed indexes in the configurations. The work in [
193] jointly optimizes different configuration decisions (compression, index, and ordering) using linear programming to determine the best runtime performance for a given workload and memory budget.
Finally, there exist a number of state-of-the-art offline approaches for suggesting vertical and horizontal partitions [
124,
171,
186] and multidimensional hierarchical Clustering of OLAP data [
159].
Semi-automatic or Full-automatic Mode. The work in [
205] proposes a semi-automatic index tuning technique, which adapts its recommendations to changes in the workload and feedback from the DBA. Fully-automatic approaches become popular in the form of database cracking [
113,
114], continuous monitor-and-tune approaches [
13,
14,
37,
88,
111,
112,
122,
147,
198,
204,
227], and machine learning-based approaches [
26,
76,
135,
136,
146,
150,
179,
184,
185,
187,
210,
231]. Database cracking is an online index selection approach, where indexes are created and continuously refined as part of query processing. In this approach, each query is interpreted not only as a request for a particular result set, but also as an advice to crack the physical database store into smaller pieces. There is a large body of work on extending and improving database cracking [
90,
91,
92,
93,
100,
110,
115,
189,
206,
207]. A comparative study of different cracking techniques is presented in [
208].
Continuous monitor-and-tune approaches [
13,
14,
37,
88,
111,
112,
122,
147,
198,
204,
227] monitor changes in the workload in a continuous way, collect statistics and make online decisions automatically to perform changes to the physical design as needed. The work in [
204] and [
198] share similar goals as [
37], but differ in the design. The authors of [
198] have developed the earliest online index selection solution, called the QUIET framework and implemented on top of DB2. It aims at creating indexes automatically at runtime, while considering a DBA-defined space budget. In comparison to [
37], this approach requires multiple additional calls to the optimizer to obtain information about candidate indexes. The work in [
204] developed a self-tuning framework, called COLT, that continuously monitors the workload of a database system and enriches the existing physical design with a set of effective indexes. COLT has been implemented inside PostgreSQL, extending the optimizer with a “what-if” interface. In contrast to [
37], it is not fully integrated in the query optimizer. The authors of [
227] have developed an autonomic tuning expert for IBM DB2 and evaluated it under changing workloads and system states. The approach stores DBA experts’ tuning knowledge for the autonomic management process.
A online DBMS-independent framework for automatic index tuning was introduced in [
14], which can be used with different DBMSs. It is based on heuristics that run continuously to guide decisions on the current physical database configuration in order to react to workload changes. In [
13], a model-driven approach was developed to support tuning decisions by different heuristics. A instantiation of the model is the database self-tuning task for physical design tuning, including indexes and materialized views. In this model, a database tuning requester can be a human agent, such as a DBA, or a computational agent, such as a monitoring tool. The work in [
147] employs an online approach to monitor database queries. It collects statistical information in order to detect performance trends and to make decisions for automatic re-partitioning. In [
111,
112], the authors presented a method for automatically deriving database configurations by evaluating an objective functions. The proposed self-management logic provides a continuous monitoring and light-weight analysis of the workload and database state, compares it against goals (e.g., response times, throughput, CPU or disk space usages, availability, and operation costs) defined by a DBA, and starts a reconfiguration analysis when there is a risk of missing the goals. Multi-objective optimization techniques are used to automatically derive database configurations that meet these goals. Additionally, this approach stores an explicit workload model to identify patterns in the workload that allow to predict workload shifts. Finally, an online technique to recommend indexes for high-dimensional databases was introduced in [
88]. This approach uses a control feedback to monitor the query workload and to detect when the query pattern changes and corresponding adjustments to the index set are recommended.
There are numerous studies [
26,
76,
135,
136,
140,
146,
150,
179,
184,
185,
187,
210,
231] that have leveraged machine learning techniques for automated tuning of DBMSs. They extract the state and performance metrics from a DBMS and use this data to train models for choosing tuning features that will benefit the most. Some studies [
135,
136,
146,
184,
185] propose autonomous (self-managing) database systems, which leverage machine learning techniques for workload modeling and forecasting, fully automated physical design, and other configuration components. Thus, they do not focus on one particular configuration aspect and take a holistic view, which considers combinations of various configuration components.
The authors of [
184] present the first self-driving RDBMS, called Peloton (later NoisePage [
185]), which automatically configures and optimizes various aspects (e.g., indexes, materialized views, data partitioning, knobs) as the database and the workload evolve over time. It supports hybrid workloads, OLTP and OLAP, and predicts future workloads based on past arrival rates [
156] so that the system can adapt itself. NoisePage uses behavior models generated by different ML algorithms to predict the cost and benefit of deploying configuration components. Based on these cost/benefit estimates, NoisePage adopts tree-based optimization methods (e.g.,
Monte Carlo tree search (
MCTS) or RL methods) to select actions that improve the system’s target objective function.
The authors in [
135,
136] proposed a framework for self-driving database systems that automatically tune various configuration features, such as physical database design, knob configurations, or hardware resources. The framework is able to predict future workloads and to deliver configurations based on these predictions. They also propose a
linear programming (
LP) model to tune the multiple dependent features in a recursive way. Additionally, an approach for adaptive cost estimation is suggested, which can be obtained by applying linear regressions, gradient boosting regressors, or neural networks. The proposed framework is integrated into the research DBMS Hyrise [
74], which is a relational columnar store for transactional and analytical transaction processing (HTAP).
In [
146], the authors built an autonomous database framework and integrated ML techniques into the open-source database system openGauss [
175]. They proposed learning-based models for developing learned optimizers and advisors, such as learned cost estimation, learned knob tuning, learned database diagnosis, and a learned view/index advisor. For the latter, a deep reinforcement learning model to select high-quality indexes and materialized views is proposed.
Other researchers have developed learning approaches to solve the problem of automatically indexing a database [
26,
140,
179,
187,
197,
210], materialized view selection [
150,
231], and partitioning [
76] using reinforcement learning or deep reinforcement learning.
Finally, we would like to mention some autonomous features in commercial databases, including Oracle Autonomous database [
176], Alibaba self-driving database [
12,
145], and fully-automated indexing in Microsoft Azure SQL Database [
67].