Abstract
To ensure critical infrastructure is operating as expected, high-quality sensors are increasingly installed. However, due to the enormous amounts of high-frequency time series they produce, it is impossible or infeasible to transfer or even store these time series in the cloud when using state-of-the-practice compression methods. Thus, simple aggregates, e.g., 1–10-minutes averages, are stored instead of the raw time series. However, by only storing these simple aggregates, informative outliers and fluctuations are lost. Many Time Series Management System (TSMS) have been proposed to efficiently manage time series, but they are generally designed for either the edge or the cloud. In this paper, we describe a new version of the open-source model-based TSMS ModelarDB. The system is designed to be modular and the same binary can be efficiently deployed on the edge and in the cloud. It also supports continuously transferring high-frequency time series compressed using models from the edge to the cloud. We first provide an overview of ModelarDB, analyze the requirements and limitations of the edge, and evaluate existing query engines and data stores for use on the edge. Then, we describe how ModelarDB has been extended to efficiently manage time series on the edge, a novel file-based data store, how ModelarDB’s compression has been improved by not storing time series that can be derived from base time series, and how ModelarDB transfers high-frequency time series from the edge to the cloud. As the work that led to ModelarDB began in 2015, we also reflect on the lessons learned while developing it.
You have full access to this open access chapter, Download chapter PDF
Similar content being viewed by others
1 Introduction
High-quality sensors are increasingly used for monitoring and to facilitate automation and error detection, e.g., for Renewable Energy Sources (RES) installations like wind turbines and solar panels. However, the limited amount of bandwidth available for transferring data between the edge and the cloud generally makes transferring high-frequency time series impossible. In addition, it is often either infeasible or prohibitively expensive to store the high-frequency time series in the cloud. Thus, simple aggregates, e.g., 1–10-minutes averages, are generally stored in the cloud instead of the high-frequency time series. However, this removes informative outliers and fluctuations. To remedy this problem, many Time Series Management System (TSMS) have been proposed to efficiently ingest, query, and store time series [4, 10, 11]. However, most TSMSs are designed to be deployed on either the edge (e.g., in a wind turbine) or the cloud. Thus, these TSMSs cannot optimize ingestion, query processing, storage, and data transfer across the edge and cloud. Users must also often add additional systems, e.g., Apache Kafka, to continuously transfer data points from the edge to the cloud.
ModelarDB is a state-of-the-art modular model-based TSMS that was originally designed specifically for deployment on a cluster of cloud nodes [12,13,14]. In the context of ModelarDB, a model is any representation that can reconstruct the values of a group of time series within a user-defined error bound (possibly 0%). For example, a linear function is a model that can approximately represent increasing, decreasing, or constant values from a group of time series efficiently. In general, model-based compression of time series provides much better compression than general-purpose compression methods like DEFLATE, LZ4, and Snappy, especially if a small amount of error is allowed [9, 12,13,14, 26]. This paper provides an overview of the current version of ModelarDB and describes how it has been extended so it can be deployed on individual edge nodes, be deployed on a cluster of cloud nodes, and continuously transfer data points efficiently compressed using models from the edge nodes to the cloud nodes. ModelarDB is an open-source project and the source code for the version of ModelarDB described in this paper is available at https://github.com/skejserjensen/ModelarDB. To differentiate between the different versions of ModelarDB referenced in this paper, we use the previous version of ModelarDB to refer to the version of ModelarDB documented in our previous papers [12,13,14], while we use the current version of ModelarDB to refer to the version of ModelarDB with the extensions described in this paper. In this paper we make the following contributions:
-
1.
Provide an overview of the open-source model-based TSMS ModelarDB.
-
2.
Analyze the requirements and limitations of the edge for RES installations.
-
3.
Analyze existing query engines and data stores and evaluate which of them are most suitable for use by ModelarDB when deployed on the edge.
-
4.
Extend ModelarDB to efficiently ingest, query, and store groups of similar time series using multiple different types of models on the edge.
-
5.
Extend ModelarDB with a novel file-based data store for the edge and cloud.
-
6.
Extend ModelarDB to reduce the number of time series that are physically stored by efficiently recomputing the time series that can be derived from a base time series using user-defined functions and dynamic code generation.
-
7.
Extend ModelarDB with a data transfer component that continuously transfers data points compressed using models from the edge to the cloud.
-
8.
Reflect on the lesson learned while designing and developing ModelarDB.
The paper is structured as follows. Definitions are provided in Sect. 2. Section 3 describes the architecture of ModelarDB. The extensions added to ModelarDB so it can be efficiently deployed on the edge are described in Sect. 4. Then each part of ModelarDB is described in detail with ingestion described in Sect. 5, query processing described in Sect. 6, storage described in Sect. 7, and data transfer described in Sect. 8. The lessons learned while designing and developing ModelarDB are documented in Sect. 9. Related work is described in Sect. 10, while our conclusion and future work are presented in Sect. 11.
2 Preliminaries
The definitions and the values used in the examples in this section are reused from [14] as the work in this paper builds upon our previous work [12,13,14].
Definition 1
Time Series: A time series TS is a sequence of data points, in the form of timestamp and value pairs, ordered by time in increasing order \(TS = \langle (t_1, v_1), (t_2, v_2), \ldots \rangle \). For each pair \((t_i, v_i)\), \(1 \le i\), the timestamp \(t_i\) represents when the value \(v_i \in \mathbb {R}\) was recorded. A time series \(TS = \langle (t_1, v_1), \ldots , (t_n, v_n) \rangle \) with a fixed number of n data points is a bounded time series.
Definition 2
Regular Time Series: A time series \(TS = \langle (t_1, v_1),\) \((t_2, v_2), \ldots \rangle \) is regular if the time elapsed between each pair of consecutive data points is always the same, i.e., \(t_{i+1} - t_i = t_{i+2} - t_{i+1}\) for \(1 \le i\) and irregular otherwise.
Definition 3
Sampling Interval: The sampling interval SI of a regular time series \(TS = \langle (t_1, v_1), (t_2, v_2), \ldots \rangle \) is the time elapsed between each pair of consecutive data points in the time series: \(SI = t_{i + 1} - t_i\) for \(1 \le i\).
For example, the time series \(TS_e = \langle (100, 9.43), (200, 9.09), (300, 8.96), (400,\) \(8.62), (500, 8.50), \ldots \rangle \) is a regular unbounded time series with a 100-milliseconds sampling interval. A bounded time series can be constructed from \(TS_e\) by extracting the data points with the timestamps \(100 \le t \le 500\).
Definition 4
Gap: A gap between a regular bounded time series \(TS_1 = \langle (t_{1}, v_{1}), \ldots , (t_{s}, v_{s}) \rangle \) and a regular time series \(TS_2 = \langle (t_{e}, v_{e}), (t_{e+1}, v_{e+1}), \ldots \rangle \) with the same sampling interval SI and recorded from the same source, is a pair of timestamps \(G = (t_s, t_e)\) with \(t_e = t_s + m \times SI\), \(m \in \mathbb N_{\ge 2}\), and where no data points exist between \(t_s\) and \(t_e\).
No gaps exist in \(TS_e\) as there is a data point for all timestamps matching the sampling interval. However, for \(TS_g = \langle (100, 9.43), (200, 9.09), (300, 8.96), (400,\) \(8.62), (500, 8.50), (1100, 7.08), \ldots \rangle \) there is a gap \(G = (500, 1100)\) which separates two regular time series. For simplicity, we say that multiple time series recorded from the same source but separated by gaps is one time series containing gaps. As the sampling interval is undefined for a time series with gaps, a regular time series with gaps is defined as a time series where all data points in a gap have the special value \(\bot \) indicating that no real values were collected.
Definition 5
Regular Time Series with Gaps: A regular time series with gaps is a regular time series, \(TS = \langle (t_1, v_1), (t_2, v_2), \ldots \rangle \) where \(v_i \in \mathbb {R} \cup \{\bot \}\) for \(1 \le i\). All sub-sequences in TS of the form \(\langle (t_s, v_s), (t_{s + 1}, \bot ), \ldots , (t_{e - 1}, \bot ), (t_e, v_e) \rangle \) where \(v_s, v_e \in \mathbb {R}\), are denoted as gaps \(G=(t_s, t_e)\).
As an example, \(TS_{rg} = \langle (100, 9.43), (200, 9.09), (300, 8.96), (400, 8.62), (500,\) \(8.50), (600, \bot ),\) \((700, \bot ), (800, \bot ), (900, \bot ), (1000, \bot ), (1100, 7.08), \ldots \rangle \) is a regular time series with gaps and has a sampling interval of 100 milliseconds.
The values of time series can be represented by models which are functions mapping from timestamps to estimates of values such that each estimate is within a given error bound from the actual value. A model-based representation enables efficient compression of time series.
Definition 6
Model: A model of a time series \(TS = \langle (t_1, v_1), (t_2, v_2),\) \(\ldots \rangle \) is a function m. For each \(t_i\), \(1 \le i\), m is a real-valued mapping from \(t_i\) to an estimate of the value \(v_i\) for the corresponding data point in TS.
Definition 7
Model Type: A model type is pair of functions \(M_T = (m_t, e_t)\). \(m_t(TS, \epsilon )\) is a partial function, which, when defined for a bounded time series TS and a non-negative real number \(\epsilon \), returns a model m of TS such that \(e_t(TS, m) \le \epsilon \). \(e_t\) is a mapping from TS and m to a non-negative real number representing the error of the values estimated by m. We call \(\epsilon \) the error bound.
A model type (e.g., linear regression) determines the set of parameters required to create a specific model of that type for approximating the values of a time series. Models represent the values of a time series as the function m with the error of the representation within the error bound \(\epsilon \) as computed by the model type’s function \(e_t\). We say that a model is fitted to a bounded regular time series, e.g., \(TS_b = \langle (100, 9.43), (200, 9.09), (300, 8.96), (400,\) \(8.62), (500, 8.50) \rangle \), when determining the parameters of a model using a model type. A single model may also be able to efficiently represent the values from a group of time series within a given error bound if the time series in the group have similar values.
Definition 8
Time Series Group: A time series group is a set of regular time series, possibly with gaps, \(TSG = \{TS_1, \ldots , TS_n\}\), where for \(TS_i, TS_j \in TSG\) it holds that they have the same sampling interval SI and that \(t_{1i} \bmod SI = t_{1j} \bmod SI\) where \(t_{1i}\) and \(t_{1j}\) are the first timestamp of \(TS_{i}\) and \(TS_{j}\), respectively.
A time series group can only contain time series with the same sampling interval and aligned timestamps. This ensures that a data point is received from all time series in a group at each sampling interval unless gaps occur. If the time series in a group do not have approximately the same values, scaling can be applied to allow a single stream of models to represent the values of all time series in the group. By representing the values from a time series group using one stream of models, the compression ratio can be significantly increased compared to a model-based representation of each individual time series in the group.
Definition 9
Segment: A segment is a 5-tuple \(S = (t_s, t_e, SI, G_{ts} : TSG \rightarrow 2^{\{t_s, t_s+SI, \ldots , t_e\}}, m)\) representing the data points for a bounded time interval of a time series group TSG. The 5-tuple consists of start time \(t_s\), end time \(t_e\), sampling interval SI, a function \(G_{ts}\) which for the \(TS \in TSG\) returns the set of timestamps for which \(v = \bot \) in TS, and where the values for all other timestamps are defined by the model m multiplied by a scaling constant \(C_{TS} \in \mathbb {R}\).
To provide an example of a segment, we use the following three time series:
-
\(TS_1 = \langle (100, 9.43), (200, 9.09), (300, 8.96), (400, 8.62), (500, 8.50) \rangle \)
-
\(TS_2 = \langle (100, 8.78), (200, 8.55), (300, 8.32), (400, 8.09), (500, 8.96) \rangle \)
-
\(TS_3 = \langle (100, 9.49), (200, 9.20), (300, 8.92), (400, 8.73), (500, 8.65) \rangle \)
These three time series are grouped together in a time series group. The values in this time series group can be efficiently compressed as the linear function \(m = -0.003 \times t_i + 9.40\). Using the uniform norm this model has an error of \(|8.96 - (-0.003 \times 500 + 9.40)| = 1.06\). If we assume that the error bound is 1, a segment like \(S = (100, 400, 100, G_{ts} = \emptyset {}, m = -0.003\times t_i + 9.40)\), \(1 \le i \le 4\), must be created for the model-based representation to be within the error bound.
We also informally define edge nodes based on discussions with owners of RES installations. Thus, an edge node is a low-end commodity PC, e.g., 4 CPU Cores, 4 GiB RAM, and an HDD, that collects data points from a set of sensors. It is deployed very close to the sensors it collects data from and has a connection with limited bandwidth to the cloud, e.g., 500 Kbits/s to 5 Mbits/s. Each edge node continuously transfers data points to the cloud. Thus, edge nodes provide limited processing power but can access the latest data points with low latency.
Likewise, we informally define a cloud node as a VM, e.g., 8 Virtual CPU Cores, 32 GiB RAM, and an SSD, running on a high-end server that processes data points collected from sensors by a set of edge nodes. It is deployed in a data center far from the sensors the data points are collected from. Multiple cloud nodes are connected to form a cluster using connections with ample bandwidth. Thus, cloud nodes provide almost unlimited processing power but the latency from a data point is collected until it can be processed by a cloud node is high.
3 ModelarDB Architecture
ModelarDB was designed to be modular and consists of a Java library named ModelarDB Core which can be interfaced with different query engines and data stores depending on the use case [12]. Switching between different query engines and data stores does not require ModelarDB to be recompiled, instead, users can simply specify which query engine and data store to use in a configuration file. Thus, the same ModelarDB binary can be efficiently deployed on both the edge and in the cloud. The indented configuration for an efficient deployment on the edge is using H2 as the query engine (see Sect. 4) and a JDBC-compatible RDBMS, Apache Parquet files written to the local file system, or Apache ORC files written to the local file system as the data store as shown in Fig. 1. To use ModelarDB with this configuration, the ModelarDB binary and its configuration file simply have to be copied to an edge node, and the binary executed on the edge node using a JVM. The indented configuration for a scalable deployment on a cluster of nodes in the cloud is using Apache Spark as the query engine and Apache Cassandra, Apache Parquet files written to HDFS, or Apache ORC files written to HDFS as the data store as shown in Fig. 2. To use ModelarDB with this configuration, a cluster with stock versions of Apache Spark and Apache Cassandra or HDFS must be available. Then ModelarDB’s configuration file must be copied to the Apache Spark Master and the ModelarDB binary deployed using the spark-submit script included with Apache Spark. While ModelarDB was designed to use these specific combinations of query engines and data stores, the system does support arbitrary combinations of query engines and data stores.
ModelarDB’s architecture consists of three sets of components as shown in Fig. 1 and Fig. 2. The Data Ingestion components ingest data points and compress them to segments containing metadata and models (see Sect. 5) or receive such compressed segments from another ModelarDB instance, the Query Processing components cache segments and execute queries on the segments that are stored locally (see Sect. 6), and the Segment Storage components provide a uniform interface for reading from and writing to local data stores (see Sect. 7) and for transferring segments to another ModelarDB instance represented by a remote data store (see Sect. 8). The remote data store uses Apache Arrow Flight for communicating with the other instance of ModelarDB. By transferring segments instead of data points, ModelarDB significantly reduces the amount of bandwidth required. Data transfer was implemented as a data store so only minor changes had to be made to the remainder of the system. This remote data store also maintains a reference to the local data store the system is configured to use for reading and writing data locally. While ModelarDB is designed to transfer segments from an instance on the edge that uses H2 as the query engine to an instance in the cloud that uses Apache Spark as the query engine, the system is not limited to this configuration. For example, segments can be transferred between instances that are both configured to use H2 as the query engine.
4 Supporting Query Processing and Storage on the Edge
ModelarDB was originally designed to run distributed in the cloud using Apache Spark for query processing and Apache Cassandra for storage [12]. The current version of ModelarDB has been extended to be efficiently deployed on both the edge and in the cloud. This was done by analyzing the requirements for ModelarDB to be deployed on the edge nodes, analyzing the limitations caused by the hardware currently used on the edge nodes, evaluating existing query engines and data stores to determine which are appropriate for deployment on the edge nodes, integrating the RDBMS H2, and developing a new data store that operates directly on files and supports Apache Parquet and Apache ORC.
4.1 Analysis of Requirements and Limitations on the Edge
To efficiently deploy ModelarDB on the edge, the hardware currently used by practitioners for their edge nodes must be taken into account. The hardware being used is a key concern as the hardware used for edge nodes differs significantly depending on the domain, and a system that is designed to run on a commodity PC will be very different compared to a system designed to run on a microcontroller. From conversations with owners of RES installations we learned that the hardware they deploy on their edge nodes is similar to low-end commodity PCs, e.g., 4 CPU Cores, 4 GiB RAM, 250 GiB HDD, and a 1 Gbit/s Network Card. Due to the high number of CPU cores compared to the amount of memory and the use of an HDD, a high compression ratio should be prioritized higher than CPU usage. A high compression ratio ensures that high-frequency time series can be ingested and written to the HDD despite its low write speed and that a large number of data points can be cached in memory for faster query performance without consuming a large fraction of the available memory. Also, to ensure that alerts can be provided on time through real-time analytics, support for executing queries on the time series during ingestion must be provided.
A key benefit of compressing time series using models is that many models represent the values of data points using functions where the number of coefficients required for each function is constant and thus unrelated to the number of values represented by the model. For example, a linear function can approximately represent an unlimited number of increasing, decreasing, or constant values using only two coefficients. This significantly reduces the amount of storage required. Of course, the used functions must match the structure of the time series, e.g., approximately representing increasing values with constant functions requires multiple functions, while approximately representing constant values using a linear function requires an unnecessary coefficient. ModelarDB already compensates for this by using multiple different model types per time series.
In addition to allowing high-frequency time series to be stored for long periods of time on the edge nodes, a high compression ratio is also required for the time series to be transferred to the cloud nodes. From conversations with owners of RES installations, we learned that the available bandwidth from the edge nodes to the cloud nodes can be as low as 500 Kbits/s to 5 Mbits/s depending on the installation. So, high-frequency time series must be highly compressed on the edge nodes before it becomes feasible to transfer them to the cloud nodes instead of transferring simple aggregates. Also, ingestion, compression, and transfer of the time series must be performed automatically and continuously. Finally, until the compressed time series have been successfully transferred to the cloud nodes and can be queried there, it must be possible to execute queries against them on the edge nodes. As the instances of ModelarDB deployed on the edge nodes and the cloud nodes will be an integrated system, ModelarDB must also support executing the exact same queries on both the edge nodes and the cloud nodes.
In summary, for ModelarDB to be used on the edge nodes it must provide a high compression ratio to reduce the amounts of storage and bandwidth required, allow queries to be executed during ingestion, and support executing the exact same queries on the edge nodes and on the cloud nodes. While the previous version of ModelarDB’s model-based compression already provides state-of-the-art compression for time series groups, its use of Apache Spark and Apache Cassandra makes it unsuitable for deployment on the edge nodes due to their limited hardware. Also, it does not provide support for automatically transferring compressed time series and metadata between multiple instances of the system.
4.2 Evaluation of Query Engines and Data Stores for the Edge
The hardware requirement of Apache Spark and Apache Cassandra significantly exceeds the hardware available on the edge nodes and their ability to run distributed across multiple nodes is not beneficial when deploying ModelarDB on the edge nodes. Thus, a query engine and data store optimized for the edge nodes are required. However, while ModelarDB was designed to be modular, the only query engine integrated with ModelarDB Core was Apache Spark. Thus, all queries had to be executed by Apache Spark and it was also required to ingest multiple time series groups in parallel. Thus, a query engine that can execute the same queries as Apache Spark must be integrated with ModelarDB Core to efficiently execute queries while ingesting time series groups in parallel on the edge nodes. ModelarDB supports using RDBMSs through JDBC for the data store, so a lightweight RDBMS can be used instead of Apache Cassandra when running on the edge nodes. The set of requirements for an edge-optimized query engine and data store to be integrated with ModelarDB Core is shown in Table 1.
Based on these requirements a large set of candidate systems were collected from db-engines.com and dbdb.io. Small-scale experiments were then used to reduce the set of candidate systems, e.g., by testing if their APIs were expressive enough for their query engines to be efficiently integrated with ModelarDB Core and if they were fast enough to process high-frequency time series. As shown in Table 2, most query engines were discarded due to a lack of functionality or documentation, while PostgreSQL was discarded due to poor query performance when queries were executed through PL/Java (22.51x to 47.28x slower than H2).
After the small-scale experiments were performed, only Apache Derby, H2, and HyperSQL remained. A proof-of-concept implementation was created using each of these three systems and their performance was evaluated. During the development of these proofs-of-concept, HyperSQL was also discarded due to its apparent lack of predicate push-down and support for user-defined aggregates with multiple arguments. To compare the query performance of Apache Derby and H2 the following experimental setup and query workloads were used:
-
Hardware: i7–2620M, 8 GiB 1333 MHz DDR3, 7,200 RPM HDD.
-
Software: Ubuntu Linux 16.04 on ext4, Apache Derby v10.15.2.0, H2 v1.4.200, and InfluxDB 1.4.2 (Baseline and current top TSMS on db-engines.com).
-
Data Sets: EP which consists of 45,353 short regular time series with gaps and uses 339 GiB when stored as CSV, and EF which consists of 197 long regular time series with gaps and uses 372 GiB when stored as CSV.
-
Query Workloads: Three types of queries are used to determine the strengths and weaknesses of each system’s query engine. The types of queries used are shown with examples in Listing 1. The aggregate queries are included as we learned from owners and manufacturers of wind turbines that aggregate queries are their most common query type. Thus, the aggregate queries are representative of the real-life use cases ModelarDB was designed for. For ModelarDB, the aggregate queries are manually rewritten so they are executed much faster directly on the segments. The point/range queries are mainly included to get a more complete evaluation of the query engines.
-
Setup: ModelarDB uses H2 as the data store in all experiments so only the query engine differs between Apache Derby and H2. In addition, ModelarDB is configured to store the value of each data point within a 10% error bound.
The results for Apache Derby, H2, and InfluxDB are shown in Table 3. In general, H2 was 1.27x to 7066x faster than Apache Derby, while Apache Derby was only 1.30x faster than H2 for a single set of queries on a single data set (point/range queries on a few long time series). The largest difference between Apache Derby and H2 was for small aggregate queries and is primarily due to Apache Derby lacking predicate push-down for WHERE Tid IN { ... } statements where Tid is a time series identifier. In terms of accuracy, ModelarDB provides a per data point error guarantee. However, the actual average error is often significantly lower than the error bound as shown in [14]. For example, even when using a 10% error bound, the highest average actual error was only 0.34% for EP and only 1.72% for EF [14]. This was also reflected in the average actual error of the query results. Specifically, the average query result error was 0.024% for large aggregates, 0.033–0.28% for small aggregates, 0.012–2.01% for point/range, and 0.0027–0.17% for multidimensional aggregates [14]. Thus, the average actual error is typically very small in practice. Of course, if a 0% error bound is used ModelarDB is guaranteed to produce exact query results.
As it provided much better performance than Apache Derby, H2 was selected as the query engine ModelarDB should use when deployed on edge nodes. While the proof-of-concept implementation indicated that H2 provides the necessary functionality and performance, we also determined that it could be replaced with a new query engine built using Apache Calcite, Apache Arrow DataFusion, or pre-compiling queries to Java in the cloud using Apache Spark SQL if necessary. However, these solutions would require more time to develop than integrating H2 with ModelarDB Core. Thus, they were initially considered secondary options. H2 was also selected as the data store ModelarDB should use when deployed on edge nodes. However, as H2 required a significant amount of storage for indexes to efficiently retrieve the requested data, a file-based data store optimized for OLAP was later implemented as a replacement as described in Sect. 7.2.
4.3 Integrating H2 with ModelarDB
As ModelarDB was designed to be modular, adding an additional query engine to the system did not require significant changes to the architecture or the existing components. However, the original implementation was implicitly assuming that Apache Spark would be the sole query engine in multiple parts of the system. So, some components had to be refactored to accommodate the new query engine. For example, the default methods for reading and writing segments were removed from the storage interface as specialized storage interfaces had to be implemented for each query engine to efficiently support predicate push-down as described in Sect. 7.1. Methods that could be used by both Apache Spark and H2 were also moved to a shared component. However these changes were all internal, so users only need to configure ModelarDB to use H2 as the query engine and data store through ModelarDB’s configuration file when deploying it on the edge nodes. Thus, as described in Sect. 3, the same ModelarDB binary can be deployed on both the edge nodes and the cloud nodes with different configurations.
ModelarDB’s modular architecture made it much simpler to integrate H2 as an additional query engine and data store compared to a monolithic architecture. However, this modularity also significantly increased the complexity of the system and made some optimizations significantly harder to implement or required them to be implemented for each combination of query engine and data store. For example, as stated above, the new storage interfaces provide specialized read and write methods for each query engine to achieve higher performance compared to a shared set of read and write methods. However, this adds complexity to the implementation and increases development time. Another example is the query interface. The previous version of ModelarDB used JSON for query results and the current version uses JSON or Apache Arrow. These formats make it very easy for programs to consume data from ModelarDB, however, they require a conversion step compared to using Apache Spark’s and H2’s binary formats.
5 Ingestion
ModelarDB uses the Group Online Lossy and lossless Extensible Multi-Model (GOLEMM) compression method [14] for compressing time series groups within a user-defined error bound (possibly 0%). The method assumes that the time series in each group have the same regular sampling interval and that the timestamps of the time series in each group are aligned. GOLEMM is a window-based approach that dynamically splits time series groups into dynamically sized sub-sequences and then compresses the values of each sub-sequence using one of multiple different model types. By using a window-based approach, bounded time series (e.g., bulk-loaded from files) and unbounded time series (e.g., ingested online from sockets) can be compressed using the exact same compression method.
5.1 Model-Based Compression
ModelarDB includes extended versions of three different model types [14]: PMC-Mean [17] fits a constant function to the data points’ values, Swing [7] fits a linear function to the data points, and Gorilla [24] compresses the data points’ values using XOR and a variable-length binary encoding. These three model types can all be incrementally updated as data points are received. Users can optionally add more model types through an API as described in Sect. 5.2. As both user-defined and predefined model types are dynamically loaded, ModelarDB does not need to be recompiled when additional model types are added.
An example of compressing a time series group containing three time series using GOLEMM is shown in Fig. 3. At \(t_1\) a data point is received from each time series in the group and the first model type is used to fit a model to these data points. In this example, PMC-Mean is used. At \(t_6\) a data point is received that PMC-Mean cannot represent using the current model within the error bound. Thus, GOLEMM switches to the next model type, which is Swing in this example. To initialize Swing, the data points from \(t_1\) to \(t_6\) are passed to it so an initial linear function can be fit to them. Then, at \(t_{13}\) a data point is received that Swing cannot represent using the current model within the error bound, so GOLEMM switches to Gorilla, which is the last model type in this example. To initialize Gorilla, the data points from \(t_1\) to \(t_{13}\) are passed to it so their values can be compressed using XOR and a variable-length binary encoding. As Gorilla uses lossless compression, it will never exceed the error bound, so instead, it uses a user-configurable length bound. In this example, the length bound is exceeded at \(t_{16}\). After all of the model types have tried to fit models to the time series group, GOLEMM emits the model that provides the best compression ratio as part of a segment containing metadata and the model. In this example, a model of type Swing is emitted as it could represent the data points from \(t_1\) to \(t_{12}\). Then, GOLEMM restarts compression of the group from \(t_{13}\) using PMC-Mean.
ModelarDB can only execute queries against models that have been emitted as part of a segment. Thus, the latency from a data point is received to it can be queried is theoretically unbounded as the error bound may never be exceeded. To remedy this, ModelarDB allows users to set an upper limit on the number of data points that cannot be queried. If this limit is reached, a temporary segment is emitted which contains a model that represents all of the values models are currently being fitted to for the time series group. This temporary segment is only stored in memory from where it can be queried, but it is never persisted to disk. Thus, ModelarDB provides efficient and extensible model-based compression of bounded and unbounded time series groups. It also ensures that the latency, i.e., the time from a data point is ingested until it is ready to be queried, is bounded.
5.2 User-Defined Model Types
As stated, ModelarDB supports dynamically loading user-defined model types. Thus, users can optimize ModelarDB by implementing specialized model types for their domain. For a model type to be used by ModelarDB it must derive from the abstract class ModelType whose overridable methods are shown in Table 4. Users must implement all of these methods with the exception of withinErrorBound() which only needs to be overridden if the model type does not provide a relative per data point error guarantee. To load user-defined model types, the classes’ canonical names must be added to ModelarDB’s configuration file and the corresponding class files must be added to the JVM’s classpath.
For each model type, a corresponding segment must also be available to represent the metadata and models returned by the model type’s get() method. Although, it is not a requirement that each model type has a unique corresponding segment, e.g., multiple model types that fit linear functions to data points can share the same segment. For a segment to be used by ModelarDB it must derive from the abstract class Segment whose overridable methods are shown in Table 5. Of these methods, only Segment’s new() and get() methods must be implemented, while the remaining methods can optionally be overridden to compute aggregates directly from the segment. If these methods are not overridden the aggregates are computed by reconstructing the data points represented by the segment, using get() to compute all of the values, and then computing the aggregate. An example of how sum() is implemented for the type of segment created by Swing is shown in Listing 2 and described in detail in Sect. 6.
6 Query Processing
ModelarDB uses SQL as its query language and is designed to support multiple different query interfaces. Originally, queries could only be submitted to ModelarDB through sockets or a text file and query results were returned as JSON to make it simple to use from other programming languages and command-line applications like Telnet [12]. Subsequent versions added support for HTTP and a REPL to the system. These interfaces also return query results as JSON. However, while the use of JSON makes it simple to query ModelarDB and view the result, converting to and from JSON adds a significant performance overhead.
Thus, the current version of ModelarDB has been extended to accept queries and return query results using Apache Arrow Flight. Apache Arrow provides a binary in-memory column-based data format that is designed to be programming language-independent and implementations exist in many different programming languages, such as C++, Go, Java, and Rust. Apache Arrow Flight provides functionality for sending and receiving data in Apache Arrow’s data format. To reduce the overhead of converting Apache Spark’s and H2’s in-memory row-based data format to Apache Arrow’s column-based data format, dynamic code generation is used. By dynamically generating the methods performing the conversion based on the schema of the query result, the conversion of each row can be performed without any branches. Static and dynamic code generation is also heavily used for projections to remove branches. However, as ModelarDB dynamically generates and compiles high-level Scala code, the use of dynamic code generation can increase the query processing time in some cases. This is a general problem when using high-level languages like C++ and Scala for dynamic code generation [22]. For example, ModelarDB does not use dynamic code generation when evaluating queries that only need to process a small amount of data and Apache Spark is used as the query engine. To determine how much data Apache Spark has to process, the number of partitions in the Apache Spark RDD to process is used as a heuristic [14]. For H2 dynamic code generation is always used as the compiled code is easy to cache when running on a single node.
Regardless of the query interface and query engine used, ModelarDB exposes the ingested time series at the data point and segment level. This is implemented as two views with the following schemas where \(\texttt {<}\)Dimensions\(\texttt {>}\) are user-defined denormalized dimensions, i.e., metadata that describes the ingested time series:
-
Data Point View: Tid int, Timestamp timestamp, Value float, \(\texttt {<}\)Dimensions\(\texttt {>}\).
-
Segment View: Tid int, StartTime timestamp, EndTime timestamp, MTid int, Model blob, Offsets blob, \(\texttt {<}\)Dimensions\(\texttt {>}\).
The Data Point View allows users to query the ingested time series as data points and thus supports arbitrary SQL queries. This is enabled by the requirement that every model can reconstruct the values it represents within a user-defined error bound (possibly 0%). However, while the Data Point View supports arbitrary queries, as stated, many aggregates can be executed more efficiently directly from the metadata and models contained in each segment than reconstructed data points. ModelarDB supports this through the Segment View using a set of UDFs and UDAFs. To simplify using these functions, ModelarDB implements one extension to SQL in the form of #. This is a specialized version of * which is replaced with the columns required to compute aggregates using the UDAFs. An example of how SUM() is computed by the type of segment created by Swing using the Data Point View and the Segment View is shown in Fig. 4.
This example shows that when SUM() is computed using the Data Point View, all of the ingested data points are reconstructed within the error bound and the aggregate is computed by iterating over these data points. The values are computed using the Segment.get() method shown in Table 5. In contrast, when the query is computed using the Segment View, the Segment.sum() method shown in Table 5 is used to compute SUM(). Thus, a specialized method can be used for each type of segment. The implementation of Segment.sum() for the type of segment created by Swing is shown in Listing 2. This method only needs to compute the value for the first and last data point, compute the average for the segment, and then multiply it with the length. Thus, for the type of segment created by Swing, using the Segment View instead of the Data Point View reduces the complexity from linear time to constant time for each segment.
In summary, when a query is received from a client, all instances of # are replaced with the required columns and the query is forwarded to the query engine ModelarDB is configured to use. The query engine then parses the query and executes it using the view specified in the FROM-clause. The view retrieves the relevant segments from the data store ModelarDB is configured to use, computes the complete query result, and returns it to the client as JSON or Apache Arrow.
7 Storage
7.1 Data Store Overview
As described in Sect. 3, ModelarDB supports storing time series groups as segments containing metadata and models using three different data stores: JDBC-compatible RDBMSs, files read from and written to a local file system or HDFS, and Apache Cassandra. The file-based data store currently supports reading and writing both Apache Parquet and Apache ORC files. The data stores ModelarDB currently supports and their recommended usage are shown in Table 6.
From a user’s perspective, the data stores all provide the same functionality. However, the RDBMSs and the file-based data store (when reading and writing to the local file system) are designed for use with H2 on the edge nodes, while Apache Cassandra and the file-based data store (when reading and writing to HDFS) are designed for use with Apache Spark on the cloud nodes. To ensure that each query engine can efficiently read and write segments from the data stores, ModelarDB defines an interface for each query engine that the data stores must implement. Thus, H2 interfaces with the data stores through the H2Storage interface which is shown in Table 7, and Apache Spark interfaces with the data stores through the SparkStorage interface shown in Table 8. By implementing separate methods for each query engine they can perform predicate push-down without converting their encoding of the predicates to a shared representation. The native representation used by each query engine for segments can also be passed and returned, thus removing the expensive step of converting all segments to and from a shared representation for both query engines. Of course, this comes at the cost of development time as separate methods must be implemented for reading and writing for each combination of data store and query engine.
open() performs the necessary setup before the data store can be used, e.g., creating tables for the RDBMSs and Apache Cassandra. H2Storage does not contain an H2 specific open() method as it isn’t necessary to pass any H2 specific information to the data stores. Thus, the general open() method required by the abstract class Storage, that all data stores must derive from, can be used. The method storeSegmentGroups() writes batches of segment groups to the data store, while getSegmentGroups() retrieves segment groups from the data store. SegmentGroup is the type used by ModelarDB to represent a dynamically sized sub-sequence of data points from a time series group and it can be converted to one Segment per time series it represents data points for. The segment groups are batched in memory before each batch is passed to storeSegmentGroups(). To reduce the amount of data retrieved from the data store by getSegmentGroups(), predicate push-down is performed by passing query engine-specific representations of the predicates to getSegmentGroups(). ModelarDB only requires that the data stores return at least the requested segment groups. Thus, the returned data is also filtered by the query engine ModelarDB is configured to use.
As the file-based data store has been added to the current version of ModelarDB, its implementation is documented in detail in Sect. 7.2. The other data stores are described in the papers documenting the previous version [12,13,14].
7.2 File-Based Data Store
The file-based data store is designed to durably store metadata and segment groups directly as files instead of using a DBMS. It was primarily added as H2 required a significant amount of storage for indexes to efficiently retrieve the requested data as described in Sect. 4. Also, the use of RDBMSs and Apache Cassandra limits the optimizations that can be implemented in ModelarDB. For example, both the start time and the end time of the time interval each segment group stores data points for, are stored as part of each segment group. Thus, if a data store partitions the segment groups by their time series group id and sorts them by their start time, they will generally also be sorted by their end time and vice versa. However, to our knowledge, no RDBMSs nor Apache Cassandra can be configured to exploit such a relationship between attributes.
The file-based data store stores metadata about time series, metadata about model types, and the segment group batches passed to storeSegmentGroups() as separate immutable files. To make reading more efficient, the many small files created during ingestion are regularly merged into fewer larger immutable files. It is also designed to make adding support for different file formats (existing and new) simple by splitting the implementation into a super-class that contains shared functionality and sub-classes that contain functionality for each file format. Specifically, FileStorage implements shared functionality such as determining which files to read for a query, periodic merging without deleting files currently used by a query, and durable writes. ParquetStorage and ORCStorage are sub-classes of FileStorage which implement functionality for reading, writing, and merging Apache ORC and Apache Parquet files, respectively. Support for other file types can be added by creating a new class that derives from FileStorage and implements the abstract methods shown in Table 9.
As stated, FileStorage is designed to provide durability for metadata and segment groups that have been successfully written completely to files. Thus, if ModelarDB is terminated while writing a batch of segment groups, the entire batch is lost. Durability is implemented by never modifying existing files, writing new files to a temporary location before renaming them, and using logging when merging multiple files containing segment groups. The log is required when merging to support recovering from abnormal termination while deleting the input files after the new merged file has been written to a temporary location. Merging is currently performed synchronically by the thread writing the tenth batch of segment groups. This very simple merge strategy provides an acceptable trade-off between read and write performance. However, when to merge is determined by the method FileStorage.shouldMerge(), so a more complex strategy can easily be used by changing this method. For example, an alternative strategy could dynamically decide when to merge based on a trade-off between the available resources (CPU, RAM, disk) and the required query performance.
The files that are currently used by a query cannot be deleted and are thus purposely not included in merge operations as their data points otherwise would be duplicated. FileStorage tracks which files are currently used by a query using an approach similar to two-phase locking. In addition to retrieving the relevant segment groups, getSegmentGroups() also increments a counter for all of the files being read and creates a Java PhantomReference to the iterator before returning it. Files with a non-zero counter are skipped when a merge is performed. After the query is complete the iterator is no longer phantom reachable [23], thus the PhantomReference is automatically added to a ReferenceQueue. Before performing a merge operation, the writing thread retrieves the PhantomReferences in the ReferenceQueue and decrements the counters for the corresponding files. Thus, the counter for each file used by a query is incremented when the query starts and only decremented when the query is complete. This could also be implemented by requiring the iterators returned by readSegmentGroupsFiles() and readSegmentGroupsFolders() to increment and decrement the counters. However, this would require duplicating code as these iterators are returned by methods implemented by the format-specific sub-classes of FileStorage. It would also require code that is guaranteed to execute no matter how the query was terminated as the file counters must always be decremented. In contrast, by using PhantomReferences the current implementation of FileStorage only requires that sub-classes implement how to read, write and merge files.
7.3 Supporting Derived Time Series
As stated, ModelarDB can group time series with similar values and compress them as one stream of models [14]. This significantly reduces the amount of storage required compared to compressing each time series separately. However, it requires that the same model can represent the values from all of the time series in a group for every data point. Thus, for the sub-sequences where the values from each time series in a group are very similar, PMC-Mean and Swing can generally be used to provide excellent compression. However, sub-sequences, where the values differ more than allowed by the error bound, can only be represented by Gorilla due to its use of lossless compression. As a result, the compression ratio for a time series group is highly dependent on the actual values of each time series in the group in addition to their structure. Thus, while ModelarDB tries to mitigate this problem by dynamically splitting and merging time series groups when a significant drop in compression ratio is detected [14], compressing n time series together rarely provides a n-times reduction in the amount of storage required. However, for some time series, the relationship between their values is static, meaning that the values of one time series can be computed exactly from the values of another time series by a function. In that case, only a single base time series needs to be physically stored as the values of all the other time series can be derived from it (i.e., calculated using a function) during query processing.
Support for these derived time series has been added to the current version of ModelarDB. This is implemented by allowing users to specify which time series can be derived from another time series, thus allowing the system to not store the derived time series. This is shown in Fig. 5. At the direction of an object which turns over time, e.g., a wind turbine nacelle, is measured. This produces the time series shown in red. Two other time series with the cosine and sine, respectively, of the angels in the first time series are also needed for later analysis. These time series are shown in blue and green, respectively. However, as these two time series can be computed directly from the base time series, ModelarDB only stores the base time series (red) at and not the two derived time series (blue and green). When ModelarDB receives a query that includes the derived time series, it reads the base time series at and applies the user-defined functions \(\cos (value \times \pi / 180\)) and \(\sin (value \times \pi / 180\)) where value is measured in degrees, to create the cosine (blue) and sine (green) time series at .
ModelarDB cannot automatically detect that a time series can be derived from another as the time series are being ingested as streams. So, users must explicitly specify all derived time series in ModelarDB’s configuration file by stating the source from which the base time series is ingested from, the source which the derived time series will be associated with, and the user-defined function that must be applied to transform the values in the base time series to the values in the derived time series. The source the derived time series will be associated with is required so different dimension members can be assigned to the base time series and the derived time series as described in Sect. 6. The function must be specified as the Scala code to be executed for each value in the base time series. The Scala code is dynamically compiled to Java bytecode at startup to significantly reduce its execution time compared to interpreting the code. The specification of derived time series is purposely only stored in ModelarDB’s configuration file, and not in the data store, to make it easy for users to add, modify, and remove derived time series. Also, as a derived time series is a mapping of values in the base time series to values in the derived time series, the function cannot be used to aggregate values (e.g., to change the sampling interval) or generate more values (e.g., for forecasting). However, these restrictions are only limitations of the current implementation and not the method.
8 Data Transfer
As stated in Sect. 3, the current version of ModelarDB was extended with support for continuously transferring segment groups from the edge nodes to the cloud nodes using Apache Arrow Flight. An example that shows m edge nodes transferring segment groups to n cloud nodes, where \(m \gg n\), is shown in Fig. 6. Usually, each cloud node can receive segments from many edge nodes as the cloud nodes generally have more powerful hardware than the edge nodes. During ingestion, the data points can be queried on each edge node using H2. When an edge node has created a user-defined number of segment groups, they are transferred in a batch to the cloud nodes. After the segment groups have been transferred, the data points they represent can be queried on the cloud nodes using Apache Spark. Even though ModelarDB was designed to transfer segment groups from low-powered edge nodes running H2 to powerful cloud nodes running Apache Spark, it is not limited to this configuration. For example, segment groups can be transferred between two instances configured to use H2 as the query engine and data store. ModelarDB only requires that users specify which instance is the client and which is the server in their configuration files.
The steps required to start transferring segment groups from an edge node to a cluster of cloud nodes are shown in Fig. 7. First ModelarDB must be deployed on the cloud nodes with at least one Apache Spark Master and at least one Apache Spark Worker. Then, at each Apache Spark Streaming Receiver registers its address with the Apache Spark Master. When a new edge node is deployed, it communicates with the Apache Spark Master to determine where to transfer its segment groups to. At the edge node requests the time series ids and time series group ids to use from the Apache Spark Master. Then, at the edge node requests an endpoint to transfer its segment groups to from the Apache Spark Master. The Apache Spark Master selects the endpoint that currently is receiving the lowest number of data points per minute compressed as segment groups containing metadata and models and returns its address to the edge node. The ModelarDB instance deployed on the edge node then ensures that it and the ModelarDB instance deployed on the cloud nodes use the same model types and refers to them using the same model type ids at . The setup process is completed at as the edge node transfers the time series metadata to the Apache Spark Master. With the setup process complete, the edge node starts ingesting data points from time series groups, compresses them as described in Sect. 5, and transfers the segment groups to the cloud nodes in batches at .
9 Lessons Learned
The work that led to ModelarDB began in 2015. Since then existing TSMSs have been analyzed [10, 11], the system has been designed and implemented [12], and later extended in multiple directions [13, 14]. In this section, we present the lessons we learned while developing ModelarDB and from the feedback we have received from collaborators and potential users. While some of these lessons are already described in the literature, they are reiterated for completeness.
Systems Research is Time Intensive and Reusing Components Limits Optimizations: While creating a proof-of-concept implementation using existing components can be done in relatively little time, making a usable and competitive novel system requires a significant time investment. In addition, removing quick workarounds from ModelarDB generally took significantly more time than anticipated as other components in the system inevitably, and sometimes very quickly, started depending on small differences between the workarounds and the proper implementation. Also, while the time required to implement a system can be significantly reduced by reusing existing components, these components are rarely a perfect fit and generally cannot provide the same level of performance as a bespoke component. For example, the observation that segment groups are sorted on both their start time and end time when sorted on one of them could not be exploited when RDBMSs and Apache Cassandra were used for storage as described in Sect. 7. As another example, the use of Apache Spark meant that even the initial version of ModelarDB [12] was scalable as it could process queries in parallel across multiple distributed nodes. However, as Apache Spark Executors by design are black boxes, it is not possible to optimize how they process queries, e.g., a local mutable cache or index cannot be created directly in the Apache Spark Executor. Also while implementing the Data Point View as an Apache Spark Data Source significantly reduced development time, Apache Spark’s stable APIs only provided required columns and the predicates for each query. Thus, it was not possible to automatically rewrite queries to efficiently compute aggregates directly from the segments containing metadata and models. As a result, the Segment View was added as a workaround despite the increase in complexity for users of the system. The use of Apache Spark also meant that ModelarDB would be implemented on top of the JVM, thus reducing both development time and performance compared to native code. For example, the authors of the Java-based data analytics engine MacroBase found that its throughput was on average 12.76x lower than hand-optimized C++ [1]. IBM also replaced Apache Spark with the Db2 BLU column-based query engine [25] in Db2 Event Store as Apache Spark could not handle low latency queries [8].
Modularity Adds Complexity and Limits Optimizations: Early it was decided that ModelarDB should use an highly extensible and modular architecture to facilitate experimentation with different model types, query engines, and data stores. While this provided a lot of flexibility, it also significantly increased the complexity and development time of the system. For example, to integrate two types of components, either specialized methods had to be implemented for each combination or a shared intermediate format had to be created. To integrate the query engines and data stores we chose the first option by implementing both H2Storage and SparkStorage for all data stores, as described in Sect. 7, due to the additional overhead converting to and from an intermediate representation would add. The use of a modular architecture also meant that significantly more testing was required to ensure all combinations of the different components worked as intended. In addition to increasing complexity and development time, ModelarDB’s extensibility also limited which optimizations could be implemented. For example, ModelarDB’s API for adding user-defined model types is designed to be as flexible as possible such that users can implement the compression methods and use the definition of error that is most suitable for their domain. Thus, ModelarDB cannot make any assumptions about these model types. However, the flexibility, unfortunately, proved to not be beneficial for users. From the feedback we received it is very clear that users generally prefer systems that work well enough “out of the box” instead of systems they can fine-tune specifically for their use case. Surprisingly, this was even the case for users with a strong background in computer science, e.g., users with a master’s or PhD degree. So while ModelarDB’s flexibility made it easy to add support for multiple model types, query engines, and data stores, it was not only a benefit as it also added a significant amount of complexity, significantly increased development time, and made some optimizations impossible to implement. Failing to dynamically load dependencies at run-time also caused run-time errors that were incomprehensible for most users, e.g., when deploying ModelarDB to an unsupported version of Apache Spark or when a JDBC driver was missing.
Code Generation Enables Optimizations but Generally Trades Latency for Throughput and Adds Complexity: To improve the performance of ModelarDB, specialized Scala code was generated through both static [12] and dynamic [14] code generation. This enabled additional optimizations, e.g., removal of branches from projections by generating a specialized method for each permutation of columns in a schema. However, while improving performance, the use of code generation also had multiple downsides. For static code generation, some of the generated methods had to be arbitrarily split into multiple methods as the Java Virtual Machine Specification requires each method to be less than 64 KiB [18]. In addition, due to the amount of code initially generated through static code generation, e.g., we initially generated a specialized method for each permutation of columns in the Segment View, the time required to compile ModelarDB significantly increased. This significantly discouraged experimentation. To remedy this problem, a specialized projection method was only generated for the most commonly used permutations of columns in the Segment View, e.g., those required for the UDFs and UDAFs described in Sect. 6. If another permutation of columns is received, a fallback method implemented using branches or dynamic code generation is used. For dynamic code generation, the main drawback was the significant amount of time required at runtime to compile the Scala code as described in Sect. 6. Thus, it proved more efficient to not use dynamic code generation in some cases, e.g., when evaluating queries that only need to process a small amount of data when using Apache Spark. To determine when to use dynamic code generation in this situation, ModelarDB uses the number of partitions in the Apache Spark RDD representing the data to be processed as a heuristic. Other techniques have been proposed for solving this problem, such as using multiple query execution methods and then only compiling the query if it would improve query execution time based on the estimated time remaining and directly generating x86_64 machine code to reduce compilation time [22]. Dynamic code generation is always used for H2, as the compiled code can be cached much more easily when ModelarDB is running on a single node. Dynamic code generation and compilation also significantly increased the amount of testing required as we had to ensure that the correct code was generated.
Pull-Based Data Ingestion Improves Performance but Increases Complexity: ModelarDB uses pull-based ingestion to read data points from sources such as files or sockets. While this removes the need for an external process that converts from the source representation to a representation ModelarDB supports, thus improving performance and theoretically reducing complexity, it has a number of downsides. For example, it limits the number of data formats ModelarDB can support as support for a new data format has to be implemented in the system itself. Also, pull-based data ingestion requires a mechanism for adding new sources to pull data points from to ModelarDB or for ModelarDB to automatically detect new sources to pull data points from. This significantly increases the complexity of the system compared to push-based data ingestion as it only requires that ModelarDB provides an endpoint that clients can push a single data format to. As a result, the current version of ModelarDB requires that the sources to ingest data points from are added to its configuration file, and the system must be restarted whenever a new source is added. Similarly, Google changed Monarch’s data ingestion from pull-based to push-based as the infrastructure needed for Monarch to discover the entities to pull data points from made the system’s architecture more complex and limited its scalability [2].
10 Related Work
As the amount of time series data increases, many TSMSs have been proposed. Surveys of TSMSs developed through academic and industrial research can be found in [10, 11] and a survey of open-source TSMSs can be found in [4].
10.1 Querying Compressed Time Series
Several TSMSs have been proposed that can execute different types of queries directly on compressed time series. FunctionDB [27] supports fitting polynomial functions to data points and evaluating queries directly on the polynomial functions. Plato [15] supports fitting different types of models to data points and supports user-defined model types. Plato evaluates queries directly on the models if the necessary functions are implemented, otherwise, queries are evaluated on data points reconstructed from the models. Plato was extended to provide deterministic error guarantees for some types of queries [19]. Tristan [21] is a TSMS based on the MiSTRAL architecture [20]. It compresses time series using dictionary compression and can execute queries on this compressed representation. Tristan’s compression method was later extended to take correlation into account [16]. SummaryStore [3] splits time series into segments, compresses each segment using different compression methods, and then uses the most appropriate representation when answering queries. Over time the segments are combined, this reduces the amount of storage required but can increase the error of the representations. Compared to FunctionDB [27] and Tristan [16, 20, 21], ModelarDB compresses each time series group using multiple different model types. Also, ModelarDB can run distributed unlike Plato [15, 19] and SummaryStore [3].
10.2 Data Transfer
Unlike ModelarDB, most TSMSs are designed for deployment on either the edge or in the cloud and do not support transferring ingested data points to the cloud. However, a few exceptions do exist. Respawn [5] is designed to be deployed on both edge nodes and cloud nodes. Queries are submitted to a Dispatcher that redirects the clients to the appropriate nodes. To reduce latency, Respawn continuously migrates data points from the edge nodes to the cloud nodes. Storacle [6] was designed for monitoring smart grids and it periodically transfers data points to the cloud. The most recent data points are not immediately deleted after being transferred so they can still be accessed on the edge nodes with low latency. Apache IoTDB [28] is designed to be used as an embedded TSMS, a standalone TSMS, or a distributed TSMS depending on the available hardware. Apache IoTDB uses the novel column-based TsFile file format for storage, which is similar to Apache Parquet. It also supports transferring TsFiles using a File Sync component. Compared to Respawn [5], Storacle [6], and Apache IoTDB [28], ModelarDB supports both lossless and lossy compression of time series groups, automatic selection of the type of model that provides the best compression ratio for each dynamically sized sub-sequence, and it exploits that the time series are stored as models to much more efficiently answer aggregate queries.
11 Conclusion and Future Work
Motivated by the need to efficiently manage high-frequency time series from sensors across the edge and the cloud, we presented the following: (i) an overview of the open-source model-based TSMS ModelarDB, (ii) an analysis of the requirements and limitations of the edge, (iii) an evaluation of existing query engines and data stores for use on the edge, (iv) extensions for ModelarDB to efficiently manage time series on the edge, (v) a file-based data store, (vi) a method for not storing time series that can be derived from base time series, (vii) extensions for ModelarDB to transfer high-frequency time series from the edge to the cloud, and (viii) reflections on the lessons learned while developing ModelarDB.
In future work, we plan to simplify the use of ModelarDB and increase its performance by: (i) replacing pull-based data ingestion with push-based data ingestion; (ii) dynamically combining similar models within the error bound instead of statically grouping time series; (iii) support ingestion and querying of irregular multivariate time series; (iv) replace the Segment View with a query optimizer; (v) develop novel pruning techniques that exploit that the time series are stored as models; (vi) perform high-level analytics, e.g., similarity search, directly on the models. These items are already under active development as we are rewriting ModelarDB in Rust using Apache Arrow, Apache Arrow DataFusion, Apache Arrow Flight, and Apache Parquet. The rewrite is being done as an open-source project at https://github.com/ModelarData/ModelarDB-RS and the source code is licensed under version 2.0 of the Apache License.
References
Abuzaid, F., et al.: MacroBase: prioritizing attention in fast data. ACM Trans. Database Syst. 43(4), 1–45 (2018). https://doi.org/10.1145/3276463
Adams, C., et al.: Monarch: Google’s planet-scale in-memory time series database. Proc. VLDB Endow. 13(12), 3181–3194 (2020). https://doi.org/10.14778/3181-3194
Agrawal, N., Vulimiri, A.: Low-latency analytics on colossal data streams with SummaryStore. In: Proceedings 26th ACM Symposium on Operating System Principles, pp. 647–664. ACM (2017). https://doi.org/10.1145/3132747.3132758
Bader, A., Kopp, O., Michael, F.: Survey and comparison of open source time series databases. In: Datenbanksysteme für Business, Technologie und Web - Workshopband, pp. 249–268. GI (2017)
Buevich, M., Wright, A., Sargent, R., Rowe, A.: Respawn: a distributed multi-resolution time-series datastore. In: Proceedings of IEEE 34th Real-Time Systems Symposium, pp. 288–297. IEEE (2013). https://doi.org/10.1109/RTSS.2013.36
Cejka, S., Mosshammer, R., Einfalt, A.: Java embedded storage for time series and meta data in smart grids. In: 2015 IEEE International Conference on Smart Grid Communications, pp. 434–439. IEEE (2015). https://doi.org/10.1109/SmartGridComm.2015.7436339
Elmeleegy, H., Elmagarmid, A.K., Cecchet, E., Aref, W.G., Zwaenepoel, W.: Online piece-wise linear approximation of numerical streams with precision guarantees. Proc. VLDB Endow. 2(1), 145–156 (2009). https://doi.org/10.14778/1687627.1687645
Garcia-Arellano, C., et al.: DB2 event store: a purpose-built IoT database engine. Proc. VLDB Endow. 13(12), 3299–3312 (2020). https://doi.org/10.14778/3415478.3415552
Hung, N.Q.V., Jeung, H., Aberer, K.: An evaluation of model-based approaches to sensor data compression. IEEE Trans. Knowl. Data Eng. 25(11), 2434–2447 (2013). https://doi.org/10.1109/TKDE.2012.237
Jensen, S.K., Pedersen, T.B., Thomsen, C.: Time series management systems: a 2022 survey. In: Palpanas, T., Zoumpatianos, K. (eds.) Data Series Management and Analytics. ACM (forthcoming)
Jensen, S.K., Pedersen, T.B., Thomsen, C.: Time series management systems: a survey. IEEE Trans. Knowl. Data Eng. 29(11), 2581–2600 (2017). https://doi.org/10.1109/TKDE.2017.2740932
Jensen, S.K., Pedersen, T.B., Thomsen, C.: ModelarDB: modular model-based time series management with spark and cassandra. Proc. VLDB Endow. 11(11), 1688–1701 (2018). https://doi.org/10.14778/3236187.3236215
Jensen, S.K., Pedersen, T.B., Thomsen, C.: Demonstration of ModelarDB: model-based management of dimensional time series. In: Proceedings of ACM SIGMOD International Conference on Management of Data, pp. 1933–1936. ACM (2019). https://doi.org/10.1145/3299869.3320216
Jensen, S.K., Pedersen, T.B., Thomsen, C.: Scalable model-based management of correlated dimensional time series in ModelarDB\(_+\). In: Proceedings of 37th International Conference on Data Engineering, pp. 1380–1391. IEEE (2021). https://doi.org/10.1109/ICDE51399.2021.00123
Katsis, Y., Freund, Y., Papakonstantinou, Y.: Combining databases and signal processing in Plato. In: Proceedings of 7th Biennial Conference on Innovative Data Systems Research, pp. 1–9 (2015)
Khelifati, A., Khayati, M., Cudré-Mauroux, P.: CORAD: correlation-aware compression of massive time series using sparse dictionary coding. In: Proceedings of 2019 IEEE International Conference on Big Data, pp. 2289–2298. IEEE (2019). https://doi.org/10.1109/BigData47090.2019.9005580
Lazaridis, I., Mehrotra, S.: Capturing sensor-generated time series with quality guarantees. In: Proceedings of 19th International Conference on Data Engineering, pp. 429–440. IEEE (2003). https://doi.org/10.1109/ICDE.2003.1260811
Limitations of the Java Virtual Machine. https://docs.oracle.com/javase/specs/jvms/se11/html/jvms-4.html#jvms-4.11
Lin, C., Boursier, E., Papakonstantinou, Y.: Plato: approximate analytics system over compressed time series with tight deterministic error guarantees. Proc. VLDB Endow. 13(7), 1105–1118 (2020). https://doi.org/10.14778/3384345.3384357
Marascu, A., et al.: MiSTRAL: an architecture for low-latency analytics on massive time series. In: Proceedings of 2013 IEEE International Conference on Big Data, pp. 15–21. IEEE (2013). https://doi.org/10.1109/BigData.2013.6691772
Marascu, A., et al.: TRISTAN: real-time analytics on massive time series using sparse dictionary compression. In: Proceedings of 2014 IEEE International Conference on Big Data, pp. 291–300. IEEE (2014). https://doi.org/10.1109/BigData.2014.7004244
Neumann, T.: Evolution of a compiling query engine. Proc. VLDB Endow. 14(12), 3207–3210 (2021). https://doi.org/10.14778/3476311.3476410
Package java.lang.ref. https://docs.oracle.com/en/java/javase/11/docs/api/java.base/java/lang/ref/package-summary.html
Pelkonen, T., et al.: Gorilla: a fast, scalable, in-memory time series database. Proc. VLDB Endow. 8(12), 1816–1827 (2015). https://doi.org/10.14778/2824032.2824078
Raman, V., et al.: DB2 with BLU acceleration: so much more than just a column store. Proc. VLDB Endow. 6(11), 1080–1091 (2013). https://doi.org/10.14778/2536222.2536233
Sathe, S., Papaioannou, T.G., Jeung, H., Aberer, K.: A survey of model-based sensor data acquisition and management. In: Aggarwal, C.C. (ed.) Managing and Mining Sensor Data, pp. 9–50. Springer, Boston (2013). https://doi.org/10.1007/978-1-4614-6309-2_2
Thiagarajan, A., Madden, S.: Querying continuous functions in a database system. In: Proceedings of ACM SIGMOD International Conference on Management of Data, pp. 791–804 (2008). https://doi.org/10.1145/1376616.1376696
Wang, C., et al.: Apache IoTDB: time-series database for internet of things. Proc. VLDB Endow. 13(12), 2901–2904 (2020). https://doi.org/10.14778/3415478.3415504
Acknowledgements
This research was supported by the MORE project funded by Horizon 2020 grant number 957345. We also thank our industry partners for providing detailed information about their domain and access to real-life data.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.
The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.
Copyright information
© 2023 The Author(s)
About this chapter
Cite this chapter
Jensen, S.K., Thomsen, C., Pedersen, T.B. (2023). ModelarDB: Integrated Model-Based Management of Time Series from Edge to Cloud. In: Hameurlain, A., Tjoa, A.M. (eds) Transactions on Large-Scale Data- and Knowledge-Centered Systems LIII. Lecture Notes in Computer Science(), vol 13840. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-662-66863-4_1
Download citation
DOI: https://doi.org/10.1007/978-3-662-66863-4_1
Published:
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-662-66862-7
Online ISBN: 978-3-662-66863-4
eBook Packages: Computer ScienceComputer Science (R0)