Open Source Data Engineering Landscape 2024 by Alireza Sadeghi Feb, 2024 Medium
Open Source Data Engineering Landscape 2024 by Alireza Sadeghi Feb, 2024 Medium
Open Source Data Engineering Landscape 2024 by Alireza Sadeghi Feb, 2024 Medium
com/open-source-data-engineering-landscape-2024-8a56d23b7fdb
153 7
Introduction
While the widespread hype surrounding Generative AI and ChatGPT took the
tech world by storm, 2023 witnessed yet another exciting and vibrant year in
the data engineering landscape, steadily grown more diverse and
sophisticated, with continuous innovation and evolution across all tiers of
the analytical hierarchy.
The MAD Landscape provides a very comprehensive view of all tools and
services for Machine Learning, AI and Data, including both commercial and
open source, while the landscape presented here provides a more
comprehensive view of active open source projects in the Data part of MAD.
Other reports such as Reppoint Open Source Top 25 and Data50 focus more
on the SaaS providers and startups, whereas this report focuses on the open
source projects themselves, rather than the SaaS services.
Therefore due to my interest in open source data stacks, I’ve compiled the
open source tools and services in data engineering ecosystem.
So without further due, here is the 2024 Open Source Data Engineering
Ecosystem:
• Projects which have been completely inactive on Github over the past
year, and are hardly mentioned in the community are excluded. Notable
examples are Apache Pig and Apache Oozie projects.
• Projects which are still quite new and have not gained much traction in
terms of Github stars, forks, as well as blog posts, show cases and
mentions in the online communities, are excluded. However some
promising projects such as OneTable which has made some notable
traction and are implemented on the foundation of existing technologies
are mentioned.
• Data Science, ML and AI tools are excluded, except for ML platform and
• Some tools could belong to more than one category. VoltDB is both an in-
memory database and distributed SQL DBMS. But I have tried to place
them in the category by which they are mostly recognised in the market.
• For certain database systems, there may be a blurry line regarding the
category they actually belong to. For example ByConity claims to be a
data warehousing solution, but is built on top of ClickHouse which is
recognised as a Real-time OLAP engine. Therefore it is still unclear
whether it is real-time (ability to support sub-second queries) OLAP
system or not.
• Not all the listed projects are fully Portable open source tools. Some of
the projects are rather Open Core than open source. In open core models,
not all components of the full system, as offered by the main SaaS
provider, are made open source. Therefore, when deciding to adopt an
open-source tool, it is important to consider how portable and truly open
source the project is.
1. Storage Systems
Storage systems are the largest category in the presented landscape,
primarily due to the recent surge of specialized database systems. Two latest
trending categories are vector and streaming databases. Materialize and
RaisingWave are examples of open-source streaming database systems.
Vector databases are also experiencing rapid growth in the storage systems
field. I have placed vector storage systems in the ML Platform section since
they are primarily used in ML and AI stacks. Distributed file systems and
object stores are also placed their own related category, that is Data Lake
Platform.
Source: Gartner
For storage layer, distributed file systems and object stores are still the main
technologies serving as the bedrock for both on-premise and cloud-based
data lake implementations. While HDFS is still the primary technology used
for on-premise Hadoop clusters, Apache Ozone distributed object store is
catching up to provide an alternative on-premise data lake storage
technology. Cloudera, the main commercial Hadoop provider, is now
offering Ozone as part of its CDP Private Cloud offering.
Another key trend in 2023 is the decoupling of storage and compute layers.
Many storage systems now offer integration with cloud-based object storage
solutions like S3, leveraging their inherent efficiency and elasticity. This
approach allows data processing resources to scale independently from
storage, leading to cost savings and enhanced scalability. Cockroachdb
supporting S3 as storage backend, and Confluent’s offering of long-term
One of the hottest developments in 2023 was the rise of open table formats.
These frameworks essentially act as a table abstraction and virtual data
management layer sitting atop your data lake storage and data layer as
depicted in the following diagram.
The open table format space is currently dominated by a fierce battle for
supremacy between the following three major contenders:
The funding received by the leading SaaS providers in this space in 2023 —
Databricks, Tabular, and OneHouse — emphasises market interest and their
potential to further advance data management on data lakes.
Moreover, a new trend is now unfolding with the emergence of unified data
lakehouse layers. OneTable (recently open-sourced by OneHouse) and
UniForm (currently non-open source offering from Databricks) are the first
two projects which were announced last year. These tools go beyond
individual table formats, offering the ability to work with all three major
contenders under a single umbrella. This empowers users to embrace a
universal format while exposing data to processing engines in their
3. Data Integration
The data integration landscape in 2023 witnessed not only continued
dominance from established players like Apache Nifi, Airbyte, and Meltano,
but also the emergence of promising tools like Apache Inlong and Apache
SeaTunnel offering compelling alternatives with their unique strengths.
Redpanda’s $100 million Series C funding in 2023 shows the growing interest
in alternative message brokers offering low latency and high throughput.
Veteran tools such as Apache Airflow and Dagster are still going strong and
remains a widely used engines amid the recent hot debates in the
community on unbundling, rebundling and bundling vs unbundling of
workflow orchestration engines. On the other hand In the past two years,
GitHub has witnessed the rise of several compelling contenders, capturing
significant traction. Kestra, Temporal, Mage, and Windmill are all worth
watching, each offering unique strengths. Whether focusing on serverless
orchestration like Temporal, or distributed task execution like Mage, these
newcomers can cater to the evolving needs of modern data pipelines.
While the Apache Ambari project, once popular for managing Hadoop
clusters, was practically abandoned after the Hortonworks-Cloudera merger
in 2019, a recent revival sparks some hope for its future. However, its long-
term fate remains uncertain.
7. ML Platform
Machine Learning Platform has been one of the most active categories with
unprecedented rise and interest in Vector databases, specialised systems
optimised for the storage and retrieval of high-dimensional data. As
highlighted by DB-Engines’ 2023 report, vector databases emerged as the
most popular database category in the past year.
8. Metadata Management
In recent years, metadata management has taken center stage, propelled by
the growing need to govern and improve management and access to data.
However, the lack of comprehensive metadata management platforms
prompted tech giants like Netflix, Lyft, Airbnb, Twitter, LinkedIn and Paypal
to build their own solutions.
Source: Github
Conclusion
This exploration of the open-source data engineering landscape is a glimpse
into the dynamic and vibrant world of data platforms. While prominent tools
and technologies were covered across various categories, the ecosystem
continues to evolve rapidly, with new solutions emerging continuously.
Remember, this is not an exhaustive list, and the “best” tools are ultimately
determined by your specific needs and use cases. Feel free to share any
notable tools I’ve missed that you think should’ve be included.
39 Followers
Senior Software and Data Engineer [big data, distributed storage ,distributed processing,
data pipelines, infraustructure, cluster management, workflow orch]
7 min read · Aug 29, 2023 5 min read · Jul 23, 2023
2 1
468 5 224 1
Lists
22 92
394 7 466 4
Help Status About Careers Blog Privacy Terms Text to speech Teams