The State of Data Engineering 2022

7/11/22, 9:21 AM The State of Data Engineering 2022 - LakeFS
+
lakeFS Cloud is live!
Read the announcement

Back to Blog
Data Engineering
The State of Data Engineering 2022
Einat Orr, PhD. June 20, 2022  
A year has passed since we shared the State of Data Engineering 2021. And since we
released that article last May, not much has changed in the data landscape. In fact, we
had discussions internally about whether we should even do an update for 2022.
We kid.
It was another year worthy of its own prime-time drama, and we’re back to share our
updated, digestible snapshot of it all!
What has changed this year?

Welcome to the lake, it's great to
have you here!
The major theme we saw play out in the past year is consolidation. Companies expanding
their scope to break into new categories, or simply coming into existence with the value-
prop of replacing several existing tools in the data stack.
https://lakefs.io/the-state-of-data-engineering-2022/?utm_source=linkedin&utm_medium=einat&utm_campaign=state-de-22 1/15
Let’s take a look at the updated diagram, this year in a higher resolution!
Click here for a full sized, zoom-able version of the State of Data Engineering 2022 Map.
Table of Contents 
1. Ingestion
2. Data Lakes
3. Metadata Management
3.1. Open Table Formats
3.2. Metastore's Future Is Still in the Dark…
3.3. Git for data
4. Compute Welcome to the lake, it's great to

have you here!
4.1. Distributed Compute
4.2. Analytics Engines
5. Orchestration and Observability
5.1. Orchestration
5.2. Observability
6. Data Science + Analytics Usability
6.1. End-to-end MLOps tools
6.2. Data-centric ML
6.3. ML observability and monitoring
6.4. Notebooks
7. Catalogs, Permissions and Governance
7.1. About lakeFS
Ingestion
This layer includes streaming technologies and SaaS services that provide pipelines from
operational systems to data storage.
The evolution worth mentioning here is the dramatic rise of Airbyte. Founded in 2020 and
pivoting to its current offering only towards the end of that year, Airbyte is an open-source
project used today by over 15,000 companies. The community counts over 600
contributors. It’s rare to see such exponential growth in usage and community.
Airbyte has just launched its commercial offering and expanded to reverse ETL (a
category we don’t cover in our diagram) by acquiring Grouparoo, an open source for
reverse ETL connectors. We believe reverse ETL to be a very different product then ETL,
as it requires integrating the data into the operational systems in a way that benefits the
users’ workflow within that system.
Curious to see how things turn out.

have you here!
Data Lakes
In 2021, we included data warehouses and lakehouses as part of the data lake layer. But
this year, we decided to keep the data lake category solely for the object storage
technologies used as data lakes. We moved all warehouses and lakehouses to the
analytics engines category.
Why? Most of the architectures data engineers deal with today are complex enough to
include both object storage and an analytics engine. So you either only need an analytics
database (in which case you don’t have a data lake but a data warehouse that serves as
the analytics engine) or you need both. And when you need both, you usually perform
some of the analysis over the object storage and some over the analytics engine. That’s
why they need to work very well together.
This dependency occurs at different layers. Large data sets would be managed in your
object storage while artifacts and serving layer datasets would be stored within analytics
engines and databases. The idea that one of them would conquer the other is something
that we’re not seeing in architectures around us.
What we see in reality is these solutions co-exist. There are several reasons behind this
architecture and one of them is definitely cost considerations. Querying a massive volume
of data in Snowflake or BigQuery is expensive. So instead of having analytics databases
managing your entire lake, you manage everything you can in object storage with cheaper
compute over it and leave all the must-haves for the analytics engines.
We consider a lakehouse to be an analytics engine (although in Databricks it includes both

the data lake and analytics engine). This architecture features an optimized version of
Spark SQL to create an analytics engine over the Delta table format. This delivers the
increased performance and costs expected from an analytics engine.
The same rule applies to Dremio over Iceberg, or to Snowflake supporting Iceberg as
external tables to its database.
Metadata Management Welcome to the lake, it's great to

have you here!
A lot of things are happening in the metadata space! The two layers of metadata, this one
and the organizational one at the top of our chart, are becoming a focus in many
organizations.
Reviewing the evolution of our challenges as scalable data practitioners, we have spent
the past decade innovating around storage and computers – all to make sure they
support the scale of data.
Today, we face mostly manageability problems and can solve them by producing and
managing metadata. This layer includes different aspects of metadata, let’s go through
them.
Open Table Formats
We’ve seen interesting advancements in open table formats in the past year. They’re
becoming a standard for holding structured data in a data lake.
A year ago, Delta Lake was a Databricks project with an actual commercial product called
Delta. Then this year, we have Apache Hudi commercialized by Onehouse and Apache
Iceberg commercialized by Tabular. Both companies were founded by the creators of
these open-source projects.
So, the entire space went from being open-source to fully backed up by commercial
entities. This brings a question mark to how much influence other players would have on
the open-source project now that there’s a commercial interest behind it.
Since all three open-source projects are part of the apache/Linux Foundation, the risk to
the community is low. This doesn’t seem to calm the passionate dispute between the
creators and fans of all these three projects over who is “really” open source and who has
the best solution. Netflix will soon jump on this story as excellent material for a drama
show. 🙂
Metastore's Future Is Still in the Dark…
We’re seeing Hive Metastore being pulled out of architectures Welcome

where it’stopossible to great to
the lake, it's
have you here!
replace it with open table formats. Not all organizations are making full use of Metastore
capabilities, and if their only use case was virtualizing tables then open table formats and
the commercial offerings around them provide a good option. Other use cases for
Metastore have yet to receive a better alternative solution.
Git for data
The concept of Git for data is taking hold in the community. dbt is encouraging analysts to
use best practices over different versions of data (dev, stage and production), though
does not support the creation and maintenance of those data sets in data lakes.
There is a growing demand from DataOps teams to provide cross organizational data
version control to manage data sets that have different revisions over time. A few
examples for different revisions to a data set are: recalculation, necessary for algorithms
and ML models, or of backfills from operational systems as often happens in BI, or
deletion of a subset due to regulations such as the right to be forgotten under GDPR.
This trend is clear from the dramatic growth in the adoption of lakeFS and its community,
which I’ve witnessed first-hand. lakeFS serves both structured and unstructured data
operations, shining in cases where both exist.
Unfortunately public data on usage for Dremio’s Project Nessie is hard to come by.
Interestingly, it is also available as a free service named Arctic. This is likely a strategic
decision made to compete with Tabular.
Compute
We applied some changes to the compute layer this year to reflect the ecosystem better.
First, we removed virtualization as a category altogether. It doesn’t seem to be catching
on at the moment.
We then split the compute engines into two categories: distributed compute and analytics
engines. The main difference between them is how opinionated these tools are regarding
their storage layer.

While distributed compute includes compute engines that aren’t opinionated about
have you here!
storage, the analytics engines category contains compute engines (distributed or not) that
are opinionated.
Distributed Compute
The general distributed compute engines allow engineers to distribute anything that is
SQL or any other code. Sure, they might be opinionated about the programming language,
but they will carry out a general distribution over (usually) federated data. This could be
data stored across many formats and sources.
The two interesting additions to the distributed compute category are Ray and Dask.
Ray is an open-source project that allows engineers to scale any compute-intensive

Python workload, used mainly for machine learning. Dask is also a distributed Python
engine based on Pandas.
You might have thought that Spark would be the distributed engine ruling the scene with
no competition in sight. So, it’s pretty exciting to witness the rise of new technologies
gaining traction in this category.
Analytics Engines
The analytics engines category includes all data warehouses such as Snowflake,
BigQuery, Redshift, Firebolt, and the good old PostgreSQL. It also contains lakehouses like
Databricks lakehouse, Dremio, or Apache Pinot. All are very opinionated about the data
format they support to provide better performance to their querying engines.
Since all analytics engines use the data lake as their deep storage or storage, it’s worth
mentioning that Snowflake now supports Apache Iceberg as one of the external table
formats that can be read by Snowflake directly from the lake.
Orchestration and Observability

This is a new layer populated by existing categories. Orchestration tools
Welcome towere partit'sofgreat
the lake, the to
metadata layer last year, but we moved it over to the compute have youwhere
engine here! it really
belongs – it’s about orchestrating pipelines across compute engines and data sources.
Together with orchestration, we get observability tools that also grew a lot during 2021 to
support more data sources resulting in any compute engine.
Orchestration
Did anything eye-catching happen around orchestration? Airflow is still the biggest open-
source product. Astronomer has backed it for a few years already and since the company
jumped on the cloud bandwagon, it’s now directly competing with cloud providers on the
managed Airflow front.
Another very interesting move by Astronomer is the acquisition of Datakin which provides
data lineage. This makes one wonder – what happens when an orchestration tool has
lineage capabilities?
This, in theory, could help data teams build safer, more resilient pipelines. By knowing
which datasets depend on missing, corrupt or low quality data, it makes impact analysis
considerably easier by tying together the logic (managed by orchestration tools) and their
output (managed in lineage tools). Whether this materializes into an integral part of the
orchestration ecosystem is to be seen.
Observability
The observability category, established by Monte Carlo data, is also led by it. Monte
Carlo’s frequent fund raising is an indication of its products’ rapid adoption in the market.
The product keeps evolving, offering more integrations, for example, the databricks
ecosystem, and additional observability and root cause analysis features. It is this
success that is probably driving the growth of the category, at least from the perspective
of the number of companies exploring this front today. Several companies were founded
or went out of stealth in 2021, most interesting is Elementary, another open source
project out of YC
have you here!
Data Science + Analytics Usability

This layer is meant for the users of data architectures created through the former layers:
data scientists and analysts who unlock insights from data. We split this category into
three subcategories:
End-to-end MLOps tools,

Tools based on the data-centric ML approach,
ML observability and monitoring.
End-to-end MLOps tools
When I set out to review this space, someone told me I should name the category:
“Prepare to be disappointed.” While the category includes great tools, none of them is
really end-to-end. They offer a good solution to certain steps in the ML process but lack at
their offering for other aspects of the ML pipeline.
Still, the approach of offering an end-to-end solution was popular in 2021. Several
mission-specific tools entered the path to become an end-to-end offering. Examples
include Comet, Weights & Biases, Clear.ml and Iguazio.
Data-centric ML
This subcategory doesn’t escape the end-to-end trap either, but the tools listed in it take
on a different approach to the functionality they provide. They put the data itself and its
management at the center of their missions.
The two new joiners to this space are Activeloop and Graviti. They were built by
experienced data practitioners who understand the criticality of data management to the
success of any data operation. We share this sentiment at lakeFS.
A unique approach is taken by DagsHub, which provides an E2E solution that is data-
centric but bases its offering on open source solutions. They excel at each of the ML
lifecycle phases, providing their unique usability and ease of integration
Welcome totouch. In it's
the lake, such a to
great
have solution
confusing space, this is a solid approach to get a good end-to-end you here!that also
delights users. We’re watching the company with anticipation….
ML observability and monitoring

This subcategory includes tools focusing on the monitoring and observability of model
quality. Very much like the data observability category, this space is growing and gaining
momentum. In early 2022 Deepchecks went open source, and had quickly gained traction,
contributors, and partners.
Notebooks
In the category of notebooks, we see Hex getting more traction and validation by
investment from both Databricks and Snowflake. Hex offers more orchestration
capabilities within its notebook. The same is true for Ploomber which emerged to offer
orchestration capabilities to Jupyter.
The category of tools meant for analysts established itself over the past year and gained
some competition. dbt proved itself as a standard for analysts and, in 2021, it released
integrations with the scalable data engineering stack, including object storage, HMS, and
the Databricks offering. While collaborating with the ecosystem, in 2021 Databricks
launched the GA of its ‘live tables’ product, which is in direct competition with dbt.
Catalogs, Permissions and Governance

My sense of the ecosystem is that the need for a data catalog is now clear to companies
of all sizes, and we will see it become a standard. Commercial offerings based on open-
source projects show good levels of adoption.
Meanwhile, long-time providers for enterprise solutions, Alation and Collibra, continue to
expand their offerings. And the security and permissions supplier BigID is attempting to
offer a catalog as well.
Immuta stays persistent in its focus on data access control, and its unique technology is
now compatible with additional data sources. To accelerate itsWelcome
growth,to thecompany
the lake, it's great to
have you here!
raised $90 million in series D funding back in mid-2021.
Conclusion
While the number of companies in the space keeps growing, we can see signs of
consolidation in product offerings across several of these categories.
MLOps is trending towards end-to-end, notebooks are going into orchestration, and
orchestration is turning towards lineage and observability. At the same time, we see open
table formats going into the Metastore functionality. And in the governance layer, security
and permission tools go into catalogs and the other way around…
Are those signs of a (still) limited market, maybe in MLOps? Does this consolidation
reflect the need for differentiation due to fierce competition (which may be the case in
orchestration)? Or is it an opportunity to fill a void that may be the case with open table
formats or data-centric ML? Maybe it’s all because users want to use fewer tools that do
more?
I’m going to leave these questions open in hope to initiate a discussion about the state of
data engineering in 2022.
What other projects are gaining steam in 2022? Which tools are on their way to becoming
de facto standards in the industry? Share your thoughts in the comments section!
About lakeFS
The lakeFS project is an open source technology that provides a git-like version control
interface for data lakes, with seamless integration to popular data tools and
frameworks.
Our mission is to maximize the manageability of open source data analytics solutions
that scale.
To learn more... Welcome to the lake, it's great to

have you here!
Read Related Articles.
Proudly announcing lakeFS Cloud

What is lakeFS? As data practitioners, we use many different terms to talk about what we do – we call
it business intelligence, analytics, data
Read More »
7 Winning Habits of Effective Data Engineers Welcome to the lake, it's great to
As organizations develop new product offerings and data streams, datahave
engineers deal with the largest
you here!
and most complex datasets ever. Add growing teams and new
Read More »
LakeFS Introduces a New Approach to Data Manageability and Reliability

lakeFS Cloud provides a Git-like repository for data lakes in a hosted version available in AWS
Marketplace NEW YORK and TEL AVIV, June 22, 2022–lakeFS,
Read More »
Never miss a post

Your Email
I agree to receive other

communications from lakeFS.
Subscribe

have you here!
Talk to a lakeFS engineer
Email* Let's talk
Cloud
Docs
Blog
Community
Careers
Contact Us

have you here!
© Copyright 2022 | All Rights Reserved | Privacy Policy | Terms of Use
Design by hello.

The State of Data Engineering 2022 - LakeFS

Uploaded by

Copyright:

Available Formats

The State of Data Engineering 2022 - LakeFS

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

The State of Data Engineering 2022 - LakeFS

Uploaded by

Copyright:

Available Formats

7/11/22, 9:21 AM The State of Data Engineering 2022 - LakeFS

Einat Orr, PhD. June 20, 2022  

What has changed this year?

3.1. Open Table Formats

3.2. Metastore's Future Is Still in the Dark…

3.3. Git for data

4. Compute Welcome to the lake, it's great to

4.2. Analytics Engines

5. Orchestration and Observability

6. Data Science + Analytics Usability​

6.1. End-to-end MLOps tools

6.3. ML observability and monitoring

7. Catalogs, Permissions and Governance

7.1. About lakeFS

Curious to see how things turn out.

We consider a lakehouse to be an analytics engine (although in Databricks it includes both

Metadata Management Welcome to the lake, it's great to

Open Table Formats

Metastore's Future Is Still in the Dark…

We’re seeing Hive Metastore being pulled out of architectures Welcome

Git for data

Welcome to the lake, it's great to

Ray is an open-source project that allows engineers to scale any compute-intensive

Orchestration and Observability

Data Science + Analytics Usability​

End-to-end MLOps tools,

End-to-end MLOps tools

delights users. We’re watching the company with anticipation….

ML observability and monitoring

Catalogs, Permissions and Governance

To learn more... Welcome to the lake, it's great to

Read Related Articles.

Proudly announcing lakeFS Cloud

LakeFS Introduces a New Approach to Data Manageability and Reliability

Never miss a post

I agree to receive other

Welcome to the lake, it's great to

Talk to a lakeFS engineer

Email* Let's talk

Welcome to the lake, it's great to

You might also like

6. Data Science + Analytics Usability

Data Science + Analytics Usability