Data Lakes For Maximum Flexibility
Data Lakes For Maximum Flexibility
Data Lakes For Maximum Flexibility
July 2017
© 2017, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Notices
This document is provided for informational purposes only. It represents AWS’s
current product offerings and practices as of the date of issue of this document,
which are subject to change without notice. Customers are responsible for
making their own independent assessment of the information in this document
and any use of AWS’s products or services, each of which is provided “as is”
without warranty of any kind, whether express or implied. This document does
not create any warranties, representations, contractual commitments,
conditions or assurances from AWS, its affiliates, suppliers or licensors. The
responsibilities and liabilities of AWS to its customers are controlled by AWS
agreements, and this document is not part of, nor does it modify, any agreement
between AWS and its customers.
Contents
Introduction 1
Amazon S3 as the Data Lake Storage Platform 2
Data Ingestion Methods 3
Amazon Kinesis Firehose 4
AWS Snowball 5
AWS Storage Gateway 5
Data Cataloging 6
Comprehensive Data Catalog 6
HCatalog with AWS Glue 7
Securing, Protecting, and Managing Data 8
Access Policy Options and AWS IAM 9
Data Encryption with Amazon S3 and AWS KMS 10
Protecting Data with Amazon S3 11
Managing Data with Object Tagging 12
Monitoring and Optimizing the Data Lake Environment 13
Data Lake Monitoring 13
Data Lake Optimization 15
Transforming Data Assets 18
In-Place Querying 19
Amazon Athena 20
Amazon Redshift Spectrum 20
The Broader Analytics Portfolio 21
Amazon EMR 21
Amazon Machine Learning 22
Amazon QuickSight 22
Amazon Rekognition 23
Future Proofing the Data Lake 23
Contributors 24
Document Revisions 24
Abstract
Organizations are collecting and analyzing increasing amounts of data making it
difficult for traditional on-premises solutions for data storage, data
management, and analytics to keep pace. Amazon S3 and Amazon Glacier
provide an ideal storage solution for data lakes. They provide options such as a
breadth and depth of integration with traditional big data analytics tools as well
as innovative query-in-place analytics tools that help you eliminate costly and
complex extract, transform, and load processes. This guide explains each of
these options and provides best practices for building your Amazon S3-based
data lake.
Amazon Web Services – Building a Data Lake with Amazon Web Services
Introduction
As organizations are collecting and analyzing increasing amounts of data,
traditional on-premises solutions for data storage, data management, and
analytics can no longer keep pace. Data siloes that aren’t built to work well
together make storage consolidation for more comprehensive and efficient
analytics difficult. This, in turn, limits an organization’s agility, ability to derive
more insights and value from its data, and capability to seamlessly adopt more
sophisticated analytics tools and processes as its skills and needs evolve.
A data lake, which is a single platform combining storage, data governance, and
analytics, is designed to address these challenges. It’s a centralized, secure, and
durable cloud-based storage platform that allows you to ingest and store
structured and unstructured data, and transform these raw data assets as
needed. You don’t need an innovation-limiting pre-defined schema. You can use
a complete portfolio of data exploration, reporting, analytics, machine learning,
and visualization tools on the data. A data lake makes data and the optimal
analytics tools available to more users, across more lines of business, allowing
them to get all of the business insights they need, whenever they need them.
Until recently, the data lake had been more concept than reality. However,
Amazon Web Services (AWS) has developed a data lake architecture that allows
you to build data lake solutions cost-effectively using Amazon Simple Storage
Service (Amazon S3) and other services.
Using the Amazon S3-based data lake architecture capabilities you can do the
following:
Ingest and store data from a wide variety of sources into a centralized
platform.
Build a comprehensive data catalog to find and use data assets stored in
the data lake.
Secure, protect, and manage all of the data stored in the data lake.
Page 1
Amazon Web Services – Building a Data Lake with Amazon Web Services
Use a broad and deep portfolio of data analytics, data science, machine
learning, and visualization tools.
The remainder of this paper provides more information about each of these
capabilities. Figure 1 illustrates a sample AWS data lake platform.
Page 2
Amazon Web Services – Building a Data Lake with Amazon Web Services
Page 3
Amazon Web Services – Building a Data Lake with Amazon Web Services
Kinesis Firehose can compress data before it’s stored in Amazon S3. It currently
supports GZIP, ZIP, and SNAPPY compression formats. GZIP is the preferred
format because it can be used by Amazon Athena, Amazon EMR, and Amazon
Redshift. Kinesis Firehose encryption supports Amazon S3 server-side
encryption with AWS Key Management Service (AWS KMS) for encrypting
delivered data in Amazon S3. You can choose not to encrypt the data or to
encrypt with a key from the list of AWS KMS keys that you own (see the section
Encryption with AWS KMS). Kinesis Firehose can concatenate multiple
incoming records, and then deliver them to Amazon S3 as a single S3 object.
This is an important capability because it reduces Amazon S3 transaction costs
and transactions per second load.
Page 4
Amazon Web Services – Building a Data Lake with Amazon Web Services
AWS Snowball
You can use AWS Snowball to securely and efficiently migrate bulk data from
on-premises storage platforms and Hadoop clusters to S3 buckets. After you
create a job in the AWS Management Console, a Snowball appliance will be
automatically shipped to you. After a Snowball arrives, connect it to your local
network, install the Snowball client on your on-premises data source, and then
use the Snowball client to select and transfer the file directories to the Snowball
device. The Snowball client uses AES-256-bit encryption. Encryption keys are
never shipped with the Snowball device, so the data transfer process is highly
secure. After the data transfer is complete, the Snowball’s E Ink shipping label
will automatically update. Ship the device back to AWS. Upon receipt at AWS,
your data is then transferred from the Snowball device to your S3 bucket and
stored as S3 objects in their original/native format. Snowball also has an HDFS
client, so data may be migrated directly from Hadoop clusters into an S3 bucket
in its native format.
Page 5
Amazon Web Services – Building a Data Lake with Amazon Web Services
proprietary modification. This means that you can easily integrate applications
and platforms that don’t have native Amazon S3 capabilities—such as on-
premises lab equipment, mainframe computers, databases, and data
warehouses—with S3 buckets, and then use tools such as Amazon EMR or
Amazon Athena to process this data.
Data Cataloging
The earliest challenges that inhibited building a data lake were keeping track of
all of the raw assets as they were loaded into the data lake, and then tracking all
of the new data assets and versions that were created by data transformation,
data processing, and analytics. Thus, an essential component of an Amazon S3-
based data lake is the data catalog. The data catalog provides a query-able
interface of all assets stored in the data lake’s S3 buckets. The data catalog is
designed to provide a single source of truth about the contents of the data lake.
There are two general forms of a data catalog: a comprehensive data catalog that
contains information about all assets that have been ingested into the S3 data
lake, and a Hive Metastore Catalog (HCatalog) that contains information about
data assets that have been transformed into formats and table definitions that
are usable by analytics tools like Amazon Athena, Amazon Redshift, Amazon
Redshift Spectrum, and Amazon EMR. The two catalogs are not mutually
exclusive and both may exist. The comprehensive data catalog can be used to
search for all assets in the data lake, and the HCatalog can be used to discover
and query data assets in the data lake.
Page 6
Amazon Web Services – Building a Data Lake with Amazon Web Services
tables with object names and metadata when those objects are put into Amazon
S3; then Amazon ES is used to search for specific assets, related metadata, and
data classifications. Figure 3 shows a high-level architectural overview of this
solution.
Page 7
Amazon Web Services – Building a Data Lake with Amazon Web Services
Securing your data lake begins with implementing very fine-grained controls
that allow authorized users to see, access, process, and modify particular assets
and ensure that unauthorized users are blocked from taking any actions that
would compromise data confidentiality and security. A complicating factor is
that access roles may evolve over various stages of a data asset’s processing and
lifecycle. Fortunately, Amazon has a comprehensive and well-integrated set of
security features to secure an Amazon S3-based data lake.
Page 8
Amazon Web Services – Building a Data Lake with Amazon Web Services
For most data lake environments, we recommend using user policies, so that
permissions to access data assets can also be tied to user roles and permissions
for the data processing and analytics services and tools that your data lake users
will use. User policies are associated with AWS Identity and Access
Management (IAM) service, which allows you to securely control access to AWS
services and resources. With IAM, you can create IAM users, groups, and roles
in accounts and then attach access policies to them that grant access to AWS
resources, including Amazon S3. The model for user policies is shown in Figure
5. For more details and information on securing Amazon S3 with user policies
and AWS IAM, please reference: Amazon Simple Storage Service Developers
Guide and AWS Identity and Access Management User Guide.
Page 9
Amazon Web Services – Building a Data Lake with Amazon Web Services
Data lakes built on AWS primarily use two types of encryption: Server-side
encryption (SSE) and client-side encryption. SSE provides data-at-rest
encryption for data written to Amazon S3. With SSE, Amazon S3 encrypts user
data assets at the object level, stores the encrypted objects, and then decrypts
them as they are accessed and retrieved. With client-side encryption, data
objects are encrypted before they written into Amazon S3. For example, a data
lake user could specify client-side encryption before transferring data assets
into Amazon S3 from the Internet, or could specify that services like Amazon
EMR, Amazon Athena, or Amazon Redshift use client-side encryption with
Amazon S3. SSE and client-side encryption can be combined for the highest
levels of protection. Given the intricacies of coordinating encryption key
management in a complex environment like a data lake, we strongly
recommend using AWS KMS to coordinate keys across client- and server-side
encryption and across multiple data processing and analytics services.
For even greater levels of data lake data protection, other services like Amazon
API Gateway, Amazon Cognito, and IAM can be combined to create a “shopping
cart” model for users to check in and check out data lake data assets. This
architecture has been created for the Amazon S3-based data lake solution
reference architecture, which can be found, downloaded, and deployed at
https://aws.amazon.com/answers/big-data/data-lake-solution/
Page 10
Amazon Web Services – Building a Data Lake with Amazon Web Services
Data protection rests on the inherent durability of the storage platform used.
Durability is defined as the ability to protect data assets against corruption and
loss. Amazon S3 provides 99.999999999% data durability, which is 4 to 6
orders of magnitude greater than that which most on-premises, single-site
storage platforms can provide. Put another way, the durability of Amazon S3 is
designed so that 10,000,000 data assets can be reliably stored for 10,000 years.
Beyond core data protection, another key element is to protect data assets
against unintentional and malicious deletion and corruption, whether through
users accidentally deleting data assets, applications inadvertently deleting or
corrupting data, or rogue actors trying to tamper with data. This becomes
especially important in a large multi-tenant data lake, which will have a large
number of users, many applications, and constant ad hoc data processing and
application development. Amazon S3 provides versioning to protect data assets
against these scenarios. When enabled, Amazon S3 versioning will keep
multiple copies of a data asset. When an asset is updated, prior versions of the
asset will be retained and can be retrieved at any time. If an asset is deleted, the
last version of it can be retrieved. Data asset versioning can be managed by
policies, to automate management at large scale, and can be combined with
other Amazon S3 capabilities such as lifecycle management for long-term
Page 11
Amazon Web Services – Building a Data Lake with Amazon Web Services
retention of versions on lower cost storage tiers such as Amazon Glacier, and
Multi-Factor-Authentication (MFA) Delete, which requires a second layer of
authentication—typically via an approved external authentication device—to
delete data asset versions.
Page 12
Amazon Web Services – Building a Data Lake with Amazon Web Services
In addition to being used for data classification, object tagging offers other
important capabilities. Object tags can be used in conjunction with IAM to
enable fine-grain controls of access permissions, For example, a particular data
lake user can be granted permissions to only read objects with specific tags.
Object tags can also be used to manage Amazon S3 data lifecycle policies, which
is discussed in the next section of this whitepaper. A data lifecycle policy can
contain tag-based filters. Finally, object tags can be combined with Amazon
CloudWatch metrics and AWS CloudTrail logs—also discussed in the next
section of this paper—to display monitoring and action audit data by specific
data asset tag filters.
Amazon CloudWatch
As an administrator you need to look at the complete data lake environment
holistically. This can be achieved using Amazon CloudWatch. CloudWatch is a
Page 13
Amazon Web Services – Building a Data Lake with Amazon Web Services
monitoring service for AWS Cloud resources and the applications that run on
AWS. You can use CloudWatch to collect and track metrics, collect and monitor
log files, set thresholds, and trigger alarms. This allows you to automatically
react to changes in your AWS resources.
AWS CloudTrail
An operational data lake has many users and multiple administrators, and may
be subject to compliance and audit requirements, so it’s important to have a
complete audit trail of actions taken and who has performed these actions. AWS
CloudTrail is an AWS service that enables governance, compliance, operational
auditing, and risk auditing of AWS accounts.
CloudTrail continuously monitors and retains events related to API calls across
the AWS services that comprise a data lake. CloudTrail provides a history of
AWS API calls for an account, including API calls made through the AWS
Management Console, AWS SDKs, command line tools, and most Amazon S3-
based data lake services. You can identify which users and accounts made
requests or took actions against AWS services that support CloudTrail, the
source IP address the actions were made from, and when the actions occurred.
Page 14
Amazon Web Services – Building a Data Lake with Amazon Web Services
Page 15
Amazon Web Services – Building a Data Lake with Amazon Web Services
Keeping more historical data assets, particularly raw data assets, allows for
better training and refinement of models. Additionally, as your organization’s
analytics sophistication grows, you may want to go back and reprocess historical
data to look for new insights and value. These historical data assets are
infrequently accessed and consume a lot of capacity, so they are often well
suited to be stored on an archival storage layer.
Another long-term data storage need for the data lake is to keep processed data
assets and results for long-term retention for compliance and audit purposes, to
be accessed by auditors when needed. Both of these use cases are well served by
Amazon Glacier, which is an AWS storage service optimized for infrequently
used cold data, and for storing write once, read many (WORM) data.
Amazon Glacier
Amazon Glacier is an extremely low-cost storage service that provides durable
storage with security features for data archiving and backup Amazon Glacier has
the same data durability (99.999999999%) as Amazon S3, the same integration
with AWS security features, and can be integrated with S3 by using S3 lifecycle
management on data assets stored in S3, so that data assets can be seamlessly
migrated from S3 to Glacier. Amazon Glacier is a great storage choice when low
storage cost is paramount, data assets are rarely retrieved, and retrieval latency
of several minutes to several hours is acceptable.
Different types of data lake assets may have different retrieval needs. For
example, compliance data may be infrequently accessed and relatively small in
size but needs to be made available in minutes when auditors request data,
while historical raw data assets may be very large but can be retrieved in bulk
over the course of a day when needed.
Amazon Glacier allows data lake users to specify retrieval times when the data
retrieval request is created, with longer retrieval times leading to lower retrieval
costs. For processed data and records that need to be securely retained, Amazon
Glacier Vault Lock allows data lake administrators to easily deploy and enforce
compliance controls on individual Glacier vaults via a lockable policy.
Administrators can specify controls such as Write Once Read Many (WORM) in
Page 16
Amazon Web Services – Building a Data Lake with Amazon Web Services
a Vault Lock policy and lock the policy from future edits. Once locked, the policy
becomes immutable and Amazon Glacier will enforce the prescribed controls to
help achieve your compliance objectives, and provide an audit trail for these
assets using AWS CloudTrail.
Data lake environments are designed to ingest and process many types of data,
and store raw data assets for future archival and reprocessing purposes, as well
as store processed and normalized data assets for active querying, analytics, and
reporting. One of the key best practices to reduce storage and analytics
processing costs, as well as improve analytics querying performance, is to use an
optimized data format, particularly a format like Apache Parquet.
Amazon tests comparing the CSV and Parquet formats using 1 TB of log data
stored in CSV format to Parquet format showed the following:
Page 17
Amazon Web Services – Building a Data Lake with Amazon Web Services
A query time for a representative Athena query was 34x faster with
Parquet (237 seconds for CSV versus 5.13 seconds for Parquet), and the
amount of data scanned for that Athena query was 99% less (1.15TB
scanned for CSV versus 2.69GB for Parquet)
The cost to run that Athena query was 99.7% less ($5.75 for CSV versus
$0.013 for Parquet)
Parquet has the additional benefit of being an open data format that can be used
by multiple querying and analytics tools in an Amazon S3-based data lake,
particularly Amazon Athena, Amazon EMR, Amazon Redshift, and Amazon
Redshift Spectrum.
The key to ‘democratizing’ the data and making the data lake available to the
widest number of users of varying skill sets and responsibilities is to transform
data assets into a format that allows for efficient ad hoc SQL querying. As
discussed earlier, when a data lake is built on AWS, we recommend
transforming log-based data assets into Parquet format. AWS provides multiple
services to quickly and efficiently achieve this.
There are a multitude of ways to transform data assets, and the “best” way often
comes down to individual preference, skill sets, and the tools available. When a
data lake is built on AWS services, there is a wide variety of tools and services
available for data transformation, so you can pick the methods and tools that
you are most comfortable with. Since the data lake is inherently multi-tenant,
multiple data transformation jobs using different tools can be run concurrently.
The two most common and straightforward methods to transform data assets
into Parquet in an Amazon S3-based data lake use Amazon EMR clusters. The
first method involves creating an EMR cluster with Hive installed using the raw
data assets in Amazon S3 as input, transforming those data assets into Hive
Page 18
Amazon Web Services – Building a Data Lake with Amazon Web Services
tables, and then writing those Hive tables back out to Amazon S3 in Parquet
format. The second, related method is to use Spark on Amazon EMR. With this
method, a typical transformation can be achieved with only 20 lines of PySpark
code.
AWS Glue automatically crawls raw data assets in your data lake’s S3 buckets,
identifies data formats, and then suggests schemas and transformations so that
you don’t have to spend time hand-coding data flows. You can then edit these
transformations, if necessary, using the tools and technologies you already
know, such as Python, Spark, Git, and your favorite integrated developer
environment (IDE), and then share them with other AWS Glue users of the data
lake. AWS Glue’s flexible job scheduler can be set up to run data transformation
flows on a recurring basis, in response to triggers, or even in response to AWS
Lambda events.
In-Place Querying
One of the most important capabilities of a data lake that is built on AWS is the
ability to do in-place transformation and querying of data assets without having
to provision and manage clusters. This allows you to run sophisticated analytic
queries directly on your data assets stored in Amazon S3, without having to
copy and load data into separate analytics platforms or data warehouses. You
Page 19
Amazon Web Services – Building a Data Lake with Amazon Web Services
can query S3 data without any additional infrastructure, and you only pay for
the queries that you run. This makes the ability to analyze vast amounts of
unstructured data accessible to any data lake user who can use SQL, and makes
it far more cost effective than the traditional method of performing an ETL
process, creating a Hadoop cluster or data warehouse, loading the transformed
data into these environments, and then running query jobs. AWS Glue, as
described in the previous sections, provides the data discovery and ETL
capabilities, and Amazon Athena and Amazon Redshift Spectrum provide the
in-place querying capabilities.
Amazon Athena
Amazon Athena is an interactive query service that makes it easy for you to
analyze data directly in Amazon S3 using standard SQL. With a few actions in
the AWS Management Console, you can use Athena directly against data assets
stored in the data lake and begin using standard SQL to run ad hoc queries and
get results in a matter of seconds.
Page 20
Amazon Web Services – Building a Data Lake with Amazon Web Services
queries. Redshift Spectrum can directly query a wide variety of data assets
stored in the data lake, including CSV, TSV, Parquet, Sequence, and RCFile.
Since Redshift Spectrum supports the SQL syntax of Amazon Redshift, you can
run sophisticated queries using the same BI tools that you use today. You also
have the flexibility to run queries that span both frequently accessed data assets
that are stored locally in Amazon Redshift and your full data sets stored in
Amazon S3. Because Amazon Athena and Amazon Redshift share a common
data catalog and common data formats, you can use both Athena and Redshift
Spectrum against the same data assets. You would typically use Athena for ad
hoc data discovery and SQL querying, and then use Redshift Spectrum for more
complex queries and scenarios where a large number of data lake users want to
run concurrent BI and reporting workloads.
Amazon EMR
Amazon EMR is a highly distributed computing framework used to quickly and
easily process data in a cost-effective manner. Amazon EMR uses Apache
Hadoop, an open source framework, to distribute data and processing across an
elastically resizable cluster of EC2 instances and allows you to use all the
common Hadoop tools such as Hive, Pig, Spark, and HBase. Amazon EMR does
all the heavily lifting involved with provisioning, managing, and maintaining the
infrastructure and software of a Hadoop cluster, and is integrated directly with
Amazon S3. With Amazon EMR, you can launch a persistent cluster that stays
Page 21
Amazon Web Services – Building a Data Lake with Amazon Web Services
Amazon QuickSight
Amazon QuickSight is a very fast, easy-to-use, business analytics service that
makes it easy for you to build visualizations, perform ad hoc analysis, and
quickly get business insights from your data assets stored in the data lake,
anytime, on any device. You can use Amazon QuickSight to seamlessly discover
AWS data sources such as Amazon Redshift, Amazon RDS, Amazon Aurora,
Amazon Athena, and Amazon S3, connect to any or all of these data sources and
Page 22
Amazon Web Services – Building a Data Lake with Amazon Web Services
data assets, and get insights from this data in minutes. Amazon QuickSight
enables organizations using the data lake to seamlessly scale their business
analytics capabilities to hundreds of thousands of users. It delivers fast and
responsive query performance by using a robust in-memory engine (SPICE).
Amazon Rekognition
Another innovative data lake service is Amazon Rekognition, which is a fully
managed image recognition service powered by deep learning, run against
image data assets stored in Amazon S3. Amazon Rekognition has been built by
Amazon’s Computer Vision teams over many years, and already analyzes
billions of images every day. The Amazon Rekognition easy-to-use API detects
thousands of objects and scenes, analyzes faces, compares two faces to measure
similarity, and verifies faces in a collection of faces. With Amazon Rekognition,
you can easily build applications that search based on visual content in images,
analyze face attributes to identify demographics, implement secure face-based
verification, and more. Amazon Rekognition is built to analyze images at scale
and integrates seamlessly with data assets stored in Amazon S3, as well as AWS
Lambda and other key AWS services.
These are just a few examples of powerful data processing and analytics tools
that can be integrated with a data lake built on AWS. See the AWS website for
more examples and for the latest list of innovative AWS services available for
data lake users.
AWS futureproofs your data lake with a standardized storage solution that
grows with your organization by ingesting and storing all of your business’s data
assets on a platform with virtually unlimited scalability and well-defined APIs
and integrates with a wide variety of data processing tools. This allows you to
Page 23
Amazon Web Services – Building a Data Lake with Amazon Web Services
add new capabilities to your data lake as you need them without infrastructure
limitations or barriers. Additionally, you can perform agile analytics
experiments against data lake assets to quickly explore new processing methods
and tools, and then scale the promising ones into production without the need
to build new infrastructure, duplicate and/or migrate data, and have users
migrate to a new platform. In closing, a data lake built on AWS allows you to
evolve your business around your data assets, and to use these data assets to
quickly and agilely drive more business value and competitive differentiation
without limits.
Contributors
The following individuals and organizations contributed to this document:
Document Revisions
Date Description
Page 24