Nothing Special   »   [go: up one dir, main page]

Aws Data Analytics Fundamentals

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 15

Definitions

Analysis is a detailed examination of something in order to understand its nature or


determine its essential features.
 Data analysis is the process of compiling, processing, and analyzing data so that you
can use it to make decisions.
Analytics is the systematic analysis of data. 

Data analytics is the specific analytical process being applied.

Big data is an industry term that has changed in recent years. Big data solutions are
often part of data analysis solutions. 

Business challenge
Imagine an organization that is growing rapidly. 

Data is generated in many ways. The big question is where to put it all and how to use
it to create value or generate competitive advantages.
The challenges identified in many data analysis solutions can be summarized by five
key challenges: volume, velocity, variety, veracity, and value.

Not all organizations experience challenges in every area. Some organizations


struggle with ingesting large volumes of data rapidly. Others struggle with processing
massive volumes of data to produce new predictive insights. Still others have users
that need to perform detailed data analysis on the fly over enormous data sets. 

Components of a data analysis solution


A data analysis solution has many components. The analytics performed in each of
these components may require different services and different approaches. 
A data analysis solution includes the following components.
volume, I mean the amount of data that a solution must handle. The solution must do it efficiently and
be able to distribute the load across enough servers to handle the next V:
Velocity is the speed at which data enters and flows through your solution. Many businesses now use
large volumes of real-time streaming data. Solutions must be able to rapidly ingest and rapidly
process this data.
V is variety—ingesting data of many different types from many different sources can mean many
different challenges to data analysis. Smart companies build solutions to work with structured,
semistructured, and completely unstructured data types.

V is veracity, which refers to the trustworthiness of your data. Have you ever heard the
saying, “My word is my bond”? It’s supposed to instill trust, to let you know that the person
saying it is honorable and will do what they say they will. That’s veracity. To have
trustworthy data, you have to know the provenance of your data.

V is value—which is the bottom line, really. The whole point of this effort is getting value from data.
That includes creating reports and dashboards that inform critical business decisions. It also includes
highlighting areas for improving the business. And it includes making it easier to find and
communicate critical details about business operations.

Due to increasing volume, velocity, variety, veracity, and value of data, some data
management challenges cannot be solved with traditional database and processing
solutions.  That's where data analysis solutions come in.

Planning a data analysis solution


Data analysis solutions incorporate many forms of analytics to store, process, and
visualize data. Planning a data analysis solution begins with knowing what you need
out of that solution.
Know where your data comes from
The majority of data ingested by data analysis solutions comes from existing on-premises
databases and file stores. This data is often in a state where the required processing within
the solution will be minimal.

Streaming data is a source of business data that is gaining popularity. This data source is
less structured. It may require special software to collect the data and specific processing
applications to correctly aggregate and analyze it in near real-time.

Public data sets are another source of data for businesses.  These include census data, health
data, population data, and many other datasets that help businesses understand the data they
are collecting on their customers.  This data may need to be transformed so that it will contain
only what the business needs.

Know the options for processing your data


There are many different solutions available for processing your data. There is no one-size-
fits-all approach. You must carefully evaluate your business needs and match them to the
services that will combine to provide you with the required results. 

Throughout this course, we will cover the services that AWS offers for each of the
components pictured below.

Know what you need to learn from your data


You must be prepared to learn from your data, work with internal teams to optimize efforts,
and be willing to experiment. 

It is vital to spot trends, make correlations, and run more efficient and profitable businesses.
It's time to put your data to work.

My business has a set of 15 JSON data files that are each about 2.5 GB in size. They

are placed on a file server once an hour. They must be ingested as soon as they arrive in this

location. This data must be combined with all transactions from the financial dashboard for

this same period, then compared to the recommendations from the marketing engine. All data

is fully cleansed. The results from this time period must be made available to decision makers

by 10 minutes after the hour in the form of financial dashboards. Based on the scenario

above, which of the following Vs pose a challenge for this business?

 This scenario describes challenges in volume, velocity, variety, and value.


 Volume This scenario describes huge JSON files to be combined with transactional data and
marketing data.
 Velocity This scenario is an example of "Wait - now hurry up!" The solution must wait to collect
data for a full hour and then produce meaningful results in less than 10 minutes.
 Variety This scenario has three data source types: log files, transactional data, and
recommendation information that is likely in a key-value format.
 Value This scenario will populate dashboards that are used by decision makers as soon as they are
made available. The value is reached because it requires an understanding of what the
organization is trying to accomplish. A thorough understanding of these initiatives is key.

Scenario 2
My business compiles data generated by hundreds of corporations. This data is delivered to
us in very large files, transactional updates, and even data streams. The data must be cleansed
and prepared to ensure that rogue inputs do not skew the results. Knowing the data source for
each record is vital to the work we do. A large portion of the data gathered is irrelevant to our
analysis, so this data must be eliminated. The final requirement is that all data must be
combined and loaded into our data warehouse, where it will be analyzed. 
This problem involves volume, variety, and veracity.
Volume The data is delivered in very large files, transactional updates, and even in data streams.
Variety The business will need to combine the data from all three sources into a single data warehouse.
Veracity The data is known to be suspect. The data must be cleansed and prepared to ensure that rogue
inputs do not skew the results. Knowing the data source for each record is vital to the work we do

When businesses have more data than they are able


to process and analyze, they have a volume problem.
Exponential growth of business data
Businesses have been storing data for decades—that is nothing new. What has
changed in recent years is the ability to analyze certain types of data.

There are three broad classifications of data source types:

 Structured data is organized and stored in the form of values that are
grouped into rows and columns of a table.
 Semistructured data is often stored in a series of key-value pairs that
are grouped into elements within a file.
 Unstructured data is not structured in a consistent way. Some data
may have structure similar to semi-structured data but others may
only contain metadata.
 Many internet articles tout the huge amount of information sitting within unstructured
data. New applications are being released that can now catalog and provide incredible
insights into this untapped resource.
 But what is unstructured data? It is in every file that we store, every picture we take,
and email we send.

Data sets are getting bigger and more diverse every single day.
Modern data management platforms must capture data from diverse sources at
speed and scale. Data needs to be pulled together in manageable, central
repositories—breaking down traditional silos. The benefits of collection and
analysis of all business data must outweigh the costs.

Topic 1:intro to amazon s3


Data analysis solutions can ingest data from just about anywhere. However, the closer
your data is to your processing system, the better that processing system will perform.
In AWS, the Amazon Simple Storage Service (Amazon S3) is the best place to store
all of your semistructured and unstructured data.
AWS file storage

CUSTOMER NEED
Imagine a business that has implemented Amazon QuickSight as a data visualization tool.
When this tool relies on data stored on-premises, latency may be added into processing. This
latency can become a problem for users. Another common concern is a user's ability to pull
together the correct data sets to perform the necessary analytics.

THE AWS OPTION


Amazon S3 is storage for the internet. This service is designed to make web-scale
computing easier for developers. Amazon S3 provides a simple web service interface
that can be used to store and retrieve any amount of data, at any time, from anywhere
on the web. The service gives any developer access to the same highly scalable,
reliable, secure, fast, inexpensive infrastructure that Amazon uses to run its own
global network of websites. The service aims to maximize benefits of scale and pass
those benefits on to developers.

The benefits of Amazon S3 include the following:

    - Store anything
    - Secure object storage
    - Natively online, HTTP access
    - Unlimited scalability 
    - 99.999999999% durability

Amazon S3 is object storage built to store and retrieve any amount of data from anywhere.

Amazon S3 concepts
To get the most out of Amazon S3, you need to understand a few simple concepts.
First, Amazon S3 stores data as objects within buckets.

An object is composed of a file and any metadata that describes that file. To store an
object in Amazon S3, you upload the file you want to store into a bucket. When you
upload a file, you can set permissions on the object and add any metadata.

Buckets are logical containers for objects. You can have one or more buckets in your
account and can control access for each bucket individually. You control who can
create, delete, and list objects in the bucket. You can also view access logs for the
bucket and its objects and choose the geographical region where Amazon S3 will store
the bucket and its contents.

to Amazon S3Accessing your content


Once objects have been stored in an Amazon S3 bucket, they are given an object key.
Use this, along with the bucket name, to access the object.

Below is an example of a URL for a single object in a bucket named doc, with an


object key composed of the prefix 2006-03-01 and the file named AmazonS3.html.

An object key is the unique identifier for an object in a bucket. Because the


combination of a bucket, key, and version ID uniquely identifies each object, you can
think of Amazon S3 as a basic data map between "bucket + key + version" and the
object itself. Every object in Amazon S3 can be uniquely addressed through the
combination of the web service endpoint, bucket name, key, and (optionally) version.

Object metadata
For each object stored in a bucket, Amazon S3 maintains a set of system metadata. Click the link
to the right to learn more.
Data analysis solutions on Amazon S3

There are numerous advantages of using Amazon S3 as the storage platform for your
data analysis solution.

Decoupling of storage from compute and data processing


+

Centralized data architecture


+

Integration with clusterless and serverless AWS services


+

Standardized Application Programming Interfaces (APIs)

Decoupling of storage from compute and data processing


With Amazon S3, you can cost-effectively store all data types in their native formats. You
can then launch as many or as few virtual servers needed using Amazon Elastic Compute
Cloud (Amazon EC2) and use AWS analytics tools to process your data. You can optimize
your EC2 instances to provide the correct ratios of CPU, memory, and bandwidth for best
performance.

Decoupling your processing and storage provides a significant number of benefits, including
the ability to process and analyze the same data with a variety of tools.

Centralized data architecture


Amazon S3 makes it easy to build a multi-tenant environment, where many users can bring
their own data analytics tools to a common set of data. This improves both cost and data
governance over traditional solutions, which require multiple copies of data to be distributed
across multiple processing platforms.

Although this may require an additional step to load your data into the right tool, using
Amazon S3 as your central data store provides even more benefits over traditional storage
options.

Centralized data architecture



Amazon S3 makes it easy to build a multi-tenant environment, where many users can bring
their own data analytics tools to a common set of data. This improves both cost and data
governance over traditional solutions, which require multiple copies of data to be distributed
across multiple processing platforms.

Although this may require an additional step to load your data into the right tool, using
Amazon S3 as your central data store provides even more benefits over traditional storage
options.

Integration with clusterless and serverless AWS services


Combine Amazon S3 with other AWS services to query and process data. Amazon S3 also
integrates with AWS Lambda serverless computing to run code without provisioning or
managing servers. Amazon Athena can query Amazon S3 directly using the Structured Query
Language (SQL), without the need for data to be ingested into a relational database.

With all of these capabilities, you only pay for the actual amounts of data you process or the
compute time you consume.
Standardized Application Programming Interfaces (APIs)

Representational State Transfer (REST) APIs are programming interfaces commonly used to
interact with files in Amazon S3. Amazon S3's RESTful APIs are simple, easy to use, and
supported by most major third-party independent software vendors (ISVs), including Apache
Hadoop and other leading analytics tool vendors. This allows customers to bring the tools
they are most comfortable with and knowledgeable about to help them perform analytics on
data in Amazon S3.

Topic 2:
Introduction to data lakesStoring business content has
always been a point of contention, and often frustration, within businesses of all types.
Should content be stored in folders? Should prefixes and suffixes be used to identify
file versions? Should content be divided by department or specialty? The list goes on
and on.

The issue stems from the fact that many companies start to implement document or
file management systems with the best of intentions but don't have the foresight or
infrastructure in place to maintain the initial data organization.

Out of the dire need for organizing the ever increasing volume of data, data lakes were
born.

Business challenge
Businesses grow over time. As they do, a natural result is that important files and data get
scattered across the enterprise. It is very common to find employees who have no idea where
data can be found and—even worse—how to analyze it when it is in different locations.

A data lake is a centralized repository that allows you to


store structured, semistructured, and unstructured data at
any scale.

Data lakes promise the ability to store all data for a business in a single repository.
You can leverage data lakes to store large volumes of data instead of persisting that
data in data warehouses. Data lakes, such as those built in Amazon S3, are generally
less expensive than specialized big data storage solutions. That way, you only pay for
the specialized solutions when using them for processing and analytics and not for
long-term storage. Your extract, transform, and load (ETL) and analytic process can
still access this data for analytics. 

Single source of truth Be careful not to let your data lake become a swamp.
Enforce proper organization and structure for all data entering the lake.
Store any type of data, regardless of structure Be careful to ensure
that data within the data lake is relevant and does not go unused. Train users on how
to access the data, and set retention policies to ensure the data stays refreshed.

Can be analyzed using artificial intelligence (AI) and machine


learning Be careful to learn how to use data in new ways. Don't limit analytics to
typical data warehouse-style analytics. AI and machine learning offer significant
insights.

Benefits of a data lake on AWS


 Are a cost-effective data storage solution. You can durably store a nearly
unlimited amount of data using Amazon S3.
 Implement industry-leading security and compliance. AWS uses
stringent data security, compliance, privacy, and protection mechanisms.
 Allow you to take advantage of many different data collection and
ingestion tools to ingest data into your data lake. These services include
Amazon Kinesis for streaming data and AWS Snowball appliances for
large volumes of on-premises data.
 Help you to categorize and manage your data simply and efficiently.
Use AWS Glue to understand the data within your data lake, prepare it,
and load it reliably into data stores. Once AWS Glue catalogs your data, it
is immediately searchable, can be queried, and is available for ETL
processing.
 Help you turn data into meaningful insights. Harness the power of
purpose-built analytic services for a wide range of use cases, such as
interactive analysis, data processing using Apache Spark and Apache
Hadoop, data warehousing, real-time analytics, operational analytics,
dashboards, and visualizations.
 Amazon EMR and data lakes
 Businesses have begun realizing the power of data lakes. Businesses can place
data within a data lake and use their choice of open source distributed
processing frameworks, such as those supported by Amazon EMR. Apache
Hadoop and Spark are both supported by Amazon EMR, which has the ability
to help businesses easily, quickly, and cost-effectively implement data
processing solutions based on Amazon S3 data lakes.
 Data lake preparation
 Data scientists spend 60% of their time cleaning and

organizing data and 19% collecting data sets.


 Data preparation is a huge undertaking. There are no easy answers when it
comes to cleaning, transforming, and collecting data for your data lake.
However, there are services that can automate many of these time-consuming
processes.
 Setting up and managing data lakes today can involve a lot of manual,
complicated, and time-consuming tasks. This work includes loading the data,
monitoring the data flows, setting up partitions for the data, and tuning
encryption. You may also need to reorganize data, deduplicate it, match linked
records, and audit data over time.
AWS content organization and curation
Business challenge
Imagine a business that has millions of files stored in numerous on-premises server-based and
network-attached storage solutions. The business is struggling to navigate all of the locations
and provide users with quick, reliable access to this content both locally and from the cloud.

Data lake on AWS                                                                                                                      

Traditional data storage and analytic tools can no longer provide the agility and flexibility
required to deliver relevant business insights. That’s why many organizations are shifting to a
data lake architecture.

A data lake on AWS can help you do the following:


- Collect and store any type of data, at any scale, and at low cost
- Secure the data and prevent unauthorized access
- Catalog, search, and find the relevant data in the central repository
- Quickly and easily perform new types of data analysis
- Use a broad set of analytic engines for one-time analytics, real-time streaming, predictive
analytics, AI, and machine learning

Imagine a business that has thousands of files stored in Amazon S3. The business needs a
solution for automating common data preparation tasks and organizing the data in a secure
repository.
AWS Lake Formation (currently in preview)

AWS Lake Formation makes it easy to set up a secure data lake in days. A data lake is a
centralized, curated, and secured repository that stores all your data, both in its original form
and when prepared for analysis. A data lake enables you to break down data silos and
combine different types of analytics to gain insights and guide better business decisions.
AWS Lake Formation is in preview only.

AWS Lake Formation makes it easy to ingest, clean, catalog, transform, and secure
your data and make it available for analysis and machine learning. Lake Formation
gives you a central console where you can discover data sources, set up transformation
jobs to move data to an Amazon S3 data lake, remove duplicates and match records,
catalog data for access by analytic tools, configure data access and security policies,
and audit and control access from AWS analytic and machine learning services.

Lake Formation automatically configures underlying AWS services to ensure


compliance with your defined policies. If you have set up transformation jobs
spanning AWS services, Lake Formation configures the flows, centralizes their
orchestration, and lets you monitor the execution of your jobs.
ntroduction to data storage methods
intro to data storage and methods

As the volume of data has increased, so have the options for storing data. Traditional
storage methods such as data warehouses are still very popular and relevant. However,
data lakes have become more popular recently. These new options can confuse
businesses that are trying to be financially wise and technically relevant.

So which is better: data warehouses or data lakes? Neither and both. They are
different solutions that can be used together to maintain existing data warehouses
while taking full advantage of the benefits of data lakes.
Business challenge
Businesses are left asking the question, "Why?" Why should we spend a bunch of time and
money implementing a data lake when we have invested so much into a data warehouse? It is
important to remember that a data lake augments, but does not replace, a data warehouse.

Data warehouses
A data warehouse is a central repository of structured data
from many data sources. This data
is transformed, aggregated, and prepared for business
reporting and analysis.
A data warehouse is a central repository of information coming from one or more data
sources. Data flows into a data warehouse from transactional systems, relational
databases, and other sources. These data sources can include structured,
semistructured, and unstructured data. These data sources are transformed into
structured data before they are stored in the data warehouse.

Data is stored within the data warehouse using a schema. A schema defines how data
is stored within tables, columns, and rows. The schema enforces constraints on the
data to ensure integrity of the data. The transformation process often involves the
steps required to make the source data conform to the schema. Following the first
successful ingestion of data into the data warehouse, the process of ingesting and
transforming the data can continue at a regular cadence.

Business analysts, data scientists, and decision makers access the data through
business intelligence (BI) tools, SQL clients, and other analytics
applications. Businesses use reports, dashboards, and analytics tools to extract insights
from their data, monitor business performance, and support decision making. These
reports, dashboards, and analytics tools are powered by data warehouses, which store
data efficiently to minimize I/O and deliver query results at blazing speeds to
hundreds and thousands of users concurrently.

You might also like