Data Lake on the AWS Cloud with Talend Big Data Platform, AWS Services, and Cognizant Best Practices
This Quick Start builds a data lake environment on the Amazon Web Services (AWS) Cloud by deploying Talend Big Data Platform components and AWS services such as Amazon EMR, Amazon Redshift, Amazon Simple Storage Service (Amazon S3), and Amazon Relational Database Service (Amazon RDS).
The Quick Start also provides an optional sample dataset and Talend jobs developed by Cognizant Technology Solutions to illustrate big data practices for integrating Apache Spark, Apache Hadoop, Amazon EMR, Amazon Redshift, and Amazon S3 technologies into the data lake implementation.
The Quick Start is for users who are evaluating big data in the cloud or looking to accelerate their big data initiative through the adoption of best practices for big data integration.
The Quick Start offers two deployment options:
- Deploying the data lake environment into a new virtual private cloud (VPC) that's configured for security, scalability, and high availability
- Deploying the data lake environment into an existing VPC in your AWS account
You can also use the AWS CloudFormation templates as a starting point for your own implementation.
For architectural details, step-by-step instructions, and customization options, see the deployment guide.
To post feedback, submit feature ideas, or report bugs, use the Issues section of this GitHub repo. If you'd like to submit code for this Quick Start, please review the AWS Quick Start Contributor's Kit.