CDI-E: an elastic cloud service for data engineering
Pages 3319 - 3331
Abstract
We live in the gilded age of data-driven computing. With public clouds offering virtually unlimited amounts of compute and storage, enterprises collecting data about every aspect of their businesses, and advances in analytics and machine learning technologies, data driven decision making is now timely, cost-effective, and therefore, pervasive. Alas, only a handful of power users can wield today's powerful data engineering tools. For one thing, most solutions require knowledge of specific programming interfaces or libraries. Furthermore, running them requires complex configurations and knowledge of the underlying cloud for cost-effectiveness.
We decided that a fundamental redesign is in order to democratize data engineering for the masses at cloud scale. The result is Informatica Cloud Data Integration - Elastic (CDI-E). Since the early 1990s, Informatica has been a pioneer and industry leader in building no-code data engineering tools. Non-experts can express complex data engineering tasks using a graphical user interface (GUI). Informatica CDI-E is built to incorporate the simplicity of GUI in the design layer with an elastic and highly scalable run time to handle data in any format without little to no user input using automated optimizations. Users upload their data to the cloud in any format and can immediately use them in conjunction with their data management and analytic tools of choice using CDI-E GUI. Implementation began in the Spring of 2017, and Informatica CDI-E has been generally available since the Summer of 2019. Today, CDI-E is used in production by a growing number of small and large enterprises to make sense of data in arbitrary formats.
In this paper, we describe the architecture of Informatica CDI-E and its novel no-code data engineering interface. The paper highlights some of the key features of CDI-E: simplicity without loss in productivity and extreme elasticity. It concludes with lessons we learned and an outlook of the future.
References
[1]
[n.d.]. Amazon Redshift. https://aws.amazon.com/redshift/.
[2]
[n.d.]. Amazon Simple Storage Service (S3). https://aws.amazon.com/s3.
[3]
[n.d.]. Apache Avro. https://avro.apache.org.
[4]
[n.d.]. Apache Cassandra. https://cassandra.apache.org.
[5]
[n.d.]. Apache Hadoop. https://www.hadoop.apache.org.
[6]
[n.d.]. Apache Hive. https://www.hive.apache.org.
[7]
[n.d.]. Apache ORC. https://orc.apache.org.
[8]
[n.d.]. Apache Parquet. https://parquet.apache.org.
[9]
[n.d.]. Apache Spark. https://spark.apache.org.
[10]
[n.d.]. AWS Glue. https://aws.amazon.com/glue/.
[11]
[n.d.]. AWS Lambda. https://docs.aws.amazon.com/lambda/.
[12]
[n.d.]. Azure Blob Storage. https://docs.microsoft.com/en-us/azure/storage/blobs/storage-blobs-overview.
[13]
[n.d.]. Azure Functions. https://azure.microsoft.com/en-us/services/functions/.
[14]
[n.d.]. Azure Logic Apps. https://azure.microsoft.com/en-us/services/logic-apps/#overview.
[15]
[n.d.]. Azure Synapse Analytics. https://azure.microsoft.com/en-us/services/synapse-analytics/.
[16]
[n.d.]. BigQuery. https://cloud.google.com/bigquery.
[17]
[n.d.]. Cloud Functions. https://cloud.google.com/functions.
[18]
[n.d.]. Cloud Storage. https://cloud.google.com/storage.
[19]
[n.d.]. Data Factory. https://azure.microsoft.com/en-us/services/data-factory/.
[20]
[n.d.]. Databricks. https://databricks.com.
[21]
[n.d.]. Extensible Markup Language (XML) - W3C. https://www.w3.org/xml.
[22]
[n.d.]. Fast Healthcare Interoperability Resources. https://en.wikipedia.org/wiki/Fast_Healthcare_Interoperability_Resources.
[23]
[n.d.]. Health Insurance Portability and Accountability Act. https://www.hhs.gov/hipaa.
[24]
[n.d.]. HL7 International. https://hl7.org.
[25]
[n.d.]. Informatica Cloud Application Integration. https://www.informatica.com/products/cloud-application-integration.html.
[26]
[n.d.]. Informatica's Cost Optimization Engine. https://www.informatica.com/lp/informaticas-cost-optimization-engine_4257.html.
[27]
[n.d.]. JavaScript Object Notation. https://www.json.org.
[28]
[n.d.]. Kubernetes Scheduler. https://www.kubernetes.io/docs/concepts/scheduling-eviction/kube-scheduler/.
[29]
[n.d.]. LVM. https://www.tecmint.com/create-lvm-storage-in-linux/.
[30]
[n.d.]. LVM1. https://www.tecmint.com/extend-and-reduce-lvms-in-linux/.
[31]
[n.d.]. MongoDB. https://www.mongodb.com.
[32]
[n.d.]. Nucleus Research. https://nucleusresearch.com/.
[33]
[n.d.]. Nucleus Research Informatica. https://nucleusresearch.com/research/single/roi-guidebook-informatica/.
[34]
[n.d.]. presto. https://prestodb.io.
[35]
[n.d.]. Running Spark on Kubernetes. https://www.spark.apache.org/docs/latest/running-on-kubernetes.html.
[36]
[n.d.]. SWIFT EDI Document Standard. https://www.edibasics.com/edi-resources/document-standards/swift/.
[37]
[n.d.]. Taints and Tolerations. https://www.kubernetes.io/docs/concepts/scheduling-eviction/taint-and-toleration/.
[38]
[n.d.]. TPC. https://www.tpc.org.
[39]
Benoit Dageville, Thierry Cruanes, Marcin Zukowski, Vadim Antonov, Artin Avanes, Jon Bock, Jonathan Claybaugh, Daniel Engovatov, Martin Hentschel, Jiansheng Huang, et al. 2016. The snowflake elastic data warehouse. In Proceedings of the 2016 International Conference on Management of Data. 215--226.
[40]
Sudipto Das, Shoji Nishimura, Divyakant Agrawal, and Amr El Abbadi. 2011. Albatross: Lightweight elasticity in shared storage databases for the cloud using live data migration. Proceedings of the VLDB Endowment 4, 8 (2011), 494--505.
[41]
Jeffrey Dean and Sanjay Ghemawat. 2008. MapReduce: Simplified Data Processing on Large Clusters. Commun. ACM 51, 1 (jan 2008), 107--113.
[42]
AnHai Doan, Alon Y. Halevy, and Zachary G. Ives. 2012. Principles of Data Integration. Morgan Kaufmann. https://research.cs.wisc.edu/dibook/
[43]
Xin Luna Dong and Divesh Srivastava. 2013. Big data integration. In ICDE. IEEE, 1245--1248.
[44]
Anurag Gupta, Deepak Agarwal, Derek Tan, Jakub Kulesza, Rahul Pathak, Stefano Stefani, and Vidhya Srinivasan. 2015. Amazon Redshift and the Case for Simpler Data Warehouses. In Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data (Melbourne, Victoria, Australia) (SIGMOD '15). Association for Computing Machinery, New York, NY, USA, 1917--1923.
[45]
Botong Huang, Matthias Boehm, Yuanyuan Tian, Berthold Reinwald, Shirish Tatikonda, and Frederick R Reiss. 2015. Resource elasticity for large-scale machine learning. In Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data. 137--152.
[46]
Jiawei Jiang, Shaoduo Gan, Yue Liu, Fanlin Wang, Gustavo Alonso, Ana Klimovic, Ankit Singla, Wentao Wu, and Ce Zhang. 2021. Towards demystifying serverless machine learning training. In ACM SIGMOD. 857--871.
[47]
Jörn Kuhlenkamp, Markus Klems, and Oliver Röss. 2014. Benchmarking scalability and elasticity of distributed database systems. Proceedings of the VLDB Endowment 7, 12 (2014), 1219--1230.
[48]
Maurizio Lenzerini. 2002. Data integration: A theoretical perspective. In Proceedings of the twenty-first ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems. 233--246.
[49]
Edo Liberty, Zohar Karnin, Bing Xiang, Laurence Rouesnel, Baris Coskun, Ramesh Nallapati, Julio Delgado, Amir Sadoughi, Yury Astashonok, Piali Das, et al. 2020. Elastic machine learning algorithms in amazon sagemaker. In Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data. 731--737.
[50]
Renée J Miller. 2018. Open data integration. Proceedings of the VLDB Endowment 11, 12 (2018), 2130--2139.
[51]
Ingo Müller, Renato Marroquín, and Gustavo Alonso. 2020. Lambada: Interactive data analytics on cold data using serverless cloud infrastructure. In ACM SIGMOD. 115--130.
[52]
Matthew Perron, Raul Castro Fernandez, David DeWitt, and Samuel Madden. 2020. Starling: A scalable query engine on cloud functions. In ACM SIGMOD. 131--141.
[53]
Qifan Pu, Shivaram Venkataraman, and Ion Stoica. 2019. Shuffling, fast and slow: Scalable analytics on serverless infrastructure. In USENIX NSDI. 193--206.
[54]
Protiva Rahman, Lilong Jiang, and Arnab Nandi. 2020. Evaluating interactive data systems. The VLDB Journal 29, 1 (2020), 119--146.
[55]
Vaishaal Shankar, Karl Krauth, Qifan Pu, Eric Jonas, Shivaram Venkataraman, Ion Stoica, Benjamin Recht, and Jonathan Ragan-Kelley. 2018. numpywren: Serverless linear algebra. arXiv preprint arXiv:1810.09679 (2018).
[56]
Michael Stein. 1987. Large Sample Properties of Simulations Using Latin Hypercube Sampling. Technometrics 29, 2 (1987), 143--151.
[57]
Michael Stonebraker, Ihab F Ilyas, et al. 2018. Data Integration: The Current Status and the Way Forward. IEEE Data Engineering Bulletin 41, 2 (2018), 3--9.
[58]
Rebecca Taft, Essam Mansour, Marco Serafini, Jennie Duggan, Aaron J Elmore, Ashraf Aboulnaga, Andrew Pavlo, and Michael Stonebraker. 2014. E-store: Finegrained elastic partitioning for distributed transaction processing systems. Proceedings of the VLDB Endowment 8, 3 (2014), 245--256.
[59]
Ashish Thusoo, Joydeep Sen Sarma, Namit Jain, Zheng Shao, Prasad Chakka, Suresh Anthony, Hao Liu, Pete Wyckoff, and Raghotham Murthy. 2009. Hive: A Warehousing Solution over a Map-Reduce Framework. Proc. VLDB Endow. 2, 2 (aug 2009), 1626--1629.
[60]
Bowei Xi, Zhen Liu, Mukund Raghavachari, Cathy Xia, and Li Zhang. 2004. A Smart Hill-Climbing Algorithm for Application Server Configuration. Thirteenth International World Wide Web Conference Proceedings, WWW 2004 (04 2004).
[61]
Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauly, Michael J Franklin, Scott Shenker, and Ion Stoica. 2012. Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing. In 9th USENIX Symposium on Networked Systems Design and Implementation (NSDI 12). 15--28.
[62]
Matei Zaharia, Mosharaf Chowdhury, Michael J. Franklin, Scott Shenker, and Ion Stoica. 2010. Spark: Cluster Computing with Working Sets. In Proceedings of the 2nd USENIX Conference on Hot Topics in Cloud Computing (Boston, MA) (HotCloud'10). USENIX Association, USA, 10.
Information & Contributors
Information
Published In
August 2022
551 pages
Publisher
VLDB Endowment
Publication History
Published: 01 August 2022
Published in PVLDB Volume 15, Issue 12
Qualifiers
- Research-article
Contributors
Other Metrics
Bibliometrics & Citations
Bibliometrics
Article Metrics
- 0Total Citations
- 112Total Downloads
- Downloads (Last 12 months)38
- Downloads (Last 6 weeks)6
Reflects downloads up to 21 Nov 2024
Other Metrics
Citations
View Options
Login options
Check if you have access through your login credentials or your institution to get full access on this article.
Sign in