research-article

CDI-E: an elastic cloud service for data engineering

Authors:

Shivangi Srivastava,

Valentin Moskovich,

Anmol Chaturvedi,

Mosharaf ChowdhuryAuthors Info & Claims

Proceedings of the VLDB Endowment, Volume 15, Issue 12

Pages 3319 - 3331

https://doi.org/10.14778/3554821.3554825

Published: 01 August 2022 Publication History

Abstract

We live in the gilded age of data-driven computing. With public clouds offering virtually unlimited amounts of compute and storage, enterprises collecting data about every aspect of their businesses, and advances in analytics and machine learning technologies, data driven decision making is now timely, cost-effective, and therefore, pervasive. Alas, only a handful of power users can wield today's powerful data engineering tools. For one thing, most solutions require knowledge of specific programming interfaces or libraries. Furthermore, running them requires complex configurations and knowledge of the underlying cloud for cost-effectiveness.

We decided that a fundamental redesign is in order to democratize data engineering for the masses at cloud scale. The result is Informatica Cloud Data Integration - Elastic (CDI-E). Since the early 1990s, Informatica has been a pioneer and industry leader in building no-code data engineering tools. Non-experts can express complex data engineering tasks using a graphical user interface (GUI). Informatica CDI-E is built to incorporate the simplicity of GUI in the design layer with an elastic and highly scalable run time to handle data in any format without little to no user input using automated optimizations. Users upload their data to the cloud in any format and can immediately use them in conjunction with their data management and analytic tools of choice using CDI-E GUI. Implementation began in the Spring of 2017, and Informatica CDI-E has been generally available since the Summer of 2019. Today, CDI-E is used in production by a growing number of small and large enterprises to make sense of data in arbitrary formats.

In this paper, we describe the architecture of Informatica CDI-E and its novel no-code data engineering interface. The paper highlights some of the key features of CDI-E: simplicity without loss in productivity and extreme elasticity. It concludes with lessons we learned and an outlook of the future.

References

[1]

[n.d.]. Amazon Redshift. https://aws.amazon.com/redshift/.

[2]

[n.d.]. Amazon Simple Storage Service (S3). https://aws.amazon.com/s3.

[3]

[n.d.]. Apache Avro. https://avro.apache.org.

[4]

[n.d.]. Apache Cassandra. https://cassandra.apache.org.

[5]

[n.d.]. Apache Hadoop. https://www.hadoop.apache.org.

[6]

[n.d.]. Apache Hive. https://www.hive.apache.org.

[7]

[n.d.]. Apache ORC. https://orc.apache.org.

[8]

[n.d.]. Apache Parquet. https://parquet.apache.org.

[9]

[n.d.]. Apache Spark. https://spark.apache.org.

[10]

[n.d.]. AWS Glue. https://aws.amazon.com/glue/.

[11]

[n.d.]. AWS Lambda. https://docs.aws.amazon.com/lambda/.

[12]

[n.d.]. Azure Blob Storage. https://docs.microsoft.com/en-us/azure/storage/blobs/storage-blobs-overview.

[13]

[n.d.]. Azure Functions. https://azure.microsoft.com/en-us/services/functions/.

[14]

[n.d.]. Azure Logic Apps. https://azure.microsoft.com/en-us/services/logic-apps/#overview.

[15]

[n.d.]. Azure Synapse Analytics. https://azure.microsoft.com/en-us/services/synapse-analytics/.

[16]

[n.d.]. BigQuery. https://cloud.google.com/bigquery.

[17]

[n.d.]. Cloud Functions. https://cloud.google.com/functions.

[18]

[n.d.]. Cloud Storage. https://cloud.google.com/storage.

[19]

[n.d.]. Data Factory. https://azure.microsoft.com/en-us/services/data-factory/.

[20]

[n.d.]. Databricks. https://databricks.com.

[21]

[n.d.]. Extensible Markup Language (XML) - W3C. https://www.w3.org/xml.

[22]

[n.d.]. Fast Healthcare Interoperability Resources. https://en.wikipedia.org/wiki/Fast_Healthcare_Interoperability_Resources.

[23]

[n.d.]. Health Insurance Portability and Accountability Act. https://www.hhs.gov/hipaa.

[24]

[n.d.]. HL7 International. https://hl7.org.

[25]

[n.d.]. Informatica Cloud Application Integration. https://www.informatica.com/products/cloud-application-integration.html.

[26]

[n.d.]. Informatica's Cost Optimization Engine. https://www.informatica.com/lp/informaticas-cost-optimization-engine_4257.html.

[27]

[n.d.]. JavaScript Object Notation. https://www.json.org.

[28]

[n.d.]. Kubernetes Scheduler. https://www.kubernetes.io/docs/concepts/scheduling-eviction/kube-scheduler/.

[29]

[n.d.]. LVM. https://www.tecmint.com/create-lvm-storage-in-linux/.

[30]

[n.d.]. LVM1. https://www.tecmint.com/extend-and-reduce-lvms-in-linux/.

[31]

[n.d.]. MongoDB. https://www.mongodb.com.

[32]

[n.d.]. Nucleus Research. https://nucleusresearch.com/.

[33]

[n.d.]. Nucleus Research Informatica. https://nucleusresearch.com/research/single/roi-guidebook-informatica/.

[34]

[n.d.]. presto. https://prestodb.io.

[35]

[n.d.]. Running Spark on Kubernetes. https://www.spark.apache.org/docs/latest/running-on-kubernetes.html.

[36]

[n.d.]. SWIFT EDI Document Standard. https://www.edibasics.com/edi-resources/document-standards/swift/.

[37]

[n.d.]. Taints and Tolerations. https://www.kubernetes.io/docs/concepts/scheduling-eviction/taint-and-toleration/.

[38]

[n.d.]. TPC. https://www.tpc.org.

[39]

Benoit Dageville, Thierry Cruanes, Marcin Zukowski, Vadim Antonov, Artin Avanes, Jon Bock, Jonathan Claybaugh, Daniel Engovatov, Martin Hentschel, Jiansheng Huang, et al. 2016. The snowflake elastic data warehouse. In Proceedings of the 2016 International Conference on Management of Data. 215--226.

Digital Library

[40]

Sudipto Das, Shoji Nishimura, Divyakant Agrawal, and Amr El Abbadi. 2011. Albatross: Lightweight elasticity in shared storage databases for the cloud using live data migration. Proceedings of the VLDB Endowment 4, 8 (2011), 494--505.

Digital Library

[41]

Jeffrey Dean and Sanjay Ghemawat. 2008. MapReduce: Simplified Data Processing on Large Clusters. Commun. ACM 51, 1 (jan 2008), 107--113.

Digital Library

[42]

AnHai Doan, Alon Y. Halevy, and Zachary G. Ives. 2012. Principles of Data Integration. Morgan Kaufmann. https://research.cs.wisc.edu/dibook/

Digital Library

[43]

Xin Luna Dong and Divesh Srivastava. 2013. Big data integration. In ICDE. IEEE, 1245--1248.

[44]

Anurag Gupta, Deepak Agarwal, Derek Tan, Jakub Kulesza, Rahul Pathak, Stefano Stefani, and Vidhya Srinivasan. 2015. Amazon Redshift and the Case for Simpler Data Warehouses. In Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data (Melbourne, Victoria, Australia) (SIGMOD '15). Association for Computing Machinery, New York, NY, USA, 1917--1923.

Digital Library

[45]

Botong Huang, Matthias Boehm, Yuanyuan Tian, Berthold Reinwald, Shirish Tatikonda, and Frederick R Reiss. 2015. Resource elasticity for large-scale machine learning. In Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data. 137--152.

Digital Library

[46]

Jiawei Jiang, Shaoduo Gan, Yue Liu, Fanlin Wang, Gustavo Alonso, Ana Klimovic, Ankit Singla, Wentao Wu, and Ce Zhang. 2021. Towards demystifying serverless machine learning training. In ACM SIGMOD. 857--871.

[47]

Jörn Kuhlenkamp, Markus Klems, and Oliver Röss. 2014. Benchmarking scalability and elasticity of distributed database systems. Proceedings of the VLDB Endowment 7, 12 (2014), 1219--1230.

Digital Library

[48]

Maurizio Lenzerini. 2002. Data integration: A theoretical perspective. In Proceedings of the twenty-first ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems. 233--246.

Digital Library

[49]

Edo Liberty, Zohar Karnin, Bing Xiang, Laurence Rouesnel, Baris Coskun, Ramesh Nallapati, Julio Delgado, Amir Sadoughi, Yury Astashonok, Piali Das, et al. 2020. Elastic machine learning algorithms in amazon sagemaker. In Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data. 731--737.

Digital Library

[50]

Renée J Miller. 2018. Open data integration. Proceedings of the VLDB Endowment 11, 12 (2018), 2130--2139.

Digital Library

[51]

Ingo Müller, Renato Marroquín, and Gustavo Alonso. 2020. Lambada: Interactive data analytics on cold data using serverless cloud infrastructure. In ACM SIGMOD. 115--130.

[52]

Matthew Perron, Raul Castro Fernandez, David DeWitt, and Samuel Madden. 2020. Starling: A scalable query engine on cloud functions. In ACM SIGMOD. 131--141.

[53]

Qifan Pu, Shivaram Venkataraman, and Ion Stoica. 2019. Shuffling, fast and slow: Scalable analytics on serverless infrastructure. In USENIX NSDI. 193--206.

[54]

Protiva Rahman, Lilong Jiang, and Arnab Nandi. 2020. Evaluating interactive data systems. The VLDB Journal 29, 1 (2020), 119--146.

Digital Library

[55]

Vaishaal Shankar, Karl Krauth, Qifan Pu, Eric Jonas, Shivaram Venkataraman, Ion Stoica, Benjamin Recht, and Jonathan Ragan-Kelley. 2018. numpywren: Serverless linear algebra. arXiv preprint arXiv:1810.09679 (2018).

[56]

Michael Stein. 1987. Large Sample Properties of Simulations Using Latin Hypercube Sampling. Technometrics 29, 2 (1987), 143--151.

[57]

Michael Stonebraker, Ihab F Ilyas, et al. 2018. Data Integration: The Current Status and the Way Forward. IEEE Data Engineering Bulletin 41, 2 (2018), 3--9.

[58]

Rebecca Taft, Essam Mansour, Marco Serafini, Jennie Duggan, Aaron J Elmore, Ashraf Aboulnaga, Andrew Pavlo, and Michael Stonebraker. 2014. E-store: Finegrained elastic partitioning for distributed transaction processing systems. Proceedings of the VLDB Endowment 8, 3 (2014), 245--256.

Digital Library

[59]

Ashish Thusoo, Joydeep Sen Sarma, Namit Jain, Zheng Shao, Prasad Chakka, Suresh Anthony, Hao Liu, Pete Wyckoff, and Raghotham Murthy. 2009. Hive: A Warehousing Solution over a Map-Reduce Framework. Proc. VLDB Endow. 2, 2 (aug 2009), 1626--1629.

Digital Library

[60]

Bowei Xi, Zhen Liu, Mukund Raghavachari, Cathy Xia, and Li Zhang. 2004. A Smart Hill-Climbing Algorithm for Application Server Configuration. Thirteenth International World Wide Web Conference Proceedings, WWW 2004 (04 2004).

Digital Library

[61]

Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauly, Michael J Franklin, Scott Shenker, and Ion Stoica. 2012. Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing. In 9th USENIX Symposium on Networked Systems Design and Implementation (NSDI 12). 15--28.

[62]

Matei Zaharia, Mosharaf Chowdhury, Michael J. Franklin, Scott Shenker, and Ion Stoica. 2010. Spark: Cluster Computing with Working Sets. In Proceedings of the 2nd USENIX Conference on Hot Topics in Cloud Computing (Boston, MA) (HotCloud'10). USENIX Association, USA, 10.

Digital Library

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image Proceedings of the VLDB Endowment

Proceedings of the VLDB Endowment Volume 15, Issue 12

August 2022

551 pages

ISSN:2150-8097

Editors:
Fatma Özcan
Google
,
Juliana Freire
New York University
,
Xuemin Lin
University of New South Wales

Issue’s Table of Contents

Publisher

VLDB Endowment

Publication History

Published: 01 August 2022

Published in PVLDB Volume 15, Issue 12

Qualifiers

Research-article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
112
Total Downloads

Downloads (Last 12 months)38
Downloads (Last 6 weeks)6

Reflects downloads up to 21 Nov 2024

Other Metrics

View Author Metrics

Citations

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Issue’s Table of Contents