Nothing Special   »   [go: up one dir, main page]

skip to main content
research-article

ConnectorX: accelerating data loading from databases to dataframes

Published: 01 July 2022 Publication History

Abstract

Data is often stored in a database management system (DBMS) but dataframe libraries are widely used among data scientists. An important but challenging problem is how to bridge the gap between databases and dataframes. To solve this problem, we present ConnectorX, a client library that enables fast and memory-efficient data loading from various databases to different dataframes. We first investigate why the loading process is slow and consumes large memory. We surprisingly find that the main overhead comes from the client-side rather than query execution or data transfer. We integrate several existing and new techniques to reduce the overhead and carefully design the system architecture and interface to make ConnectorX easy to extend to various databases and dataframes. Moreover, we propose server-side result partitioning that can be adopted by DBMSs in order to better support exporting data to data science tools. We conduct extensive experiments to evaluate ConnectorX and compare it with popular libraries. The results show that ConnectorX significantly outperforms existing solutions. ConnectorX is open sourced at: https://github.com/sfu-db/connector-x.

References

[1]
2001--2022. pickle --- Python object serialization. https://docs.python.org/3/library/pickle.html. Accessed: 2022-05-01.
[2]
2011--2019. Apache Sqoop. https://sqoop.apache.org/. Accessed: 2022-05-01.
[3]
2014--2021. Ibis: Write your analytics code once, run it everywhere. http://ibis-project.org. Accessed: 2022-01-27.
[4]
2016. pandas read_sql is unusually slow. https://stackoverflow.com/questions/40045093/pandas-read-sql-is-unusually-slow. Accessed: 2022-01-27.
[5]
2016. Pandas using too much memory with read_sql_table. https://stackoverflow.com/questions/41253326/pandas-using-too-much-memory-with-read-sql-table. Accessed: 2022-01-27.
[6]
2016--2022. Apache Arrow. https://arrow.apache.org/. Accessed: 2022-01-27.
[7]
2017. Program (Time) Bottleneck is Database Interaction. https://stackoverflow.com/questions/44154430/program-time-bottleneck-is-database-interaction. Accessed: 2022-01-27.
[8]
2017. Use Turbodbc/Arrow for read_sql_table. https://github.com/pandas-dev/pandas/issues/17790. Accessed: 2022-01-27.
[9]
2017-2021. Turbodbc - Turbocharged database access for data scientists. https://turbodbc.readthedocs.io/en/latest/. Accessed: 2022-01-27.
[10]
2018. AWS CLI s3. https://awscli.amazonaws.com/v2/documentation/api/latest/reference/s3/index.html. Accessed: 2022-05-01.
[11]
2021. ConnectorX for ETL workload. https://github.com/sfu-db/connector-x/discussions/133. Accessed: 2022-01-27.
[12]
2021. ConnectorX for ML feature fetching. https://github.com/sfu-db/connector-x/issues/140#issuecomment-948918848. Accessed: 2022-01-27.
[13]
2021. ConnectorX integrates with dataframe system. https://pola-rs.github.io/polars-book/user-guide/howcani/io/read_db.html. Accessed: 2022-01-27.
[14]
2021. DataPrep: The easiest way to prepare data in Python. https://dataprep.ai/. Accessed: 2021-01-27.
[15]
2021. DDoS Dataset. https://www.kaggle.com/devendra416/ddos-datasets. Accessed: 2022-01-27.
[16]
2021. Google BigQuery. https://cloud.google.com/bigquery. Accessed: 2022-01-27.
[17]
2021. Polars: Fast multi-threaded DataFrame library in Rust and Python. https://github.com/pola-rs/polars. Accessed: 2022-01-27.
[18]
2021. TPC-H Homepage. http://www.tpc.org/tpch. Accessed: 2022-01-27.
[19]
2022. Amazon Redshift. https://aws.amazon.com/redshift/. Accessed: 2022-05-01.
[20]
2022. Amazon Redshift: Unloading data to Amazon S3. https://docs.aws.amazon.com/redshift/latest/dg/t_Unloading_tables.html. Accessed: 2022-05-01.
[21]
2022. Amazon S3. https://aws.amazon.com/s3/. Accessed: 2022-05-01.
[22]
2022. Apache Avro. https://avro.apache.org/. Accessed: 2022-05-01.
[23]
2022. Apache ORC. https://orc.apache.org/. Accessed: 2022-05-01.
[24]
2022. Apache Parquet. https://parquet.apache.org/. Accessed: 2022-05-01.
[25]
2022. Azure Blob Storage. https://azure.microsoft.com/en-us/services/storage/blobs/. Accessed: 2022-05-01.
[26]
2022. Azure Data Lake. https://azure.microsoft.com/en-us/solutions/data-lake/. Accessed: 2022-05-01.
[27]
2022. DistCp. https://hadoop.apache.org/docs/r3.1.3/hadoop-distcp/DistCp.html. Accessed: 2022-05-01.
[28]
2022. Google BigQuery: Exporting table data. https://cloud.google.com/bigquery/docs/exporting-data. Accessed: 2022-05-01.
[29]
2022. Google Cloud Storage. https://cloud.google.com/storage. Accessed: 2022-05-01.
[30]
2022. Ray Data. https://docs.ray.io/en/latest/data/getting-started.html. Accessed: 2022-05-01.
[31]
2022. The Rust Programming Language - Macro. https://doc.rust-lang.org/book/ch19-06-macros.html. Accessed: 2022-01-27.
[32]
2022. Spark SQL, DataFrames and Datasets Guide. https://spark.apache.org/docs/latest/sql-programming-guide.html. Accessed: 2022-05-01.
[33]
2022. Unloading Data from Snowlake. https://docs.snowlake.com/en/user-guide-data-unload.html. Accessed: 2022-05-01.
[34]
Michael Armbrust, Reynold S. Xin, Cheng Lian, Yin Huai, Davies Liu, Joseph K. Bradley, Xiangrui Meng, Tomer Kaftan, Michael J. Franklin, Ali Ghodsi, and Matei Zaharia. 2015. Spark SQL: Relational Data Processing in Spark. In Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data, Melbourne, Victoria, Australia, May 31 - June 4, 2015, Timos K. Sellis, Susan B. Davidson, and Zachary G. Ives (Eds.). ACM, 1383--1394.
[35]
SQLAlchemy authors and contributors. 2007--2022. Using Server Side Cursors (a.k.a. stream results). https://docs.sqlalchemy.org/en/14/core/connections.html#using-server-side-cursors-a-k-a-stream-results. Accessed: 2022-01-27.
[36]
Stefania Leone Bogdan Ionut Ghit, Juliusz Sompolski and Reynold Xin. 2021. How We Achieved High-bandwidth Connectivity With BI Tools. https://databricks.com/blog/2021/08/11/how-we-achieved-high-bandwidth-connectivity-with-bi-tools.html. Accessed: 2022-01-27.
[37]
Lingjiao Chen, Arun Kumar, Jeffrey F. Naughton, and Jignesh M. Patel. 2017. Towards Linear Algebra over Normalized Data. Proc. VLDB Endow. 10, 11 (2017), 1214--1225.
[38]
Tianqi Chen and Carlos Guestrin. 2016. XGBoost: A Scalable Tree Boosting System. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (San Francisco, California, USA) (KDD '16). ACM, New York, NY, USA, 785--794.
[39]
Benoît Dageville, Thierry Cruanes, Marcin Zukowski, Vadim Antonov, Artin Avanes, Jon Bock, Jonathan Claybaugh, Daniel Engovatov, Martin Hentschel, Jiansheng Huang, Allison W. Lee, Ashish Motivala, Abdul Q. Munir, Steven Pelley, Peter Povinec, Greg Rahn, Spyridon Triantafyllis, and Philipp Unterbrunner. 2016. The Snowflake Elastic Data Warehouse. In Proceedings of the 2016 International Conference on Management of Data, SIGMOD Conference 2016, San Francisco, CA, USA, June 26 - July 01, 2016, Fatma Özcan, Georgia Koutrika, and Sam Madden (Eds.). ACM, 215--226.
[40]
Sudipto Das, Yannis Sismanis, Kevin S. Beyer, Rainer Gemulla, Peter J. Haas, and John McPherson. 2010. Ricardo: integrating R and Hadoop. In Proceedings of the ACM SIGMOD International Conference on Management of Data, SIGMOD 2010, Indianapolis, Indiana, USA, June 6--10, 2010, Ahmed K. Elmagarmid and Divyakant Agrawal (Eds.). ACM, 987--998.
[41]
Joseph Vinish D'silva, Florestan De Moor, and Bettina Kemme. 2018. AIDA - Abstraction for Advanced In-Database Analytics. Proc. VLDB Endow. 11, 11 (2018), 1400--1413.
[42]
The Apache Software Foundation. 2016--2021. Apache Arrow Flight. https://arrow.apache.org/docs/format/Flight.html. Accessed: 2022-01-27.
[43]
The Python Software Foundation. 2001. PEP 249 - Python Database API Specification v2.0. https://www.python.org/dev/peps/pep-0249/. Accessed: 2022-01-27.
[44]
Philipp Große, Wolfgang Lehner, Thomas Weichert, Franz Färber, and Wen-Syan Li. 2011. Bridging Two Worlds with RICE Integrating R into the SAP In-Memory Computing Engine. Proc. VLDB Endow. 4, 12 (2011), 1307--1317. http://www.vldb.org/pvldb/vol4/p1307-grosse.pdf
[45]
Stefan Hagedorn, Steffen Kläbe, and Kai-Uwe Sattler. 2021. Putting Pandas in a Box. In 11th Conference on Innovative Data Systems Research, CIDR 2021, Virtual Event, January 11--15, 2021, Online Proceedings. www.cidrdb.org. http://cidrdb.org/cidr2021/papers/cidr2021_paper07.pdf
[46]
Charles R. Harris, K. Jarrod Millman, Stéfan J. van der Walt, Ralf Gommers, Pauli Virtanen, David Cournapeau, Eric Wieser, Julian Taylor, Sebastian Berg, Nathaniel J. Smith, Robert Kern, Matti Picus, Stephan Hoyer, Marten H. van Kerkwijk, Matthew Brett, Allan Haldane, Jaime Fernández del Río, Mark Wiebe, Pearu Peterson, Pierre Gérard-Marchant, Kevin Sheppard, Tyler Reddy, Warren Weckesser, Hameer Abbasi, Christoph Gohlke, and Travis E. Oliphant. 2020. Array programming with NumPy. Nature 585, 7825 (Sept. 2020), 357--362.
[47]
J. D. Hunter. 2007. Matplotlib: A 2D graphics environment. Computing in Science & Engineering 9, 3 (2007), 90--95.
[48]
Matthias Jasny, Tobias Ziegler, Tim Kraska, Uwe Röhm, and Carsten Binnig. 2020. DB4ML - An In-Memory Database Kernel with Machine Learning Support. In Proceedings of the 2020 International Conference on Management of Data, SIGMOD Conference 2020, online conference [Portland, OR, USA], June 14--19, 2020, David Maier, Rachel Pottinger, AnHai Doan, Wang-Chiew Tan, Abdussalam Alawini, and Hung Q. Ngo (Eds.). ACM, 159--173.
[49]
Alekh Jindal, K Venkatesh Emani, Maureen Daum, Olga Poppe, Brandon Haynes, Anna Pavlenko, Ayushi Gupta, Karthik Ramachandra, Carlo Curino, Andreas Mueller, et al. 2021. Magpie: Python at speed and scale using cloud backends. In CIDR.
[50]
Konstantinos Karanasos, Matteo Interlandi, Fotis Psallidas, Rathijit Sen, Kwanghyun Park, Ivan Popivanov, Doris Xin, Supun Nakandala, Subru Krishnan, Markus Weimer, Yuan Yu, Raghu Ramakrishnan, and Carlo Curino. 2020. Extending Relational Query Processing with ML Inference. In 10th Conference on Innovative Data Systems Research, CIDR 2020, Amsterdam, The Netherlands, January 12--15, 2020, Online Proceedings. www.cidrdb.org. http://cidrdb.org/cidr2020/papers/p24-karanasos-cidr20.pdf
[51]
Mahmoud Abo Khamis, Hung Q. Ngo, XuanLong Nguyen, Dan Olteanu, and Maximilian Schleich. 2018. In-Database Learning with Sparse Tensors. In Proceedings of the 37th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems, Houston, TX, USA, June 10--15, 2018, Jan Van den Bussche and Marcelo Arenas (Eds.). ACM, 325--340.
[52]
Jonathan Lajus and Hannes Mühleisen. 2014. Efficient data management and statistics with zero-copy integration. In Conference on Scientific and Statistical Database Management, SSDBM '14, Aalborg, Denmark, June 30 - July 02, 2014, Christian S. Jensen, Hua Lu, Torben Bach Pedersen, Christian Thomsen, and Kristian Torp (Eds.). ACM, 12:1--12:10.
[53]
Tianyu Li, Matthew Butrovich, Amadou Ngom, Wan Shen Lim, Wes McKinney, and Andrew Pavlo. 2020. Mainlining Databases: Supporting Fast Transactional Workloads on Universal Columnar Data File Formats. Proc. VLDB Endow. 14, 4 (2020), 534--546.
[54]
Xupeng Li, Bin Cui, Yiru Chen, Wentao Wu, and Ce Zhang. 2017. MLog: Towards Declarative In-Database Machine Learning. Proc. VLDB Endow. 10, 12 (2017), 1933--1936.
[55]
Sergey Melnik, Andrey Gubarev, Jing Jing Long, Geofrey Romer, Shiva Shivakumar, Matt Tolton, Theo Vassilakis, Hossein Ahmadi, Dan Delorey, Slava Min, Mosha Pasumansky, and Jeff Shute. 2020. Dremel: A Decade of Interactive SQL Analysis at Web Scale. Proc. VLDB Endow. 13, 12 (2020), 3461--3472.
[56]
Xiangrui Meng, Joseph K. Bradley, Burak Yavuz, Evan R. Sparks, Shivaram Venkataraman, Davies Liu, Jeremy Freeman, D. B. Tsai, Manish Amde, Sean Owen, Doris Xin, Reynold Xin, Michael J. Franklin, Reza Zadeh, Matei Zaharia, and Ameet Talwalkar. 2016. MLlib: Machine Learning in Apache Spark. J. Mach. Learn. Res. 17 (2016), 34:1--34:7. http://jmlr.org/papers/v17/15-237.html
[57]
Philipp Moritz, Robert Nishihara, Stephanie Wang, Alexey Tumanov, Richard Liaw, Eric Liang, Melih Elibol, Zongheng Yang, William Paul, Michael I. Jordan, and Ion Stoica. 2018. Ray: A Distributed Framework for Emerging AI Applications. In 13th USENIX Symposium on Operating Systems Design and Implementation, OSDI 2018, Carlsbad, CA, USA, October 8--10, 2018, Andrea C. Arpaci-Dusseau and Geoff Voelker (Eds.). USENIX Association, 561--577. https://www.usenix.org/conference/osdi18/presentation/nishihara
[58]
The pandas development team. 2008--2021. pandas.read_sql. https://pandas.pydata.org/docs/reference/api/pandas.read_sql.html. Accessed: 2022-01-27.
[59]
The pandas development team. 2020. pandas-dev/pandas: Pandas.
[60]
F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. 2011. Scikit-learn: Machine Learning in Python. Journal of Machine Learning Research 12 (2011), 2825--2830.
[61]
Devin Petersohn, William W. Ma, Doris Jung Lin Lee, Stephen Macke, Doris Xin, Xiangxi Mo, Joseph Gonzalez, Joseph M. Hellerstein, Anthony D. Joseph, and Aditya G. Parameswaran. 2020. Towards Scalable Dataframe Systems. Proc. VLDB Endow. 13, 11 (2020), 2033--2046. http://www.vldb.org/pvldb/vol13/p2033-petersohn.pdf
[62]
R Core Team. 2021. R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria. https://www.R-project.org/
[63]
Mark Raasveldt and Hannes Mühleisen. 2017. Don't Hold My Data Hostage - A Case For Client Protocol Redesign. Proc. VLDB Endow. 10, 10 (2017), 1022--1033.
[64]
Mark Raasveldt and Hannes Mühleisen. 2020. Data Management for Data Science - Towards Embedded Analytics. In 10th Conference on Innovative Data Systems Research, CIDR 2020, Amsterdam, The Netherlands, January 12--15, 2020, Online Proceedings. www.cidrdb.org. http://cidrdb.org/cidr2020/papers/p23-raasveldt-cidr20.pdf
[65]
Matthew Rocklin. 2015. Dask: Parallel computation with blocked algorithms and task scheduling. In Proceedings of the 14th python in science conference. Citeseer.
[66]
Maximilian E. Schüle, Matthias Bungeroth, Alfons Kemper, Stephan Günnemann, and Thomas Neumann. 2019. MLearn: A Declarative Machine Learning Language for Database Systems. In Proceedings of the 3rd International Workshop on Data Management for End-to-End Machine Learning, DEEM@SIGMOD 2019, Amsterdam, The Netherlands, June 30, 2019, Sebastian Schelter, Neoklis Polyzotis, Stephan Seufert, and Manasi Vartak (Eds.). ACM, 7:1--7:4.
[67]
Ashish Thusoo, Joydeep Sen Sarma, Namit Jain, Zheng Shao, Prasad Chakka, Ning Zhang, Suresh Anthony, Hao Liu, and Raghotham Murthy. 2010. Hive - a petabyte scale data warehouse using Hadoop. In Proceedings of the 26th International Conference on Data Engineering, ICDE 2010, March 1--6, 2010, Long Beach, California, USA, Feifei Li, Mirella M. Moro, Shahram Ghandeharizadeh, Jayant R. Haritsa, Gerhard Weikum, Michael J. Carey, Fabio Casati, Edward Y. Chang, Ioana Manolescu, Sharad Mehrotra, Umeshwar Dayal, and Vassilis J. Tsotras (Eds.). IEEE Computer Society, 996--1005.
[68]
Itamar Turner-Trauring. 2021. Loading SQL data into Pandas without running out of memory. https://pythonspeed.com/articles/pandas-sql-chunking/. Accessed: 2022-01-27.
[69]
Jinze Wu Yizhou Chen Nick Zrymiak Changbo Qu Lampros Flokas George Chow Jiannan Wang Tianzheng Wang Eugene Wu Qingqing Zhou Xiaoying Wang, Weiyuan Wu. 2021. [Technical Report] ConnectorX: Accelerating Data Loading From Database to Dataframe. http://raw.githubusercontent.com/sfu-db/connector-x/main/assets/Technical_Report__ConnectorX.pdf.
[70]
Matei Zaharia, Ali Ghodsi, Reynold Xin, and Michael Armbrust. 2021. Lakehouse: A New Generation of Open Platforms that Unify Data Warehousing and Advanced Analytics. In 11th Conference on Innovative Data Systems Research, CIDR 2021, Virtual Event, January 11--15, 2021, Online Proceedings. www.cidrdb.org. http://cidrdb.org/cidr2021/papers/cidr2021_paper17.pdf

Cited By

View all
  • (2023)A Comparison of End-to-End Decision Forest Inference PipelinesProceedings of the 2023 ACM Symposium on Cloud Computing10.1145/3620678.3624656(200-215)Online publication date: 30-Oct-2023
  • (2023)Teaching Blue Elephants the Maths for Machine LearningProceedings of the Seventh Workshop on Data Management for End-to-End Machine Learning10.1145/3595360.3595852(1-4)Online publication date: 18-Jun-2023

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image Proceedings of the VLDB Endowment
Proceedings of the VLDB Endowment  Volume 15, Issue 11
July 2022
980 pages
ISSN:2150-8097
Issue’s Table of Contents

Publisher

VLDB Endowment

Publication History

Published: 01 July 2022
Published in PVLDB Volume 15, Issue 11

Badges

Qualifiers

  • Research-article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)93
  • Downloads (Last 6 weeks)3
Reflects downloads up to 21 Nov 2024

Other Metrics

Citations

Cited By

View all
  • (2023)A Comparison of End-to-End Decision Forest Inference PipelinesProceedings of the 2023 ACM Symposium on Cloud Computing10.1145/3620678.3624656(200-215)Online publication date: 30-Oct-2023
  • (2023)Teaching Blue Elephants the Maths for Machine LearningProceedings of the Seventh Workshop on Data Management for End-to-End Machine Learning10.1145/3595360.3595852(1-4)Online publication date: 18-Jun-2023

View Options

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media