Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/3665939.3665958acmconferencesArticle/Chapter ViewAbstractPublication PagesmodConference Proceedingsconference-collections
research-article

Growing a FLOWER: Building a Diagram Unifying Flow and ER Notation for Data Science

Published: 18 June 2024 Publication History

Abstract

An ER diagram is a fundamental visual abstraction to design a database. Modern ER notation has evolved with UML symbols to represent both entities (logical level) and relational tables (physical level). On the other hand, flow diagrams (flowcharts, process flow) remain an important mechanism to visualize the main steps of a data processing pipeline. However, in modern data science projects there is a significant fraction of data that does not come from databases or data that is exported outside the database system, being processed by Python code, without any data model whatsoever. In this paper, we present a novel diagram which is built from source code and its associated browser-based GUI for collaborating on data integration and data preprocessing, mixing diverse data sources and diverse programming languages (mainly Python and SQL). Specifically, our targets are data integration, data cleaning and data transformation, which are needed to derive data sets that can be used as input for a machine learning model. We present a couple of target applications and a preliminary GUI, which partially automates diagram creation. We show our diagram has promise understanding, extending and reusing both data preparation source code and data sets.

References

[1]
Sikder Tahsin Al-Amin, Robin Varghese, David Lloyd, Maria A. GonzalezGonzalez, Mario I. Romero-Ortega, and Carlos Ordonez. 2022. Discovering Similar Spike Patterns in High Dimensional Biomedical Signals. In 2022 IEEE International Conference on Big Data (Big Data). 4337--4345.
[2]
Jing Ao and Rada Chirkova. 2019. Effective and Efficient Data Cleaning for Entity Matching. In Proc. of the Workshop on Human-In-the-Loop Data Analytics (HILDA). ACM, 1--7.
[3]
Carlo Batini, Enrico Nardelli, and Roberto Tamassia. 1986. A Layout Algorithm for Data Flow Diagrams. IEEE Trans. Software Eng. 12, 4 (1986), 538--546.
[4]
Carlo Combi, Barbara Oliboni, Mathias Weske, and Francesca Zerbato. 2018. Conceptual Modeling of Processes and Data: Connecting Different Perspectives. In Proc. of Conceptual Modeling Conference ER (Lecture Notes in Computer Science), Vol. 11157. Springer, 236--250.
[5]
Mike Dreves, Gene Huang, Zhuo Peng, Neoklis Polyzotis, Evan Rosen, and Paul Suganthan G. C. 2020. From Data to Models and Back. In Proceedings of the Fourth Workshop on Data Management for End-To-End Machine Learning, In conjunction with the 2020 ACM SIGMOD/PODS Conference, DEEM@SIGMOD 2020, Portland, OR, USA, June 14, 2020. ACM, 1:1--1:4.
[6]
Gaoyang Guo. 2018. An Active Workflow Method for Entity-Oriented Data Collection. In Advances in Conceptual Modeling - ER Workshops.
[7]
Benjamin Hilprecht, Christian Hammacher, Eduardo Souza dos Reis, Mohamed Abdelaal, and Carsten Binnig. 2023. DiffML: End-to-end Differentiable ML Pipelines. In Proceedings of the Seventh Workshop on Data Management for End-to-End Machine Learning, DEEM 2023, Seattle, WA, USA, 18 June 2023. ACM, 7:1--7:7.
[8]
Md. Fazle Elahi Khan, Renran Tian, and Xiao Luo. 2022. Flexible and scalable annotation tool to develop scene understanding datasets. In HILDA@SIGMOD 2022: Proceedings of the Workshop on Human-In-the-Loop Data Analytics, Philadelphia, Pennsylvania, 12 June 2022. ACM, 9:1--9:7.
[9]
Elijah Mitchell, Nabila Berkani, Ladjel Bellatreche, and Carlos Ordonez. 2023. FLOWER: Viewing Data Flow In ER Diagrams. In Proc. of DaWaK Conference (Penang, Malaysia). Springer-Verlag, Berlin, Heidelberg, 356âĂŞ371.
[10]
Carlos Ordonez, Sofian Maabout, David Sergio Matusevich, and Wellington Cabrera. 2013. Extending ER models to capture database transformations to build data sets for data mining. Data & Knowledge Engineering 89 (2013), 38--54.
[11]
Carlos Ordonez, Sikder Tahsin Al-Amin, and Ladjel Bellatreche. 2020. An ER-Flow Diagram for Big Data. In 2020 IEEE International Conference on Big Data (Big Data). 5795--5797.
[12]
Minh Pham, Craig A. Knoblock, and Jay Pujara. 2019. Learning Data Transformations with Minimal User Effort. In IEEE International Conference on Big Data (BigData). 657--664.
[13]
Merlijn Sebrechts, Sander Borny, Thomas Vanhove, Gregory van Seghbroeck, Tim Wauters, Bruno Volckaert, and Filip De Turck. 2016. Model-driven deployment and management of workflows on analytics frameworks. In IEEE International Conference on Big Data. 2819--2826.
[14]
William Spoth, Poonam Kumari, Oliver Kennedy, and Fatemeh Nargesian. 2020. Loki: Streamlining integration and enrichment. In Proc. Human in the Loop Data Analytics (HILDA). ACM.
[15]
Robin Varghese and Carlos Ordonez. 2023. Understanding Data Pre-processing with a Hybrid Diagram Integrating ER and Data Flow Notation. In IEEE International Conference on Big Data, BigData 2023, Sorrento, Italy, December 15-18, 2023. IEEE, 2450--2455.
[16]
Petia Wohed, Wil M. P. van der Aalst, Marlon Dumas, Arthur H. M. ter Hofstede, and Nick Russell. 2005. Pattern-Based Analysis of the Control-Flow Perspective of UML Activity Diagrams. In Proc. Conceptual Modeling Conference ER, Vol. 3716. 63--78.
[17]
Jinjin Zhao, Avigdor Gal, and Sanjay Krishnan. 2023. Data Makes Better Data Scientists. In Proceedings of the Workshop on Human-In-the-Loop Data Analytics, HILDA 2023, Seattle, WA, USA, 18 June 2023. ACM, 12:1--12:3.

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
HILDA 24: Proceedings of the 2024 Workshop on Human-In-the-Loop Data Analytics
June 2024
91 pages
ISBN:9798400706936
DOI:10.1145/3665939
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 18 June 2024

Check for updates

Author Tags

  1. diagram
  2. Python
  3. ER
  4. database model
  5. SQL
  6. source code

Qualifiers

  • Research-article

Conference

HILDA 24
Sponsor:

Acceptance Rates

Overall Acceptance Rate 28 of 56 submissions, 50%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • 0
    Total Citations
  • 37
    Total Downloads
  • Downloads (Last 12 months)37
  • Downloads (Last 6 weeks)3
Reflects downloads up to 18 Nov 2024

Other Metrics

Citations

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media