Abstract
Obtaining valuable insights and actionable knowledge from data requires cross-analysis of domain data typically coming from various sources. Doing so, inevitably imposes burdensome processes of unifying different data formats, discovering integration paths, and all this given specific analytical needs of a data analyst. Along with large volumes of data, the variety of formats, data models, and semantics drastically contribute to the complexity of such processes. Although there have been many attempts to automate various processes along the Big Data pipeline, no unified platforms accessible by users without technical skills (like statisticians or business analysts) have been proposed. In this paper, we present a Big Data integration platform (Quarry) that uses hypergraph-based metadata to facilitate (and largely automate) the integration of domain data coming from a variety of sources, and provides an intuitive interface to assist end users both in: (1) data exploration with the goal of discovering potentially relevant analysis facets, and (2) consolidation and deployment of data flows which integrate the data, and prepare them for further analysis (descriptive or predictive), visualization, and/or publishing. We validate Quarry’s functionalities with the use case of World Health Organization (WHO) epidemiologists and data analysts in their fight against Neglected Tropical Diseases (NTDs).
Similar content being viewed by others
Notes
WebVOWL:Web-based Visualization of Ontologies - http://vowl.visualdataweb.org/webvowl.html
References
Abiteboul, S., André, B., & Kaplan, D. (2015). Managing your digital life. Communications of the ACM, 58(5), 32–35.
Angles, R., & Gutiérrez, C. (2008). Survey of graph database models. ACM Computing Surveys, 40(1), 1:1–1:39.
Angles, R., Arenas, M., Barceló, P., Hogan, A., Reutter, J. L., & Vrgoc, D. (2017). Foundations of modern query languages for graph databases. ACM Computing Surveys, 50(5), 68:1–68:40.
Bean, R. (2016). Variety, not volume, is driving big data initiatives. URL https://sloanreview.mit.edu/article/variety-not-volume-is-driving-big-data-initiatives
Bilalli, B., Abelló, A., Aluja-Banet, T., Munir, R. F., & Wrembel, R. (2018a). PRESISTANT: data pre-processing assistant. In CAiSE Forum (pp. 57–65).
Bilalli, B., Abelló, A., Aluja-Banet, T., & Wrembel, R. (2018b). Intelligent assistance for data pre-processing. Computer Standards & Interfaces, 57, 101–109.
Bilalli, B., Abelló, A., Aluja-Banet, T., & Wrembel, R. (2019). PRESISTANT: Learning based assistant for data pre-processing. Data & Knowledge Engineering, 123, 100–122.
Calvanese, D., Cogrel, B., Komla-Ebri, S., Kontchakov, R., Lanti, D., Rezk, M., Rodriguez-Muro, M., & Xiao, G. (2017). Ontop: Answering SPARQL queries over relational databases. Semantic Web, 8(3), 471–487.
Ceravolo, P., Azzini, A., Angelini, M., Catarci, T., Cudré-Mauroux, P., Damiani, E., Mazak, A., van Keulen, M., Jarrar, M., Santucci, G., Sattler, K., Scannapieco, M., Wimmer, M., Wrembel, R., & Zaraket, F. A. (2018). Big data semantics. J. Data Semantics, 7(2), 65–85.
Chen, Y., Alspaugh, S., & Katz, R. H. (2012). Interactive analytical processing in big data systems: A cross-industry study of mapreduce workloads. PVLDB, 5(12), 1802–1813.
Dallachiesa, M., Ebaid, A., Eldawy, A., Elmagarmid, A., Ilyas, I. F., Ouzzani, M., & Tang, N. (2013). Nadeef: A commodity data cleaning system. In SIGMOD (pp. 541–552).
Deng, D., Fernandez, R. C., Abedjan, Z., Wang, S., Stonebraker, M., Elmagarmid, A. K., Ilyas, I. F., Madden, S., Ouzzani, M., & Tang, N. (2017). The data civilizer system. In CIDR.
Doan, A., Halevy, A. Y., & Ives, Z. G. (2012). Principles of Data Integration. Morgan Kaufmann.
Duggan, J., Elmore, A. J., Stonebraker, M., Balazinska, M., Howe, B., Kepner, J., Madden, S., Maier, D., Mattson, T., & Zdonik, S. B. (2015). The BigDAWG Polystore System. SIGMOD Record, 44(2), 11–16.
Fernandez, R. C., & Madden, S. (2019). Termite: a system for tunneling through heterogeneous data. In aiDM@SIGMOD (p. 7:1–7:8).
Fletcher, G. H. L., & Mandreoli, F. (2016). No users no dataspaces! query-driven dataspace orchestration? In SEBD (pp. 150–157).
Franklin, M. J., Halevy, A. Y., & Maier, D. (2005). From databases to dataspaces: a new abstraction for information management. SIGMOD Record, 34(4), 27–33.
Friedman, M., Levy, A. Y., & Millstein, T. D. (1999). Navigational plans for data integration. IJCAI: In.
Garcia-Molina, H., Papakonstantinou, Y., Quass, D., Rajaraman, A., Sagiv, Y., Ullman, J. D., Vassalos, V., & Widom, J. (1997). The TSIMMIS approach to mediation: Data models and languages. Journal of Intelligent Information System, 8(2), 117–132.
Golshan, B., Halevy, A. Y., Mihaila, G. A., & Tan, W. (2017). Data integration: After the teenage years. In Proceedings of the 36th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems, PODS 2017, Chicago, IL, USA, May 14-19, 2017 (pp. 101–106).
Gorawski, M., & Lorek, M. (2017). Efficient storage, retrieval and analysis of poker hands: An adaptive data framework. Applied Mathematics and Computer Science, 27(4), 713–726.
Gorton, I., & Klein, J. (2015). Distribution, data, deployment: Software architecture convergence in big data systems. IEEE Software, 32(3), 78–85.
Hai, R., Geisler, S., & Quix, C. (2016). Constance: An intelligent data lake system. In SIGMOD (pp. 2097–2100).
Halevy, A. Y., Rajaraman, A., & Ordille, J. J. (2006). Data integration: The teenage years. In Proceedings of the 32nd International Conference on Very Large Data Bases, Seoul, Korea, September 12-15, 2006 (pp. 9–16).
Halevy, A. Y., Korn, F., Noy, N. F., Olston, C., Polyzotis, N., Roy, S., & Whang, S. E. (2016). Managing google’s data lake: an overview of the goods system. IEEE Data Eng. Bull., 39(3), 5–14.
Hewasinghage, M., Varga, J., Abelló, A., & Zimányi, E. (2018). Managing polyglot systems metadata with hypergraphs. In ER (pp. 463–478).
Jovanovic, P., Romero, O., Simitsis, A., Abelló, A., & Mayorova, D. (2014a). A requirement-driven approach to the design and evolution of data warehouses. Information Systems, 44, 94–119.
Jovanovic, P., Simitsis, A., & Wilkinson, K. (2014b). Engine independence for logical analytic flows. In ICDE (pp. 1060–1071).
Jovanovic, P., Romero, O., Simitsis, A., & Abelló, A. (2016). Incremental consolidation of data-intensive multi-flows. IEEE Transactions on Knowledge and Data Engineering, 28(5), 1203–1216.
Konstantinou, N., Koehler, M., Abel, E., Civili, C., Neumayr, B., Sallinger, E., Fernandes, A. A. A., Gottlob, G., Keane, J. A., Libkin, L., & Paton, N. W. (2017). The VADA architecture for cost-effective data wrangling. In SIGMOD (pp. 1599–1602).
Lenzerini, M. (2002). Data integration: A theoretical perspective. In PODS (pp. 233–246).
Lerman, K., Minton, S., & Knoblock, C. A. (2003). Wrapper maintenance: A machine learning approach. Journal of Artificial Intelligence Research, 18, 149–181.
Luján-Mora, S., & Trujillo, J. (2006). Applying the UML and the unified process to the design of data warehouses. JCIS, 46(5), 30–58.
Munir, R. F., Nadal, S., Romero, O., Abelló, A., Jovanovic, P., Thiele, M., & Lehner, W. (2018). Intermediate results materialization selection and format for data-intensive flows. Fundam. Inform., 163(2), 111–138.
Nadal, S., Rabbani, K., Romero, O., & Tadesse, S. (2019a). ODIN: A dataspace management system. In ISWC (pp. 185–188).
Nadal, S., Romero, O., Abelló, A., Vassiliadis, P., & Vansummeren, S. (2019b). An integration-oriented ontology to govern evolution in big data ecosystems. Information Systems, 79, 3–19.
Popovic, A., Hackney, R., Tassabehji, R., & Castelli, M. (2018). The impact of big data analytics on firms’ high value business performance. Information Systems Frontiers, 20(2), 209–222.
Priyatna, F., Corcho, Ó., & Sequeda, J. F. (2014). Formalisation and experiences of r2rml-based SPARQL to SQL query translation using morph. In WWW (pp. 479–490).
Quix, C., & Hai, R. (2019). Data lake. Encyclopedia of Big Data Technologies: In.
Rabbani, K. (2019). Supporting the Semi-Automatic Creation of the Target Schema in Data Integration Systems. Master’s thesis, Technische Univesitat Berlin - Universitat Politècnica de Catalunya, BarcelonaTech.
Saltor, F., Castellanos, M., & García-Solaco, M. (1991). Suitability of data models as canonical models for federated databases. SIGMOD Record, 20(4), 44–48.
Sarma, A. D., Dong, X. L., & Halevy, A. Y. (2011). Uncertainty in data integration and dataspace support platforms. In Schema Matching and Mapping (pp. 75–108).
Simitsis, A., Vassiliadis, P., & Sellis, T. K. (2005). State-space optimization of ETL workflows. IEEE Transactions on Knowledge and Data Engineering, 17(10), 1404–1419.
Simitsis, A., Wilkinson, K., Dayal, U., & Hsu, M. (2013). HFMS: managing the lifecycle and complexity of hybrid analytic data flows. In ICDE (pp. 1174–1185).
Skoutas, D., & Simitsis, A. (2007). Ontology-based conceptual design of ETL processes for both structured and semi-structured data. Int. J. Semantic Web Inf. Syst., 3(4), 1–24.
Stonebraker, M. (2019). The Case for Polystores – ACM SIGMOD Blog. [Online; accessed 27. Jun. 2019].
Stonebraker, M., Bruckner, D., Ilyas, I. F., Beskales, G., Cherniack, M., Zdonik, S. B., Pagan, A., & Xu, S. (2013). Data Curation at Scale: The Data Tamer System. In CIDR.
Tadesse, S., Gómez, C., Romero, O., Hose, K., & Rabbani, K. (2019). ARDI: Automatic Generation of RDFS Models from Heterogeneous Data Sources. In: EDOC.
Terrizzano, I. G., Schwarz, P. M., Roth, M., & Colino, J. E. (2015). Data wrangling: The challenging yourney from the wild to the lake. In CIDR.
Touma, R., Romero, O., & Jovanovic, P. (2015). Supporting data integration tasks with semi-automatic ontology construction. In DOLAP (pp. 89–98).
Varga, J., Romero, O., Pedersen, T. B., & Thomsen, C. (2014). Towards next generation BI systems: The analytical metadata challenge. In DaWaK (pp. 89–101).
Wojciechowski, A. (2018). ETL workflow reparation by means of case-based reasoning. Information Systems Frontiers, 20(1), 21–43.
Acknowledgements
We thank Dr. Lise Grout and Dr. Pedro Albajar-Viñas from the Neglected Tropical Diseases (NTD) department at WHO, for providing the use case. This work is partially supported by GENESIS project, funded by the Spanish Ministerio de Ciencia, Innovación y Universidades under project TIN2016-79269-R.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Jovanovic, P., Nadal, S., Romero, O. et al. Quarry: A User-centered Big Data Integration Platform. Inf Syst Front 23, 9–33 (2021). https://doi.org/10.1007/s10796-020-10001-y
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10796-020-10001-y