Nothing Special   »   [go: up one dir, main page]

skip to main content
column

Exploring Big Data with Helix: Finding Needles in a Big Haystack

Published: 18 February 2015 Publication History

Abstract

While much work has focused on efficient processing of Big Data, little work considers how to understand them. In this paper, we describe Helix, a system for guided exploration of Big Data. Helix provides a unified view of sources, ranging from spreadsheets and XML files with no schema, all the way to RDF graphs and relational data with well-defined schemas. Helix users explore these heterogeneous data sources through a combination of keyword searches and navigation of linked web pages that include information about the schemas, as well as data and semantic links within and across sources. At a technical level, the paper describes the research challenges involved in developing Helix, along with a set of real-world usage scenarios and the lessons learned.

References

[1]
D. Agrawal, S. Das, and A. El Abbadi. Big Data and Cloud Computing: Current State and Future Opportunities. In EDBT, pages 530--533, 2011.
[2]
M. A. Bornea, J. Dolby, A. Kementsietsidis, K. Srinivas, P. Dantressangle, O. Udrea, and B. Bhattacharjee. Building an Efficient RDF Store over a Relational Database. In SIGMOD, pages 121--132, 2013.
[3]
A. Z. Broder. On The Resemblance and Containment of Documents. In SEQUENCES, pages 21--29. IEEE Computer Society, 1997.
[4]
M. S. Charikar. Similarity Estimation Techniques from Rounding Algorithms. In STOC, pages 380--388, 2002.
[5]
Peter Christen. A Survey of Indexing Techniques for Scalable Record Linkage and Deduplication. IEEE Trans. Knowl. Data Eng., 24(9):1537--1555, 2012.
[6]
Marcelo Cohen and Daniel Schwabe. RExplorator - Supporting Reusable Explorations of Semantic Web Linked Data. In ISWC Posters&Demos, 2010.
[7]
S.F.C. de Araujo and D. Schwabe. Explorator: A Tool for Exploring RDF Data through Direct Manipulation. In LDOW2009, 2009.
[8]
X. Dong and A. Y. Halevy. Indexing Dataspaces. In SIGMOD, pages 43--54, 2007.
[9]
S. Duan, A. Fokoue, O. Hassanzadeh, A. Kementsietsidis, K. Srinivas, and M. J. Ward. Instance-Based Matching of Large Ontologies Using Locality-Sensitive Hashing. In ISWC, pages 49--64, 2012.
[10]
R. Goldman and J. Widom. Approximate dataguides. In Proceedings of the Workshop on Query Processing for Semistructured Data and Non-Standard Data Formats, pages 436--445, 1999.
[11]
H. Gonzalez, A. Y. Halevy, C. S. Jensen, A. Langen, J. Madhavan, R. Shapley, W. Shen, and J. Goldberg-Kidon. Google Fusion Tables: Web-Centered Data Management and Collaboration. In SIGMOD, pages 1061--1066. 2010.
[12]
A. Y. Halevy, M. J. Franklin, and D. Maier. Principles of dataspace systems. In PODS, pages 1--9, 2006.
[13]
O. Hassanzadeh, F. Chiang, R. J. Miller, and H. C. Lee. Framework for Evaluating Clustering Algorithms in Duplicate Detection. PVLDB, 2(1):1282--1293, 2009.
[14]
O. Hassanzadeh, S. Duan, A. Fokoue, A. Kementsietsidis, K. Srinivas, and M. J. Ward. Helix: Online Enterprise Data Analytics. In WWW, pages 225--228, 2011.
[15]
O. Hassanzadeh, A. Kementsietsidis, L. Lim, R. J. Miller, and M. Wang. Semantic Link Discovery over Relational Data. In Semantic Search over the Web, pages 193--224. 2012.
[16]
O. Hassanzadeh, K. Q. Pu, S. Hassas Yeganeh, R. J. Miller, M. Hernandez, L. Popa, and H. Ho. Discovering Linkage Points over Web Data. PVLDB, 6(6):444--456, 2013.
[17]
S. Hassas Yeganeh, O. Hassanzadeh, and R. J. Miller. Linking Semistructured Data on the Web. In WebDB, 2011.
[18]
V. Lopez, S. Kotoulas, M. L. Sbodio, M. Stephenson, A. Gkoulalas-Divanis, and P. M. Aonghusa. QuerioCity: A Linked Data Platform for Urban Information Management. In ISWC, pages 148--163, 2012.
[19]
G. Marchionini. Exploratory Search: From Finding to Understanding. CACM, 49(4):41--46, 2006.
[20]
S. Nestorov, J. D. Ullman, J. L. Wiener, and S. S. Chawathe. Representative Objects: Concise Representations of Semistructured, Hierarchial Data. In ICDE, pages 79--90, 1997.
[21]
E. Rahm and P. A. Bernstein. A Survey of Approaches to Automatic Schema Matching. VLDB Journal, 10(4):334--350, 2001.
[22]
M. Sayyadian, H. LeKhac, A. Doan, and L. Gravano. Efficient Keyword Search Across Heterogeneous Relational Databases. In ICDE, pages 346--355. 2007.
[23]
Y. Sismanis, P. Brown, P. J. Haas, and B. Reinwald. GORDIAN: Efficient and Scalable Discovery of Composite Keys. In VLDB, pages 691--702, 2006.
[24]
SPARQL Query Language for RDF. http://www.w3.org/TR/rdf-sparql-query/.
[25]
P. P. Talukdar, M. Jacob, M. S. Mehmood, K. Crammer, Z. G. Ives, F. Pereira, and S. Guha. Learning to Create Data-Integrating Queries. PVLDB, 1(1):785--796, 2008.
[26]
T. Tran, H. Wang, S. Rudolph, and P. Cimiano. Top-k Exploration of Query Candidates for Efficient Keyword Search on Graph-Shaped (RDF) Data. In ICDE, pages 405--416, 2009.
[27]
M. A. Vaz Salles, J.-P. Dittrich, S. K. Karakashian, O. R. Girard, and L. Blunschi. iTrails: Pay-as-you-go Information Integration in Dataspaces. In VLDB, pages 663--674. 2007.
[28]
R. W. White, S. M. Drucker, G. Marchionini, M. A. Hearst, and m. c. schraefel. Exploratory Search and HCI: Designing and Evaluating Interfaces to Support. Exploratory Search Interaction. In CHI Extended Abstracts, pages 2877--2880, 2007.

Cited By

View all

Index Terms

  1. Exploring Big Data with Helix: Finding Needles in a Big Haystack
        Index terms have been assigned to the content through auto-classification.

        Comments

        Please enable JavaScript to view thecomments powered by Disqus.

        Information & Contributors

        Information

        Published In

        cover image ACM SIGMOD Record
        ACM SIGMOD Record  Volume 43, Issue 4
        December 2014
        54 pages
        ISSN:0163-5808
        DOI:10.1145/2737817
        Issue’s Table of Contents

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        Published: 18 February 2015
        Published in SIGMOD Volume 43, Issue 4

        Check for updates

        Qualifiers

        • Column

        Contributors

        Other Metrics

        Bibliometrics & Citations

        Bibliometrics

        Article Metrics

        • Downloads (Last 12 months)6
        • Downloads (Last 6 weeks)0
        Reflects downloads up to 24 Sep 2024

        Other Metrics

        Citations

        Cited By

        View all
        • (2022)Federated Data Science to Break Down Silos [Vision]ACM SIGMOD Record10.1145/3516431.351643550:4(16-22)Online publication date: 31-Jan-2022
        • (2019)SmartData 4.0The Journal of Supercomputing10.1007/s11227-018-2705-y75:7(3585-3620)Online publication date: 31-Jul-2019
        • (2018)Universal Metadata Repository: Integrating Data Profiles Across an Organization2018 IEEE International Conference on Information Reuse and Integration (IRI)10.1109/IRI.2018.00072(452-459)Online publication date: 6-Jul-2018
        • (2018)Seeping Semantics: Linking Datasets Using Word Embeddings for Data Discovery2018 IEEE 34th International Conference on Data Engineering (ICDE)10.1109/ICDE.2018.00093(989-1000)Online publication date: Apr-2018
        • (2017)Stitching web tables for improving matching qualityProceedings of the VLDB Endowment10.14778/3137628.313765710:11(1502-1513)Online publication date: 1-Aug-2017
        • (2016)A minimally-intrusive approach for query-driven data integration systems2016 IEEE 32nd International Conference on Data Engineering Workshops (ICDEW)10.1109/ICDEW.2016.7495654(231-235)Online publication date: May-2016
        • (2016)Big data analytics and big data science: a surveyJournal of Management Analytics10.1080/23270012.2016.11413323:1(1-42)Online publication date: 26-Feb-2016

        View Options

        Get Access

        Login options

        View options

        PDF

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader

        Media

        Figures

        Other

        Tables

        Share

        Share

        Share this Publication link

        Share on social media