Export Citations
Save this search
Please login to be able to save your searches and receive alerts for new content matching your search criteria.
- posterSeptember 2024
Integrity Verification in Cloud Data Lakes
SYSTOR '24: Proceedings of the 17th ACM International Systems and Storage ConferencePage 201https://doi.org/10.1145/3688351.3689166Cloud data lakes support storage and querying at scale. However, traditional data integrity methods do not apply to them due to a different system model. We propose a novel completeness verification protocol based on a data lake partitioning scheme.
- posterSeptember 2024
Coverage-Based Caching in Cloud Data Lakes
SYSTOR '24: Proceedings of the 17th ACM International Systems and Storage ConferencePage 193https://doi.org/10.1145/3688351.3689165Cloud data lakes are a modern approach to handling large volumes of data. They separate the compute and storage layers, making them highly scalable and cost-effective. However, query performance in cloud data lakes could be faster, and various efforts ...
- short-paperJune 2024
Rethinking Table Retrieval from Data Lakes
aiDM '24: Proceedings of the Seventh International Workshop on Exploiting Artificial Intelligence Techniques for Data ManagementArticle No.: 2, Pages 1–5https://doi.org/10.1145/3663742.3663972Table retrieval from data lakes has recently become important for many downstream tasks, including data discovery and table question answering. Existing table retrieval approaches estimate each table's relevance to a particular information need and ...
- research-articleJune 2023
Steered Training Data Generation for Learned Semantic Type Detection
Proceedings of the ACM on Management of Data (PACMMOD), Volume 1, Issue 2Article No.: 201, Pages 1–25https://doi.org/10.1145/3589786In this paper, we introduce STEER to adapt learned semantic type extraction approaches to a new, unseen data lake. STEER provides a data programming framework for semantic labeling which is used to generate new labeled training data with minimal ...
- extended-abstractJune 2023
Analyzing large-scale genomic data with cloud data lakes
SYSTOR '23: Proceedings of the 16th ACM International Conference on Systems and StoragePage 142https://doi.org/10.1145/3579370.3594750In recent years there is huge influx of genomic data and a growing need for its analysis, yet existing genomic databases do not allow easy accessibility. We developed a pipeline that continuously pre-processes raw human genetic data. The data is then ...
- short-paperJune 2023
DIALITE: Discover, Align and Integrate Open Data Tables
SIGMOD '23: Companion of the 2023 International Conference on Management of DataPages 187–190https://doi.org/10.1145/3555041.3589732We demonstrate a novel table discovery pipeline called DIALITE that allows users to discover, integrate and analyze open data tables. DIALITE has three main stages. First, it allows users to discover tables from open data platforms using state-of-the-art ...
- research-articleMay 2023
SANTOS: Relationship-based Semantic Table Union Search
- Aamod Khatiwada,
- Grace Fan,
- Roee Shraga,
- Zixuan Chen,
- Wolfgang Gatterbauer,
- Renée J. Miller,
- Mirek Riedewald
Proceedings of the ACM on Management of Data (PACMMOD), Volume 1, Issue 1Article No.: 9, Pages 1–25https://doi.org/10.1145/3588689Existing techniques for unionable table search define unionability using metadata (tables must have the same or similar schemas) or column-based metrics (for example, the values in a table should be drawn from the same domain). In this work, we introduce ...
- research-articleJanuary 2022
An overview of the 2022 NISO plus conference: Global conversations/Global Connections
Information Services and Use (INSU), Volume 42, Issue 3-4Pages 327–376https://doi.org/10.3233/ISU-220178This paper offers an overview of some of the highlights of the 2022 NISO Plus Annual Conference that was held virtually from February 15 - February 18, 2022. This was the third such conference and the second to be held in a completely virtual ...
- short-paperDecember 2020
Data lakes for digital humanities
DTUC '20: Proceedings of the 2nd International Conference on Digital Tools & Uses CongressArticle No.: 6, Pages 1–4https://doi.org/10.1145/3423603.3424004Traditional data in Digital Humanities projects bear various formats (structured, semi-structured, textual) and need substantial transformations (encoding and tagging, stemming, lemmatization, etc.) to be managed and analyzed. To fully master this ...
- research-articleMay 2020
Finding Related Tables in Data Lakes for Interactive Data Science
SIGMOD '20: Proceedings of the 2020 ACM SIGMOD International Conference on Management of DataPages 1951–1966https://doi.org/10.1145/3318464.3389726Many modern data science applications build on data lakes, schema-agnostic repositories of data files and data products that offer limited organization and management capabilities. There is a need to build data lake search capabilities into data science ...
- short-paperMay 2020
Organizing Data Lakes for Navigation
SIGMOD '20: Proceedings of the 2020 ACM SIGMOD International Conference on Management of DataPages 1939–1950https://doi.org/10.1145/3318464.3380605We consider the problem of creating an effective navigation structure over a data lake. We define an organization as a navigation graph that contains nodes representing sets of attributes within a data lake and edges indicating subset relationships ...
- short-paperDecember 2019
Dredging a data lake: decentralized metadata extraction
Middleware '19: Proceedings of the 20th International Middleware Conference Doctoral SymposiumPages 51–53https://doi.org/10.1145/3366624.3368170The rapid generation of data from distributed IoT devices, scientific instruments, and compute clusters presents unique data management challenges. The influx of large, heterogeneous, and complex data causes repositories to become siloed or generally ...
- research-articleDecember 2019
Serverless Workflows for Indexing Large Scientific Data
- Tyler J. Skluzacek,
- Ryan Chard,
- Ryan Wong,
- Zhuozhao Li,
- Yadu N. Babuji,
- Logan Ward,
- Ben Blaiszik,
- Kyle Chard,
- Ian Foster
WOSC '19: Proceedings of the 5th International Workshop on Serverless ComputingPages 43–48https://doi.org/10.1145/3366623.3368140The use and reuse of scientific data is ultimately dependent on the ability to understand what those data represent, how they were captured, and how they can be used. In many ways, data are only as useful as the metadata available to describe them. ...
- research-articleJune 2019
JOSIE: Overlap Set Similarity Search for Finding Joinable Tables in Data Lakes
SIGMOD '19: Proceedings of the 2019 International Conference on Management of DataPages 847–864https://doi.org/10.1145/3299869.3300065We present a new solution for finding joinable tables in massive data lakes: given a table and one join column, find tables that can be joined with the given table on the largest number of distinct values. The problem can be formulated as an overlap set ...
- short-paperJune 2019
DWS: a data placement approach for smart grid ecosystems
IDEAS '19: Proceedings of the 23rd International Database Applications & Engineering SymposiumArticle No.: 39, Pages 1–5https://doi.org/10.1145/3331076.3331126In Smart grid ecosystems, it is important to carefully choose the placement of the datasets across different kind of big data systems in order to achieve high performance of the workloads and conformity with the business and data ecosystem. Our approach ...
- research-articleNovember 2016
The next information architecture evolution: the data lake wave
MEDES: Proceedings of the 8th International Conference on Management of Digital EcoSystemsPages 174–180https://doi.org/10.1145/3012071.3012077Data warehouses and data marts have long been considered as the unique solution for providing end-users with decisional information. More recently, data lakes have been proposed in order to govern data swamps. However, no formal definition has been ...
- research-articleJune 2016
CLAMS: Bringing Quality to Data Lakes
SIGMOD '16: Proceedings of the 2016 International Conference on Management of DataPages 2089–2092https://doi.org/10.1145/2882903.2899391With the increasing incentive of enterprises to ingest as much data as they can in what is commonly referred to as "data lakes", and with the recent development of multiple technologies to support this "load-first" paradigm, the new environment presents ...
- research-articleJune 2016
Goods: Organizing Google's Datasets
- Alon Halevy,
- Flip Korn,
- Natalya F. Noy,
- Christopher Olston,
- Neoklis Polyzotis,
- Sudip Roy,
- Steven Euijong Whang
SIGMOD '16: Proceedings of the 2016 International Conference on Management of DataPages 795–806https://doi.org/10.1145/2882903.2903730Enterprises increasingly rely on structured datasets to run their businesses. These datasets take a variety of forms, such as structured files, databases, spreadsheets, or even services that provide access to the data. The datasets often reside in ...