No abstract available.
Proceeding Downloads
Cocoon: Semantic Table Profiling Using Large Language Models
Data profilers play a crucial role in the preprocessing phase of data analysis by identifying quality issues such as missing, extreme, or erroneous values. Traditionally, profilers have relied solely on statistical methods, which lead to high false ...
Growing a FLOWER: Building a Diagram Unifying Flow and ER Notation for Data Science
An ER diagram is a fundamental visual abstraction to design a database. Modern ER notation has evolved with UML symbols to represent both entities (logical level) and relational tables (physical level). On the other hand, flow diagrams (flowcharts, ...
It Took Longer than I was Expecting: Why is Dataset Search Still so Hard?
Dataset search is a long-standing problem across both industry and academia. While most industry tools focus on identifying one or more datasets matching a user-specified query, most recent academic papers focus on the subsequent problems of join and ...
Transparent Data Preprocessing for Machine Learning
Data preprocessing is an important task in machine learning which can significantly improve model outcomes. However, evaluating the impact of data preprocessing is often difficult. There is a need for tools which make it transparent to the user on how ...
Key Insights from a Feature Discovery User Study
Multiple works in data management research focus on automating the processes of data augmentation and feature discovery to save users from having to perform these tasks manually. Yet, this automation often leads to a disconnect with the users, as it ...
Pipe(line) Dreams: Fully Automated End-to-End Analysis and Visualization
We exploit large language models (LLMs) to automate the end-to-end process of descriptive analytics and visualization. A user simply declares who they are and provides their data set. Our tool LLM4Vis sets analysis goals or metrics, generates code to ...
CopycHats: Question Sequencing with Artificial Agents
Schema Matching, the task of finding correspondences among attributes of different schemata, plays an important role in data integration. The task has been extensively researched, leading to the development of multiple algorithmic approaches, many of ...
Guided Querying over Videos using Autocompletion Suggestions
A critical challenge with querying video data is that the user is often unaware of the contents of the video, its structure, and the exact terminology to use in the query. While these problems exist in exploratory querying settings over traditional ...
Drag, Drop, Merge: A Tool for Streamlining Integration of Longitudinal Survey Instruments
- Pratik Pokharel,
- Juseung Lee,
- Oliver Kennedy,
- Marianthi Markatou,
- Andrew Talal,
- Jeff Good,
- Raktim Mukhopadhyay
We explore data management for longitudinal study survey instruments: (i) Survey instrument evolution presents a unique data integration challenge; and (ii) Longitudinal study data frequently requires repeated, task-specific integration efforts. We ...
More of that, please: Domain Adaptation of Information Extraction through Examples & Feedback
Automatic information extraction, e.g., into a tabular format, is crucial for leveraging knowledge in large text collections. Yet, creating such extraction pipelines for custom target attributes can cause high overheads, while off-the-shelf tools might ...
Towards Extending XAI for Full Data Science Pipelines
Data preprocessing and engineering are essential parts of any AI system, as indicated by the current trend of data-centric AI. However, until now, explainability efforts have almost exclusively focused on models. We propose explanations for preprocessing ...
Causal Dataset Discovery with Large Language Models
Causal data discovery is crucial in scientific research by uncovering causal links among a variety of observed variables. Causal dataset discovery is the task of identifying datasets that contain columns that have causal relationships with columns in a ...
LLMs as an Interactive Database Interface for Designing Large Queries
Text2SQL is typically considered a one-shot process where the user gives a natural language query and receives an SQL query in return. This approach is fraught with potential concerns, such as syntactical errors, logical mismatches, and schema ...