research-article

A collaborative framework for tweaking properties in a synthetic dataset

Authors:

J. W. Zhang,

Yu Wang,

Y. C. TayAuthors Info & Claims

Proceedings of the VLDB Endowment, Volume 11, Issue 12

Pages 2010 - 2013

https://doi.org/10.14778/3229863.3236247

Published: 01 August 2018 Publication History

Get Access

Abstract

Researchers and developers use benchmarks to compare their algorithms and products. For database systems, a benchmark must have a dataset D. To be application-specific, this dataset D should be empirical. However, a real D may be too small, or too large, for the benchmarking experiments. Therefore, D must first be scaled to the desired size.

Previous related work typically extracts a set of properties Π = {π₁, . . . , π_n} from D, then use Π to generate the synthetic D~. Π may thus ensure D~ is similar to D. This approach of having some monolithic software enforce properties π₁, . . . , π_n becomes increasingly intractable as n increases. Our demonstration will present ASPECT, a framework that takes a different approach.

With ASPECT, there is a tool So to first scale the dataset size. The resulting D~ can then be tweaked by tools T₁, . . . , T_n, where T_k enforces π_k in D~.

At the demonstration, a visitor has a choice of (i) D, (ii) size scaler S₀, (iii) the subset of properties to enforce, and (iv) the order of applying the tools for the chosen properties. The visitor can then see the enforcement error for each π_k and the running time for each T_k.

A video of the demonstration is presented here: http://scaler.d2.comp.nus.edu.sg/

References

[1]

T. Buda, T. Cerqueus, et al. ReX: Extrapolating relational data in a representative way. In Data Science, LNCS 9147, pages 95--107. Springer, 2015.

Google Scholar

[2]

T. S. Buda, T. Cerqueus, et al. VFDS: An application to generate fast sample databases. In CIKM, pages 2048--2050, 2014.

Digital Library

Google Scholar

[3]

L. Gu, M. Zhou, Z. Zhang, et al. Chronos: An elastic parallel framework for stream benchmark generation and simulation. In ICDE, pages 101--112, 2015.

Crossref

Google Scholar

[4]

N. Patki, R. Wedge, and K. Veeramachaneni. The synthetic data vault. In DSAA, pages 399--410, Oct 2016.

Crossref

Google Scholar

[5]

M. Stonebraker. A new direction for TPC? In TPCTC, pages 11--17, 2009.

Digital Library

Google Scholar

[6]

Y. C. Tay. Data generation for application-specific benchmarking. PVLDB, 4(12):1470--1473, 2011.

Digital Library

Google Scholar

[7]

Y. C. Tay, B. T. Dai, et al. UpSizeR: Synthetically scaling an empirical relational database. Inf. Syst., 38(8):1168--1183, 2013.

Digital Library

Google Scholar

[8]

J. W. Zhang and Y. C. Tay. Dscaler: Synthetically scaling a given relational database. PVLDB, 9(14):1671--1682, 2016.

Digital Library

Google Scholar

[9]

J. W. Zhang and Y. C. Tay. A tool framework for tweaking features in synthetic datasets. https://arxiv.org/abs/1801.03645, 2018.

Google Scholar

Recommendations

Synthetic Dataset Generation for Fairer Unfairness Research
LAK '24: Proceedings of the 14th Learning Analytics and Knowledge Conference

Recent research has made strides toward fair machine learning. Relatively few datasets, however, are commonly examined to evaluate these fairness-aware algorithms, and even fewer in education domains, which can lead to a narrow focus on particular types ...
A Comparative Study of Synthetic Dataset Generation Techniques
Database and Expert Systems Applications
Abstract
Unrestricted availability of the datasets is important for the researchers to evaluate their strategies to solve the research problems. While publicly releasing the datasets, it is equally important to protect the privacy of the respective data ...
A Framework to Generate Synthetic Multi-label Datasets

A controlled environment based on known properties of the dataset used by a learning algorithm is useful to empirically evaluate machine learning algorithms. Synthetic (artificial) datasets are used for this purpose. Although there are publicly ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image Proceedings of the VLDB Endowment

Proceedings of the VLDB Endowment Volume 11, Issue 12

August 2018

426 pages

ISSN:2150-8097

Editors:
Sihem Amer-Yahia
University of Grenoble Alpes, CNRS
,
Jian Pei
Simon Fraser University

Issue’s Table of Contents

Publisher

VLDB Endowment

Publication History

Published: 01 August 2018

Published in PVLDB Volume 11, Issue 12

Qualifiers

Research-article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
30
Total Downloads

Downloads (Last 12 months)1
Downloads (Last 6 weeks)0

Reflects downloads up to 03 Mar 2025

Other Metrics

View Author Metrics

Citations

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Abstract

References

Recommendations

Synthetic Dataset Generation for Fairer Unfairness Research

A Comparative Study of Synthetic Dataset Generation Techniques

A Framework to Generate Synthetic Multi-label Datasets

Comments

Information

Published In

Publisher

Publication History

Qualifiers

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

Login options

Full Access

View options

PDF

eReader

Share

Share this Publication link

Share on social media

Affiliations