research-article

Dscaler: synthetically scaling a given relational database

Editor: Surajit Chaudhuri Authors:

Y. C. TayAuthors Info & Claims

Proceedings of the VLDB Endowment, Volume 9, Issue 14

Pages 1671 - 1682

https://doi.org/10.14778/3007328.3007333

Published: 01 October 2016 Publication History

Abstract

The Dataset Scaling Problem (DSP) defined in previous work states: Given an empirical set of relational tables D and a scale factor s, generate a database state D that is similar to D but s times its size. A DSP solution is useful for application development (s < 1), scalability testing (s > 1) and anonymization (s = 1). Current solutions assume all table sizes scale by the same ratio s.

However, a real database tends to have tables that grow at different rates. This paper therefore considers non-uniform scaling (nuDSP), a DSP generalization where, instead of a single scale factor s, tables can scale by different factors.

Dscaler is the first solution for nuDSP. It follows previous work in achieving similarity by reproducing correlation among the primary and foreign keys. However, it introduces the concept of a correlation database that captures fine-grained, per-tuple correlation.

Experiments with well-known real and synthetic datasets D show that Dscaler produces D with greater similarity to D than state-of-the-art techniques. Here, similarity is measured by number of tuples, frequency distribution of foreign key references, and multi-join aggregate queries.

References

[1]

S. Acharya, P. B. Gibbons, V. Poosala, and S. Ramaswamy. Join synopses for approximate query answering. In SIGMOD, pages 275--286, 1999.

Digital Library

[2]

S. Agarwal, A. P. Iyer, et al. Blink and it's done: interactive queries on very large data. VLDB, 5(12):1902--1905, 2012.

Digital Library

[3]

A. Arasu, R. Kaushik, and J. Li. Data generation using declarative constraints. In SIGMOD, pages 685--696, 2011.

Digital Library

[4]

C. Binnig, D. Kossmann, and E. Lo. Reverse query processing. In ICDE, pages 506--515. IEEE, 2007.

[5]

C. Binnig, D. Kossmann, E. Lo, and M. T. Özsu. QAGen: generating query-aware test databases. In SIGMOD, pages 341--352, 2007.

Digital Library

[6]

N. Bruno and S. Chaudhuri. Flexible database generators. In VLDB, pages 1097--1107, 2005.

Digital Library

[7]

T. Buda, T. Cerqueus, et al. ReX: Extrapolating relational data in a representative way. In Data Science, LNCS 9147, pages 95--107. 2015.

[8]

T. S. Buda, T. Cerqueus, et al. VFDS: An application to generate fast sample databases. In CIKM, pages 2048--2050, 2014.

Digital Library

[9]

S. Chaudhuri, G. Das, and U. Srivastava. Effective use of block-level sampling in statistics estimation. In SIGMOD, pages 287--298, 2004.

Digital Library

[10]

I. S. Dhillon, S. Mallela, and D. S. Modha. Information-theoretic co-clustering. In KDD, pages 89--98. ACM, 2003.

Digital Library

[11]

S. Duan, A. Kementsietsidis, et al. Apples and oranges: a comparison of RDF benchmarks and real RDF datasets. In SIGMOD, pages 145--156, 2011.

Digital Library

[12]

H. Fu, A. Zhang, and X. Xie. Effective social graph deanonymization based on graph structure and descriptive information. ACM Trans. Intell. Syst. Technol., 6(4):49:1--49:29, July 2015.

Digital Library

[13]

R. Gemulla, P. Rösch, and W. Lehner. Linked Bernoulli synopses: Sampling along foreign keys. In Scientific and Statistical Database Management, pages 6--23, 2008.

Digital Library

[14]

J. Gray, P. Sundaresan, et al. Quickly generating billion-record synthetic databases. In SIGMOD, pages 243--252, 1994.

Digital Library

[15]

V. Gupta, G. Miklau, and N. Polyzotis. Private database synthesis for outsourced system evaluation. In Proc. AMW, May 2011.

[16]

K. Houkjær, K. Torp, and R. Wind. Simple and realistic data generation. In VLDB, pages 1243--1246, 2006.

Digital Library

[17]

G. H. John and P. Langley. Static versus dynamic sampling for data mining. In KDD, volume 96, pages 367--370, 1996.

Digital Library

[18]

H. Köhler, X. Zhou, et al. Sampling dirty data for matching attributes. In SIGMOD, pages 63--74, 2010.

Digital Library

[19]

E. Lo, N. Cheng, and W.-K. Hon. Generating databases for query workloads. PVLDB, 3(1--2):848--859, 2010.

Digital Library

[20]

W. Lu, G. Miklau, and V. Gupta. Generating private synthetic databases for untrusted system evaluation. In ICDE, pages 652--663, Mar. 2014.

[21]

F. McSherry and I. Mironov. Differentially private recommender systems. In KDD, pages 627--636, 2009.

Digital Library

[22]

A. Narayanan and V. Shmatikov. Robust de-anonymization of large sparse datasets. In IEEE Symp. Security and Privacy, pages 111--125, 2008.

Digital Library

[23]

C. R. Palmer and C. Faloutsos. Density biased sampling: An improved method for data mining and clustering. SIGMOD Rec., 29(2):82--92, May 2000.

Digital Library

[24]

F. Provost, D. Jensen, and T. Oates. Efficient progressive sampling. In SIGMOD, pages 23--32, 1999.

Digital Library

[25]

S. Qiao and Z. M. Özsoyoğlu. RBench: Application-specific RDF benchmarking. In SIGMOD, pages 1825--1838, 2015.

Digital Library

[26]

T. Rabl, M. Danisch, et al. Just can't get enough: Synthesizing big data. In SIGMOD, pages 1457--1462, 2015.

Digital Library

[27]

J. M. Stephens and M. Poess. MUDD: a multi-dimensional data generator. In SIGSOFT Software Engineering Notes, pages 104--109, 2004.

Digital Library

[28]

Y. C. Tay. Data generation for application-specific benchmarking. PVLDB, 4(12):1470--1473, 2011.

Digital Library

[29]

Y. C. Tay, B. T. Dai, et al. UpSizeR: Synthetically scaling an empirical relational database. Inf. Syst., 38(8):1168--1183, 2013.

Digital Library

[30]

X. Yin, J. Han, et al. Efficient classification across multiple database relations: A CrossMine approach. IEEE TKDE, 18(6):770--783, 2006.

Digital Library

[31]

J. W. Zhang and Y. C. Tay. Dscaler: Synthetically scaling a given relational database. http://www.comp.nus.edu.sg/&tilde;upsizer/, 2016.

Digital Library

Cited By

Li YZhang RYang XZhang ZZhou AGunawi HReed B(2018)TouchstoneProceedings of the 2018 USENIX Conference on Usenix Annual Technical Conference10.5555/3277355.3277411(575-586)Online publication date: 11-Jul-2018
https://dl.acm.org/doi/10.5555/3277355.3277411
Zhang JWang YTay Y(2018)A collaborative framework for tweaking properties in a synthetic datasetProceedings of the VLDB Endowment10.14778/3229863.323624711:12(2010-2013)Online publication date: 1-Aug-2018
https://dl.acm.org/doi/10.14778/3229863.3236247
Sanghi ASood RSingh DHaritsa JTirthapura S(2018)HYDRAProceedings of the VLDB Endowment10.14778/3229863.323623811:12(1974-1977)Online publication date: 1-Aug-2018
https://dl.acm.org/doi/10.14778/3229863.3236238
Show More Cited By

Recommendations

Sql: Learn Basics of Queries and Implement Easily (sql programming, SQL 2016, sql database programming, sql for beginners, sql beginners guide, sql ... sql workbook, sql guide, MSSQL) (Volume 1)
Oracle High-Performance SQL Tuning
Query processing over object views of relational data

This paper presents an approach to object view management for relational databases. Such a view mechanism makes it possible for users to transparently work with data in a relational database as if it was stored in an object-oriented (OO) database. A ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image Proceedings of the VLDB Endowment

Proceedings of the VLDB Endowment Volume 9, Issue 14

October 2016

96 pages

ISSN:2150-8097

Editor:
Surajit Chaudhuri
Microsoft Research

Issue’s Table of Contents

Publisher

VLDB Endowment

Publication History

Published: 01 October 2016

Published in PVLDB Volume 9, Issue 14

Qualifiers

Research-article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

4
Total Citations
View Citations
137
Total Downloads

Downloads (Last 12 months)15
Downloads (Last 6 weeks)2

Reflects downloads up to 13 Nov 2024

Other Metrics

View Author Metrics

Citations

Cited By

Li YZhang RYang XZhang ZZhou AGunawi HReed B(2018)TouchstoneProceedings of the 2018 USENIX Conference on Usenix Annual Technical Conference10.5555/3277355.3277411(575-586)Online publication date: 11-Jul-2018
https://dl.acm.org/doi/10.5555/3277355.3277411
Zhang JWang YTay Y(2018)A collaborative framework for tweaking properties in a synthetic datasetProceedings of the VLDB Endowment10.14778/3229863.323624711:12(2010-2013)Online publication date: 1-Aug-2018
https://dl.acm.org/doi/10.14778/3229863.3236247
Sanghi ASood RSingh DHaritsa JTirthapura S(2018)HYDRAProceedings of the VLDB Endowment10.14778/3229863.323623811:12(1974-1977)Online publication date: 1-Aug-2018
https://dl.acm.org/doi/10.14778/3229863.3236238
Park HKim MGuo YFarooq F(2018)EvoGraphProceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining10.1145/3219819.3220123(2051-2059)Online publication date: 19-Jul-2018
https://dl.acm.org/doi/10.1145/3219819.3220123

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Issue’s Table of Contents