Nothing Special   »   [go: up one dir, main page]

skip to main content
research-article

Dscaler: synthetically scaling a given relational database

Published: 01 October 2016 Publication History

Abstract

The Dataset Scaling Problem (DSP) defined in previous work states: Given an empirical set of relational tables D and a scale factor s, generate a database state D that is similar to D but s times its size. A DSP solution is useful for application development (s < 1), scalability testing (s > 1) and anonymization (s = 1). Current solutions assume all table sizes scale by the same ratio s.
However, a real database tends to have tables that grow at different rates. This paper therefore considers non-uniform scaling (nuDSP), a DSP generalization where, instead of a single scale factor s, tables can scale by different factors.
Dscaler is the first solution for nuDSP. It follows previous work in achieving similarity by reproducing correlation among the primary and foreign keys. However, it introduces the concept of a correlation database that captures fine-grained, per-tuple correlation.
Experiments with well-known real and synthetic datasets D show that Dscaler produces D with greater similarity to D than state-of-the-art techniques. Here, similarity is measured by number of tuples, frequency distribution of foreign key references, and multi-join aggregate queries.

References

[1]
S. Acharya, P. B. Gibbons, V. Poosala, and S. Ramaswamy. Join synopses for approximate query answering. In SIGMOD, pages 275--286, 1999.
[2]
S. Agarwal, A. P. Iyer, et al. Blink and it's done: interactive queries on very large data. VLDB, 5(12):1902--1905, 2012.
[3]
A. Arasu, R. Kaushik, and J. Li. Data generation using declarative constraints. In SIGMOD, pages 685--696, 2011.
[4]
C. Binnig, D. Kossmann, and E. Lo. Reverse query processing. In ICDE, pages 506--515. IEEE, 2007.
[5]
C. Binnig, D. Kossmann, E. Lo, and M. T. Özsu. QAGen: generating query-aware test databases. In SIGMOD, pages 341--352, 2007.
[6]
N. Bruno and S. Chaudhuri. Flexible database generators. In VLDB, pages 1097--1107, 2005.
[7]
T. Buda, T. Cerqueus, et al. ReX: Extrapolating relational data in a representative way. In Data Science, LNCS 9147, pages 95--107. 2015.
[8]
T. S. Buda, T. Cerqueus, et al. VFDS: An application to generate fast sample databases. In CIKM, pages 2048--2050, 2014.
[9]
S. Chaudhuri, G. Das, and U. Srivastava. Effective use of block-level sampling in statistics estimation. In SIGMOD, pages 287--298, 2004.
[10]
I. S. Dhillon, S. Mallela, and D. S. Modha. Information-theoretic co-clustering. In KDD, pages 89--98. ACM, 2003.
[11]
S. Duan, A. Kementsietsidis, et al. Apples and oranges: a comparison of RDF benchmarks and real RDF datasets. In SIGMOD, pages 145--156, 2011.
[12]
H. Fu, A. Zhang, and X. Xie. Effective social graph deanonymization based on graph structure and descriptive information. ACM Trans. Intell. Syst. Technol., 6(4):49:1--49:29, July 2015.
[13]
R. Gemulla, P. Rösch, and W. Lehner. Linked Bernoulli synopses: Sampling along foreign keys. In Scientific and Statistical Database Management, pages 6--23, 2008.
[14]
J. Gray, P. Sundaresan, et al. Quickly generating billion-record synthetic databases. In SIGMOD, pages 243--252, 1994.
[15]
V. Gupta, G. Miklau, and N. Polyzotis. Private database synthesis for outsourced system evaluation. In Proc. AMW, May 2011.
[16]
K. Houkjær, K. Torp, and R. Wind. Simple and realistic data generation. In VLDB, pages 1243--1246, 2006.
[17]
G. H. John and P. Langley. Static versus dynamic sampling for data mining. In KDD, volume 96, pages 367--370, 1996.
[18]
H. Köhler, X. Zhou, et al. Sampling dirty data for matching attributes. In SIGMOD, pages 63--74, 2010.
[19]
E. Lo, N. Cheng, and W.-K. Hon. Generating databases for query workloads. PVLDB, 3(1--2):848--859, 2010.
[20]
W. Lu, G. Miklau, and V. Gupta. Generating private synthetic databases for untrusted system evaluation. In ICDE, pages 652--663, Mar. 2014.
[21]
F. McSherry and I. Mironov. Differentially private recommender systems. In KDD, pages 627--636, 2009.
[22]
A. Narayanan and V. Shmatikov. Robust de-anonymization of large sparse datasets. In IEEE Symp. Security and Privacy, pages 111--125, 2008.
[23]
C. R. Palmer and C. Faloutsos. Density biased sampling: An improved method for data mining and clustering. SIGMOD Rec., 29(2):82--92, May 2000.
[24]
F. Provost, D. Jensen, and T. Oates. Efficient progressive sampling. In SIGMOD, pages 23--32, 1999.
[25]
S. Qiao and Z. M. Özsoyoğlu. RBench: Application-specific RDF benchmarking. In SIGMOD, pages 1825--1838, 2015.
[26]
T. Rabl, M. Danisch, et al. Just can't get enough: Synthesizing big data. In SIGMOD, pages 1457--1462, 2015.
[27]
J. M. Stephens and M. Poess. MUDD: a multi-dimensional data generator. In SIGSOFT Software Engineering Notes, pages 104--109, 2004.
[28]
Y. C. Tay. Data generation for application-specific benchmarking. PVLDB, 4(12):1470--1473, 2011.
[29]
Y. C. Tay, B. T. Dai, et al. UpSizeR: Synthetically scaling an empirical relational database. Inf. Syst., 38(8):1168--1183, 2013.
[30]
X. Yin, J. Han, et al. Efficient classification across multiple database relations: A CrossMine approach. IEEE TKDE, 18(6):770--783, 2006.
[31]
J. W. Zhang and Y. C. Tay. Dscaler: Synthetically scaling a given relational database. http://www.comp.nus.edu.sg/&tilde;upsizer/, 2016.

Cited By

View all

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image Proceedings of the VLDB Endowment
Proceedings of the VLDB Endowment  Volume 9, Issue 14
October 2016
96 pages
ISSN:2150-8097
Issue’s Table of Contents

Publisher

VLDB Endowment

Publication History

Published: 01 October 2016
Published in PVLDB Volume 9, Issue 14

Qualifiers

  • Research-article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)15
  • Downloads (Last 6 weeks)2
Reflects downloads up to 13 Nov 2024

Other Metrics

Citations

Cited By

View all
  • (2018)TouchstoneProceedings of the 2018 USENIX Conference on Usenix Annual Technical Conference10.5555/3277355.3277411(575-586)Online publication date: 11-Jul-2018
  • (2018)A collaborative framework for tweaking properties in a synthetic datasetProceedings of the VLDB Endowment10.14778/3229863.323624711:12(2010-2013)Online publication date: 1-Aug-2018
  • (2018)HYDRAProceedings of the VLDB Endowment10.14778/3229863.323623811:12(1974-1977)Online publication date: 1-Aug-2018
  • (2018)EvoGraphProceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining10.1145/3219819.3220123(2051-2059)Online publication date: 19-Jul-2018

View Options

Get Access

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media