Abstract
In this paper we extend STHoles, a very successful algorithm that uses query results to build and maintain multi-dimensional histograms of numerical data. Our contribution is the formal definition of extensions of all relevant concepts; such that they are independent of the domain of the data, but subsume STHoles concepts as their numerical specialization. At the same time, we also derive specializations for the string domain and implement these into a prototype that we use to empirically validate our approach. Our current implementation uses string prefixes as the machinery for describing string ranges. Although weaker than regular expressions, prefixes can be very efficiently applied and can capture interesting ranges in hierarchically structured string domains, such as those of filesystem pathnames and URIs. In fact, we base the empirical validation of the approach on existing, publicly available Semantic Web data where we demonstrate convergence to accurate and efficient histograms.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
Notes
- 1.
STRHist is available at https://bitbucket.org/acharal/strhist
For more details on Semagrow, please see http://www.semagrow.eu.
- 2.
Please see http://agris.fao.org for more details on AGRIS. The AGRIS site mentions 7 million distinct publications, but this includes recent additions that are not in end-2013 data dump used for these experiments.
- 3.
We use the canonical string representation of URIs as defined in Sect. 2, IETF RFC 7320 (http://tools.ietf.org/html/rfc7320).
References
Aboulnaga, A., Chaudhuri, S.: Self-tuning histograms: Building histograms without looking at data. In: Proceedings of the 1999 ACM International Conference on Management of Data (SIGMOD 1999), pp. 181–192. ACM (1999)
Bruno, N., Chaudhuri, S., Gravano, L.: STHoles: a multidimensional workload-aware histogram. In: Proceedings of 2001 ACM International Conference on Management of Data (SIGMOD 2001), pp. 211–222. ACM (2001)
Srivastava, U., Haas, P.J., Markl, V., Kutsch, M., Tran, T.M.: ISOMER: Consistent histogram construction using query feedback. In: Proceedings of the 22nd International Conference on Data Engineering (ICDE 2006). IEEE Computer Society (2006)
Roh, Y.J., Kim, J.H., Chung, Y.D., Son, J.H., Kim, M.H.: Hierarchically organized skew-tolerant histograms for geographic data objects. In: Proceedings of 2010 ACM International Conference on Management of Data (SIGMOD 2010), pp. 627–638. ACM (2010)
Chaudhuri, S., Ganti, V., Gravano, L.: Selectivity estimation for string predicates: Overcoming the underestimation problem. In: Proceedings of 20th International Conference on Data Engineering (ICDE 2004). IEEE Computer Society (2004)
Lim, L., Wang, M., Vitter, J.S.: CXHist: An on-line classification-based histogram for XML string selectivity estimation. In: Proceedings of the 31st International Conference on Very Large Data Bases (VLDB 2005), Trondheim, Norway, 30 August – 2 September 2005, pp. 1187–1198 (2005)
Ding, L., Finin, T., Joshi, A., Pan, R., Cost, R.S., Peng, Y., Reddivari, P., Doshi, V., Sachs, J.: Swoogle: A search and metadata engine for the Semantic Web. In: Proceedings of the 13th ACM International Conference on Information and Knowledge Management (CIKM 2004), pp. 652–659. ACM (2004)
Auer, S., Demter, J., Martin, M., Lehmann, J.: LODStats – an extensible framework for high-performance dataset analytics. In: ten Teije, A., Völker, J., Handschuh, S., Stuckenschmidt, H., d’Acquin, M., Nikolov, A., Aussenac-Gilles, N., Hernandez, N. (eds.) EKAW 2012. LNCS, vol. 7603, pp. 353–362. Springer, Heidelberg (2012)
Langegger, A., Wöss, W.: RDFStats - an extensible RDF statistics generator and library. In: Proceedings of DEXA 2009, pp. 79–83 (2009)
Harth, A., Hose, K., Karnstedt, M., Polleres, A., Sattler, K.U., Umbrich, J.: Data summaries for on-demand queries over linked data. In: Proceedings of 19th International World Wide Web Conference (WWW 2010), Raleigh, NC, USA, 26–30 April 2010 (2010)
Charalambidis, A., Konstantopoulos, S., Karkaletsis, V.: Dataset descriptions for optimizing federated querying. In: Poster Track, Companion Volume to the Procedings of the 24th Intl World Wide Web Conference (WWW 2015), Florence, Italy, 18–22 May 2015. ACM (2015)
Acknowledgements
The work described here was partially carried out at the 2014 edition of the International Research-Centred Summer School, held at NCSR ‘Demokritos’, Athens, Greece, 3–30 July 2014. For more details please see http://irss.iit.demokritos.gr
The research leading to these results has received funding from the European Union’s Seventh Framework Programme (FP7/2007–2013) under grant agreement No. 318497. More details at http://www.semagrow.eu.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2015 Springer International Publishing Switzerland
About this paper
Cite this paper
Zoulis, N., Mavroudi, E., Lykoura, A., Charalambidis, A., Konstantopoulos, S. (2015). Workload-Aware Self-Tuning Histograms of String Data. In: Chen, Q., Hameurlain, A., Toumani, F., Wagner, R., Decker, H. (eds) Database and Expert Systems Applications. Globe DEXA 2015 2015. Lecture Notes in Computer Science(), vol 9261. Springer, Cham. https://doi.org/10.1007/978-3-319-22849-5_20
Download citation
DOI: https://doi.org/10.1007/978-3-319-22849-5_20
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-22848-8
Online ISBN: 978-3-319-22849-5
eBook Packages: Computer ScienceComputer Science (R0)