Abstract
Recent demands for querying big data have revealed various shortcomings of traditional database systems. This, in turn, has led to the emergency of a new kind of query mode, approximate query.Online aggregation is a sample-based technology for approximate querying. It becomes quite indispensable in the era of information explosion today. Online aggregation continuously gives an approximate result with some error estimation (usually confidence interval) until all data are processed. This survey mainly aims at elucidating the most critical two steps for online aggregation: sampling mechanism and error estimation methods. As the development of MapReduce, researchers try to implement online aggregation in MapReduce framework. We will also briefly introduce some implementations of online aggregation in MapReduce and evaluate their features, strength, and drawbacks. Finally, we disclose some existing challenges in online aggregation, which needs attention of the research community and application designers.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Hellerstein, J.M., Haas, P.J., Wang, H.J.: Online aggregation. In: ACM SIGMOD Record, vol. 26, pp. 171–182. ACM (1997)
Aarnio, T.: Parallel data processing with MapReduce. In: TKK T-110.5190, Seminar on Internetworking (2009)
Olken, F.: Random sampling from databases. Ph.D. thesis, University of California, Berkeley (1993)
Wu, S., Ooi, B.C., Tan, K.L.: Continuous sampling for online aggregation over multiple queries. In: Proceedings of the 2010 ACM SIGMOD International Conference on Management of Data, pp. 651–662. ACM (2010)
Agarwal, S., et al.: Knowing when you’re wrong: building fast and reliable approximate query processing systems. In: Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data, pp. 481–492. ACM (2014)
Zeng, K., Gao, S., Mozafari, B., Zaniolo, C.: The analytical bootstrap: a new method for fast error estimation in approximate query processing. In: Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data, pp. 277–288. ACM (2014)
Park, Y., Mozafari, B., Sorenson, J., Wang, J.: VerdictDB: universalizing approximate query processing. arXiv preprint arXiv:1804.00770 (2018)
An, M., Sun, X., Ninghui, S.: Dynamic data partitioned online aggregation. J. Comput. Res. Dev. (2010)
Joshi, S., Jermaine, C.: Robust stratified sampling plans for low selectivity queries. In: IEEE 24th International Conference on Data Engineering, ICDE 2008, pp. 199–208. IEEE (2008)
Agarwal, S., Mozafari, B., Panda, A., Milner, H., Madden, S., Stoica, I.: BlinkDB: queries with bounded errors and bounded response times on very large data. In: Proceedings of the 8th ACM European Conference on Computer Systems, pp. 29–42. ACM (2013)
Kim, A., Blais, E., Parameswaran, A., Indyk, P., Madden, S., Rubinfeld, R.: Rapid sampling for visualizations with ordering guarantees. Proc. VLDB Endow. 8(5), 521–532 (2015)
Haas, P.J., Hellerstein, J.M.: Ripple joins for online aggregation. ACM SIGMOD Rec. 28(2), 287–298 (1999)
Haas, P.J.: Large-sample and deterministic confidence intervals for online aggregation. In: Proceedings of Ninth International Conference on Scientific and Statistical Database Management, pp. 51–62. IEEE (1997)
Luo, G., Ellmann, C.J., Haas, P.J., Naughton, J.F.: A scalable hash ripple join algorithm. In: Proceedings of the 2002 ACM SIGMOD International Conference on Management of Data, pp. 252–262. ACM (2002)
Dittrich, J.P., Seeger, B., Taylor, D.S., Widmayer, P.: Progressive merge join: a generic and non-blocking sort-based join algorithm** this work has been supported by grant no. se 553/2-2 from DFG. In: VLDB 2002: Proceedings of the 28th International Conference on Very Large Databases, pp. 299–310. Elsevier (2002)
Jermaine, C., Dobra, A., Arumugam, S., Joshi, S., Pol, A.: The sort-merge-shrink join. ACM Trans. Database Syst. (TODS) 31(4), 1382–1416 (2006)
Jermaine, C., Dobra, A., Arumugam, S., Joshi, S., Pol, A.: A disk-based join with probabilistic guarantees. In: Proceedings of the 2005 ACM SIGMOD International Conference on Management of Data, pp. 563–574. ACM (2005)
Jermaine, C., Dobra, A., Pol, A., Joshi, S.: Online estimation for subset-based SQL queries. In: Proceedings of the 31st International Conference on Very Large Data Bases, pp. 745–756. VLDB Endowment (2005)
Li, F., Wu, B., Yi, K., Zhao, Z.: Wander join: online aggregation via random walks. In: Proceedings of the 2016 International Conference on Management of Data, pp. 615–629. ACM (2016)
Wang, Y., Luo, J., Song, A., Dong, F.: Oats: online aggregation with two-level sharing strategy in cloud. Distrib. Parallel Databases 32(4), 467–505 (2014)
Efron, B.: Bootstrap methods: another look at the jackknife. In: Kotz, S., Johnson, N.L. (eds.) Breakthroughs in Statistics, pp. 569–593. Springer, New York (1992). https://doi.org/10.1007/978-1-4612-4380-9_41
Ghemawat, S., Gobioff, H., Leung, S.T.: The Google file system, vol. 37. ACM (2003)
Dean, J., Ghemawat, S.: MapReduce: simplified data processing on large clusters. Commun. ACM 51, 107–113 (2008)
Condie, T., et al.: Online aggregation and continuous query support in MapReduce. In: ACM SIGMOD International Conference on Management of Data, pp. 1115–1118 (2010)
Condie, T., Conway, N., Alvaro, P., Hellerstein, J.M., Elmeleegy, K., Sears, R.: MapReduce online. In: NSDI, vol. 10, p. 20 (2010)
Qin, C., Rusu, F.: PF-OLA: a high-performance framework for parallel online aggregation. Distrib. Parallel Databases 32(3), 337–375 (2014)
Pansare, N., Borkar, V.R., Jermaine, C., Condie, T.: Online aggregation for large MapReduce jobs. Proc. VLDB Endow. 4(11), 1135–1145 (2011)
Agarwal, S., Agarwal, S., Armbrust, M., Armbrust, M., Stoica, I.: G-OLA: generalized on-line aggregation for interactive analysis on big data. In: ACM SIGMOD International Conference on Management of Data, pp. 913–918 (2015)
Zeng, K., Gao, S., Gu, J., Mozafari, B., Zaniolo, C.: ABS: a system for scalable approximate queries with accuracy guarantees. In: Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data, pp. 1067–1070. ACM (2014)
Zhang, Z., Hu, J., Xie, X., Pan, H., Feng, X.: An online approximate aggregation query processing method based on hadoop. In: 2016 IEEE 20th International Conference on Computer Supported Cooperative Work in Design (CSCWD), pp. 117–122. IEEE (2016)
Cheng, Y., Zhao, W., Rusu, F.: Bi-level online aggregation on raw data. In: Proceedings of the 29th International Conference on Scientific and Statistical Database Management, p. 10. ACM (2017)
Shi, Y., Meng, X., Wang, F., Gan, Y.: You can stop early with cola: online processing of aggregate queries in the cloud. In: Proceedings of the 21st ACM International Conference on Information and Knowledge Management, pp. 1223–1232. ACM (2012)
Gan, Y., Meng, X., Shi, Y.: COLA: a cloud-based system for online aggregation. In: 2013 IEEE 29th International Conference on Data Engineering (ICDE), pp. 1368–1371. IEEE (2013)
Acknowledgements
This work is supported by the National Natural Science Foundation of China under Grant No. 61772289 and the Fundamental Research Funds for the Central Universities.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2018 Springer Nature Switzerland AG
About this paper
Cite this paper
Li, Y., Wen, Y., Yuan, X. (2018). Online Aggregation: A Review. In: Meng, X., Li, R., Wang, K., Niu, B., Wang, X., Zhao, G. (eds) Web Information Systems and Applications. WISA 2018. Lecture Notes in Computer Science(), vol 11242. Springer, Cham. https://doi.org/10.1007/978-3-030-02934-0_10
Download citation
DOI: https://doi.org/10.1007/978-3-030-02934-0_10
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-02933-3
Online ISBN: 978-3-030-02934-0
eBook Packages: Computer ScienceComputer Science (R0)