Abstract
In the era of intelligent Internet, the management and analysis of massive spatio-temporal data is one of the important links to realize intelligent applications and build smart cities, in which the interaction of multi-source data is the basis of realizing spatio-temporal data management and analysis. As an important carrier to achieve the interactive calculation of massive data, Flink provides the advanced Operator Join to facilitate user program development. In a Flink job with multi-source data connection operations, the selection of join sequences and the data communication in the repartition phase are both key factors that affect the efficiency of the job. However, Flink does not provide any optimization mechanism for the two factors, which in turn leads to low job efficiency. If the enumeration method is used to find the optimal join sequence, the result will not be obtained in polynomial time, so the optimization effect cannot be achieved. We investigate the above problems, design and implement a more advanced Operator joinTree that can support multi-source data connection in Flink, and introduce two optimization strategies into the Operator. In summary, the advantages of our work are highlighted as follows: (1) the Operator enables Flink to support multi-source data connection operation, and reduces the amount of calculation and data communication by introducing lightweight optimization strategies to improve job efficiency; (2) with the optimization strategy for join sequence, the total running time can be reduced by 29% and the data communication can be reduced by 34% compared with traditional sequential execution; (3) the optimization strategy for data repartition can further enable the job to bring 35% performance improvement, and in the average case can reduce the data communication by 43%.
Similar content being viewed by others
Data availability
All data in the experiment is authoritative and available.
Code availability
All the codes in this research are available.
References
Isaksen ET, Johansen BG (2021) Congestion pricing, air pollution, and individual-level behavioral responses. Memorandum
Ye Y, Wang G, Chen L, Wang H (2015) Graph similarity search on large uncertain graph databases. Vldb Journal 24(2):271–296
Delianidi M, Salampasis M, Diamantaras K, Siomos, T, Karaveli I (2021) A graph-based method for session-based recommendations
Ye Y, Xiang L, Chen L, Sun Y, Wang G (2016) Rsknn: knn search on road networks by incorporating social influence. IEEE Transactions on Knowledge & Data Engineering 28(6):1575–1588
Yuan Y, Lian X, Wang G, Chen L, Ma Y, Wang Y (2019) Weight-constrained route planning over time-dependent graphs. 2019 IEEE 35th international conference on data engineering (ICDE)
Wang Y, Yuan Y, Wang H, Zhou X, Mu C, Wang G (2021) Constrained route planning over large multi-modal time-dependent networks. ICDE, 313–324
Carbone P, Katsifodimos A, Kth, Sweden S, Tzoumas K (2015) Apache flink : Stream and batch processing in a single engine
Failure H, Failure H, Access SD, Access SD, Sets LD, Sets LD, Model SC, Model SC, Computation M, Computation M (2007) The hadoop distributed file system: Architecture and design. Hadoop Project Website 11(11):1–10
Zaharia M, Chowdhury M, Franklin MJ, Shenker S, Stoica I (2010) Spark: Cluster computing with working sets
Scheufele W, Moerkotte G, Seminargebaude A (1997) Constructing optimal bushy processing trees for join queries is np-hard (extended abstract)
Dittrich J, Quiané-Ruiz J, Jindal A, Kargin Y, Setty V, Schad J (2010) Hadoop++: Making a yellow elephant run like a cheetah (without it even noticing). Proc. VLDB Endow 3(1):518–529
Eltabakh MY, Tian Y, Özcan F, Gemulla R, Krettek A, McPherson J (2011) Cohadoop: Flexible data placement and its exploitation in hadoop. Proc. VLDB Endow 4(9):575–585
Kimmett B, Thomo A, Venkatesh S (2014) Three-way joins on mapreduce: An experimental study, 227–232
Afrati FN, Ullman JD (2011) Optimizing multiway joins in a map-reduce environment. IEEE Transactions on Knowledge and Data Engineering 23(9):1282–1298
Leis V, Radke B, Gubichev A, Mirchev A, Boncz PA, Kemper A, Neumann T (2018) Query optimization through the looking glass, and what we found running the join order benchmark. VLDB J 27(5):643–668
Li N, Liu Y, Dong Y, Gu J (2008) Application of ant colony optimization algorithm to multi-join query optimization 5370:189–197
Kadkhodaei H, Mahmoudi F (2011) A combination method for join ordering problem in relational databases using genetic algorithm and ant colony, 312–317
A LD, A GW, A JX, A XW, A SH, B RZ (2012) Commapreduce: An improvement of mapreduce with lightweight communication mechanisms. In: International conference on database systems for advanced applications, pp. 224–247
Michael L, Nejdl W, Papapetrou O, Siberski W (2007) Improving distributed join efficiency with extended bloom filter operations. In: 21st international conference on advanced information networking and applications (AINA 2007)
Selinger PG, Astrahan MM, Chamberlin DD, Lorie, RA, Price TG (1979) Access path selection in a relational database management system, 23–34
Vance B, Maier D (1996) Rapid bushy join-order optimization with cartesian products, 35–46
Ahmed R, Sen R, Poess M, Chakkappen S (2014) Of snowstorms and bushy trees. Proc. VLDB Endow 7(13):1452–1461
Blanas S, Li Y, Patel JM (2011) Design and evaluation of main memory hash join algorithms for multi-core cpus, 37–48
Stutzle T, Hoos H (1999) Improving the ant system: A detailed report on the max-min ant system
Barata M, Bernardino J, Furtado P (2015) An overview of decision support benchmarks: Tpc-ds. TPC-H and SSB 353:619–628
Funding
This research was supported by the National Key R&D Program of China under Grant No. 2018YFB1004402; and the NSFC under Grant No. 61872072, 62072087, 61772124, 61932004, 61732003, and 61729201; and the Fundamental Research Funds for the Central Universities under Grant No. N2016009.
Author information
Authors and Affiliations
Contributions
Conceptualization, Hangxu Ji; software, Hangxu Ji; methodology, Hangxu Ji and Yuhai Zhao; supervision, Hangxu Ji and Shiye Wang; validation, Hangxu Ji and Shiye Wang; writing-original draft, Hangxu Ji; writing-review and editing, Gang Wu, George Y. Yuan, and Guoren Wang.
Corresponding author
Ethics declarations
Ethics approval
This article does not contain any studies involving human participants and/or animals by any of the authors.
Consent to participate
All authors have agreed to participate in the research described in this manuscript.
Consent for publication
All authors have read and agreed to the published version of the manuscript.
Conflict of interest
The authors declare no conflict of interest.
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Ji, H., Wu, G., Zhao, Y. et al. joinTree: A novel join-oriented multivariate operator for spatio-temporal data management in Flink. Geoinformatica 27, 107–132 (2023). https://doi.org/10.1007/s10707-022-00470-5
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10707-022-00470-5