Nothing Special   »   [go: up one dir, main page]

skip to main content
research-article

Analyzing workload trends for boosting triple stores performance

Published: 18 October 2024 Publication History

Abstract

The Resource Description Framework (RDF) is widely used to model web data. The scale and complexity of the modeled data emphasized performance challenges on the RDF-triple stores. Workload adaption is one important strategy to deal with those challenges on the storage level. Current workload-adaption approaches lack the necessary generalization of the problem and only optimize part of the storage layer with the workload (mostly the replication). This creates a big performance gap within other data structures (e.g. indexes and cache) that could heavily benefit from the same workload adaption strategy. Moreover, the workload statistics are built collectively in most of the current approaches. Thus, the analysis process is unaware of whether workloads’ items are old or recent. However, that does not simulate the temporal trends that exist naturally in user queries which causes the analysis process to lag behind the rapid workload development. We present a novel universal adaption approach to the storage management of a distributed RDF store. The system aims to find optimal data assignments to the different indexes, replications, and join cache within the limited storage space. We present a cost model based on the workload that often contains frequent patterns. The workload is dynamically and continuously analyzed to evaluate predefined rules considering the benefits and costs of all options of assigning data to the storage structures. The objective is to reduce query execution time by letting different data containers compete on the limited storage space. By modeling the workload statistics as time series, we can apply well-known smoothing techniques allowing the importance of the workload to decay over time. That allows the universal adaption to stay tuned with potential changes in the workload trends.

Highlights

A new approach to achieve unified workload awareness in distributed RDF triple store.
Our triple store dynamically sets its needs of indexes, replications, and cache
The algorithm puts indexes, replication, and cache in one optimization problem.
The importance of data triples is set by the workload and the used data structures.
Smoothing techniques allowed the accumulated workload to follow recent changes.
Multiple types of workload access rules overcame the effect of workload fluctuation.

References

[1]
Aluc G., Özsu M.T., Daudjee K., Workload matters: Why RDF databases need a new design, Proc. VLDB Endow. 7 (10) (2014) 837–840,. URL http://www.vldb.org/pvldb/vol7/p837-aluc.pdf.
[2]
Peng P., Zou L., Chen L., Zhao D., Query workload-based RDF graph fragmentation and allocation, in: EDBT, 2016, pp. 377–388. OpenProceedings.org.
[3]
Hose K., Schenkel R., WARP: workload-aware replication and partitioning for RDF, in: ICDE Workshops, IEEE Computer Society, 2013, pp. 1–6.
[4]
Priyadarshi A., Kochut K.J., AWAPart: Adaptive workload-aware partitioning of knowledge graphs, 2022,. CoRR. arXiv:2203.14884.
[5]
Harbi R., Abdelaziz I., Kalnis P., Mamoulis N., Ebrahim Y., Sahli M., Accelerating SPARQL queries by exploiting hash-based locality and adaptive partitioning, VLDB J. 25 (3) (2016) 355–380,.
[6]
Cheng L., Kotoulas S., Scale-out processing of large RDF datasets, IEEE Trans. Big Data 1 (4) (2015) 138–150,.
[7]
Kaoudi Z., Manolescu I., RDF in the clouds: a survey, VLDB J. 24 (1) (2015) 67–91.
[8]
Ali W., Saleem M., Yao B., Hogan A., Ngomo A.N., A survey of RDF stores & SPARQL engines for querying knowledge graphs, VLDB J. 31 (3) (2022) 1–26,.
[9]
Shao B., Wang H., Li Y., Trinity: a distributed graph engine on a memory cloud, in: SIGMOD Conference, ACM, 2013, pp. 505–516.
[10]
Zhang X., Chen L., Tong Y., Wang M., EAGRE: Towards scalable I/O efficient SPARQL query evaluation on the cloud, in: ICDE, IEEE Computer Society, 2013, pp. 565–576.
[11]
Galárraga L., Hose K., Schenkel R., Partout: A distributed engine for efficient RDF processing, in: Proceedings of the 23rd International Conference on World Wide Web, in: WWW ’14 Companion, ACM, New York, NY, USA, 2014, pp. 267–268,. URL http://doi.acm.org/10.1145/2567948.2577302.
[12]
Zhang W.E., Sheng Q.Z., Taylor K., Qin Y., Identifying and caching hot triples for efficient RDF query processing, in: Renz M., Shahabi C., Zhou X., Cheema M.A. (Eds.), Database Systems for Advanced Applications, Springer International Publishing, Cham, 2015, pp. 259–274.
[13]
Singh M., Nerurkar N., Pandat A., Bhise M., Hot data identification for dynamic workload using parallel setup, in: 2022 IEEE Region 10 Symposium, TENSYMP, 2022, pp. 1–6,.
[14]
Bonifati A., Martens W., Timm T., An analytical study of large SPARQL query logs, Proc. VLDB Endow. 11 (2) (2017) 149–161.
[15]
OpenLink Software A., DBpedia usage report, 2020, https://www.dbpedia.org/blog/dbpedia-usage-report/.
[16]
Shokouhi M., Detecting seasonal queries by time-series analysis, in: Proceeding of the 34th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2011, Beijing, China, July 25-29, 2011, ACM, 2011, pp. 1171–1172,.
[17]
Hashavit A., Levin R., Guy I., Kutiel G., Effective trend detection within a dynamic search context, in: Perego R., Sebastiani F., Aslam J.A., Ruthven I., Zobel J. (Eds.), Proceedings of the 39th International ACM SIGIR Conference on Research and Development in Information Retrieval, ACM, 2016,.
[18]
Al-Ghezi A.I.A., Wiese L., Universal storage adaption for distributed RDF-triple stores, in: Big Data Analytics and Knowledge Discovery - 23rd International Conference, DaWaK 2021, Virtual Event, September 27-30, 2021, Proceedings, in: Lecture Notes in Computer Science, vol. 12925, Springer, 2021, pp. 97–108,.
[19]
Cook S.A., The complexity of theorem-proving procedures, in: Harrison M.A., Banerji R.B., Ullman J.D. (Eds.), Proceedings of the 3rd Annual ACM Symposium on Theory of Computing, May 3-5, 1971, Shaker Heights, Ohio, USA, ACM, 1971, pp. 151–158,.
[20]
Neumann T., Weikum G., The RDF-3X engine for scalable management of RDF data, VLDB J. 19 (1) (2010) 91–113.
[21]
Weiss C., Karras P., Bernstein A., Hexastore: sextuple indexing for semantic web data management, Proc. VLDB Endow. 1 (1) (2008) 1008–1019.
[22]
Huang J., Abadi D.J., Ren K., Scalable SPARQL querying of large RDF graphs, Proc. VLDB Endow. 4 (11) (2011) 1123–1134.
[23]
Guo X., Gao H., Zou Z., WISE: Workload-aware partitioning for RDF systems, Big Data Res. 22 (2020),.
[24]
Aluç G., Özsu M.T., Daudjee K., Building self-clustering RDF databases using tunable-LSH, VLDB J. 28 (2) (2019) 173–195,.
[25]
Karypis G., Kumar V., A fast and high quality multilevel scheme for partitioning irregular graphs, SIAM J. Sci. Comput. 20 (1) (1998) 359–392.
[26]
Davoudian A., Chen L., Tu H., Liu M., A workload-adaptive streaming partitioner for distributed graph stores, Data Sci. Eng. 6 (2) (2021) 163–179,.
[27]
Madkour A., Aly A.M., Aref W.G., WORQ: Workload-driven RDF query processing, in: The Semantic Web - ISWC 2018 - 17th International Semantic Web Conference, Monterey, CA, USA, October 8-12, 2018, Proceedings, Part I, in: Lecture Notes in Computer Science, vol. 11136, Springer, 2018, pp. 583–599,.
[28]
Moerkotte G., Neumann T., Analysis of two existing and one new dynamic programming algorithm for the generation of optimal bushy join trees without cross products, in: VLDB, VLDB Endowment, 2006, pp. 930–941.
[29]
Zeng K., Yang J., Wang H., Shao B., Wang Z., A distributed graph engine for web scale RDF data, in: PVLDB, VLDB Endowment, 2013, pp. 265–276. URL http://dl.acm.org/citation.cfm?id=2488329.2488333.
[30]
Gurajada S., Seufert S., Miliaraki I., Theobald M., TriAD: A distributed shared-nothing RDF engine based on asynchronous message passing, in: SIGMOD, ACM, 2014, pp. 289–300,.
[31]
Dasgupta S., Papadimitriou C.H., Vazirani U.V., Algorithms, McGraw-Hill, 2008.
[32]
Monaci M., Pferschy U., Serafini P., Exact solution of the robust knapsack problem, Comput. Oper. Res. 40 (11) (2013) 2625–2631.
[33]
Zhang R., Konda Y., Dong A., Kolari P., Chang Y., Zheng Z., Learning recurrent event queries for web search, in: Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing, EMNLP 2010, 9-11 October 2010, MIT Stata Center, USA, ACL, 2010, pp. 1129–1139. URL https://aclanthology.org/D10-1110/.
[34]
Hyndman R., Athanasopoulos G., Forecasting: Principles and Practice, second ed., OTexts, Australia, 2018.
[35]
Wang L., Xiao Y., Shao B., Wang H., How to partition a billion-node graph, in: ICDE, IEEE Computer Society, 2014, pp. 568–579.
[36]
S. Projects, The Lehigh University Benchmark (LUBM). http://swat.cse.lehigh.edu/projects/lubm/.
[37]
Pellissier Tanon T., Weikum G., Suchanek F., YAGO 4: A reason-able knowledge base, in: Harth A., Kirrane S., Ngonga Ngomo A.-C., Paulheim H., Rula A., Gentile A.L., Haase P., Cochez M. (Eds.), The Semantic Web, Springer International Publishing, Cham, 2020, pp. 583–596.
[38]
Rietveld L., Hoekstra R., Schlobach S., Guéret C., Structural properties as proxy for semantic relevance in RDF graph sampling, in: International Semantic Web Conference (2), in: Lecture Notes in Computer Science, vol. 8797, Springer, 2014, pp. 81–96.
[39]
Zloch M., Acosta M., Hienert D., Dietze S., Conrad S., A software framework and datasets for the analysis of graph measures on RDF graphs, in: ESWC, in: Lecture Notes in Computer Science, vol. 11503, Springer, 2019, pp. 523–539.
[40]
. DBpedia, DBpedia version 2020. http://dbpedia.org/.

Index Terms

  1. Analyzing workload trends for boosting triple stores performance
    Index terms have been assigned to the content through auto-classification.

    Recommendations

    Comments

    Please enable JavaScript to view thecomments powered by Disqus.

    Information & Contributors

    Information

    Published In

    cover image Information Systems
    Information Systems  Volume 125, Issue C
    Nov 2024
    260 pages

    Publisher

    Elsevier Science Ltd.

    United Kingdom

    Publication History

    Published: 18 October 2024

    Author Tags

    1. RDF
    2. Triple-stores
    3. Workload adaption

    Qualifiers

    • Research-article

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • 0
      Total Citations
    • 0
      Total Downloads
    • Downloads (Last 12 months)0
    • Downloads (Last 6 weeks)0
    Reflects downloads up to 16 Nov 2024

    Other Metrics

    Citations

    View Options

    View options

    Login options

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media