Abstract
A key property of Linked Data is the representation and publication of data as interconnected labelled graphs where different resources linked to each other form a network of meaningful information. Searching these important relationships between resources – within single or distributed graphs – can be reduced to a pathfinding or navigation problem, i.e., looking for chains of intermediate nodes. SPARQL1.1, the current standard query language for RDF-based Linked Data defines a construct – called Property Paths (PPs) – to navigate between entities within a single graph. Since Linked Data technologies are naturally aimed at decentralised scenarios, there are many cases where centralising this data is not feasible or even not possible for querying purposes. To address these problems, we propose a SPARQL PP-based graph processing approach – dubbed DpcLD – where users can execute SPARQL PP queries and find paths distributed across multiple, connected graphs exposed as SPARQL endpoints. To execute the distributed path queries we implemented an index-free, cache-based query engine that communicates with a shared algorithm running on each remote endpoint, and computes the distributed paths. In this paper, we highlight the way in which this approach exploits and aggregates partial paths, within a distributed environment, to produce complete results. We perform extensive experiments to demonstrate the performance of our approach on two datasets: One representing 10 million triples from the DBPedia SPARQL benchmark, and another full benchmark dataset corresponding to 124 million triples. We also perform a scalability test of our approach using real-world genomics datasets distributed across multiple endpoints. We compare our distributed approach with other distributed and centralized pathfinding approaches, showing that it outperforms other distributed approaches by orders of magnitude, and provides a good trade-off for cases when the data cannot be centralised.
Similar content being viewed by others
Notes
LOD Cloud: https://lod-cloud.net/
Bio2RDF data portal: https://download.bio2rdf.org/files/release/3/release.html
In our stress testing, for a highly complex path query with over 400K distributed paths, the cache size reached 1.5GB.
We have explicitly asked authors for the availability of their code and/or data used in the evaluation
References
Aluċ, G., Hartig, O., Özsu, M. T., Daudjee, K.: Diversified Stress Testing of Rdf Data Management Systems. In: International Semantic Web Conference, pp. 197–212. Springer (2014)
Anyanwu, K., Maduko, A., Sheth, A.: Sparq2l: towards support for subgraph extraction queries in rdf databases. In: Proceedings of the 16th international conference on World Wide Web, pp 797–806. ACM (2007)
Arenas, M., Conca, S., Pérez, J.: Counting beyond a yottabyte, or how sparql 1.1 property paths will prevent adoption of the standard. In: Proceedings of the 21st international conference on World Wide Web, pp 629–638 (2012)
Bai, Y., Wang, C., Ying, X.: Para-g: Path pattern query processing on large graphs. World Wide Web 20(3), 515–541 (2017)
Beek, W., Rietveld, L., Bazoobandi, H. R., Wielemaker, J., Schlobach, S.: Lod Laundromat: a Uniform Way of Publishing Other People’s Dirty Data. In: International Semantic Web Conference, pp. 213–228. Springer (2014)
Buneman, P., Cong, G., Fan, W., Kementsietsidis, A.: Using partial evaluation in distributed query evaluation. In: Proceedings of the 32nd international conference on Very large data bases, pp 211–222 (2006)
Buneman, P., Cong, G., Fan, W., Kementsietsidis, A.: Using partial evaluation in distributed query evaluation. In: Proceedings of the 32nd international conference on Very large data bases, pp 211–222. VLDB Endowment (2006)
Clark, J., DeRose, S., et al.: Xml path language (xpath) version 1.0 (1999)
Cong, G., Fan, W., Kementsietsidis, A.: Distributed query evaluation with performance guarantees. In: Proceedings of the 2007 ACM SIGMOD international conference on Management of data, pp 509–520 (2007)
Cong, G., Fan, W., Kementsietsidis, A., Li, J., Liu, X.: Partial evaluation for distributed xpath query processing and beyond. ACM Trans. Database Syst. (TODS) 37(4), 1–43 (2012)
Cudré-Mauroux, P., Enchev, I., Fundatureanu, S., Groth, P., Haque, A., Harth, A., Keppmann, F. L., Miranker, D., Sequeda, J. F., Wylot, M.: Nosql Databases for Rdf: an Empirical Evaluation. In: International Semantic Web Conference, pp. 310–325. Springer (2013)
De Vocht, L., Verborgh, R., Mannens, E.: Using Triple Pattern Fragments to Enable Streaming of Top-K Shortest Paths via the Web. In: Semantic Web Evaluation Challenge, pp. 228–240. Springer (2016)
Dean, J., Ghemawat, S.: Mapreduce: simplified data processing on large clusters. Commun. ACM 51(1), 107–113 (2008)
Duan, S., Kementsietsidis, A., Srinivas, K., Udrea, O.: Apples and oranges: a comparison of rdf benchmarks and real rdf datasets. In: Proceedings of the 2011 ACM SIGMOD International Conference on Management of data, pp 145–156 (2011)
Duan, S., Kementsietsidis, A., Srinivas, K., Udrea, O.: Apples and oranges: a comparison of rdf benchmarks and real rdf datasets. In: Proceedings of the 2011 ACM SIGMOD International Conference on Management of data, pp 145–156 (2011)
Ermilov, I., Lehmann, J., Martin, M., Auer, S.: LODStats: The Data Web Census Dataset. In: International Semantic Web Conference, pp. 38–46. Springer (2016)
Fan, W., Wang, X., Wu, Y.: Performance guarantees for distributed reachability queries. arXiv:1208.0091 (2012)
Galárraga, L., Hose, K., Schenkel, R.: Partout: a distributed engine for efficient rdf processing. In: Proceedings of the 23rd International Conference on World Wide Web, pp 267–268 (2014)
Giraph Team. Apache giraph. http://giraph.apache.org
Görlitz, O., Staab, S.: Splendid: Sparql endpoint federation exploiting void descriptions. In: Proceedings of the Second International Conference on Consuming Linked Data-Volume 782, pp. 13–24. CEUR-WS. org (2011)
Gubichev, A., Neumann, T.: Path Query Processing on Very Large Rdf Graphs. In: WebDB. Citeseer (2011)
Gurajada, S., Seufert, S., Miliaraki, I., Theobald, M.: Triad: a distributed shared-nothing rdf engine based on asynchronous message passing. In: Proceedings of the 2014 ACM SIGMOD international conference on Management of data, pp 289–300 (2014)
Hartig, O.: Querying a web of linked data: foundations and query execution, vol. 24. Ios Press (2016)
Hartig, O., Pirrò, G.: Sparql with property paths on the web. Semantic Web 8(6), 773–795 (2017)
Hertling, S., Schröder, M., Jilek, C., Dengel, A.: Top-K Shortest Paths in Directed Labeled Multigraphs. In: Semantic Web Evaluation Challenge, pp. 200–212. Springer (2016)
Huang, J., Abadi, D. J., Ren, K.: Scalable sparql querying of large rdf graphs. Proc. VLDB Endowment 4(11), 1123–1134 (2011)
Husain, M., McGlothlin, J., Masud, M. M., Khan, L., Thuraisingham, B. M.: Heuristics-based query processing for large rdf graphs using cloud computing. IEEE Trans. Knowl. Data Eng. 23(9), 1312–1327 (2011)
Husain, M. F., Khan, L., Kantarcioglu, M., Thuraisingham, B.: Data Intensive Query Processing for Large Rdf Graphs Using Cloud Computing Tools. In: 2010 IEEE 3Rd International Conference on Cloud Computing, pp. 1–10. IEEE (2010)
Jones, N. D.: An introduction to partial evaluation. ACM Comput. Surv. (CSUR) 28(3), 480–503 (1996)
Kaoudi, Z., Manolescu, I.: Rdf in the clouds: a survey. VLDB J. 24(1), 67–91 (2015)
Karypis, G., Kumar, V.: Analysis of multilevel graph partitioning. In: Supercomputing’95: Proceedings of the 1995 ACM/IEEE conference on Supercomputing, pp 29–29. IEEE (1995)
Khadilkar, V., Kantarcioglu, M., Thuraisingham, B., Castagna, P.: Jena-hbase: A distributed, scalable and efficient rdf triple store. In: Proceedings of the 11th International Semantic Web Conference Posters & Demonstrations Track, ISWC-PD, vol. 12, pp 85–88. Citeseer (2012)
Kochut, K. J., Janik, M.: Sparqler: Extended Sparql for Semantic Association Discovery. In: European Semantic Web Conference, pp. 145–159. Springer (2007)
Kossmann, D.: The state of the art in distributed query processing. ACM Comput. Surv. (CSUR) 32(4), 422–469 (2000)
Kostylev, E. V., Reutter, J. L., Ugarte, M.: Construct Queries in Sparql. In: 18Th International Conference on Database Theory (ICDT 2015). Schloss Dagstuhl-Leibniz-Zentrum Fuer Informatik (2015)
Le Anh, V., Kiss, A.: Efficient Processing Regular Queries in Shared-Nothing Parallel Database Systems Using Tree-And Structural Indexes. In: ADBIS Research Communications (2007)
Lee, K., Liu, L.: Scaling queries over big rdf graphs with semantic hash partitioning. Proc. VLDB Endowment 6(14), 1894–1905 (2013)
Lee, K., Liu, L., Tang, Y., Zhang, Q., Zhou, Y.: Efficient and Customizable Data Partitioning Framework for Distributed Big Rdf Data Processing in the Cloud. In: 2013 IEEE Sixth International Conference on Cloud Computing, pp. 327–334. IEEE (2013)
Li, F., Ooi, B. C., Özsu, M. T., Wu, S.: Distributed data management using mapreduce. ACM Comput. Surv. (CSUR) 46(3), 1–42 (2014)
Malewicz, G., Austern, M. H., Bik, A. J., Dehnert, J. C., Horn, I., Leiser, N., Czajkowski, G.: Pregel: a system for large-scale graph processing. In: Proceedings of the 2010 ACM SIGMOD International Conference on Management of data, pp 135–146. ACM (2010)
Mehmood, Q., Saleem, M., Sahay, R., Ngomo, A. -C. N., D’Aquin. M.: Qppds Querying property paths over distributed rdf datasets. IEEE Access 7, 101031–101045 (2019)
Morsey, M., Lehmann, J., Auer, S., Ngomo, A. -C. N.: Dbpedia Sparql Benchmark–Performance Assessment with Real Queries on Real Data. In: International Semantic Web Conference, pp. 454–469. Springer (2011)
Myung, J., Yeon, J., Lee, S.-G.: Sparql basic graph pattern processing with iterative mapreduce. In: Proceedings of the 2010 Workshop on Massive Data Analytics on the Cloud, pp 1–6 (2010)
Nolé, M., Sartiani, C.: Regular path queries on massive graphs. In: Proceedings of the 28th International Conference on Scientific and Statistical Database Management, pp 1–12 (2016)
Olston, C., Reed, B., Srivastava, U., Kumar, R., Tomkins, A.: Pig latin: a not-so-foreign language for data processing. In: Proceedings of the 2008 ACM SIGMOD international conference on Management of data, pp 1099–1110 (2008)
Padiya, T., Bhise, M.: Dwahp: workload aware hybrid partitioning and distribution of rdf data. In: Proceedings of the 21st International Database Engineering & Applications Symposium, pp 235–241 (2017)
Papailiou, N., Konstantinou, I., Tsoumakos, D., Koziris, N.: H2rdf: adaptive query processing on rdf data in the cloud. In: Proceedings of the 21st International Conference on World Wide Web, pp 397–400 (2012)
Papailiou, N., Tsoumakos, D., Konstantinou, I., Karras, P., Koziris, N.: H2rdf+ an efficient data management system for big rdf graphs. In: Proceedings of the 2014 ACM SIGMOD international conference on Management of data, pp 909–912 (2014)
Peng, P., Zou, L., Chen, L., Zhao, D.: Adaptive distributed rdf graph fragmentation and allocation based on query workload. IEEE Trans. Knowl. Data Eng. 31(4), 670–685 (2018)
Peng, P., Zou, L., Guan, R.: Accelerating Partial Evaluation in Distributed Sparql Query Evaluation. In: 2019 IEEE 35Th International Conference on Data Engineering (ICDE), pp. 112–123. IEEE (2019)
Przyjaciel-Zablocki, M., Schätzle, A., Hornung, T., Lausen, G.: Rdfpath: Path Query Processing on Large Rdf Graphs with Mapreduce. In: Extended Semantic Web Conference, pp. 50–64. Springer (2011)
Qiao, S., Özsoyoġlu, Z. M.: Rbench: Application-specific rdf benchmarking. In: Proceedings of the 2015 acm sigmod international conference on management of data, pp 1825–1838 (2015)
Quilitz, B., Leser, U.: Querying Distributed Rdf Data Sources with Sparql. In: European Semantic Web Conference, pp. 524–538. Springer (2008)
Rohloff, K., Schantz, R. E.: High-Performance, Massively Scalable Distributed Systems Using the Mapreduce Software Framework: the Shard Triple-Store. In: Programming Support Innovations for Emerging Distributed Applications, pp. 1–5 (2010)
Saleem, M., Hasnain, A., Ngomo, A. -C. N.: Largerdfbench: a billion triples benchmark for sparql endpoint federation. J. Web Semant. 48, 85–125 (2018)
Saleem, M., Khan, Y., Hasnain, A., Ermilov, I., Ngonga Ngomo, A. -C.: A fine-grained evaluation of sparql endpoint federation systems. Semant. Web 7(5), 493–518 (2016)
Saleem, M., Ngomo, A.-C. N. : Hibiscus: Hypergraph-Based Source Selection for Sparql Endpoint Federation. In: European Semantic Web Conference, pp. 176–191. Springer (2014)
Saleem, M., Szárnyas, G., Conrads, F., Bukhari, S. A. C., Mehmood, Q., Ngonga Ngomo, A. -C.: How Representative is a Sparql Benchmark? an Analysis of Rdf Triplestore Benchmarks. In: The World Wide Web Conference, pp. 1623–1633 (2019)
Salihoglu, S., Widom, J.: Gps: A graph processing system. In: Proceedings of the 25th International Conference on Scientific and Statistical Database Management, pp 1–12 (2013)
Santana, L. H. Z., Mello, R. d. S.: An analysis of mapping strategies for storing rdf data into nosql databases. In: Proceedings of the 35th Annual ACM Symposium on Applied Computing, pp 386–392 (2020)
Savenkov, V., Mehmood, Q., Umbrich, J., Polleres, A.: Counting to k or how sparql1. 1 property paths can be extended to top-k path queries. In: Proceedings of the 13th International Conference on Semantic Systems, pp 97–103. ACM (2017)
Schätzle, A., Przyjaciel-Zablocki, M., Hornung, T., Lausen. G.: Pigsparql: ÜBersetzung von sparql nach pig latin. Gesellschaft für Informatik eV (2011)
Schwarte, A., Haase, P., Hose, K., Schenkel, R., Schmidt, M.: Fedx: a Federation Layer for Distributed Query Processing on Linked Open Data. In: Extended Semantic Web Conference, pp. 481–486. Springer (2011)
Shao, B., Wang, H., Trinity, Y. L. i.: A distributed graph engine on a memory cloud. In: Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data, pp 505–516. ACM (2013)
Shvachko, K., Kuang, H., Radia, S., Chansler, R., et al.: The Hadoop Distributed File System. In: MSST, vol. 10, pp 1–10 (2010)
Sirin, E.: Stardog - a path of our own. https://www.stardog.com/blog/a-path-of-our-own/
Team, N.: Neo4j. https://neo4j.com
Umbrich, J., Hogan, A., Polleres, A., Decker, S.: Link traversal querying for a diverse web of data. Semant. Web 6(6), 585–624 (2015)
Valdestilhas, A., Soru, T., Nentwig, M., Marx, E., Saleem, M., Ngomo, A. -C. N.: Where is My Uri?. In: European Semantic Web Conference, pp. 671–681. Springer (2018)
Valiant, L. G.: A bridging model for parallel computation. Commun. ACM 33(8), 103–111 (1990)
Wang, X., Wang, J., Zhang, X.: Efficient distributed regular path queries on rdf graphs using partial evaluation. In: Proceedings of the 25th ACM International on Conference on Information and Knowledge Management, pp 1933–1936 (2016)
Wang, X., Wang, J., Zhang, X.: Efficient distributed regular path queries on rdf graphs using partial evaluation. In: Proceedings of the 25th ACM International on Conference on Information and Knowledge Management, pp 1933–1936. ACM (2016)
Wang, X., Wang, S., Xin, Y., Yang, Y., Li, J., Wang, X.: Distributed pregel-based provenance-aware regular path query processing on rdf knowledge graphs. World Wide Web, 1–32 (2019)
Xin, Y., Wang, X., Jin, D., Wang, S.: Distributed Efficient Provenance-Aware Regular Path Queries on Large Rdf Graphs. In: International Conference on Database Systems for Advanced Applications, pp. 766–782. Springer (2018)
Yu, J. X., queries, J. Cheng.: Graph Reachability a Survey. In: Managing and Mining Graph Data, pp. 181–215. Springer (2010)
Yu, J. X., Cheng, J.: Graph Reachability queries: A Survey. In: Managing and Mining Graph Data, pp. 181–215. Springer (2010)
Zeng, K., Yang, J., Wang, H., Shao, B., Wang, Z.: A distributed graph engine for web scale rdf data. Proc. VLDB Endowment 6(4), 265–276 (2013)
Acknowledgments
This work is partly supported by a research grant from Science Foundation Ireland, co-funded by the European Regional Development Fund, for the Insight SFI Research Centre for Data Analytics under Grant Number SFI/12/RC/2289_P2. The work conducted in the University of Leipzig has been supported by the project LIMBO (Grant no. 19F2029I), OPAL (no. 19F2028A), KnowGraphs (no. 860801), and SOLIDE (no. 13N14456).
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
This article belongs to the Topical Collection: Special Issue on Large Scale Graph Data Analytics
Guest Editors: Xuemin Lin, Lu Qin, Wenjie Zhang, and Ying Zhang
Rights and permissions
About this article
Cite this article
Mehmood, Q., Saleem, M., Jha, A. et al. Efficient distributed path computation on RDF knowledge graphs using partial evaluation. World Wide Web 25, 1005–1036 (2022). https://doi.org/10.1007/s11280-021-00965-5
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11280-021-00965-5