Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/3459637.3481999acmconferencesArticle/Chapter ViewAbstractPublication PagescikmConference Proceedingsconference-collections
research-article

DistRDF2ML - Scalable Distributed In-Memory Machine Learning Pipelines for RDF Knowledge Graphs

Published: 30 October 2021 Publication History

Abstract

This paper presents DistRDF2ML, the generic, scalable, and distributed framework for creating in-memory data preprocessing pipelines for Spark-based machine learning on RDF knowledge graphs. This framework introduces software modules that transform large-scale RDF data into ML-ready fixed-length numeric feature vectors. The developed modules are optimized to the multi-modal nature of knowledge graphs. DistRDF2ML provides aligned software design and usage principles as common data science stacks that offer an easy-to-use package for creating machine learning pipelines. The modules used in the pipeline, the hyper-parameters and the results are exported as a semantic structure that can be used to enrich the original knowledge graph. The semantic representation of metadata and machine learning results offers the advantage of increasing the machine learning pipelines' reusability, explainability, and reproducibility. The entire framework of DistRDF2ML is open source, integrated into the holistic SANSA stack, documented in scala-docs, and covered by unit tests. DistRDF2ML demonstrates its scalable design across different processing power configurations and (hyper-)parameter setups within various experiments. The framework brings the three worlds of knowledge graph engineers, distributed computation developers, and data scientists closer together and offers all of them the creation of explainable ML pipelines using a few lines of code.

References

[1]
Tim Berners-Lee, James Hendler, and Ora Lassila. 2001. The semantic web. Scientific american, Vol. 284, 5 (2001), 34--43.
[2]
Antoine Bordes, Nicolas Usunier, Alberto Garcia-Duran, Jason Weston, and Oksana Yakhnenko. 2013. Translating embeddings for modeling multi-relational data. In Neural Information Processing Systems (NIPS). NIPS, South Lake Tahoe, 1--9.
[3]
Weiwei Cheng, Gjergji Kasneci, Thore Graepel, David H. Stern, and Ralf Herbrich. 2011. Automated feature generation from structured knowledge. In Proceedings of the 20th ACM Conference on Information and Knowledge Management, CIKM 2011, Glasgow, United Kingdom, October 24--28, 2011. ACM, 1395--1404. https://doi.org/10.1145/2063576.2063779
[4]
Jason Jinquan Dai, Yiheng Wang, Xin Qiu, Ding Ding, Yao Zhang, Yanzhang Wang, Xianyan Jia, Cherry Li Zhang, Yan Wan, Zhichao Li, et al. 2019. Bigdl: A distributed deep learning framework for big data. In Proceedings of the ACM Symposium on Cloud Computing. ACM, Santa Cruz CA USA, 50--60.
[5]
Carsten Felix Draschner, Jens Lehmann, and Hajira Jabeen. 2021. DistSim-Scalable Distributed in-Memory Semantic Similarity Estimation for RDF Knowledge Graphs. In 2021 IEEE 15th International Conference on Semantic Computing (ICSC). IEEE, Laguna Hills, California, 333--336.
[6]
Apache Flink Foundation. 2021 a. Apache Flink. https://flink.apache.org.
[7]
Apache Jena Foundation. 2021 b. Apache Jena. https://jena.apache.org/index.html.
[8]
Apache Spark Foundation. 2021 c. Apache Spark. https://spark.apache.org.
[9]
Apache Spark Foundation. 2021 d. Apache Spark MLlib. https://spark.apache.org/mllib/.
[10]
Oktie Hassanzadeh and Mariano P Consens. 2009. Linked Movie Data Base .openreview.net, no adress.
[11]
Venkata Narasimha Pavan Kappara, Ryutaro Ichise, and O. P. Vyas. 2011. LiDDM: A Data Mining System for Linked Data. In WWW2011 Workshop on Linked Data on the Web, Hyderabad, India, March 29, 2011 (CEUR Workshop Proceedings, Vol. 813). CEUR-WS.org. http://ceur-ws.org/Vol-813/ldow2011-paper07.pdf
[12]
M.A. Khan, G.A Grimnes, and A. Dengel. 2010. Two pre-processing operators for improved learning from semanticweb data. In First RapidMiner Community Meeting And Conference (RCOMM 2010).
[13]
Jens Lehmann, Robert Isele, Max Jakob, Anja Jentzsch, Dimitris Kontokostas, Pablo N Mendes, Sebastian Hellmann, Mohamed Morsey, Patrick Van Kleef, Sören Auer, et al. 2015. Dbpedia--a large-scale, multilingual knowledge base extracted from wikipedia. Semantic web, Vol. 6, 2 (2015), 167--195.
[14]
Jens Lehmann, Gezim Sejdiu, Lorenz Bü hmann, Patrick Westphal, Claus Stadler, Ivan Ermilov, Simon Bin, Nilesh Chakraborty, Muhammad Saleem, Axel Cyrille Ngonga Ngomo, and Hajira Jabeen. 2017. Distributed semantic analytics using the SANSA stack. Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), Vol. 10588 LNCS, iii (2017), 147--155. https://doi.org/10.1007/978--3--319--68204--4_15
[15]
Xiangrui Meng, Joseph Bradley, Burak Yavuz, Evan Sparks, Shivaram Venkataraman, Davies Liu, Jeremy Freeman, DB Tsai, Manish Amde, Sean Owen, et al. 2016. Mllib: Machine learning in apache spark. The Journal of Machine Learning Research, Vol. 17, 1 (2016), 1235--1241.
[16]
Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781 (2013).
[17]
Farshad Bakhshandegan Moghaddam, Carsten Felix Draschner, Jens Lehmann, and Hajira Jabeen. 2021. Literal2Feature: An Automatic Scalable RDF Graph Feature Extractor. In Proceedings of the 17th International Conference on Semantic Systems, SEMANTICS 2021, Amsterdam, The Netherlands, September 6--9, 2021.
[18]
Martin Odersky, Lex Spoon, and Bill Venners. 2008. Programming in scala. Artima Inc.
[19]
Jeff Z Pan. 2009. Resource description framework. In Handbook on ontologies. Springer, 71--90.
[20]
Heiko Paulheim and Johannes Fü rnkranz. 2012. Unsupervised generation of data mining features from linked open data. In 2nd International Conference on Web Intelligence, Mining and Semantics, WIMS '12, Craiova, Romania, June 6--8, 2012. ACM, 31:1--31:12. https://doi.org/10.1145/2254129.2254168
[21]
Fabian Pedregosa, Gaël Varoquaux, Alexandre Gramfort, Vincent Michel, Bertrand Thirion, Olivier Grisel, Mathieu Blondel, Peter Prettenhofer, Ron Weiss, Vincent Dubourg, et al. 2011. Scikit-learn: Machine learning in Python. the Journal of machine Learning research, Vol. 12 (2011), 2825--2830.
[22]
Petar Ristoski and Heiko Paulheim. 2016. Rdf2vec: Rdf graph embeddings for data mining. In International Semantic Web Conference. Springer, Kobe, Japan, 498--514.
[23]
Claus Stadler, Gezim Sejdiu, Damien Graux, and Jens Lehmann. 2019. Sparklify: A Scalable Software Component for Efficient Evaluation of SPARQL Queries over Distributed RDF Datasets. In The Semantic Web -- ISWC 2019, Chiara Ghidini, Olaf Hartig, Maria Maleshkova, Vojtve ch Svátek, Isabel Cruz, Aidan Hogan, Jie Song, Maxime Lefrancc ois, and Fabien Gandon (Eds.). Springer International Publishing, Cham, 293--308.
[24]
Fabian M Suchanek, Gjergji Kasneci, and Gerhard Weikum. 2007. Yago: a core of semantic knowledge. In Proceedings of the 16th international conference on World Wide Web. 697--706.
[25]
SANSA team. 2021. DistRDF2ML Release. https://github.com/SANSA-Stack/SANSA-Stack/releases/tag/v0.8.1_DistRDF2ML.
[26]
Denny Vrandevcić and Markus Krötzsch. 2014. Wikidata: a free collaborative knowledgebase. Commun. ACM, Vol. 57, 10 (2014), 78--85.
[27]
Bishan Yang, Wen-tau Yih, Xiaodong He, Jianfeng Gao, and Li Deng. 2014. Embedding entities and relations for learning and inference in knowledge bases. arXiv preprint arXiv:1412.6575, Vol. arXiv:1412.6575 (2014), 1--12.

Cited By

View all
  • (2024)AutoRDF2GML: Facilitating RDF Integration in Graph Machine LearningThe Semantic Web – ISWC 202410.1007/978-3-031-77847-6_7(115-133)Online publication date: 27-Nov-2024
  • (2024)SparkKG-ML: A Library to Facilitate End–to–End Large–Scale Machine Learning Over Knowledge Graphs in PythonThe Semantic Web – ISWC 202410.1007/978-3-031-77847-6_1(3-19)Online publication date: 27-Nov-2024
  • (2023)Blockchain Secured Dynamic Machine Learning Pipeline for ManufacturingApplied Sciences10.3390/app1302078213:2(782)Online publication date: 5-Jan-2023
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
CIKM '21: Proceedings of the 30th ACM International Conference on Information & Knowledge Management
October 2021
4966 pages
ISBN:9781450384469
DOI:10.1145/3459637
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 30 October 2021

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. RDF
  2. SANSA
  3. apache spark
  4. big data
  5. data science
  6. distributed computing
  7. explainable artificial intelligence
  8. knowledge graphs
  9. machine learning
  10. open source framework
  11. preprocessing pipeline
  12. processing
  13. resource description framework
  14. scalable semantic

Qualifiers

  • Research-article

Funding Sources

Conference

CIKM '21
Sponsor:

Acceptance Rates

Overall Acceptance Rate 1,861 of 8,427 submissions, 22%

Upcoming Conference

CIKM '25

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)27
  • Downloads (Last 6 weeks)3
Reflects downloads up to 14 Dec 2024

Other Metrics

Citations

Cited By

View all
  • (2024)AutoRDF2GML: Facilitating RDF Integration in Graph Machine LearningThe Semantic Web – ISWC 202410.1007/978-3-031-77847-6_7(115-133)Online publication date: 27-Nov-2024
  • (2024)SparkKG-ML: A Library to Facilitate End–to–End Large–Scale Machine Learning Over Knowledge Graphs in PythonThe Semantic Web – ISWC 202410.1007/978-3-031-77847-6_1(3-19)Online publication date: 27-Nov-2024
  • (2023)Blockchain Secured Dynamic Machine Learning Pipeline for ManufacturingApplied Sciences10.3390/app1302078213:2(782)Online publication date: 5-Jan-2023
  • (2023)Towards a GML-Enabled Knowledge Graph Platform2023 IEEE 39th International Conference on Data Engineering (ICDE)10.1109/ICDE55515.2023.00225(2946-2954)Online publication date: Apr-2023
  • (2022)Ethical and Sustainability Considerations for Knowledge Graph based Machine Learning2022 IEEE Fifth International Conference on Artificial Intelligence and Knowledge Engineering (AIKE)10.1109/AIKE55402.2022.00015(53-60)Online publication date: Sep-2022
  • (2022)SimE4KG: Distributed and Explainable Multi-Modal Semantic Similarity Estimation for Knowledge Graphs2022 IEEE Fifth International Conference on Artificial Intelligence and Knowledge Engineering (AIKE)10.1109/AIKE55402.2022.00007(1-8)Online publication date: Sep-2022

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media