Abstract
Knowledge Graphs (KGs) are able to structure and represent knowledge in complex systems, whereby their completeness impacts the quality of any further application. Real world KGs are notoriously incomplete, which is why KG Completion (KGC) methods emerged. Most KGC methods rely on Knowledge Graph Embedding (KGE) based link prediction models, which provide completion candidates for a sparse KG. Metrics like the Mean Rank, the Mean Reciprocal Rank or Hits@K evaluate the quality of those models, like TransR, DistMult or ComplEx. Based on the principle of supervised learning, these metrics evaluate a KGC model trained on a training dataset, based on the partition of true completion candidates it achieves on a test dataset. Dealing with real world, complex KGs, we found that sparsity is not equally distributed across a KG, but rather grouped in clusters. We use modularity-based KG clustering, to approximate sparsity levels in a KG. Furthermore, we postulate that prediction errors of an embedding-based KGC model are correlated within clusters of a KG but uncorrelated between them and formalize a new, cluster-robust KGC evaluation metric. We test our metric using six benchmark dataset and one real-world industrial example dataset. Our experiments show its superiority to existing metrics with regards to the prediction of cluster-robust triplets\(^{1}\)(\(^{1}\)The code is available at https://github.com/simoncharmms/crmrr.).
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
In general, those graphs are referred to as Label-Property-Graphs and are opposed to graphs where also a pre-defined scheme for rules over entities, referred to as an ontology, is underlying [25]. In this paper, we focus on Label-Property-Graphs.
References
Akrami, F., Saeef, M.S., Zhang, Q., Hu, W., Li, C.: Realistic re-evaluation of knowledge graph completion methods: an experimental study (2020). https://doi.org/10.1145/3318464.3380599
Ali, M., et al.: Bringing light into the dark: a large-scale evaluation of knowledge graph embedding models under a unified framework. CoRR (2020). https://doi.org/10.1109/TPAMI.2021.3124805
Bishop, C.M.: Pattern Recognition and Machine Learning. Springer, New York (2007). https://doi.org/10.1117/1.2819119
Biswas, R.: Embedding based link prediction for knowledge graph completion. In: Proceedings of the 29th ACM International Conference on Information and Knowledge Management, pp. 3221–3224. ACM, Virtual Event, Ireland (2020). https://doi.org/10.1145/3340531.3418512
Bordes, A., Usunier, N., Garcia-Durán, A., Weston, J., Yakhnenko, O.: Translating embeddings for modeling multi-relational data. In: Proceedings of the 26th International Conference on Neural Information Processing Systems, vol. 2, pp. 2787–2795. Curran Associates Inc., Red Hook, NY, USA (2013). https://doi.org/10.5555/2999792.2999923
Chen, Z., Wang, Y., Zhao, B., Cheng, J., Zhao, X., Duan, Z.: Knowledge graph completion: a review. IEEE Access 8, 192435–192456 (2020). https://doi.org/10.1109/ACCESS.2020.3030076
Colin Cameron, A., Miller, D.L.: A practitioner’s guide to cluster-robust inference. J. Hum. Resourc. 50(2), 317–372 (2015). https://doi.org/10.3368/jhr.50.2.317
Esarey, J., Menger, A.: Practical and effective approaches to dealing with clustered data. Polit. Sci. Res. Methods 7(3), 541–559 (2019). https://doi.org/10.1017/psrm.2017.42
Fagiolo, G.: Clustering in complex directed networks. Phys. Rev. E 76, 026107 (2007). https://doi.org/10.1103/PhysRevE.76.026107
Fayyad, U., Smyth, P., Piatetsky-Shapiro, G.: From data mining to knowledge discovery in databases. AI Mag. 17(3), 37–54 (1996). https://doi.org/10.1609/aimag.v17i3.1230
Fuhr, N.: Some common mistakes in IR evaluation, and how they can be avoided. ACM SIGIR Forum 51(3), 32–41 (2018). https://doi.org/10.1145/3190580.3190586
Hoyt, C.T., Berrendorf, M., Galkin, M., Tresp, V., Gyori, B.M.: A unified framework for rank-based evaluation metrics for link prediction in knowledge graphs. Pre-print (2022). https://doi.org/10.48550/ARXIV.2203.07544
Kumar, S., Hooi, B., Makhija, D., Kumar, M., Faloutsos, C., Subrahmanian, V.: Rev2: fraudulent user prediction in rating platforms. In: Proceedings of the Eleventh ACM International Conference on Web Search and Data Mining, pp. 333–341. ACM (2018). https://doi.org/10.1145/3159652.3159729
Kumar, S., Spezzano, F., Subrahmanian, V., Faloutsos, C.: Edge weight prediction in weighted signed networks. In: 2016 IEEE 16th International Conference on Data Mining (ICDM), pp. 221–230. IEEE (2016). https://doi.org/10.1109/ICDM.2016.0033
Lee, J.R., Gharan, S.O., Trevisan, L.: Multiway spectral partitioning and higher-order Cheeger inequalities. J. ACM 61(6), 1–30 (2014). https://doi.org/10.1145/2665063
Leskovec, J., Lang, K.J., Dasgupta, A., Mahoney, M.W.: Community structure in large networks: natural cluster sizes and the absence of large well-defined clusters (2008). https://doi.org/10.1080/15427951.2009.10129177
Leung, M.P.: Network cluster-robust inference. Pre-print (2021). https://doi.org/10.48550/ARXIV.2103.01470
Lin, Y., Liu, Z., Sun, M., Liu, Y., Zhu, X.: Learning entity and relation embeddings for knowledge graph completion. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 29, no. 1 (2015). https://doi.org/10.1609/aaai.v29i1.9491
MacKinnon, J.G.: How cluster-robust inference is changing applied econometrics. Can. J. Econ./Revue canadienne d’économique 52(3), 851–881 (2019). https://doi.org/10.1111/caje.12388
MacKinnon, J.G., Nielsen, M.Ø., Webb, M.D.: Cluster-robust inference: a guide to empirical practice. J. Econometrics, S0304407622000781 (2022). https://doi.org/10.1016/j.jeconom.2022.04.001
Marina Speranskaya, M.S.: Ranking vs. classifying: measuring knowledge base completion quality. Pre-print (2020). https://doi.org/10.24432/C57G65
McAuley, J., Leskovec, J.: Hidden factors and hidden topics: understanding rating dimensions with review text. In: Proceedings of the 7th ACM Conference on Recommender Systems. RecSys ’13, pp. 165–172. Association for Computing Machinery, New York, NY, USA (2013). https://doi.org/10.1145/2507157.2507163
Mirkin, B.G.: Clustering for Data Mining: A Data Recovery Approach. No. 3 in Computer Science and Data Analysis Series, Chapman & Hall/CRC, Boca Raton, FL (2005). https://doi.org/10.1201/9781420034912
Paranjape, A., Benson, A.R., Leskovec, J.: Motifs in temporal networks. In: Proceedings of the Tenth ACM International Conference on Web Search and Data Mining. ACM (2017). https://doi.org/10.1145/3018661.3018731
Paulheim, H.: Knowledge graph refinement: a survey of approaches and evaluation methods. Semant. Web 8(3), 489–508 (2016). https://doi.org/10.3233/SW-160218
Rosa, G.J.M.: The Elements of Statistical Learning: Data Mining, Inference, and Prediction by Hastie, T., Tibshirani, R., Friedman, J. Biometrics 66(4), 1315–1315 (2010). https://doi.org/10.1111/j.1541-0420.2010.01516.x
Rozemberczki, B., Allen, C., Sarkar, R.: Multi-scale attributed node embedding (2019). https://doi.org/10.1093/comnet/cnab014
Traag, V.A., Waltman, L., van Eck, N.J.: From Louvain to Leiden: guaranteeing well-connected communities. Sci. Rep. 9(1), 5233 (2019). https://doi.org/10.1038/s41598-019-41695-z
Trouillon, T., Welbl, J., Riedel, S., Gaussier, É., Bouchard, G.: Complex embeddings for simple link prediction. Pre-print (2016). https://doi.org/10.48550/ARXIV.1606.06357
van Engelen, J.E., Hoos, H.H.: A survey on semi-supervised learning. Mach. Learn. 109(2), 373–440 (2019). https://doi.org/10.1007/s10994-019-05855-6
White, H.: A heteroskedasticity-consistent covariance matrix estimator and a direct test for heteroskedasticity. Econometrica 48(4), 817 (1980). https://doi.org/10.2307/1912934
Yang, B., Yih, W.T., He, X., Gao, J., Deng, L.: Embedding entities and relations for learning and inference in knowledge bases. Pre-print (2014). https://doi.org/10.48550/ARXIV.1412.6575
Yang, H., Lin, Z., Zhang, M.: Rethinking knowledge graph evaluation under the open-world assumption (2022). https://doi.org/10.48550/ARXIV.2209.08858
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Appendix
Appendix
1.1 Appendix A: Relevant Notations of this Paper
Notation | Meaning |
---|---|
G | A graph |
\( \lbrace (h, r, t) \rbrace \in \mathcal {T}\) | A set of triples with heads, relations and tails |
\(e \in \mathcal {E}\) | A set of entities e |
\(r \in \mathcal {R}\) | A set of relations r |
\(R_r \in \mathbb {R}^{d \times d}\) | A \(d \times d\) dimensional matrix with coefficients \(a_{ji}\) |
\(r_r \in \mathbb {R}^{d}\) | A d-dimensional vector |
\(k^{out}=\sum _j a_{ji}\) | The outbound degree |
\(k_i^{in}=\sum _j a_{ij}\) | The inbound degree |
Deg | The total degree of a graph |
T(e) | The number of triangles through entity e |
Trans | The transitivity of a graph |
CCO | The mean clustering coefficient of a graph |
\(r_{cc}\) | The rank of a cross-cluster completion candidate |
n | The total count of ranks \(r_{i}\) |
m | The count of cross-cluster ranks \(r_{j \vert cc}\) from \( \lbrace (h,r,t)_{cc} \rbrace \) |
\(|c |\) | The count of clusters of a clustered KG |
\((h,r,t)_{cc}\) | A cross-cluster completion candidate |
1.2 Appendix B: Metrics of Used Knowledge Graphs
Industry KG | \(BTC_{alpha}\) | \(BTC_{otc}\) | \(web-google\) | |
---|---|---|---|---|
[14] | [13] | [16] | ||
Node count | 116699 | 24185 | 35591 | 5105039 |
Relation count | 5710 | 3782 | 5881 | 875713 |
Deg | 40.875 | 12.789 | 12.103 | 11.659 |
Trans | 0.012 | 0.063 | 0.045 | 0.449 |
CCO | 0.044 | 0.158 | 0.151 | 0.369 |
\(|c |\) | 61 | 79 | 87 | 29722 |
\(sx-stackoverflow\) | \(web-amazon\) | \(web-facebook\) | ||
[24] | [22] | [27] | ||
Node count | 574795 | 925872 | 171002 | |
Relation count | 406972 | 334863 | 22470 | |
Deg | 2.824 | 5.529 | 15.220 | |
Trans | 0.001 | 0.103 | 0.124 | |
CCO | 0.001 | 0.198 | 0.179 | |
\(|c |\) | 90266 | 31916 | 790 |
1.3 Appendix C: Training Details and Results by KG and Model.
Model | Epochs | Batch size | m/n | MR | MRR | \(H_{10}\) | CRMRR | |
---|---|---|---|---|---|---|---|---|
\(web-google\) | TransR | 0.381 | 6.070 | 0.142 | 0.887 | 0.221 | ||
DistMult | 510 | 32 | 0.137 | 49.370 | 0.024 | 0.362 | 0.389 | |
ComplEx | 0.152 | 315.440 | 0.001 | 0.135 | 0.229 | |||
\(web-amazon\) | TransR | 0.212 | 11.300 | 0.093 | 0.362 | 0.116 | ||
DistMult | 510 | 32 | 0.365 | 29.620 | 0.012 | 0.566 | 0.217 | |
ComplEx | 0.195 | 508.480 | 0.009 | 0.112 | 0.368 | |||
\(sx-stackoverflow\) | TransR | 0.262 | 7.150 | 0.074 | 0.791 | 0.145 | ||
DistMult | 510 | 32 | 0.165 | 38.400 | 0.025 | 0.498 | 0.285 | |
ComplEx | 0.265 | 484.940 | 0.002 | 0.255 | 0.115 | |||
\(web-facebook\) | TransR | 0.185 | 3.660 | 0.156 | 0.272 | 0.083 | ||
DistMult | 510 | 32 | 0.114 | 26.880 | 0.008 | 0.798 | 0.127 | |
ComplEx | 0.098 | 795.670 | 0.001 | 0.302 | 0.136 | |||
Industry KG | TransR | 0.134 | 8.312 | 0.124 | 0.727 | 0.189 | ||
DistMult | 510 | 32 | 0.119 | 54.854 | 0.018 | 0.578 | 0.283 | |
ComplEx | 0.219 | 470.813 | 0.002 | 0.184 | 0.171 | |||
\(BTC_{alpha}\) | TransR | 0.530 | 7.400 | 0.056 | 0.210 | 0.082 | ||
DistMult | 510 | 32 | 0.262 | 20.840 | 0.011 | 0.438 | 0.180 | |
ComplEx | 0.235 | 404.900 | 0.004 | 0.076 | 0.119 | |||
\(BTC_{otc}\) | TransR | 0.154 | 3.820 | 0.113 | 0.626 | 0.155 | ||
DistMult | 510 | 32 | 0.170 | 21.390 | 0.012 | 0.394 | 0.202 | |
ComplEx | 0.32 | 348.40 | 0.001 | 0.134 | 0.081 |
Rights and permissions
Copyright information
© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Schramm, S., Niklas, U., Schmid, U. (2023). Cluster Robust Inference for Embedding-Based Knowledge Graph Completion. In: Jin, Z., Jiang, Y., Buchmann, R.A., Bi, Y., Ghiran, AM., Ma, W. (eds) Knowledge Science, Engineering and Management. KSEM 2023. Lecture Notes in Computer Science(), vol 14117. Springer, Cham. https://doi.org/10.1007/978-3-031-40283-8_25
Download citation
DOI: https://doi.org/10.1007/978-3-031-40283-8_25
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-40282-1
Online ISBN: 978-3-031-40283-8
eBook Packages: Computer ScienceComputer Science (R0)