Abstract
Negative sampling, which samples negative triplets from non-observed ones in knowledge graph (KG), is an essential step in KG embedding. Recently, generative adversarial network (GAN) has been introduced in negative sampling. By sampling negative triplets with large gradients, these methods avoid the problem of vanishing gradient and thus obtain better performance. However, they make the original model more complex and harder to train. In this paper, motivated by the observation that negative triplets with large gradients are important but rare, we propose to directly keep track of them with the cache. In this way, our method acts as a “distilled” version of previous GAN-based methods, which does not waste training time on additional parameters to fit the full distribution of negative triplets. However, how to sample from and update the cache are two critical questions. We propose to solve these issues by automated machine learning techniques. The automated version also covers GAN-based methods as special cases. Theoretical explanation of NSCaching is also provided, justifying the superior over fixed sampling scheme. Besides, we further extend NSCaching with skip-gram model for graph embedding. Finally, extensive experiments show that our method can gain significant improvements on various KG embedding models and the skip-gram model and outperforms the state-of-the-art negative sampling methods.
Similar content being viewed by others
Notes
The mean rank (MR) metric, which is given in the conference version [67], is removed in the journal version since (i) space is limited. and (ii) the mean rank is easily influenced by low ranking samples.
References
Arjovsky, M., Chintala, S., Bottou, L.: Wasserstein GAN. In: ICLR (2017)
Auer, S., Bizer, C., Kobilarov, G., Lehmann, J., Cyganiak, R., Ives, Z.: DBpedia: a nucleus for a web of open data. In: The Semantic Web, pp. 722–735. Springer (2007)
Bengio, Y., Louradour, J., Collobert, R., Weston, J.: Curriculum learning. In: ICML, pp. 41–48 (2009)
Bergstra, J., Bardenet, R., Bengio, Y., Kégl, B.: Algorithms for hyper-parameter optimization. In:NIPS, pp. 2546–2554 (2011)
Bergstra, J., Bengio, Y.: Random search for hyper-parameter optimization. JMLR 13, 281–305 (2012)
Bollacker, K., Evans, C., Paritosh, P., Sturge, T., Taylor, J.: Freebase: a collaboratively created graph database for structuring human knowledge. In: ACM SIGMOD, pp. 1247–1250 (2008)
Bordes, A., Chopra, S., Weston, J.: Question answering with subgraph embeddings. In: Conference on EMNLP, pp. 615–620 (2014)
Bordes, A., Usunier, N., Garcia-Duran, A., Weston, J., Yakhnenko, O.: Translating embeddings for modeling multi-relational data. In: NIPS, pp. 2787–2795 (2013)
Bose, A., Ling, H., Cao, Y.: Adversarial contrastive estimation. In: ACL (Volume 1: Long Papers), pp. 1021–1032 (2018)
Cai, H., Zheng, V., Chang, K.: A comprehensive survey of graph embedding: problems, techniques, and applications. IEEE TKDE 30(9), 1616–1637 (2018)
Cai, L., Wang, W.: KBGAN: adversarial learning for knowledge graph embeddings. Conference of NAACL 1, 1470–1480 (2018)
Chen, L., Yuan, F., Jose, J., Zhang, W.: Improving negative sampling for word representation using self-embedded features. In: WSDM, pp. 99–107 (2018)
Dettmers, T., Minervini, P., Stenetorp, P., Riedel, S.: Convolutional 2d knowledge graph embeddings. In: AAAI (2018)
Ding, J., Quan, Y., He, X., Li, Y., Jin, D.: Reinforced negative sampling for recommendation with exposure data. In: IJCAI, pp. 2230–2236. AAAI Press (2019)
Dong, L., Gabrilovich, E., Heitz, G., Horn, W., Lao, N., Murphy, K., Strohmann, T., Sun, S., Zhang, W.: Knowledge vault: a web-scale approach to probabilistic knowledge fusion. In: ACM SIGKDD, pp. 601–610 (2014)
Dori, D.: Visweb-the visual semantic web: unifying human and machine knowledge representations with object-process methodology. VLDB J. 13(2), 120–147 (2004)
Fedus, W., Goodfellow, I., Dai, A.: Maskgan: better text generation via filling in the \_. In: ICLR (2018)
Gao, H. Huang, H.: Self-paced network embedding. In: ACM SIGKDD, pp. 1406–1415 (2018)
Glorot, X., Bengio, Y.: Understanding the difficulty of training deep feedforward neural networks. In International Conference on Artificial Intelligence and Statistics, pp. 249–256 (2010)
Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial nets. In: NIPS, pp. 2672–2680 (2014)
Grover, A., Leskovec, J.: node2vec: scalable feature learning for networks. In: ACM SIGKDD, pp. 855–864 (2016)
Gulrajani, I., Ahmed, F., Arjovsky, M., Dumoulin, V., Courville, A.: Improved training of wasserstein gans. In: NIPS, pp. 5767–5777 (2017)
Hutter, F., Hoos, H., Leyton-Brown, K.: Sequential model-based optimization for general algorithm configuration. In: ICLIO, pp. 507–523 (2011)
Hutter, F., Kotthoff, L., Vanschoren, J. (eds.): Automated Machine Learning: Methods. Systems, Challenges. Springer (2018)
Ji, G., He, S., Xu, L., Liu, K., Zhao, J.: Knowledge graph embedding via dynamic mapping matrix. ACL 1, 687–696 (2015)
Kadlec, R., Bajgar, O., Kleindienst, J.: Knowledge base completion: baselines strike back. In: The 2nd Workshop on Representation Learning for NLP, pp. 69–74 (2017)
Kazemi, S., Poole, D.: SimplE embedding for link prediction in knowledge graphs. In: NeurIPS, pp. 4289–4300 (2018)
Kingma, D., Ba, J.A.: A method for stochastic optimization. Technical report. arXiv:1412.6980 (2014)
Kok, S., Domingos, P.: Statistical predicate invention. In: ICML, pp. 433–440 (2007)
Koller, D., Friedman, N., Džeroski, S., Sutton, C., McCallum, A., Pfeffer, A., Abbeel, P., Wong, M., Heckerman, D., Meek, C., et al.: Introduction to Statistical Relational Learning. The MIT Press (2007)
Kumar, M., Packer, B., Koller, D.: Self-paced learning for latent variable models. In: NIPS, pp. 1189–1197 (2010)
Lao, N., Mitchell, T., Cohen, W. Random walk inference and learning in a large scale knowledge base. In: Conference on EMNLP, pp. 529–539. ACL (2011)
Li, J., Tao, C., Feng, Y., Zhao, D., Yan, R. et al.: Sampling matters! an empirical study of negative sampling strategies for learning of matching models in retrieval-based dialogue systems. In: Proceedings of the 2019 EMNLP-IJCNLP, pp. 1291–1296 (2019)
March, J.: Exploration and exploitation in organizational learning. Organ. Sci. 2(1), 71–87 (1991)
McCallum, A., Nigam, K., Rennie, J., Seymore, K.: Automating the construction of internet portals with machine learning. Inf. Retrieval 3(2), 127–163 (2000)
Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word representations in vector space. In: ICLR (2013)
Mikolov, T., Sutskever, I., Chen, K., Corrado, G., Dean, J.: Distributed representations of words and phrases and their compositionality. In: NIPS, pp. 3111–3119 (2013)
Mikolov, T. and Yih, G., Zweig, W.: Linguistic regularities in continuous space word representations. In: NAACL, pp. 746–751 (2013)
Mottin, D., Lissandrini, M., Velegrakis, Y., Palpanas, T.: Exemplar queries: a new way of searching. VLDB J. 25(6), 741–765 (2016)
Needell, D., Ward, R., Srebro, N.: Stochastic gradient descent, weighted sampling, and the randomized kaczmarz algorithm. In: NIPS, pp. 1017–1025 (2014)
Nickel, M., Murphy, K., Tresp, V., Gabrilovich, E.: A review of relational machine learning for knowledge graphs. Proc. IEEE 104(1), 11–33 (2015)
Noy, N., Gao, Y., Jain, A., Narayanan, A., Patterson, A., Taylor, J.: Industry-scale knowledge graphs: lessons and challenges. Queue 17(2), 48–75 (2019)
Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., Killeen, T., Lin, Z., Gimelshein, N., Antiga, L., et al.: Pytorch: an imperative style, high-performance deep learning library. In: NeurIPS, pp. 8024–8035 (2019)
Perozzi, B., Al-Rfou, R., Skiena, S.D.: Online learning of social representations. In: ACM SIGKDD, pp. 701–710 (2014)
Rawat, A., Chen, J., Yu, F., Suresh, A., Kumar, S.: Sampled softmax with random Fourier features. In: NeurIPS, pp. 13857–13867 (2019)
Rendle, S., Freudenthaler, C., Gantner, Z., Schmidt-Thieme, L.: BPR: Bayesian personalized ranking from implicit feedback. In: Conference on UAI, pp. 452–461. AUAI Press (2009)
Suchanek, F., Kasneci, G., Weikum, G.: YAGO: a core of semantic knowledge. In: WWW, pp. 697–706 (2007)
Sun, Z., Deng, Z., Nie, J., Tang, J.: Rotate: knowledge graph embedding by relational rotation in complex space. In: ICLR (2018)
Tang, J., Qu, M., Wang, M., Zhang, M., Yan, J., Mei, Q.: Line: large-scale information network embedding. In: WWW, pp. 1067–1077 (2015)
Toutanova, K. and Chen, D.: Observed versus latent features for knowledge base and text inference. In: Workshop on CVSMC, pp. 57–66 (2015)
Trouillon, T., Dance, C., Gaussier, É., Welbl, J., Riedel, S., Bouchard, G.: Knowledge graph completion via complex tensor factorization. JMLR 18(1), 4735–4772 (2017)
Wang, H., Wang, J., Wang, J., Zhao, M., Zhang, W., Zhang, F., Xie, X., Guo, M.: Graphgan: graph representation learning with generative adversarial nets. In: AAAI (2018)
Wang, J., Yu, L., Zhang, W., Gong, Y., Xu, Y., Wang, B., Zhang, P., Zhang, D.: IRGAN: a minimax game for unifying generative and discriminative information retrieval models. In: ACM SIGIR, pp. 515–524 (2017)
Wang, P., Li, S., Pan, R.: Incorporating GAN for negative sampling in knowledge representation learning. In: AAAI (2018)
Wang, Q., Mao, Z., Wang, B., Guo, L.: Knowledge graph embedding: a survey of approaches and applications. IEEE TKDE 29(12), 2724–2743 (2017)
Wang, Z., Zhang, J., Feng, J., Chen, Z.: Knowledge graph embedding by translating on hyperplanes. AAAI 14, 1112–1119 (2014)
Welford, B.: Note on a method for calculating corrected sums of squares and products. Technometrics 4(3), 419–420 (1962)
Williams, R.: Simple statistical gradient-following algorithms for connectionist reinforcement learning. Mach. Learn. 8(3–4), 229–256 (1992)
Wu, C., Manmatha, R., Smola, A., Krahenbuhl, P.: Sampling matters in deep embedding learning. In: Proceedings of the ICCV, pp. 2840–2848 (2017)
Yang, B., Yih, W., He, X., Gao, J., Deng, L.: Embedding entities and relations for learning and inference in knowledge bases. In: ICLR (2017)
Yao, Q., Wang, N., Jair Escalante, H., Guyon, I., Hu, Y., Li, Y., Tu, W., Yang, Q., Yu, Y.: Taking human out of learning applications: a survey on automated machine learning. Technical report, arXiv preprint (2018)
Ying, R., He, R., Chen, K., Eksombatchai, P., Hamilton, W., Leskovec, J.: Graph convolutional neural networks for web-scale recommender systems. In: ACM SIGKDD, pp. 974–983 (2018)
Zhang, C., Li, Y., Du, N., Fan, W., Yu, P.: On the generative discovery of structured medical knowledge. In: SIGKDD, pp. 2720–2728 (2018)
Zhang, F., Yuan, N., Lian, D., Xie, X., Ma, W.: Collaborative knowledge base embedding for recommender systems. In: ACM SIGKDD, pp. 353–362 (2016)
Zhang, S., Yao, L., Sun, A., Tay, Y.: Deep learning based recommender system: a survey and new perspectives. ACM Comput. Surv. 52(1), 1–38 (2019)
Zhang, W., Chen, T., Wang, J., Yu, T.: Optimizing top-n collaborative filtering via dynamic negative item sampling. In: ACM SIGIR, pp. 785–788 (2013)
Zhang, Y., Yao, Q., Shao, Y., Chen, L.: NSCaching: simple and efficient negative sampling for knowledge graph embedding. In: ICDE, pp. 614–625 (2019)
Zhao, P., Zhang, T.: Stochastic optimization with importance sampling for regularized loss minimization. In: ICML, pp. 1–9 (2015)
Zou, L., Chen, L., Özsu, M., Zhao, D.: Answering pattern match queries in large graph databases via graph embedding. VLDB J. 21(1), 97–120 (2012)
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
A Appendix: Proof of Theorem 1
A Appendix: Proof of Theorem 1
Proof
Following [68], we consider a more general optimization formulation as \( \min _{\mathbf {w}} \bar{F}(\mathbf {w}) \equiv \frac{1}{n} \sum \nolimits _{i = 1}^n \phi _{i}(\mathbf {w}) + \lambda r(\mathbf {w})\), which covers (11) as a special case with \(\lambda = 0\). Then, the stochastic training gives \(\mathbf {w}^{t + 1}\) as
where \(\mathscr {B}_{\psi }\) is a Bregman distance function measuring the difference between \(\mathbf {w}\) and \(\mathbf {w}^t\). Based on (18), we state Theorem 3 in [68] as
Theorem 2
([68]) Let \(\mathbf {w}^t\) be generated from (18). Assume \(\mathscr {B}_{\psi }\) is \(\sigma \)-strongly convex w.r.t. a norm \(\Vert \cdot \Vert \), \(\bar{F}\) and r are convex, if \(\eta _t = \eta \), the following inequality holds for any \(T \ge 1\)
where the expectation is take with distribution \(\mathbf {p}^t\).
In our Algorithm 5, we do not have r and thus \(\lambda = 0\) in (11). Besides, \(\mathscr {B}_{\psi }(\mathbf {w}, \mathbf {w}^t) = \frac{1}{2} \Vert \mathbf {w} - \mathbf {w}^t \Vert ^2\) in our case; thus, \(\sigma = 1\). Above all, we have Theorem 1 which is derived from Theorem 2. \(\square \)
Rights and permissions
About this article
Cite this article
Zhang, Y., Yao, Q. & Chen, L. Simple and automated negative sampling for knowledge graph embedding. The VLDB Journal 30, 259–285 (2021). https://doi.org/10.1007/s00778-020-00640-7
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00778-020-00640-7