Abstract
Metric indexes are traditionally used for organizing unstructured or complex data to speed up similarity queries. The most widely-used indexes cluster data or divide space using hyper-planes. While searching, the mutual distances between objects and the metric properties allow for the pruning of branches with irrelevant data – this is usually implemented by utilizing selected anchor objects called pivots. Recently, we have introduced an alternative to this approach called Learned Metric Index. In this method, a series of machine learning models substitute decisions performed on pivots – the query evaluation is then determined by the predictions of these models. This technique relies upon a traditional metric index as a template for its own structure – this dependence on a pre-existing index and the related overhead is the main drawback of the approach.
In this paper, we propose a data-driven variant of the Learned Metric Index, which organizes the data using their descriptors directly, thus eliminating the need for a template. The proposed learned index shows significant gains in performance over its earlier version, as well as the established indexing structure M-index.
This research has been supported by the Czech Science Foundation project No. GA19-02033S. Computational resources were supplied by the project “e-Infrastruktura CZ” (e-INFRA LM2018140) provided within the program Projects of Large Research, Development and Innovations Infrastructures.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
Notes
- 1.
This training procedure consists of two separate phases: one for clustering the data, and the second for their categorization. For every level, the data is firstly clustered in the same way as described above. Subsequently, a supervised categorization machine learning algorithm is trained on the relevant portion of the data and the clustered labels.
- 2.
Full enumeration of stop-conditions used: 0.05%, 0.1%, 0.3%, 0.5%, 1%, 5%, 10%, 20%, 30%, 50% and 75% of the data-set size.
- 3.
The configurations of M-index selected as baselines for our three data-sets [2]: M-index CoPhIR 200, M-index Profiset 2000 and M-index MoCap 2000.
- 4.
Best LMI setups in [2]: Multi-label trained on CoPhIR (M-index 200), Logistic Reg. trained on Profiset (M-tree 2000) and Neural net. trained on MoCap (M-index 2000).
- 5.
The best performing setup was the one achieving 90% recall in the lowest stop-condition and in the shortest time.
- 6.
For the sake of consistency of the environments across indexes, we used the Python 3.6 implementation of M-index from [2].
References
Antol, M., Dohnal, V.: BM-index: balanced metric space index based on weighted Voronoi partitioning. In: Welzer, T., Eder, J., Podgorelec, V., Kamišalić Latifić, A. (eds.) ADBIS 2019. LNCS, vol. 11695, pp. 337–353. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-28730-6_21
Antol, M., Ol’ha, J., Slanináková, T., Dohnal, V.: Learned metric index — proposition of learned indexing for unstructured data. Inf. Syst. 100, 101774 (2021)
Batko, M., et al.: Building a web-scale image similarity search system. Multimedia Tools Appl. 47(3), 599–629 (2009)
Berrendorf, M., Borutta, F., Kröger, P.: k-distance approximation for memory-efficient RkNN retrieval. In: Amato, G., Gennaro, C., Oria, V., Radovanović, M. (eds.) SISAP 2019. LNCS, vol. 11807, pp. 57–71. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-32047-8_6
Chávez, E., Navarro, G., Baeza-Yates, R.A., Marroquín, J.L.: Searching in metric spaces. ACM Comput. Surv. (CSUR 2001) 33(3), 273–321 (2001)
Ciaccia, P., Patella, M., Zezula, P.: M-tree: an efficient access method for similarity search in metric spaces. In: Proceedings of the 23rd International Conference on Very Large Data Bases (VLDB 1997), Athens, Greece, 25–29 August 1997, pp. 426–435. Morgan Kaufmann (1997)
Dong, Y., Indyk, P., Razenshteyn, I.P., Wagner, T.: Learning space partitions for nearest neighbor search. In: 8th International Conference on Learning Representations, ICLR, Addis Ababa, Ethiopia, 26–30 April 2020 (2020)
Ferragina, P., Vinciguerra, G.: The PGM-index: a fully-dynamic compressed learned index with provable worst-case bounds. Proc. VLDB Endow. 13(8), 1162–1175 (2020)
Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997)
Houle, M.E., Nett, M.: Rank cover trees for nearest neighbor search. In: Brisaboa, N., Pedreira, O., Zezula, P. (eds.) SISAP 2013. LNCS, vol. 8199, pp. 16–29. Springer, Heidelberg (2013). https://doi.org/10.1007/978-3-642-41062-8_3
Hünemörder, M., Kröger, P., Renz, M.: Towards a learned index structure for approximate nearest neighbor search query processing. In: Reyes, N., et al. (eds.) SISAP 2021. LNCS 13058, pp. 95–103 (2021)
Johnson, J., Douze, M., Jégou, H.: Billion-scale similarity search with GPUs. arXiv preprint arXiv:1702.08734 (2017)
Lin, K.-I., Yang, C.: The ANN-tree: an index for efficient approximate nearest neighbor search. In: Proceedings Seventh International Conference on Database Systems for Advanced Applications, DASFAA 2001, pp. 174–181, April 2001
Kraska, T., Beutel, A., Chi, E.H., Dean, J., Polyzotis, N.: The case for learned index structures. In: Proceedings of the 2018 International Conference on Management of Data, SIGMOD 2018, pp. 489–504. Association for Computing Machinery (2018)
Krizhevsky, A., Sutskever, I., Hinton, G.E.: ImageNet classification with deep convolutional neural networks. Adv. Neural Inf. Process. Syst. 25, 1097–1105 (2012)
Li, W., et al.: Approximate nearest neighbor search on high dimensional data — experiments, analyses, and improvement. IEEE Trans. Knowl. Data Eng. 32(8), 1475–1488 (2020)
Llaveshi, A., Sirin, U., Ailamaki, A., West, R.: Accelerating B+tree search by using simple machine learning techniques. In: AIDB — VLDB Workshop on Applied AI for Database Systems and Applications (2019)
Macke, S., et al.: Lifting the curse of multidimensional data with learned existence indexes. In: Workshop on ML for Systems at NeurIPS, pp. 1–6 (2018)
Mic, V., Novak, D., Zezula, P.: Binary sketches for secondary filtering. ACM Trans. Inf. Syst. 37(1), 1:1–1:28 (2019). https://doi.org/10.1145/3231936
Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781 (2013)
Moriyama, A., Rodrigues, L.S., Scabora, L.C., Cazzolato, M.T., Traina, A.J.M., Traina, C.: VD-tree: how to build an efficient and fit metric access method using Voronoi diagrams. In: Proceedings of the 36th Annual ACM Symposium on Applied Computing (SAC), p. 327–335. ACM, New York (2021)
Muja, M., Lowe, D.G.: Fast approximate nearest neighbors with automatic algorithm configuration. In: International Conference on Computer Vision Theory and Applications (VISAPP), pp. 331–340 (2009)
Müller, M., Röder, T., Clausen, M., Eberhardt, B., Krüger, B., Weber, A.: Documentation Mocap database HDM05. Technical report, CG-2007-2, Universität Bonn (2007)
Nathan, V., Ding, J., Alizadeh, M., Kraska, T.: Learning multi-dimensional indexes. In: Proceedings of the 2020 International Conference on Management of Data (SIGMOD), pp. 985–1000. ACM (2020)
Navarro, G., Reyes, N.: Dynamic spatial approximation trees. J. Exp. Algorithmics 12 (2008). https://doi.org/10.1145/1227161.1322337
Novak, D., Batko, M., Zezula, P.: Metric index: an efficient and scalable solution for precise and approximate similarity search. Inf. Syst. 36, 721–733 (2011)
Novak, D., Batko, M., Zezula, P.: Large-scale image retrieval using neural net descriptors. In: Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 1039–1040. ACM (2015)
Novak, D., Zezula, P.: Rank aggregation of candidate sets for efficient similarity search. In: Decker, H., Lhotská, L., Link, S., Spies, M., Wagner, R.R. (eds.) DEXA 2014. LNCS, vol. 8645, pp. 42–58. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10085-2_4
Oosterhuis, H., Culpepper, J.S., de Rijke, M.: The potential of learned index structures for index compression. In: Proceedings of the 23rd Australasian Document Computing Symposium (ADCS) (2018). https://doi.org/10.1145/3291992.3291993
Pedregosa, F., et al.: Scikit-learn: Machine learning in Python. J. Mach. Learn. Res. 12, 2825–2830 (2011)
Sablayrolles, A., Douze, M., Schmid, C., Jégou, H.: Spreading vectors for similarity search. In: 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, 6–9 May 2019. OpenReview.net (2019)
Vaswani, A., et al.: Attention is all you need. In: Advances in Neural Information Processing Systems, pp. 5998–6008 (2017)
Wang, H., Fu, X., Xu, J., Lu, H.: Learned index for spatial queries. In: 20th IEEE International Conference on Mobile Data Management (MDM), pp. 569–574 (2019)
Xiang, W., Zhang, H., Cui, R., Chu, X., Li, K., Zhou, W.: Pavo: a RNN-based learned inverted index, supervised or unsupervised? IEEE Access 7, 293–303 (2019)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2021 Springer Nature Switzerland AG
About this paper
Cite this paper
Slanináková, T., Antol, M., OǏha, J., Kaňa, V., Dohnal, V. (2021). Data-Driven Learned Metric Index: An Unsupervised Approach. In: Reyes, N., et al. Similarity Search and Applications. SISAP 2021. Lecture Notes in Computer Science(), vol 13058. Springer, Cham. https://doi.org/10.1007/978-3-030-89657-7_7
Download citation
DOI: https://doi.org/10.1007/978-3-030-89657-7_7
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-89656-0
Online ISBN: 978-3-030-89657-7
eBook Packages: Computer ScienceComputer Science (R0)