Data-Driven Learned Metric Index: An Unsupervised Approach

Terézia Slanináková¹⁵,
Matej Antol ORCID: orcid.org/0000-0002-1380-5647¹⁵,
Jaroslav OǏha ORCID: orcid.org/0000-0003-1824-468X¹⁵,
Vojtěch Kaňa¹⁵ &
…
Vlastislav Dohnal¹⁵

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 13058))

Included in the following conference series:

International Conference on Similarity Search and Applications

1270 Accesses
4 Citations

Abstract

Metric indexes are traditionally used for organizing unstructured or complex data to speed up similarity queries. The most widely-used indexes cluster data or divide space using hyper-planes. While searching, the mutual distances between objects and the metric properties allow for the pruning of branches with irrelevant data – this is usually implemented by utilizing selected anchor objects called pivots. Recently, we have introduced an alternative to this approach called Learned Metric Index. In this method, a series of machine learning models substitute decisions performed on pivots – the query evaluation is then determined by the predictions of these models. This technique relies upon a traditional metric index as a template for its own structure – this dependence on a pre-existing index and the related overhead is the main drawback of the approach.

In this paper, we propose a data-driven variant of the Learned Metric Index, which organizes the data using their descriptors directly, thus eliminating the need for a template. The proposed learned index shows significant gains in performance over its earlier version, as well as the established indexing structure M-index.

This research has been supported by the Czech Science Foundation project No. GA19-02033S. Computational resources were supplied by the project “e-Infrastruktura CZ” (e-INFRA LM2018140) provided within the program Projects of Large Research, Development and Innovations Infrastructures.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 64.99; Price excludes VAT (USA)

Softcover Book: USD 84.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

SISAP 2023 Indexing Challenge – Learned Metric Index

Learning Distance Estimators from Pivoted Embeddings of Metric Objects

Pruning Algorithms for Low-Dimensional Non-metric k-NN Search: A Case Study

Notes

1.
This training procedure consists of two separate phases: one for clustering the data, and the second for their categorization. For every level, the data is firstly clustered in the same way as described above. Subsequently, a supervised categorization machine learning algorithm is trained on the relevant portion of the data and the clustered labels.
2.
Full enumeration of stop-conditions used: 0.05%, 0.1%, 0.3%, 0.5%, 1%, 5%, 10%, 20%, 30%, 50% and 75% of the data-set size.
3.
The configurations of M-index selected as baselines for our three data-sets [2]: M-index CoPhIR 200, M-index Profiset 2000 and M-index MoCap 2000.
4.
Best LMI setups in [2]: Multi-label trained on CoPhIR (M-index 200), Logistic Reg. trained on Profiset (M-tree 2000) and Neural net. trained on MoCap (M-index 2000).
5.
The best performing setup was the one achieving 90% recall in the lowest stop-condition and in the shortest time.
6.
For the sake of consistency of the environments across indexes, we used the Python 3.6 implementation of M-index from [2].

References

Antol, M., Dohnal, V.: BM-index: balanced metric space index based on weighted Voronoi partitioning. In: Welzer, T., Eder, J., Podgorelec, V., Kamišalić Latifić, A. (eds.) ADBIS 2019. LNCS, vol. 11695, pp. 337–353. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-28730-6_21
Chapter Google Scholar
Antol, M., Ol’ha, J., Slanináková, T., Dohnal, V.: Learned metric index — proposition of learned indexing for unstructured data. Inf. Syst. 100, 101774 (2021)
Google Scholar
Batko, M., et al.: Building a web-scale image similarity search system. Multimedia Tools Appl. 47(3), 599–629 (2009)
Article Google Scholar
Berrendorf, M., Borutta, F., Kröger, P.: k-distance approximation for memory-efficient RkNN retrieval. In: Amato, G., Gennaro, C., Oria, V., Radovanović, M. (eds.) SISAP 2019. LNCS, vol. 11807, pp. 57–71. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-32047-8_6
Chapter Google Scholar
Chávez, E., Navarro, G., Baeza-Yates, R.A., Marroquín, J.L.: Searching in metric spaces. ACM Comput. Surv. (CSUR 2001) 33(3), 273–321 (2001)
Google Scholar
Ciaccia, P., Patella, M., Zezula, P.: M-tree: an efficient access method for similarity search in metric spaces. In: Proceedings of the 23rd International Conference on Very Large Data Bases (VLDB 1997), Athens, Greece, 25–29 August 1997, pp. 426–435. Morgan Kaufmann (1997)
Google Scholar
Dong, Y., Indyk, P., Razenshteyn, I.P., Wagner, T.: Learning space partitions for nearest neighbor search. In: 8th International Conference on Learning Representations, ICLR, Addis Ababa, Ethiopia, 26–30 April 2020 (2020)
Google Scholar
Ferragina, P., Vinciguerra, G.: The PGM-index: a fully-dynamic compressed learned index with provable worst-case bounds. Proc. VLDB Endow. 13(8), 1162–1175 (2020)
Google Scholar
Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997)
Article Google Scholar
Houle, M.E., Nett, M.: Rank cover trees for nearest neighbor search. In: Brisaboa, N., Pedreira, O., Zezula, P. (eds.) SISAP 2013. LNCS, vol. 8199, pp. 16–29. Springer, Heidelberg (2013). https://doi.org/10.1007/978-3-642-41062-8_3
Chapter Google Scholar
Hünemörder, M., Kröger, P., Renz, M.: Towards a learned index structure for approximate nearest neighbor search query processing. In: Reyes, N., et al. (eds.) SISAP 2021. LNCS 13058, pp. 95–103 (2021)
Google Scholar
Johnson, J., Douze, M., Jégou, H.: Billion-scale similarity search with GPUs. arXiv preprint arXiv:1702.08734 (2017)
Lin, K.-I., Yang, C.: The ANN-tree: an index for efficient approximate nearest neighbor search. In: Proceedings Seventh International Conference on Database Systems for Advanced Applications, DASFAA 2001, pp. 174–181, April 2001
Google Scholar
Kraska, T., Beutel, A., Chi, E.H., Dean, J., Polyzotis, N.: The case for learned index structures. In: Proceedings of the 2018 International Conference on Management of Data, SIGMOD 2018, pp. 489–504. Association for Computing Machinery (2018)
Google Scholar
Krizhevsky, A., Sutskever, I., Hinton, G.E.: ImageNet classification with deep convolutional neural networks. Adv. Neural Inf. Process. Syst. 25, 1097–1105 (2012)
Google Scholar
Li, W., et al.: Approximate nearest neighbor search on high dimensional data — experiments, analyses, and improvement. IEEE Trans. Knowl. Data Eng. 32(8), 1475–1488 (2020)
Article Google Scholar
Llaveshi, A., Sirin, U., Ailamaki, A., West, R.: Accelerating B+tree search by using simple machine learning techniques. In: AIDB — VLDB Workshop on Applied AI for Database Systems and Applications (2019)
Google Scholar
Macke, S., et al.: Lifting the curse of multidimensional data with learned existence indexes. In: Workshop on ML for Systems at NeurIPS, pp. 1–6 (2018)
Google Scholar
Mic, V., Novak, D., Zezula, P.: Binary sketches for secondary filtering. ACM Trans. Inf. Syst. 37(1), 1:1–1:28 (2019). https://doi.org/10.1145/3231936
Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781 (2013)
Moriyama, A., Rodrigues, L.S., Scabora, L.C., Cazzolato, M.T., Traina, A.J.M., Traina, C.: VD-tree: how to build an efficient and fit metric access method using Voronoi diagrams. In: Proceedings of the 36th Annual ACM Symposium on Applied Computing (SAC), p. 327–335. ACM, New York (2021)
Google Scholar
Muja, M., Lowe, D.G.: Fast approximate nearest neighbors with automatic algorithm configuration. In: International Conference on Computer Vision Theory and Applications (VISAPP), pp. 331–340 (2009)
Google Scholar
Müller, M., Röder, T., Clausen, M., Eberhardt, B., Krüger, B., Weber, A.: Documentation Mocap database HDM05. Technical report, CG-2007-2, Universität Bonn (2007)
Google Scholar
Nathan, V., Ding, J., Alizadeh, M., Kraska, T.: Learning multi-dimensional indexes. In: Proceedings of the 2020 International Conference on Management of Data (SIGMOD), pp. 985–1000. ACM (2020)
Google Scholar
Navarro, G., Reyes, N.: Dynamic spatial approximation trees. J. Exp. Algorithmics 12 (2008). https://doi.org/10.1145/1227161.1322337
Novak, D., Batko, M., Zezula, P.: Metric index: an efficient and scalable solution for precise and approximate similarity search. Inf. Syst. 36, 721–733 (2011)
Google Scholar
Novak, D., Batko, M., Zezula, P.: Large-scale image retrieval using neural net descriptors. In: Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 1039–1040. ACM (2015)
Google Scholar
Novak, D., Zezula, P.: Rank aggregation of candidate sets for efficient similarity search. In: Decker, H., Lhotská, L., Link, S., Spies, M., Wagner, R.R. (eds.) DEXA 2014. LNCS, vol. 8645, pp. 42–58. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10085-2_4
Chapter Google Scholar
Oosterhuis, H., Culpepper, J.S., de Rijke, M.: The potential of learned index structures for index compression. In: Proceedings of the 23rd Australasian Document Computing Symposium (ADCS) (2018). https://doi.org/10.1145/3291992.3291993
Pedregosa, F., et al.: Scikit-learn: Machine learning in Python. J. Mach. Learn. Res. 12, 2825–2830 (2011)
MathSciNet MATH Google Scholar
Sablayrolles, A., Douze, M., Schmid, C., Jégou, H.: Spreading vectors for similarity search. In: 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, 6–9 May 2019. OpenReview.net (2019)
Google Scholar
Vaswani, A., et al.: Attention is all you need. In: Advances in Neural Information Processing Systems, pp. 5998–6008 (2017)
Google Scholar
Wang, H., Fu, X., Xu, J., Lu, H.: Learned index for spatial queries. In: 20th IEEE International Conference on Mobile Data Management (MDM), pp. 569–574 (2019)
Google Scholar
Xiang, W., Zhang, H., Cui, R., Chu, X., Li, K., Zhou, W.: Pavo: a RNN-based learned inverted index, supervised or unsupervised? IEEE Access 7, 293–303 (2019)
Article Google Scholar

Download references

Author information

Authors and Affiliations

Faculty of Informatics, Masaryk University, Botanická 68a, 602 00, Brno, Czech Republic
Terézia Slanináková, Matej Antol, Jaroslav OǏha, Vojtěch Kaňa & Vlastislav Dohnal

Authors

Terézia Slanináková
View author publications
You can also search for this author in PubMed Google Scholar
Matej Antol
View author publications
You can also search for this author in PubMed Google Scholar
Jaroslav OǏha
View author publications
You can also search for this author in PubMed Google Scholar
Vojtěch Kaňa
View author publications
You can also search for this author in PubMed Google Scholar
Vlastislav Dohnal
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Terézia Slanináková .

Editor information

Editors and Affiliations

National University of San Luis, San Luis, Argentina
Nora Reyes
University of St Andrews, St Andrews, UK
Richard Connor
University of Vienna, Vienna, Austria
Nils Kriege
Kiel University, Kiel, Germany
Daniyal Kazempour
University of Bologna, Bologna, Italy
Ilaria Bartolini
TU Dortmund University, Dortmund, Germany
Erich Schubert
TU Dortmund University, Dortmund, Germany
Jian-Jia Chen

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Slanináková, T., Antol, M., OǏha, J., Kaňa, V., Dohnal, V. (2021). Data-Driven Learned Metric Index: An Unsupervised Approach. In: Reyes, N., et al. Similarity Search and Applications. SISAP 2021. Lecture Notes in Computer Science(), vol 13058. Springer, Cham. https://doi.org/10.1007/978-3-030-89657-7_7

Download citation

DOI: https://doi.org/10.1007/978-3-030-89657-7_7
Published: 22 October 2021
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-89656-0
Online ISBN: 978-3-030-89657-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Data-Driven Learned Metric Index: An Unsupervised Approach

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

SISAP 2023 Indexing Challenge – Learned Metric Index

Learning Distance Estimators from Pivoted Embeddings of Metric Objects

Pruning Algorithms for Low-Dimensional Non-metric k-NN Search: A Case Study

Notes

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Subscribe and save

Buy Now

Navigation

Data-Driven Learned Metric Index: An Unsupervised Approach

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

SISAP 2023 Indexing Challenge – Learned Metric Index

Learning Distance Estimators from Pivoted Embeddings of Metric Objects

Pruning Algorithms for Low-Dimensional Non-metric k-NN Search: A Case Study

Notes

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation