High-dimensional similarity searches using query driven dynamic quantization and distributed indexing

Gheorghi Guzun¹ &
Guadalupe Canahuate²

491 Accesses
4 Citations
1 Altmetric
Explore all metrics

Abstract

The concept of similarity is used as the basis for many data exploration and data mining tasks. Nearest neighbor (NN) queries identify the most similar items, or in terms of distance the closest points to a query point. Similarity is traditionally characterized using a distance function between multi-dimensional feature vectors. However, when the data is high-dimensional, traditional distance functions fail to significantly distinguish between the closest and furthest points, as few dissimilar dimensions dominate the distance function. Localized similarity functions, i.e. functions that only consider dimensions close to the query, quantize each dimension independently and only compute similarity for the dimensions where the query and the points fall into the same bin. These quantizations are query-agnostic and there is potential to improve accuracy when a query-dependent quantization is used. In this work we propose a query dependent equi-depth (QED) on-the-fly quantization method to improve high-dimensional similarity searches. The quantization is done for each dimension at query time and localized scores are generated for the closest p fraction of the points while a constant penalty is applied for the rest of the points. QED not only improves the quality of the distance metric, but also improves query time performance by filtering out non relevant data. We propose a distributed indexing and query algorithm to efficiently compute QED. Our experimental results show improvements in classification accuracy as well as query performance up to one order of magnitude faster than Manhattan-based sequential scan NN queries over datasets with hundreds of dimensions. Furthermore, similarity searches with QED show linear or better scalability in relation to the number of dimensions, and the number of compute nodes.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Effective product quantization-based indexing for nearest neighbor search

Article 04 May 2018

An Approximate Nearest Neighbor Search Algorithm Using Distance-Based Hashing

Is Quantized ANN Search Cursed? Case Study of Quantifying Search and Index Quality

Notes

References

Aggarwal, C.C., Yu, P.S.: The igrid index: reversing the dimensionality curse for similarity indexing in high dimensional space. In: Proceedings of the Sixth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, ACM, pp. 119–129 (2000)
Amato, G., Esuli, A., Falchi, F.: A comparison of pivot selection techniques for permutation-based indexing. Inf. Syst. 52, 176–188 (2015)
Article Google Scholar
Baldi, P., Sadowski, P., Whiteson, D.: Searching for exotic particles in high-energy physics with deep learning. Nat. Commun. 5, 4308 (2014)
Article Google Scholar
Barrios, J.M., Bustos, B., Skopal, T.: Analyzing and dynamically indexing the query set. Inf. Syst. 45, 37–47 (2014)
Article Google Scholar
Beyer, K., Goldstein, J., Ramakrishnan, R., Shaft, U.: When is “nearest neighbor” meaningful? In: International Conference on Database Theory, Springer, New York, pp. 217–235 (1999)
Blake, C., Merz, C.: Uci repository of machine learning databases. Department of Information and Computer Science, University of California, Irvine, CA. http://www.ics.uci.edu/~mlearn/mlrepository.html (1998)
Boiman, O., Shechtman, E., Irani, M.: In defense of nearest-neighbor based image classification. In: IEEE Conference on Computer Vision and Pattern Recognition, 2008. CVPR 2008, pp. 1–8 (2008)
Chambi, S., Lemire, D., Kaser, O., Godin, R.: Better bitmap performance with roaring bitmaps (2014). arXiv:1402.6407
Chen, L., Gao, Y., Zheng, B., Jensen, C.S., Yang, H., Yang, K.: Pivot-based metric indexing. Proc. VLDB Endow. 10(10), 1058–1069 (2017)
Article Google Scholar
Donoho, D.L.: High-dimensional data analysis: the curses and blessings of dimensionality. AMS Math. Chall. Lect. 1, 32 (2000)
Google Scholar
Ester, M., Kriegel, H.-P., Sander, J., Xu, X.: A density-based algorithm for discovering clusters in large spatial databases with noise. KDD 96, 226–231 (1996)
Google Scholar
Fayyad, U., Irani, K.: Multi-interval discretization of continuous-valued attributes for classification learning. In: Joint Conference on Artificial Intelligence, pp. 1022–1027 (1993)
Gao, J., Jagadish, H.V., Lu, W., Ooi, B.C.: Dsh: data sensitive hashing for high-dimensional k-nn search. In: Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data, ACM, pp. 1127–1138 (2014)
García, S., Luengo, J., Sáez, J.A., López, V., Herrera, F.: A survey of discretization techniques: taxonomy and empirical analysis in supervised learning. IEEE Trans. Knowl. Data Eng. 25(4), 734–750 (2013)
Article Google Scholar
Gionis, A., Indyk, P., Motwani, R.: Similarity search in high dimensions via hashing. In: Proceedings of the 25th International Conference on Very Large Data Bases, Morgan Kaufmann Publishers Inc., pp. 518–529 (1999)
Guttman, A.: R-trees: a dynamic index structure for spatial searching. In: Proceedings of the 1984 ACM SIGMOD International Conference on Management of Data (New York, NY, USA), SIGMOD ’84, ACM, pp. 47–57 (1984)
Guzun, G., Canahuate, G.: Hybrid query optimization for hard-to-compress bit-vectors. VLDB J 25, 339–354 (2015)
Article Google Scholar
Guzun, G., Canahuate, G.: Supporting dynamic quantization for high-dimensional data analytics. In: Proceedings of the ExploreDB’17 (New York, NY, USA), ExploreDB ’17, ACM, pp. 6:1–6:6 (2017)
Guzun, G., Canahuate, G.: Distributed query-aware quantization for high-dimensional similarity searches. In: Advances in Database Technology: Proceedings. International Conference on Extending Database Technology, vol. 2018, pp. 373–384 (2018)
Guzun, G., Canahuate, G., Chiu, D., Sawin, J.: A tunable compression framework for bitmap indices. In: IEEE 30th International Conference on Data Engineering (ICDE), IEEE, 2014, pp. 484–495 (2014)
Guzun, G., Tosado, J., Canahuate, G.: Slicing the dimensionality: top-k query processing for high-dimensional spaces. In: Transactions on Large-Scale Data-and Knowledge-Centered Systems XIV. Springer, New York, pp. 26–50 (2014)
Guzun, G., Canahuate, G., Chiu, D.: A two-phase Mapreduce algorithm for scalable preference queries over high-dimensional data. In: Proceedings of the 20th International Database Engineering & Applications Symposium, IDEAS (2016)
Guzun, G., McClurg, J.C., Canahuate, G., Mudumbai, R.: Power efficient big data analytics algorithms through low-level operations. In: 2016 IEEE International Conference on Big Data (Big Data), IEEE, pp. 355–361 (2016)
Hamming, R.W.: Error detecting and error correcting codes. Bell Syst. Tech/ J. 29(2), 147–160 (1950)
Article MathSciNet Google Scholar
Huang, Q., Feng, J., Zhang, Y., Fang, Q., Ng, W.: Query-aware locality-sensitive hashing for approximate nearest neighbor search. Proc. VLDB Endow. 9(1), 1–12 (2015)
Article Google Scholar
Jegou, H., Douze, M., Schmid, C.: Product quantization for nearest neighbor search. IEEE Trans. Pattern Anal. Mach. Intell. 33(1), 117–128 (2011)
Article Google Scholar
Johnson, J., Douze, M., Jégou, H.: Billion-scale similarity search with gpus (2017). arXiv:1702.08734
Kamel, I., Faloutsos, C.: Hilbert r-tree: an improved r-tree using fractals. In: Proceedings of the 20th International Conference on Very Large Data Bases (San Francisco, CA, USA), VLDB ’94, Morgan Kaufmann Publishers Inc., pp. 500–509 (1994)
Katayama, N., Satoh, S.: The sr-tree: an index structure for high-dimensional nearest neighbor queries. SIGMOD Rec. 26(2), 369–380 (1997)
Article Google Scholar
Kerber, R.: Chimerge: discretization of numeric attributes. In: Proceedings of the 10th National Conference on Artificial Intelligence. San Jose, CA, July 12–16, pp. 123–128 (1992)
Kurgan, L.A., Cios, K.J.: Caim discretization algorithm. IEEE Trans. Knowl. Data Eng. 16(2), 145–153 (2004)
Article Google Scholar
Lemire, D., Kaser, O., Gutarra, E.: Reordering rows for better compression: beyond the lexicographic order. ACM Trans. Datab. Syst. 37(3), 20:1–20:29 (2012)
Google Scholar
Leskovec, J., Rajaraman, A., Ullman, J.D.: Mining of Massive Datasets. Cambridge University Press, Cambridge (2014)
Book Google Scholar
Liu, F., Lee, H.J.: Use of social network information to enhance collaborative filtering performance. Expert Syst. Appl. 37(7), 4772–4778 (2010)
Article Google Scholar
O’Neil, P., Quass, D.: Improved query performance with variant indexes. In: Proceedings of the 1997 ACM SIGMOD International Conference on Management of data, ACM Press, pp. 38–49 (1997)
Pareto, V.: Manual of Political Economy. Augustus M. Kelley Publishers, New York (1906)
Google Scholar
Phung, S.L., Bouzerdoum, A., Chai, D.: Skin segmentation using color pixel classification: analysis and comparison. IEEE Trans. Pattern Anal. Mach. Intell. 27(1), 148–154 (2005)
Article Google Scholar
Rinfret, D.: Answering preference queries with bit-sliced index arithmetic. In: Proceedings of the 2008 C3S2E Conference, ACM, pp. 173–185 (2008)
Rinfret, D., O’Neil, P., O’Neil, E.: Bit-sliced index arithmetic. In: ACM SIGMOD Record, vol. 30, ACM, pp. 47–57 (2001)
Samet, H.: The Design and Analysis of Spatial Data Structures, vol. 199. Addison-Wesley, Reading, MA (1990)
Google Scholar
Shy Goh, K., Li, B., Chang, E.: Dyndex: a dynamic and non-metric space indexer. In: In ACM Multimedia, pp. 466–475 (2002)
Tsai, C., Lee, C., Yang, W.: A discretization algorithm based on class-attribute contingency coefficient. Inf. Sci. 178(3), 714–731 (2008)
Article Google Scholar
Tung, A., K.H., Zhang, R., Koudas, N., Ooi, B.C.: Similarity search: a matching based approach. In: VLDB’2006: Proceedings of the 32nd International Conference on Very Large Data Bases, VLDB Endowment, pp. 631–642 (2006)
Weber, R.: Parallel va-file. In: Proceeding of ECDL, Springer, New York, pp. 83–92 (2000)
Weber, R., Blott, S.: An approximation based data structure for similarity search. Technical Report (1997)
Weber, R., Schek, H.-J., Blott, S.: A quantitative analysis and performance study for similarity-search methods in high-dimensional spaces. In: Proceedings of the 24rd International Conference on Very Large Data Bases (San Francisco, CA, USA), VLDB ’98, Morgan Kaufmann Publishers Inc., pp. 194–205 (1998)
Wu, K., Otoo, E.J., Shoshani, A., Nordberg, H.: Notes on design and implementation of compressed bit vectors. Technical Report LBNL/PUB-3161, Lawrence Berkeley National Laboratory (2001)
Wu, K., Otoo, E.J., Shoshani, A.: Compressing bitmap indexes for faster search operations. In: Proceedings of the 2002 International Conference on Scientific and Statistical Database Management Conference (SSDBM’02), pp. 99–108 (2002)
Zipf, G.K.: Human Behavior and the Principle of Least Effort: An Introduction to Human Ecology. Ravenio Books, Cambridge (2016)
Google Scholar

Download references

Acknowledgements

This work was partially supported by the NIH National Cancer Institute/Big Data to Knowledge (BD2K) Program under Grant R01CA214825 and joint NSF/NIH Initiative on Quantitative Approaches to Biomedical Big Data (QuBDD) (R01) Grant R01CA225190.

Author information

Authors and Affiliations

Department of Computer Engineering, San Jose State University, San Jose, CA, USA
Gheorghi Guzun
Department of Electrical and Computer Engineering, The University of Iowa, Iowa, IA, USA
Guadalupe Canahuate

Authors

Gheorghi Guzun
View author publications
You can also search for this author in PubMed Google Scholar
Guadalupe Canahuate
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Gheorghi Guzun.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Guzun, G., Canahuate, G. High-dimensional similarity searches using query driven dynamic quantization and distributed indexing. Distrib Parallel Databases 38, 255–286 (2020). https://doi.org/10.1007/s10619-019-07266-x

Download citation

Published: 11 April 2019
Issue Date: June 2020
DOI: https://doi.org/10.1007/s10619-019-07266-x

High-dimensional similarity searches using query driven dynamic quantization and distributed indexing

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Effective product quantization-based indexing for nearest neighbor search

An Approximate Nearest Neighbor Search Algorithm Using Distance-Based Hashing

Is Quantized ANN Search Cursed? Case Study of Quantifying Search and Index Quality

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Subscribe and save

Buy Now

High-dimensional similarity searches using query driven dynamic quantization and distributed indexing

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Effective product quantization-based indexing for nearest neighbor search

An Approximate Nearest Neighbor Search Algorithm Using Distance-Based Hashing

Is Quantized ANN Search Cursed? Case Study of Quantifying Search and Index Quality

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now