Nothing Special   »   [go: up one dir, main page]

Skip to main content

Advertisement

Log in

High-dimensional similarity searches using query driven dynamic quantization and distributed indexing

  • Published:
Distributed and Parallel Databases Aims and scope Submit manuscript

Abstract

The concept of similarity is used as the basis for many data exploration and data mining tasks. Nearest neighbor (NN) queries identify the most similar items, or in terms of distance the closest points to a query point. Similarity is traditionally characterized using a distance function between multi-dimensional feature vectors. However, when the data is high-dimensional, traditional distance functions fail to significantly distinguish between the closest and furthest points, as few dissimilar dimensions dominate the distance function. Localized similarity functions, i.e. functions that only consider dimensions close to the query, quantize each dimension independently and only compute similarity for the dimensions where the query and the points fall into the same bin. These quantizations are query-agnostic and there is potential to improve accuracy when a query-dependent quantization is used. In this work we propose a query dependent equi-depth (QED) on-the-fly quantization method to improve high-dimensional similarity searches. The quantization is done for each dimension at query time and localized scores are generated for the closest p fraction of the points while a constant penalty is applied for the rest of the points. QED not only improves the quality of the distance metric, but also improves query time performance by filtering out non relevant data. We propose a distributed indexing and query algorithm to efficiently compute QED. Our experimental results show improvements in classification accuracy as well as query performance up to one order of magnitude faster than Manhattan-based sequential scan NN queries over datasets with hundreds of dimensions. Furthermore, similarity searches with QED show linear or better scalability in relation to the number of dimensions, and the number of compute nodes.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13

Similar content being viewed by others

Notes

  1. https://github.com/mrsqueeze/spark-hash.

  2. http://www.emc.ncep.noaa.gov/mmb/ylin/pcpanl/stage4/.

References

  1. Aggarwal, C.C., Yu, P.S.: The igrid index: reversing the dimensionality curse for similarity indexing in high dimensional space. In: Proceedings of the Sixth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, ACM, pp. 119–129 (2000)

  2. Amato, G., Esuli, A., Falchi, F.: A comparison of pivot selection techniques for permutation-based indexing. Inf. Syst. 52, 176–188 (2015)

    Article  Google Scholar 

  3. Baldi, P., Sadowski, P., Whiteson, D.: Searching for exotic particles in high-energy physics with deep learning. Nat. Commun. 5, 4308 (2014)

    Article  Google Scholar 

  4. Barrios, J.M., Bustos, B., Skopal, T.: Analyzing and dynamically indexing the query set. Inf. Syst. 45, 37–47 (2014)

    Article  Google Scholar 

  5. Beyer, K., Goldstein, J., Ramakrishnan, R., Shaft, U.: When is “nearest neighbor” meaningful? In: International Conference on Database Theory, Springer, New York, pp. 217–235 (1999)

  6. Blake, C., Merz, C.: Uci repository of machine learning databases. Department of Information and Computer Science, University of California, Irvine, CA. http://www.ics.uci.edu/~mlearn/mlrepository.html (1998)

  7. Boiman, O., Shechtman, E., Irani, M.: In defense of nearest-neighbor based image classification. In: IEEE Conference on Computer Vision and Pattern Recognition, 2008. CVPR 2008, pp. 1–8 (2008)

  8. Chambi, S., Lemire, D., Kaser, O., Godin, R.: Better bitmap performance with roaring bitmaps (2014). arXiv:1402.6407

  9. Chen, L., Gao, Y., Zheng, B., Jensen, C.S., Yang, H., Yang, K.: Pivot-based metric indexing. Proc. VLDB Endow. 10(10), 1058–1069 (2017)

    Article  Google Scholar 

  10. Donoho, D.L.: High-dimensional data analysis: the curses and blessings of dimensionality. AMS Math. Chall. Lect. 1, 32 (2000)

    Google Scholar 

  11. Ester, M., Kriegel, H.-P., Sander, J., Xu, X.: A density-based algorithm for discovering clusters in large spatial databases with noise. KDD 96, 226–231 (1996)

    Google Scholar 

  12. Fayyad, U., Irani, K.: Multi-interval discretization of continuous-valued attributes for classification learning. In: Joint Conference on Artificial Intelligence, pp. 1022–1027 (1993)

  13. Gao, J., Jagadish, H.V., Lu, W., Ooi, B.C.: Dsh: data sensitive hashing for high-dimensional k-nn search. In: Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data, ACM, pp. 1127–1138 (2014)

  14. García, S., Luengo, J., Sáez, J.A., López, V., Herrera, F.: A survey of discretization techniques: taxonomy and empirical analysis in supervised learning. IEEE Trans. Knowl. Data Eng. 25(4), 734–750 (2013)

    Article  Google Scholar 

  15. Gionis, A., Indyk, P., Motwani, R.: Similarity search in high dimensions via hashing. In: Proceedings of the 25th International Conference on Very Large Data Bases, Morgan Kaufmann Publishers Inc., pp. 518–529 (1999)

  16. Guttman, A.: R-trees: a dynamic index structure for spatial searching. In: Proceedings of the 1984 ACM SIGMOD International Conference on Management of Data (New York, NY, USA), SIGMOD ’84, ACM, pp. 47–57 (1984)

  17. Guzun, G., Canahuate, G.: Hybrid query optimization for hard-to-compress bit-vectors. VLDB J 25, 339–354 (2015)

    Article  Google Scholar 

  18. Guzun, G., Canahuate, G.: Supporting dynamic quantization for high-dimensional data analytics. In: Proceedings of the ExploreDB’17 (New York, NY, USA), ExploreDB ’17, ACM, pp. 6:1–6:6 (2017)

  19. Guzun, G., Canahuate, G.: Distributed query-aware quantization for high-dimensional similarity searches. In: Advances in Database Technology: Proceedings. International Conference on Extending Database Technology, vol. 2018, pp. 373–384 (2018)

  20. Guzun, G., Canahuate, G., Chiu, D., Sawin, J.: A tunable compression framework for bitmap indices. In: IEEE 30th International Conference on Data Engineering (ICDE), IEEE, 2014, pp. 484–495 (2014)

  21. Guzun, G., Tosado, J., Canahuate, G.: Slicing the dimensionality: top-k query processing for high-dimensional spaces. In: Transactions on Large-Scale Data-and Knowledge-Centered Systems XIV. Springer, New York, pp. 26–50 (2014)

  22. Guzun, G., Canahuate, G., Chiu, D.: A two-phase Mapreduce algorithm for scalable preference queries over high-dimensional data. In: Proceedings of the 20th International Database Engineering & Applications Symposium, IDEAS (2016)

  23. Guzun, G., McClurg, J.C., Canahuate, G., Mudumbai, R.: Power efficient big data analytics algorithms through low-level operations. In: 2016 IEEE International Conference on Big Data (Big Data), IEEE, pp. 355–361 (2016)

  24. Hamming, R.W.: Error detecting and error correcting codes. Bell Syst. Tech/ J. 29(2), 147–160 (1950)

    Article  MathSciNet  Google Scholar 

  25. Huang, Q., Feng, J., Zhang, Y., Fang, Q., Ng, W.: Query-aware locality-sensitive hashing for approximate nearest neighbor search. Proc. VLDB Endow. 9(1), 1–12 (2015)

    Article  Google Scholar 

  26. Jegou, H., Douze, M., Schmid, C.: Product quantization for nearest neighbor search. IEEE Trans. Pattern Anal. Mach. Intell. 33(1), 117–128 (2011)

    Article  Google Scholar 

  27. Johnson, J., Douze, M., Jégou, H.: Billion-scale similarity search with gpus (2017). arXiv:1702.08734

  28. Kamel, I., Faloutsos, C.: Hilbert r-tree: an improved r-tree using fractals. In: Proceedings of the 20th International Conference on Very Large Data Bases (San Francisco, CA, USA), VLDB ’94, Morgan Kaufmann Publishers Inc., pp. 500–509 (1994)

  29. Katayama, N., Satoh, S.: The sr-tree: an index structure for high-dimensional nearest neighbor queries. SIGMOD Rec. 26(2), 369–380 (1997)

    Article  Google Scholar 

  30. Kerber, R.: Chimerge: discretization of numeric attributes. In: Proceedings of the 10th National Conference on Artificial Intelligence. San Jose, CA, July 12–16, pp. 123–128 (1992)

  31. Kurgan, L.A., Cios, K.J.: Caim discretization algorithm. IEEE Trans. Knowl. Data Eng. 16(2), 145–153 (2004)

    Article  Google Scholar 

  32. Lemire, D., Kaser, O., Gutarra, E.: Reordering rows for better compression: beyond the lexicographic order. ACM Trans. Datab. Syst. 37(3), 20:1–20:29 (2012)

    Google Scholar 

  33. Leskovec, J., Rajaraman, A., Ullman, J.D.: Mining of Massive Datasets. Cambridge University Press, Cambridge (2014)

    Book  Google Scholar 

  34. Liu, F., Lee, H.J.: Use of social network information to enhance collaborative filtering performance. Expert Syst. Appl. 37(7), 4772–4778 (2010)

    Article  Google Scholar 

  35. O’Neil, P., Quass, D.: Improved query performance with variant indexes. In: Proceedings of the 1997 ACM SIGMOD International Conference on Management of data, ACM Press, pp. 38–49 (1997)

  36. Pareto, V.: Manual of Political Economy. Augustus M. Kelley Publishers, New York (1906)

    Google Scholar 

  37. Phung, S.L., Bouzerdoum, A., Chai, D.: Skin segmentation using color pixel classification: analysis and comparison. IEEE Trans. Pattern Anal. Mach. Intell. 27(1), 148–154 (2005)

    Article  Google Scholar 

  38. Rinfret, D.: Answering preference queries with bit-sliced index arithmetic. In: Proceedings of the 2008 C3S2E Conference, ACM, pp. 173–185 (2008)

  39. Rinfret, D., O’Neil, P., O’Neil, E.: Bit-sliced index arithmetic. In: ACM SIGMOD Record, vol. 30, ACM, pp. 47–57 (2001)

  40. Samet, H.: The Design and Analysis of Spatial Data Structures, vol. 199. Addison-Wesley, Reading, MA (1990)

    Google Scholar 

  41. Shy Goh, K., Li, B., Chang, E.: Dyndex: a dynamic and non-metric space indexer. In: In ACM Multimedia, pp. 466–475 (2002)

  42. Tsai, C., Lee, C., Yang, W.: A discretization algorithm based on class-attribute contingency coefficient. Inf. Sci. 178(3), 714–731 (2008)

    Article  Google Scholar 

  43. Tung, A., K.H., Zhang, R., Koudas, N., Ooi, B.C.: Similarity search: a matching based approach. In: VLDB’2006: Proceedings of the 32nd International Conference on Very Large Data Bases, VLDB Endowment, pp. 631–642 (2006)

  44. Weber, R.: Parallel va-file. In: Proceeding of ECDL, Springer, New York, pp. 83–92 (2000)

  45. Weber, R., Blott, S.: An approximation based data structure for similarity search. Technical Report (1997)

  46. Weber, R., Schek, H.-J., Blott, S.: A quantitative analysis and performance study for similarity-search methods in high-dimensional spaces. In: Proceedings of the 24rd International Conference on Very Large Data Bases (San Francisco, CA, USA), VLDB ’98, Morgan Kaufmann Publishers Inc., pp. 194–205 (1998)

  47. Wu, K., Otoo, E.J., Shoshani, A., Nordberg, H.: Notes on design and implementation of compressed bit vectors. Technical Report LBNL/PUB-3161, Lawrence Berkeley National Laboratory (2001)

  48. Wu, K., Otoo, E.J., Shoshani, A.: Compressing bitmap indexes for faster search operations. In: Proceedings of the 2002 International Conference on Scientific and Statistical Database Management Conference (SSDBM’02), pp. 99–108 (2002)

  49. Zipf, G.K.: Human Behavior and the Principle of Least Effort: An Introduction to Human Ecology. Ravenio Books, Cambridge (2016)

    Google Scholar 

Download references

Acknowledgements

This work was partially supported by the NIH National Cancer Institute/Big Data to Knowledge (BD2K) Program under Grant R01CA214825 and joint NSF/NIH Initiative on Quantitative Approaches to Biomedical Big Data (QuBDD) (R01) Grant R01CA225190.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Gheorghi Guzun.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Guzun, G., Canahuate, G. High-dimensional similarity searches using query driven dynamic quantization and distributed indexing. Distrib Parallel Databases 38, 255–286 (2020). https://doi.org/10.1007/s10619-019-07266-x

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10619-019-07266-x

Keywords