Abstract
This work presents an adaptive group testing framework for the range-based high dimensional near neighbor search problem. Our method efficiently marks each item in a database as neighbor or non-neighbor of a query point, based on a cosine distance threshold without exhaustive search. Like other methods for large scale retrieval, our approach exploits the assumption that most of the items in the database are unrelated to the query. Unlike other methods, it does not assume a large difference between the cosine similarity of the query vector with the least related neighbor and that with the least unrelated non-neighbor. Following a multi-stage adaptive group testing algorithm based on binary splitting, we divide the set of items to be searched into half at each step, and perform dot product tests on smaller and smaller subsets, many of which we are able to prune away. We experimentally show that, using softmax-based features, our method achieves a more than ten-fold speed-up over exhaustive search with no loss of accuracy, on a variety of large datasets. Based on empirically verified models for the distribution of cosine distances, we present a theoretical analysis of the expected number of distance computations per query and the probability that a pool with a certain number of members will be pruned. Our method has the following features: (i) It implicitly exploits useful distributional properties of cosine distances unlike other methods; (ii) All required data structures are created purely offline; (iii) It does not impose any strong assumptions on the number of true near neighbors; (iv) It is adaptable to streaming settings where new vectors are dynamically added to the database; and (v) It does not require any parameter tuning. The high recall of our technique makes it particularly suited to plagiarism detection scenarios where it is important to report every database item that is sufficiently similar item to the query.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
https://www.kaggle.com/competitions/imagenet-object-localization-challenge/data
https://github.com/google-research/google-research/tree/master/scann
FLANN - fast library for approximate nearest neighbors: wbia-pyflann 4.0.4. https://pypi.org/project/wbia-pyflann/
Generalized binary splitting algorithm. https://en.wikipedia.org/wiki/Group_testing#Generalised_binary-splitting_algorithm
Imdb-wiki - 500k+ face images with age and gender labels. https://data.vision.ee.ethz.ch/cvl/rrothe/imdb-wiki/
The InstaCities1M Dataset. https://gombru.github.io/2018/08/01/InstaCities1M/
MIRFLICKR download. https://press.liacs.nl/mirflickr/mirdownload.html
Oxford and Paris buildings dataset. https://www.kaggle.com/datasets/skylord/oxbuildings
Supplemental material. Uploaded on conference portal
Aldridge, M., Johnson, ., Scarlett, J.: Group testing: an information theory perspective. Found. Trends Commun. Inf. Theory 15(3–4), 196–392 (2019)
Andoni, A., Razenshteyn, I.: Optimal data-dependent hashing for approximate near neighbors. In: Proceedings of the Forty-Seventh Annual ACM Symposium on Theory of Computing, pp. 793–801 (2015)
Babenko, A., Lempitsky, V.: The inverted multi-index. IEEE Trans. Pattern Anal. Mach. Intell. 37(6), 1247–1260 (2014)
Candès, E.J., Wakin, M.B.: An introduction to compressive sampling. IEEE Sig. Process. Mag. 25(2), 21–30 (2008)
Davenport, M., Duarte, M., Eldar, Y., Kutyniok, G.: Introduction to compressed sensing (2012)
Dong, Y., Indyk, P., Razenshteyn, I., Wagner, T.: Learning space partitions for nearest neighbor search. In: ICLR (2019)
Dorfman, R.: The detection of defective members of large populations. Ann. Math. Stat. 14(4), 436–440 (1943)
Engels, J., Coleman, B., Shrivastava, A.: Practical near neighbor search via group testing. In: Advances in Neural Information Processing Systems, vol. 34, pp. 9950–9962 (2021)
Fu, C., Xiang, C., Wang, C., Cai, D.: Fast approximate nearest neighbor search with the navigating spreading-out graph, 12(5), 461–474 (2019)
Ghosh, S., et al.: A compressed sensing approach to pooled RT-PCR testing for COVID-19 detection. IEEE Open J. Sig. Process. 2, 248–264 (2021)
Guo, R., et al.: Accelerating large-scale inference with anisotropic vector quantization. In: International Conference on Machine Learning, pp. 3887–3896 (2020)
Hajebi, K., Abbasi-Yadkori, Y., Shahbazi, H., Zhang, H.: Fast approximate nearest-neighbor search with k-nearest neighbor graph. In: IJCAI, pp. 1312–1317 (2011)
Hastie, T., Tibshirani, R., Wainwright, M.: Statistical Learning with Sparsity: The Lasso and Generalizations. CRC Press, Boca Raton (2015)
Hwang, F.: A method for detecting all defective members in a population by group testing. J. Am. Stat. Assoc. 67(339), 605–608 (1972)
Indyk, P., Motwani, R.: Approximate nearest authoneighbors: towards removing the curse of dimensionality. In: STOC, pp. 604–613 (1998)
Iscen, A., Chum, O.: Local orthogonal-group testing. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11206, pp. 460–476. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01216-8_28
Li, C.H.: A sequential method for screening experimental variables. J. Am. Stat. Assoc. 57(298), 455–477
Lv, Q., Josephson, W., Wang, Z., Charikar, M., Li, K.: Multi-probe LSH: efficient indexing for high-dimensional similarity search. In: Proceedings of the 33rd International Conference on Very Large Data Bases, pp. 950–961 (2007)
Malkov, Y., Yashunin, D.: Efficient and robust approximate nearest neighbor search using hierarchical navigable small world graphs. IEEE Trans. Pattern Anal. Mach. Intell. 42(4), 824–836 (2018)
Muja, M., Lowe, D.: Scalable nearest neighbor algorithms for high dimensional data. IEEE Trans. Pattern Anal. Mach. Intell. 36(11), 2227–2240 (2014)
Pham, N., Liu, T.: Falconn++: a locality-sensitive filtering approach for approximate nearest neighbor search. In: Advances in Neural Information Processing Systems, vol. 35, pp. 31186–31198 (2022)
Ram, P., Sinha, K.: Revisiting kd-tree for nearest neighbor search. In: SIGKDD, pp. 1378–1388 (2019)
Shi, M., Furon, T., Jégou, H.: A group testing framework for similarity search in high-dimensional spaces. In: ACMMM, pp. 407–416 (2014)
Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. In: International Conference on Learning Representations (2015)
Vidyasagar, M.: An introduction to compressed sensing. In: SIAM (2019)
Zhang, R., Isola, P., Efros, A.A., Shechtman, E., Wang, O.: The unreasonable effectiveness of deep features as a perceptual metric. In: CVPR, pp. 586–595 (2018)
Acknowledgement
The authors thank the Amazon IIT Bombay AI-ML Initiative for support for the work in this paper.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
1 Electronic supplementary material
Below is the link to the electronic supplementary material.
Rights and permissions
Copyright information
© 2025 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Shah, H., Mittal, K., Rajwade, A. (2025). Group Testing for Accurate and Efficient Range-Based Near Neighbor Search for Plagiarism Detection. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds) Computer Vision – ECCV 2024. ECCV 2024. Lecture Notes in Computer Science, vol 15090. Springer, Cham. https://doi.org/10.1007/978-3-031-73411-3_18
Download citation
DOI: https://doi.org/10.1007/978-3-031-73411-3_18
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-73410-6
Online ISBN: 978-3-031-73411-3
eBook Packages: Computer ScienceComputer Science (R0)