Group Testing for Accurate and Efficient Range-Based Near Neighbor Search for Plagiarism Detection

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 15090))

Included in the following conference series:

European Conference on Computer Vision

147 Accesses

Abstract

This work presents an adaptive group testing framework for the range-based high dimensional near neighbor search problem. Our method efficiently marks each item in a database as neighbor or non-neighbor of a query point, based on a cosine distance threshold without exhaustive search. Like other methods for large scale retrieval, our approach exploits the assumption that most of the items in the database are unrelated to the query. Unlike other methods, it does not assume a large difference between the cosine similarity of the query vector with the least related neighbor and that with the least unrelated non-neighbor. Following a multi-stage adaptive group testing algorithm based on binary splitting, we divide the set of items to be searched into half at each step, and perform dot product tests on smaller and smaller subsets, many of which we are able to prune away. We experimentally show that, using softmax-based features, our method achieves a more than ten-fold speed-up over exhaustive search with no loss of accuracy, on a variety of large datasets. Based on empirically verified models for the distribution of cosine distances, we present a theoretical analysis of the expected number of distance computations per query and the probability that a pool with a certain number of members will be pruned. Our method has the following features: (i) It implicitly exploits useful distributional properties of cosine distances unlike other methods; (ii) All required data structures are created purely offline; (iii) It does not impose any strong assumptions on the number of true near neighbors; (iv) It is adaptable to streaming settings where new vectors are dynamically added to the database; and (v) It does not require any parameter tuning. The high recall of our technique makes it particularly suited to plagiarism detection scenarios where it is important to report every database item that is sufficiently similar item to the query.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 64.99; Price excludes VAT (USA)

Softcover Book: USD 79.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Compressed Data Structures for Range Searching

An Empirical Evaluation of Search Strategies for Locality-Sensitive Hashing: Lookup, Voting, and Natural Classifier Search

Provable randomized rounding for minimum-similarity diversification

Article Open access 04 January 2022

References

https://github.com/facebookresearch/faiss
https://www.kaggle.com/competitions/imagenet-object-localization-challenge/data
https://github.com/JoshEngels/FLINNG
https://github.com/FALCONN-LIB/FALCONN
https://github.com/NinhPham/FalconnLSF
https://github.com/google-research/google-research/tree/master/scann
FLANN - fast library for approximate nearest neighbors: wbia-pyflann 4.0.4. https://pypi.org/project/wbia-pyflann/
Generalized binary splitting algorithm. https://en.wikipedia.org/wiki/Group_testing#Generalised_binary-splitting_algorithm
Imdb-wiki - 500k+ face images with age and gender labels. https://data.vision.ee.ethz.ch/cvl/rrothe/imdb-wiki/
The InstaCities1M Dataset. https://gombru.github.io/2018/08/01/InstaCities1M/
MIRFLICKR download. https://press.liacs.nl/mirflickr/mirdownload.html
Oxford and Paris buildings dataset. https://www.kaggle.com/datasets/skylord/oxbuildings
Supplemental material. Uploaded on conference portal
Google Scholar
Aldridge, M., Johnson, ., Scarlett, J.: Group testing: an information theory perspective. Found. Trends Commun. Inf. Theory 15(3–4), 196–392 (2019)
Google Scholar
Andoni, A., Razenshteyn, I.: Optimal data-dependent hashing for approximate near neighbors. In: Proceedings of the Forty-Seventh Annual ACM Symposium on Theory of Computing, pp. 793–801 (2015)
Google Scholar
Babenko, A., Lempitsky, V.: The inverted multi-index. IEEE Trans. Pattern Anal. Mach. Intell. 37(6), 1247–1260 (2014)
Article Google Scholar
Candès, E.J., Wakin, M.B.: An introduction to compressive sampling. IEEE Sig. Process. Mag. 25(2), 21–30 (2008)
Article Google Scholar
Davenport, M., Duarte, M., Eldar, Y., Kutyniok, G.: Introduction to compressed sensing (2012)
Google Scholar
Dong, Y., Indyk, P., Razenshteyn, I., Wagner, T.: Learning space partitions for nearest neighbor search. In: ICLR (2019)
Google Scholar
Dorfman, R.: The detection of defective members of large populations. Ann. Math. Stat. 14(4), 436–440 (1943)
Article Google Scholar
Engels, J., Coleman, B., Shrivastava, A.: Practical near neighbor search via group testing. In: Advances in Neural Information Processing Systems, vol. 34, pp. 9950–9962 (2021)
Google Scholar
Fu, C., Xiang, C., Wang, C., Cai, D.: Fast approximate nearest neighbor search with the navigating spreading-out graph, 12(5), 461–474 (2019)
Google Scholar
Ghosh, S., et al.: A compressed sensing approach to pooled RT-PCR testing for COVID-19 detection. IEEE Open J. Sig. Process. 2, 248–264 (2021)
Article Google Scholar
Guo, R., et al.: Accelerating large-scale inference with anisotropic vector quantization. In: International Conference on Machine Learning, pp. 3887–3896 (2020)
Google Scholar
Hajebi, K., Abbasi-Yadkori, Y., Shahbazi, H., Zhang, H.: Fast approximate nearest-neighbor search with k-nearest neighbor graph. In: IJCAI, pp. 1312–1317 (2011)
Google Scholar
Hastie, T., Tibshirani, R., Wainwright, M.: Statistical Learning with Sparsity: The Lasso and Generalizations. CRC Press, Boca Raton (2015)
Google Scholar
Hwang, F.: A method for detecting all defective members in a population by group testing. J. Am. Stat. Assoc. 67(339), 605–608 (1972)
Article Google Scholar
Indyk, P., Motwani, R.: Approximate nearest authoneighbors: towards removing the curse of dimensionality. In: STOC, pp. 604–613 (1998)
Google Scholar
Iscen, A., Chum, O.: Local orthogonal-group testing. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11206, pp. 460–476. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01216-8_28
Chapter Google Scholar
Li, C.H.: A sequential method for screening experimental variables. J. Am. Stat. Assoc. 57(298), 455–477
Google Scholar
Lv, Q., Josephson, W., Wang, Z., Charikar, M., Li, K.: Multi-probe LSH: efficient indexing for high-dimensional similarity search. In: Proceedings of the 33rd International Conference on Very Large Data Bases, pp. 950–961 (2007)
Google Scholar
Malkov, Y., Yashunin, D.: Efficient and robust approximate nearest neighbor search using hierarchical navigable small world graphs. IEEE Trans. Pattern Anal. Mach. Intell. 42(4), 824–836 (2018)
Article Google Scholar
Muja, M., Lowe, D.: Scalable nearest neighbor algorithms for high dimensional data. IEEE Trans. Pattern Anal. Mach. Intell. 36(11), 2227–2240 (2014)
Article Google Scholar
Pham, N., Liu, T.: Falconn++: a locality-sensitive filtering approach for approximate nearest neighbor search. In: Advances in Neural Information Processing Systems, vol. 35, pp. 31186–31198 (2022)
Google Scholar
Ram, P., Sinha, K.: Revisiting kd-tree for nearest neighbor search. In: SIGKDD, pp. 1378–1388 (2019)
Google Scholar
Shi, M., Furon, T., Jégou, H.: A group testing framework for similarity search in high-dimensional spaces. In: ACMMM, pp. 407–416 (2014)
Google Scholar
Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. In: International Conference on Learning Representations (2015)
Google Scholar
Vidyasagar, M.: An introduction to compressed sensing. In: SIAM (2019)
Google Scholar
Zhang, R., Isola, P., Efros, A.A., Shechtman, E., Wang, O.: The unreasonable effectiveness of deep features as a perceptual metric. In: CVPR, pp. 586–595 (2018)
Google Scholar

Download references

Acknowledgement

The authors thank the Amazon IIT Bombay AI-ML Initiative for support for the work in this paper.

Author information

Authors and Affiliations

Department of Computer Science and Engineering, Indian Institute of Technology, Bombay, Mumbai, India
Harsh Shah, Kashish Mittal & Ajit Rajwade

Authors

Harsh Shah
View author publications
You can also search for this author in PubMed Google Scholar
Kashish Mittal
View author publications
You can also search for this author in PubMed Google Scholar
Ajit Rajwade
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Ajit Rajwade .

Editor information

Editors and Affiliations

University of Birmingham, Birmingham, UK
Aleš Leonardis
University of Trento, Trento, Italy
Elisa Ricci
Technical University of Darmstadt, Darmstadt, Germany
Stefan Roth
Princeton University, Princeton, NJ, USA
Olga Russakovsky
Czech Technical University in Prague, Prague, Czech Republic
Torsten Sattler
École des Ponts ParisTech, Marne-la-Vallée, France
Gül Varol

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 1723 KB)

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Shah, H., Mittal, K., Rajwade, A. (2025). Group Testing for Accurate and Efficient Range-Based Near Neighbor Search for Plagiarism Detection. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds) Computer Vision – ECCV 2024. ECCV 2024. Lecture Notes in Computer Science, vol 15090. Springer, Cham. https://doi.org/10.1007/978-3-031-73411-3_18

Download citation

DOI: https://doi.org/10.1007/978-3-031-73411-3_18
Published: 23 November 2024
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-73410-6
Online ISBN: 978-3-031-73411-3
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Group Testing for Accurate and Efficient Range-Based Near Neighbor Search for Plagiarism Detection

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Compressed Data Structures for Range Searching

An Empirical Evaluation of Search Strategies for Locality-Sensitive Hashing: Lookup, Voting, and Natural Classifier Search

Provable randomized rounding for minimum-similarity diversification

References

Acknowledgement

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

1 Electronic supplementary material

Supplementary material 1 (pdf 1723 KB)

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Subscribe and save

Buy Now

Navigation

Group Testing for Accurate and Efficient Range-Based Near Neighbor Search for Plagiarism Detection

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Compressed Data Structures for Range Searching

An Empirical Evaluation of Search Strategies for Locality-Sensitive Hashing: Lookup, Voting, and Natural Classifier Search

Provable randomized rounding for minimum-similarity diversification

References

Acknowledgement

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

1 Electronic supplementary material

Supplementary material 1 (pdf 1723 KB)

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation