Nothing Special   »   [go: up one dir, main page]

skip to main content
article

Searching in metric spaces

Published: 01 September 2001 Publication History

Abstract

The problem of searching the elements of a set that are close to a given query element under some similarity criterion has a vast number of applications in many branches of computer science, from pattern recognition to textual and multimedia information retrieval. We are interested in the rather general case where the similarity criterion defines a metric space, instead of the more restricted case of a vector space. Many solutions have been proposed in different areas, in many cases without cross-knowledge. Because of this, the same ideas have been reconceived several times, and very different presentations have been given for the same approaches. We present some basic results that explain the intrinsic difficulty of the search problem. This includes a quantitative definition of the elusive concept of "intrinsic dimensionality." We also present a unified view of all the known proposals to organize metric spaces, so as to be able to understand them under a common framework. Most approaches turn out to be variations on a few different concepts. We organize those works in a taxonomy that allows us to devise new algorithms from combinations of concepts not noticed before because of the lack of communication between different communities. We present experiments validating our results and comparing the existing approaches. We finish with recommendations for practitioners and open questions for future development.

References

[1]
APERS, P., BLANKEN, H., AND HOUTSMA, M. 1997. Multimedia Databases in Perspective. Springer- Verlag, New York.]]
[2]
ARYA, S., MOUNT, D., NETANYAHU, N., SILVERMAN, R., AND WU, A. 1994. An optimal algorithm for approximate nearest neighbor searching in fixed dimension. In Proceedings of the Fifth ACM-SIAM Symposium on Discrete Algorithms (SODA'94), 573-583.]]
[3]
AURENHAMMER, F. 1991. Voronoi diagrams-A survey of a fundamental geometric data structure. ACM Comput. Surv. 23,3.]]
[4]
BAEZA-YATES, R. 1997. Searching: An algorithmic tour. In Encyclopedia of Computer Science and Technology, vol. 37, A. Kent and J. Williams, Eds., Marcel Dekker, New York, 331-359.]]
[5]
BAEZA-YATES,R.AND NAVARRO, G. 1998. Fast approximate string matching in a dictionary. In Proceedings of the Fifth South American Symposium on String Processing and Information Retrieval (SPIRE'98), IEEE Computer Science Press, Los Alamitos, Calif., 14-22.]]
[6]
BAEZA-YATES,R.AND RIBEIRO-NETO, B. 1999. Modern Information Retrieval. Addison-Wesley, Reading, Mass.]]
[7]
BAEZA-YATES, R., CUNTO, W., MANBER,U.,AND WU,S. 1994. Proximity matching using fixed-queries trees. In Proceedings of the Fifth Combinatorial Pattern Matching (CPM'94), Lecture Notes in Computer Science, vol. 807, 198-212.]]
[8]
BENTLEY, J. 1975. Multidimensional binary search trees used for associative searching. Commun. ACM 18, 9, 509-517.]]
[9]
BENTLEY, J. 1979. Multidimensional binary search trees in database applications. IEEE Trans. Softw. Eng. 5, 4, 333-340.]]
[10]
BENTLEY, J., WEIDE,B.,AND YAO, A. 1980. Optimal expected-time algorithms for closest point problems. ACM Trans. Math. Softw. 6, 4, 563-580.]]
[11]
BERCHTOLD, S., KEIM,D.,AND KRIEGEL, H. 1996. The X-tree: An index structure for high-dimensional data. In Proceedings of the 22nd Conference on Very Large Databases (VLDB'96), 28-39.]]
[12]
BHANU, B., PENG,J.,AND QING, S. 1998. Learning feature relevance and similarity metrics in image databases. In Proceedings of the IEEE Workshop on Content-Based Access of Image and Video Libraries (Santa Barbara, Calif.), IEEE Computer Society, Los Alamitos, Calif., 14-18.]]
[13]
BIMBO,A.D.AND VICARIO, E. 1998. Using weighted spatial relationships in retrieval by visual contents. In Proceedings of the IEEE Workshop on Content-Based Access of Image and Video Libraries (Santa Barbara, Calif.), IEEE Computer Society, Los Alamitos, Calif., 35-39.]]
[14]
B~HM, C., BERCHTOLD,S.,AND KEIM, A. 2002. Searching in high-dimensional spaces-index structures for improving the performance of multimedia databases. ACM Comput. Surv. To appear.]]
[15]
BOZKAYA,T.AND OZSOYOGLU, M. 1997. Distancebased indexing for high-dimensional metric spaces. In Proceedings of ACM SIGMOD International Conference on Management of Data, SIGMOD Rec. 26, 2, 357-368.]]
[16]
BRIN, S. 1995. Near neighbor search in large metric spaces. In Proceedings of the 21st Conference on Very Large Databases (VLDB'95), 574-584.]]
[17]
BRITO, M., CH~VEZ, E., QUIROZ, A., AND YUKICH,J. 1996. Connectivity of the mutual k-nearest neighbor graph in clustering and outlier detection. Stat. Probab. Lett. 35, 33-42.]]
[18]
BUGNION, E., FHEI, S., ROOS, T., WIDMAYER,P., AND WIDMER, F. 1993. A spatial index for approximate multiple string matching. In Proceedings of the First South American Workshop on String Processing (WSP'93), R. Baeza-Yates and N. Ziviani, Eds., 43-53.]]
[19]
BURKHARD,W.AND KELLER, R. 1973. Some approaches to best-match file searching. Commun. ACM 16, 4, 230-236.]]
[20]
CASCIA, M. L., SETHI,S.,AND SCLAROFF, S. 1998. Combining textual and visual cues for content-based image retrieval on the world wide web. In Proceedings of the IEEE Workshop on Content-Based Access of Image and Video Libraries (Santa Barbara, Calif.), IEEE Computer Society, Los Alamitos, Calif., 24-28.]]
[21]
CH~VEZ, E. 1999. Optimal discretization for pivot based algorithms. Manuscript. ftp:// garota.fismat.umich.mx/pub/users/elchavez/ minimax.ps.gz.]]
[22]
CH~VEZ,E.AND MARROQUYN, J. 1997. Proximity queries in metric spaces. In Proceedings of the Fourth South American Workshop on String Processing ( WSP'97), R. Baeza-Yates, Ed. Carleton University Press, Ottawa, 21-36.]]
[23]
CH~VEZ,E.AND NAVARRO, G. 2000. An effective clustering algorithm to index high dimensional metric spaces. In Proceedings of the Seventh String Processing and Informational Retrieval (SPIRE'00), IEEE CS Press, 75-86.]]
[24]
CH~VEZ,E.AND NAVARRO, G. 2001a. Towards measuring the searching complexity of metric spaces. In Proceedings of the Mexican Computing Meeting, vol. II, 969-972.]]
[25]
CH~VEZ,E.AND NAVARRO, G. 2001b. A probabilistic spell for the curse of dimensionality. In Proceedings of the Third Workshop on Algorithm Engineering and Experimentation (ALENEX'01), 147-160. Lecture Notes in Computer Science v. 2153.]]
[26]
CH~VEZ,E.AND NAVARRO, G. 2002. A metric index for approximate string matching. In Proceedings of the Fifth Latin American Symposium on Theoretical Informatics (LATIN'02), To appear, Lecture Notes in Computer Science.]]
[27]
CH~VEZ, E., MARROQU~N,J.,AND BAEZA-YATES,R. 1999. Spaghettis: An array based algorithm for similarity queries in metric spaces. In Proceedings of String Processing and Information Retrieval (SPIRE'99), IEEE Computer Science Press, Los Alamitos, Calif., 38-46.]]
[28]
CH~VEZ, E., MARROQUIN,J.L.,AND NAVARRO, G. 2001. Fixed queries array: a fast and economical data structure for proximity searching. Multimedia Tools and Applications 14 (2), 113-135. Kluwer.]]
[29]
CHAZELLE, B. 1994. Computational geometry: A retrospective. In Proceedings of the 26th ACM Symposium on the Theory of Computing (STOC'94), 75-94.]]
[30]
CHIUEH, T. 1994. Content-based image indexing. In Proceedings of the Twentieth Conference on Very Large Databases ( VLDB'94), 582-593.]]
[31]
CIACCIA, P., PATELLA, M., AND ZEZULA, P. 1997. M- tree: An efficient access method for similarity search in metric spaces. In Proceedings of the 23rd Conference on Very Large Databases ( VLDB'97), 426-435.]]
[32]
CIACCIA, P., PATELLA, M., AND ZEZULA, P. 1998a. A cost model for similarity queries in metric spaces. In Proceedings of the Seventeenth ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems (PODS'98).]]
[33]
CIACCIA, P., PATELLA, M., AND ZEZULA, P. 1998b. Processing complex similarity queries with distance-based access methods. In Proceedings of the Sixth International Conference on Extending Database Technology (EDBT'98).]]
[34]
CLARKSON, K. 1999. Nearest neighbor queries in metric spaces. Discrete Comput. Geom. 22,1, 63-93.]]
[35]
COX,T.AND COX, M. 1994. Multidimensional Scaling. Chapman and Hall, London.]]
[36]
DEHNE,F.AND NOLTEMEIER, H. 1987. Voronoi trees and clustering problems. Inf. Syst. 12,2, 171-175.]]
[37]
DEVROYE, L. 1987. A Course in Density Estimation. Birkhauser, Boston.]]
[38]
DUDA,R.AND HART, P. 1973. Pattern Classification and Scene Analysis. Wiley, New York.]]
[39]
FALOUTSOS,C.AND KAMEL, I. 1994. Beyond uniformity and independence: Analysis of R-trees using the concept of fractal dimension. In Proceedings of the Thirteenth ACM Symposium on Principles of Database Principles (PODS'94), 4-13.]]
[40]
FALOUTSOS,C.AND LIN, K. 1995. Fastmap: A fast algorithm for indexing, data mining and visualization of traditional and multimedia datasets. ACM SIGMOD Rec. 24, 2, 163-174.]]
[41]
FALOUTSOS, C., EQUITZ, W., FLICKNER, M., NIBLACK,W., PETKOVIC,D.,AND BARBER, R. 1994. Efficient and effective querying by image content. J. Intell. Inf. Syst. 3, 3/4, 231-262.]]
[42]
FARAG~, A., LINDER,T.,AND LUGOSI, G. 1993. Fast nearest-neighbor search in dissimilarity spaces. IEEE Trans. Patt. Anal. Mach. Intell. 15,9, 957-962.]]
[43]
FRAKES,W.AND BAEZA-YATES, R., EDS. 1992. Information Retrieval: Data Structures and Algorithms. Prentice-Hall, Englewood Cliffs, N.J.]]
[44]
GAEDE,V.AND GUNTHER, O. 1998. Multidimensional access methods. ACMComput. Surv. 30,2, 170-231.]]
[45]
GUTTMAN, A. 1984. R-trees: A dynamic index structure for spatial searching. In Proceedings of the ACM SIGMOD International Conference on Management of Data, 47-57.]]
[46]
HAIR, J., ANDERSON, R., TATHAM, R., AND BLACK,W. 1995. Multivariate Data Analysis with Readings, 4th ed. Prentice-Hall, Englewood Cliffs, N.J.]]
[47]
HJALTASON,G.AND SAMET, H. 1995. Ranking in spatial databases. In Proceedings of Fourth International Symposium on Large Spatial Databases, 83-95.]]
[48]
JAIN,A.AND DUBES, R. 1988. Algorithms for Clustering Data. Prentice-Hall, Englewood Cliffs, N.J.]]
[49]
KALANTARI,I.AND MCDONALD, G. 1983. A data structure and an algorithm for the nearest point problem. IEEE Trans. Softw. Eng. 9,5.]]
[50]
MEHLHORN, K. 1984. Data Structures and Algorithms, Volume III "Multidimensional Searching and Computational Geometry. Springer-Verlag, New York.]]
[51]
MICO, L., ONCINA,J.,AND CARRASCO, R. 1996. A fast branch and bound nearest neighbour classifier in metric spaces. Patt. Recogn. Lett. 17, 731-739.]]
[52]
MICO, L., ONCINA,J.,AND VIDAL, E. 1994. A new version of the nearest-neighbor approximating and eliminating search (AESA) with linear preprocessing-time and memory requirements. Patt. Recog. Lett. 15, 9-17.]]
[53]
NAVARRO, G. 1999. Searching in metric spaces by spatial approximation. In Proceedings of String Processing and Information Retrieval (SPIRE'99), IEEE Computer Science Press, Los Alamitos, Calif., 141-148.]]
[54]
NAVARRO,G.AND REYES, N. 2001. Dynamic spatial approximation trees. In Proceedings of the Eleventh Conference of the Chilean Computer Science Society (SCCC'01), IEEE CS Press, 213-222.]]
[55]
NENE,S.AND NAYAR, S. 1997. A simple algorithm for nearest neighbor search in high dimensions. IEEE Trans. Patt. Anal. Mach. Intell. 19,9, 989-1003.]]
[56]
NIEVERGELT,J.AND HINTERBERGER, H. 1984. The grid file: An adaptable, symmetric multikey file structure. ACM Trans. Database Syst. 9,1, 38-71.]]
[57]
NOLTEMEIER, H. 1989. Voronoi trees and applications. In Proceedings of the International Workshop on Discrete Algorithms and Complex-ity (Fukuoka, Japan), 69-74.]]
[58]
NOLTEMEIER, H., VERBARG, K., AND ZIRKELBACH,C. 1992. Monotonous bisector" treesA tool for efficient partitioning of complex schenes of geometric objects. In Data Structures and Efficient Algorithms, Lecture Notes in Computer Science, vol. 594, Springer-Verlag, New York, 186-203.]]
[59]
PRABHAKAR, S., AGRAWAL,D.,AND ABBADI, A. E. 1998. Efficient disk allocation for fast similarity searching. In Proceedings of ACM SPAA'98 (Puerto Vallarta, Mexico).]]
[60]
ROUSSOPOULOS, N., KELLEY,S.,AND VINCENT, F. 1995. Nearest neighbor queries. In Proceedings of the ACM SIGMOD'95, 71-79.]]
[61]
SALTON,G.AND MCGILL, M. 1983. Introduction to Modern Information Retrieval. McGraw-Hill, New York.]]
[62]
SAMET, H. 1984. The quadtree and related hierarchical data structures. ACM Comput. Surv. 16, 2, 187-260.]]
[63]
SANKOFF,D.AND KRUSKAL, J., EDS. 1983. Time Warps, String Edits, and Macromolecules: The Theory and Practice of Sequence Comparison. Addison-Wesley, Reading, Mass.]]
[64]
SASHA,D.AND WANG, T. 1990. New techniques for best-match retrieval. ACM Trans. Inf. Syst. 8,2, 140-158.]]
[65]
SHAPIRO, M. 1977. The choice of reference points in best-match file searching. Commun. ACM 20, 5, 339-343.]]
[66]
SUTTON,R.AND BARTO, A. 1998. Reinforcement Learning : An Introduction. MIT Press, Cambridge, Mass.]]
[67]
UHLMANN, J. 1991a. Implementing metric trees to satisfy general proximity/similarity queries. Manuscript.]]
[68]
UHLMANN, J. 1991b. Satisfying general proximity/ similarity queries with metric trees. Inf. Proc. Lett. 40, 175-179.]]
[69]
VERBARG, K. 1995. The C-Ttree: A dynamically balanced spatial index. Comput. Suppl. 10, 323-340.]]
[70]
VIDAL, E. 1986. An algorithm for finding nearest neighbors in (approximately) constant average time. Patt. Recogn. Lett. 4, 145-157.]]
[71]
WATERMAN, M. 1995. Introduction to Computational Biology. Chapman and Hall, London.]]
[72]
WEBER,R.SCHEK, H.-J., AND BLOTT, S. 1998. A quantitative analysis and performance study for similarity-search methods in highdimensional spaces. In Proceedings of the International Conference on Very Large Databases (VLDB'98).]]
[73]
WHITE,D.AND JAIN, R. 1996. Algorithms and strategies for similarity retrieval. Tech. Rep. VCL-96-101 (July), Visual Computing Laboratory, University of California, La Jolla.]]
[74]
YAO, A. 1990. Computational Geometry,J.Van Leeuwen, Ed. Elsevier Science, New York, 345-380.]]
[75]
YIANILOS, P. 1993. Data structures and algorithms for nearest neighbor search in general metric spaces. In Proceedings of the Fourth ACM-SIAM Symposium on Discrete Algorithms (SODA'93), 311-321.]]
[76]
YIANILOS, P. 1999. Excluded middle vantage point forests for nearest neighbor search. In DIMACS Implementation Challenge, ALENEX'99 (Baltimore, Md).]]
[77]
YIANILOS, P. 2000. Locally lifting the curse of dimensionality for nearest neighbor search. In Proceedings of the Eleventh ACM-SIAM Symposium on Discrete Algorithms (SODA'00), 361-370.]]
[78]
YOSHITAKA,A.AND ICHIKAWA, T. 1999. A survey on content-based retrieval for multimedia databases. IEEE Trans. Knowl. Data Eng. 11,1, 81-93.]]

Cited By

View all
  • (2024)$\mathbb{T}-$Relative Fuzzy Linear Programming for $\mathbb{T}-$Relative Fuzzy Target Coverage ProblemsEarthline Journal of Mathematical Sciences10.34198/ejms.14224.317331(317-331)Online publication date: 24-Jan-2024
  • (2024)In-Host Flat-like Quasispecies: Characterization Methods and Clinical ImplicationsMicroorganisms10.3390/microorganisms1205101112:5(1011)Online publication date: 17-May-2024
  • (2024)Rank-based Hashing for Effective and Efficient Nearest Neighbor Search for Image RetrievalACM Transactions on Multimedia Computing, Communications, and Applications10.1145/365958020:10(1-19)Online publication date: 12-Sep-2024
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Computing Surveys
ACM Computing Surveys  Volume 33, Issue 3
September 2001
153 pages
ISSN:0360-0300
EISSN:1557-7341
DOI:10.1145/502807
Issue’s Table of Contents
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 01 September 2001
Published in CSUR Volume 33, Issue 3

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Curse of dimensionality
  2. nearest neighbors
  3. similarity searching
  4. vector spaces

Qualifiers

  • Article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)154
  • Downloads (Last 6 weeks)12
Reflects downloads up to 28 Sep 2024

Other Metrics

Citations

Cited By

View all
  • (2024)$\mathbb{T}-$Relative Fuzzy Linear Programming for $\mathbb{T}-$Relative Fuzzy Target Coverage ProblemsEarthline Journal of Mathematical Sciences10.34198/ejms.14224.317331(317-331)Online publication date: 24-Jan-2024
  • (2024)In-Host Flat-like Quasispecies: Characterization Methods and Clinical ImplicationsMicroorganisms10.3390/microorganisms1205101112:5(1011)Online publication date: 17-May-2024
  • (2024)Rank-based Hashing for Effective and Efficient Nearest Neighbor Search for Image RetrievalACM Transactions on Multimedia Computing, Communications, and Applications10.1145/365958020:10(1-19)Online publication date: 12-Sep-2024
  • (2024)GTS: GPU-based Tree Index for Fast Similarity SearchProceedings of the ACM on Management of Data10.1145/36549452:3(1-27)Online publication date: 30-May-2024
  • (2024)nSimplex Zen: A Novel Dimensionality Reduction for Euclidean and Hilbert SpacesACM Transactions on Knowledge Discovery from Data10.1145/364764218:6(1-44)Online publication date: 12-Apr-2024
  • (2024)Inexact Simplification of Symbolic Regression Expressions with Locality-sensitive HashingProceedings of the Genetic and Evolutionary Computation Conference10.1145/3638529.3654147(896-904)Online publication date: 14-Jul-2024
  • (2024)Towards Ptolemaic metric properties of the z-normalized Euclidean distance for multivariate time series indexing2024 IEEE 40th International Conference on Data Engineering Workshops (ICDEW)10.1109/ICDEW61823.2024.00026(153-157)Online publication date: 13-May-2024
  • (2024)HJG: An Effective Hierarchical Joint Graph for ANNS in Multi-Metric Spaces2024 IEEE 40th International Conference on Data Engineering (ICDE)10.1109/ICDE60146.2024.00326(4275-4287)Online publication date: 13-May-2024
  • (2024)Stronger compact representations of object trajectoriesGeo-spatial Information Science10.1080/10095020.2024.2310590(1-37)Online publication date: 13-Feb-2024
  • (2024)HubHSP graphInformation Systems10.1016/j.is.2023.102341121:COnline publication date: 1-Mar-2024
  • Show More Cited By

View Options

Get Access

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media