Article

Finding data broadness via generalized nearest neighbors

Authors:

Jayendra Venkateswaran,

Orhan CamogluAuthors Info & Claims

EDBT'06: Proceedings of the 10th international conference on Advances in Database Technology

Pages 645 - 663

https://doi.org/10.1007/11687238_39

Published: 26 March 2006 Publication History

Abstract

A data object is broad if it is one of the k-Nearest Neighbors (k-NN) of many data objects. We introduce a new database primitive called Generalized Nearest Neighbor (GNN) to express data broadness. We also develop three strategies to answer GNN queries efficiently for large datasets of multidimensional objects. The R*-Tree based search algorithm generates candidate pages and ranks them based on their distances. Our first algorithm, Fetch All (FA), fetches as many candidate pages as possible. Our second algorithm, Fetch One (FO), fetches one candidate page at a time. Our third algorithm, Fetch Dynamic (FD), dynamically decides on the number of pages that needs to be fetched. We also propose three optimizations, Column Filter, Row Filter and Adaptive Filter, to eliminate pages from each dataset. Column Filter prunes the pages that are guaranteed to be non-broad. Row Filter prunes the pages whose removal do not change the broadness of any data point. Adaptive Filter prunes the search space dynamically along each dimension to eliminate unpromising objects. Our experiments show that FA is the fastest when the buffer size is large and FO is the fastest when the buffer size is small. FD is always either fastest or very close to the faster of FA and FO. FD is significantly faster than the existing methods adapted to the GNN problem.

References

[1]

Susanne Albers. Competitive Online Algorithms. Technical Report LS-96-2, brics, September 1996.

[2]

N. Beckmann, H.-P. Kriegel, R. Schneider, and B. Seeger. The R*-tree: An Efficient and Robust Access Method for Points and Rectangles. In International Conference on Management of Data (SIGMOD), pages 322-331, 1990.

Digital Library

[3]

S. Berchtold, B. Ertl, D.A. Keim, H.-P. Kriegel, and T. Seidl. Fast Nearest Neighbor Search in High-dimensional Space. In International Conference on Data Engineering (ICDE), pages 209-218, 1998.

Digital Library

[4]

K. S. Beyer, J. Goldstein, R. Ramakrishnan, and U. Shaft. When Is "Nearest Neighbor" Meaningful? In International Conference on Database Theory (ICDT), pages 217-235, 1999.

Digital Library

[5]

C. Böhm and F. Krebs. The k-Nearest Neighbour Join: Turbo Charging the KDD Process. Knowledge and Information Systems (KAIS), 6(6), 2004.

Digital Library

[6]

O. Çamoğlu, T. Kahveci, and A. K. Singh. Towards Index-based Similarity Search for Protein Structure Databases. Journal of Bioinformatics and Computational Biology (JBCB), 2(1):99-126, 2004.

[7]

Chee Yong Chan and Beng Chin Ooi. Efficient Scheduling of Page Access in Index-Based Join Processing. IEEE Transactions on Knowledge and Data Engineering (TKDE), 9(6):1005-1011, November/December 1997.

Digital Library

[8]

C. Ding and H. Peng. Minimum redundancy feature selection from microarray gene expression data. In Computational Systems Bioinformatics Conference (CSB), pages 523-528, 2003.

Digital Library

[9]

G.R. Hjaltason and H. Samet. Ranking in Spatial Databases. In Symposium on Spatial Databases, pages 83-95, Portland, Maine, August 1995.

Digital Library

[10]

X. Huang and A. Madan. CAP3: A DNA Sequence Assembly Program. Genome Research, 9(9):868-877, 1999.

[11]

Ibrahim Kamel and Christos Faloutsos. Hilbert R-tree: An Improved R-tree using Fractals. In International Conference on Very Large Databases (VLDB), pages 500-509, 1994.

Digital Library

[12]

F. Korn and S. Muthukrishnan. Influence sets based on reverse nearest neighbor queries. In International Conference on Management of Data (SIGMOD), pages 201-212, 2000.

Digital Library

[13]

F. Korn, N. Sidiropoulos, C. Faloutsos, E. Siegel, and Z. Protopapas. Fast Nearest Neighbor Search in Medical Databases. In International Conference on Very Large Databases (VLDB), pages 215-226, India, 1996.

Digital Library

[14]

T. H. Merrett, Yahiko Kambayashi, and H. Yasuura. Scheduling of Page-Fetches in Join Operations. In International Conference on Very Large Databases (VLDB), pages 488-498, 1981.

Digital Library

[15]

N. Roussopoulos, S. Kelley, and F. Vincent. Nearest Neighbor Queries. In International Conference on Management of Data (SIGMOD), San Jose, CA, 1995.

Digital Library

[16]

Mario Lopez Scott Leutenegger and Jeffrey Edgington. STR: A Simple and Efficient Algorithm for R-Tree Packing. In International Conference on Data Engineering (ICDE), pages 497-506, 1997.

Digital Library

[17]

B. Seeger. An analysis of schedules for performing multi-page requests. Information Systems, 21(5):387-407, 1996.

Digital Library

[18]

T. Seidl and H.P. Kriegel. Optimal Multi-Step k-Nearest Neighbor Search. In International Conference on Management of Data (SIGMOD), 1998.

Digital Library

[19]

I. Stanoi, M. Riedewald, D. Agrawal, and A.E. Abbadi. Discovery of Influence Sets in Frequently Updated Databases. In International Conference on Very Large Databases (VLDB), pages 99-108, 2001.

Digital Library

[20]

Y. Tao, D. Papadias, and X. Lian. Reverse kNN Search in Arbitrary Dimensionality. In International Conference on Very Large Databases (VLDB), 2004.

Digital Library

[21]

C. Xia, H. Lu, B.C. Ooi, and J. Hu. GORDER: An Efficient Method for KNN Join Processing. In International Conference on Very Large Databases (VLDB), 2004.

Digital Library

[22]

C. Yang and K.-I. Lin. An Index Structure for Efficient Reverse Nearest Neighbor Queries. In International Conference on Data Engineering (ICDE), pages 485-492, 2001.

Digital Library

Cited By

Tang Y(2010)Towards evaluating GRASIM for ontology-based data matchingProceedings of the 2010 international conference on On the move to meaningful internet systems: Part II10.5555/1926129.1926164(1009-1017)Online publication date: 25-Oct-2010
https://dl.acm.org/doi/10.5555/1926129.1926164

Finding data broadness via generalized nearest neighbors
1. Information systems
  1. Data management systems
    1. Database management system engines
2. Theory of computation
  1. Theory and algorithms for application domains
    1. Database theory

Recommendations

Improving Locality Sensitive Hashing by Efficiently Finding Projected Nearest Neighbors
Similarity Search and Applications
Abstract
Similarity search in high-dimensional spaces is an important task for many multimedia applications. Due to the notorious curse of dimensionality, approximate nearest neighbor techniques are preferred over exact searching techniques since they can ...
K-Nearest Neighbor Finding Using MaxNearestDist

Similarity searching often reduces to finding the k nearest neighbors to a query object. Finding the k nearest neighbors is achieved by applying either a depth- first or a best-first algorithm to the search hierarchy containing the data. These ...
Extending LAESA Fast Nearest Neighbour Algorithm to Find the k Nearest Neighbours
Proceedings of the Joint IAPR International Workshop on Structural, Syntactic, and Statistical Pattern Recognition

Many pattern recognition tasks make use of the k nearest neighbour (k-NN) technique. In this paper we are interested on fast k- NN search algorithms that can work in any metric space i.e. they are not restricted to Euclidean-like distance functions. ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image Guide Proceedings

EDBT'06: Proceedings of the 10th international conference on Advances in Database Technology

March 2006

1204 pages

ISBN:3540329609

Editors:
Yannis Ioannidis
University of Athens, Greece
,
Marc H. Scholl
University of Konstanz, Konstanz, Germany
,
Joachim W. Schmidt
Sustainable Content Logistics Centre, Hamburg, Germany
,
Florian Matthes
Chair of Software Engineering for Business Information Systems, Technische Universität München, Boltzmannstraße 3, Garching b. München
,
Mike Hatzopoulos
Department of Informatics, University of Athens Panepistimiopolis, Boltzmannstraße 3, Athens, Greece

Publisher

Springer-Verlag

Berlin, Heidelberg

Publication History

Published: 26 March 2006

Qualifiers

Article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

1
Total Citations
View Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 13 Nov 2024

Other Metrics

View Author Metrics

Citations

Cited By

Tang Y(2010)Towards evaluating GRASIM for ontology-based data matchingProceedings of the 2010 international conference on On the move to meaningful internet systems: Part II10.5555/1926129.1926164(1009-1017)Online publication date: 25-Oct-2010
https://dl.acm.org/doi/10.5555/1926129.1926164

View Options

View options

Media

Figures

Other

Tables

View Table of Contents