Abstract
Kernel is a kind of data summary which is elaborately extracted from a large dataset. Given a problem, the solution obtained from the kernel is an approximate version of the solution obtained from the whole dataset with a provable approximate ratio. It is widely used in geometric optimization, clustering, and approximate query processing, etc., for scaling them up to massive data. In this paper, we focus on the minimum ε-kernel (MK) computation that asks for a kernel of the smallest size for large-scale data processing. For the open problem presented by Wang et al. that whether the minimum ε-coreset (MC) problem and the MK problem can be reduced to each other, we first formalize the MK problem and analyze its complexity. Due to the NP-hardness of the MK problem in three or higher dimensions, an approximate algorithm, namely Set Cover-Based Minimum ε-Kernel algorithm (SCMK), is developed to solve it. We prove that the MC problem and the MK problem can be Turing-reduced to each other. Then, we discuss the update of MK under insertion and deletion operations, respectively. Finally, a randomized algorithm, called the Randomized Algorithm of Set Cover-Based Minimum ε-Kernel algorithm (RA-SCMK), is utilized to further reduce the complexity of SCMK. The efficiency and effectiveness of SCMK and RA-SCMK are verified by experimental results on real-world and synthetic datasets. Experiments show that the kernel sizes of SCMK are 2x and 17.6x smaller than those of an ANN-based method on real-world and synthetic datasets, respectively. The speedup ratio of SCMK over the ANN-based method is 5.67 on synthetic datasets. RA-SCMK runs up to three times faster than SCMK on synthetic datasets.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.References
Agarwal P K, Har-Peled S, Varadarajan K R. Approximating extent measures of points. Journal of the ACM, 2004, 51(4): 606-635. https://doi.org/10.1145/1008731.1008736.
Wang Y, Li Y, Tan K L. Coresets for minimum enclosing balls over sliding windows. In Proc. the 25th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Aug. 2019, pp.314-323. https://doi.org/10.1145/3292500.3330826.
Bachem O, Lucic M, Krause A. Scalable k-means clustering via lightweight coresets. In Proc. the 24th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Aug. 2018, pp.1119-1127. https://doi.org/10.1145/3219819.3219973.
Har-Peled S, Mazumdar S. On coresets for k-means and k-median clustering. In Proc. the 36th Annual ACM Symposium on Theory of Computing, Jun. 2004, pp.291-300. https://doi.org/10.1145/1007352.1007400.
Yu A, Agarwal P K, Yang J. Processing a large number of continuous preference top-k queries. In Proc. the 2012 ACM SIGMOD International Conference on Management of Data, May 2012, pp.397-408. https://doi.org/10.1145/2213836.2213882.
Wang Y, Mathioudakis M, Li Y, Tan K L. Minimum coresets for maxima representation of multidimensional data. In Proc. the 40th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems, Jun. 2021, pp.138-152. https://doi.org/10.1145/3452021.3458322.
Agarwal P K, Har-Peled S, Varadarajan K R. Geometric approximation via coresets. In Combinatorial and Computational Geometry, 2005, 52. http://library.msri.org/books/Book52/files/01agar.pdf, Oct. 2022.
Yu H, Agarwal P K, Poreddy R, Varadarajan K R. Practical methods for shape fitting and kinetic data structures using coresets. Algorithmica, 2008, 52(3): 378-402. https://doi.org/10.1007/s00453-007-9067-9.
Luo G, Wu K L, Yu P S. Answering linear optimization queries with an approximate stream index. Knowledge and Information Systems, 2009, 20(1): 95-121. https://doi.org/10.1007/s10115-008-0157-z.
Agarwal P K, Kumar N, Sintos S, Suri S. Efficient algorithms for k-regret minimizing sets. In Proc. the 16th International Symposium on Experimental Algorithms, Jun. 2017, Article No. 7. https://doi.org/10.4230/LIPIcs.SEA.2017.7.
Xie M, Wong R C W, Li J, Long C, Lall A. Efficient k-regret query algorithm with restriction-free bound for any dimensionality. In Proc. the 2018 International Conference on Management of Data, Jun. 2018, pp.959-974. https://doi.org/10.1145/3183713.3196903.
Zheng J, Wang Y, Wang X, Ma W. Continuous k-regret minimization queries: A dynamic coreset approach. IEEE Transactions on Knowledge and Data Engineering. https://doi.org/10.1109/TKDE.2022.3166835.
Chan T M. Faster core-set constructions and data-stream algorithms in fixed dimensions. Computational Geometry, 2006, 35(1/2): 20-35. https://doi.org/10.1016/j.comgeo.2005.10.002.
Arya S, Chan T M. Better ε-dependencies for offline approximate nearest neighbor search, Euclidean minimum spanning trees, and ε-kernels. In Proc. the 30th Annual Symposium on Computational Geometry, Jun. 2014, pp.416-425. https://doi.org/10.1145/2582112.2582161.
Arya S, Da Fonseca G D, Mount D M. Near-optimal ε-Kernel construction and related problems. In Proc. the 33rd International Symposium on Computational Geometry, Jul. 2017, Article No. 10. https://doi.org/10.4230/LIPIcs.SoCG.2017.10.
Chan T M. Applications of Chebyshev polynomials to low-dimensional computational geometry. Journal of Computational Geometry, 2018, 9(2): 3-20. https://doi.org/10.20382/jocg.v9i2a2.
Chan T M. Dynamic coresets. Discrete and Computational Geometry, 2009, 42(3): 469-488. https://doi.org/10.1007/s00454-009-9165-3.
Andoni A, Nguyen H L. Width of points in the streaming model. In Proc. the 23rd Annual ACM-SIAM Symposium on Discrete Algorithms, Jan. 2012, pp.447-452. https://doi.org/10.1137/1.9781611973099.38.
Chan T M. Dynamic streaming algorithms for ε-kernels. In Proc. the 32nd International Symposium on Computational Geometry, Jun. 2016, Article No. 27. https://doi.org/10.4230/LIPIcs.SoCG.2016.27.
Guo H, Li J, Gao H. Data source selection for approximate query. Journal of Combinatorial Optimization, 2022, 44(4): 2443-2459. https://doi.org/10.1007/s10878-021-00760-y.
Suh N P. Complexity: Theory and Applications (1st edition). Oxford University Press, 2005.
Bentley J L. Multidimensional binary search trees used for associative searching. Communications of the ACM, 1975, 18(9): 509-517. https://doi.org/10.1145/361002.361007.
Yao A C C. On constructing minimum spanning trees in k-dimensional spaces and related problems. SIAM Journal on Computing, 1982, 11(4): 721-736. https://doi.org/10.1137/0211059.
Author information
Authors and Affiliations
Corresponding author
Supplementary Information
ESM 1
(PDF 1089 kb)
Rights and permissions
About this article
Cite this article
Guo, HJ., Li, JZ. & Gao, H. Minimum Epsilon-Kernel Computation for Large-Scale Data Processing. J. Comput. Sci. Technol. 37, 1398–1411 (2022). https://doi.org/10.1007/s11390-022-2429-6
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11390-022-2429-6