Nothing Special   »   [go: up one dir, main page]

Skip to main content
Log in

Minimum Epsilon-Kernel Computation for Large-Scale Data Processing

  • Regular Paper
  • Published:
Journal of Computer Science and Technology Aims and scope Submit manuscript

Abstract

Kernel is a kind of data summary which is elaborately extracted from a large dataset. Given a problem, the solution obtained from the kernel is an approximate version of the solution obtained from the whole dataset with a provable approximate ratio. It is widely used in geometric optimization, clustering, and approximate query processing, etc., for scaling them up to massive data. In this paper, we focus on the minimum ε-kernel (MK) computation that asks for a kernel of the smallest size for large-scale data processing. For the open problem presented by Wang et al. that whether the minimum ε-coreset (MC) problem and the MK problem can be reduced to each other, we first formalize the MK problem and analyze its complexity. Due to the NP-hardness of the MK problem in three or higher dimensions, an approximate algorithm, namely Set Cover-Based Minimum ε-Kernel algorithm (SCMK), is developed to solve it. We prove that the MC problem and the MK problem can be Turing-reduced to each other. Then, we discuss the update of MK under insertion and deletion operations, respectively. Finally, a randomized algorithm, called the Randomized Algorithm of Set Cover-Based Minimum ε-Kernel algorithm (RA-SCMK), is utilized to further reduce the complexity of SCMK. The efficiency and effectiveness of SCMK and RA-SCMK are verified by experimental results on real-world and synthetic datasets. Experiments show that the kernel sizes of SCMK are 2x and 17.6x smaller than those of an ANN-based method on real-world and synthetic datasets, respectively. The speedup ratio of SCMK over the ANN-based method is 5.67 on synthetic datasets. RA-SCMK runs up to three times faster than SCMK on synthetic datasets.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

Explore related subjects

Discover the latest articles, news and stories from top researchers in related subjects.

References

  1. Agarwal P K, Har-Peled S, Varadarajan K R. Approximating extent measures of points. Journal of the ACM, 2004, 51(4): 606-635. https://doi.org/10.1145/1008731.1008736.

    Article  MathSciNet  MATH  Google Scholar 

  2. Wang Y, Li Y, Tan K L. Coresets for minimum enclosing balls over sliding windows. In Proc. the 25th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Aug. 2019, pp.314-323. https://doi.org/10.1145/3292500.3330826.

  3. Bachem O, Lucic M, Krause A. Scalable k-means clustering via lightweight coresets. In Proc. the 24th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Aug. 2018, pp.1119-1127. https://doi.org/10.1145/3219819.3219973.

  4. Har-Peled S, Mazumdar S. On coresets for k-means and k-median clustering. In Proc. the 36th Annual ACM Symposium on Theory of Computing, Jun. 2004, pp.291-300. https://doi.org/10.1145/1007352.1007400.

  5. Yu A, Agarwal P K, Yang J. Processing a large number of continuous preference top-k queries. In Proc. the 2012 ACM SIGMOD International Conference on Management of Data, May 2012, pp.397-408. https://doi.org/10.1145/2213836.2213882.

  6. Wang Y, Mathioudakis M, Li Y, Tan K L. Minimum coresets for maxima representation of multidimensional data. In Proc. the 40th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems, Jun. 2021, pp.138-152. https://doi.org/10.1145/3452021.3458322.

  7. Agarwal P K, Har-Peled S, Varadarajan K R. Geometric approximation via coresets. In Combinatorial and Computational Geometry, 2005, 52. http://library.msri.org/books/Book52/files/01agar.pdf, Oct. 2022.

  8. Yu H, Agarwal P K, Poreddy R, Varadarajan K R. Practical methods for shape fitting and kinetic data structures using coresets. Algorithmica, 2008, 52(3): 378-402. https://doi.org/10.1007/s00453-007-9067-9.

    Article  MathSciNet  MATH  Google Scholar 

  9. Luo G, Wu K L, Yu P S. Answering linear optimization queries with an approximate stream index. Knowledge and Information Systems, 2009, 20(1): 95-121. https://doi.org/10.1007/s10115-008-0157-z.

    Article  Google Scholar 

  10. Agarwal P K, Kumar N, Sintos S, Suri S. Efficient algorithms for k-regret minimizing sets. In Proc. the 16th International Symposium on Experimental Algorithms, Jun. 2017, Article No. 7. https://doi.org/10.4230/LIPIcs.SEA.2017.7.

  11. Xie M, Wong R C W, Li J, Long C, Lall A. Efficient k-regret query algorithm with restriction-free bound for any dimensionality. In Proc. the 2018 International Conference on Management of Data, Jun. 2018, pp.959-974. https://doi.org/10.1145/3183713.3196903.

  12. Zheng J, Wang Y, Wang X, Ma W. Continuous k-regret minimization queries: A dynamic coreset approach. IEEE Transactions on Knowledge and Data Engineering. https://doi.org/10.1109/TKDE.2022.3166835.

  13. Chan T M. Faster core-set constructions and data-stream algorithms in fixed dimensions. Computational Geometry, 2006, 35(1/2): 20-35. https://doi.org/10.1016/j.comgeo.2005.10.002.

    Article  MathSciNet  MATH  Google Scholar 

  14. Arya S, Chan T M. Better ε-dependencies for offline approximate nearest neighbor search, Euclidean minimum spanning trees, and ε-kernels. In Proc. the 30th Annual Symposium on Computational Geometry, Jun. 2014, pp.416-425. https://doi.org/10.1145/2582112.2582161.

  15. Arya S, Da Fonseca G D, Mount D M. Near-optimal ε-Kernel construction and related problems. In Proc. the 33rd International Symposium on Computational Geometry, Jul. 2017, Article No. 10. https://doi.org/10.4230/LIPIcs.SoCG.2017.10.

  16. Chan T M. Applications of Chebyshev polynomials to low-dimensional computational geometry. Journal of Computational Geometry, 2018, 9(2): 3-20. https://doi.org/10.20382/jocg.v9i2a2.

    Article  MathSciNet  MATH  Google Scholar 

  17. Chan T M. Dynamic coresets. Discrete and Computational Geometry, 2009, 42(3): 469-488. https://doi.org/10.1007/s00454-009-9165-3.

    Article  MathSciNet  MATH  Google Scholar 

  18. Andoni A, Nguyen H L. Width of points in the streaming model. In Proc. the 23rd Annual ACM-SIAM Symposium on Discrete Algorithms, Jan. 2012, pp.447-452. https://doi.org/10.1137/1.9781611973099.38.

  19. Chan T M. Dynamic streaming algorithms for ε-kernels. In Proc. the 32nd International Symposium on Computational Geometry, Jun. 2016, Article No. 27. https://doi.org/10.4230/LIPIcs.SoCG.2016.27.

  20. Guo H, Li J, Gao H. Data source selection for approximate query. Journal of Combinatorial Optimization, 2022, 44(4): 2443-2459. https://doi.org/10.1007/s10878-021-00760-y.

    Article  MathSciNet  MATH  Google Scholar 

  21. Suh N P. Complexity: Theory and Applications (1st edition). Oxford University Press, 2005.

  22. Bentley J L. Multidimensional binary search trees used for associative searching. Communications of the ACM, 1975, 18(9): 509-517. https://doi.org/10.1145/361002.361007.

    Article  MATH  Google Scholar 

  23. Yao A C C. On constructing minimum spanning trees in k-dimensional spaces and related problems. SIAM Journal on Computing, 1982, 11(4): 721-736. https://doi.org/10.1137/0211059.

    Article  MathSciNet  MATH  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Hong Gao.

Supplementary Information

ESM 1

(PDF 1089 kb)

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Guo, HJ., Li, JZ. & Gao, H. Minimum Epsilon-Kernel Computation for Large-Scale Data Processing. J. Comput. Sci. Technol. 37, 1398–1411 (2022). https://doi.org/10.1007/s11390-022-2429-6

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11390-022-2429-6

Keywords

Navigation