Nothing Special   »   [go: up one dir, main page]

skip to main content
article

A comprehensive empirical comparison of hubness reduction in high-dimensional spaces

Published: 01 April 2019 Publication History

Abstract

Hubness is an aspect of the curse of dimensionality related to the distance concentration effect. Hubs occur in high-dimensional data spaces as objects that are particularly often among the nearest neighbors of other objects. Conversely, other data objects become antihubs, which are rarely or never nearest neighbors to other objects. Many machine learning algorithms rely on nearest neighbor search and some form of measuring distances, which are both impaired by high hubness. Degraded performance due to hubness has been reported for various tasks such as classification, clustering, regression, visualization, recommendation, retrieval and outlier detection. Several hubness reduction methods based on different paradigms have previously been developed. Local and global scaling as well as shared neighbors approaches aim at repairing asymmetric neighborhood relations. Global and localized centering try to eliminate spatial centrality, while the related global and local dissimilarity measures are based on density gradient flattening. Additional methods and alternative dissimilarity measures that were argued to mitigate detrimental effects of distance concentration also influence the related hubness phenomenon. In this paper, we present a large-scale empirical evaluation of all available unsupervised hubness reduction methods and dissimilarity measures. We investigate several aspects of hubness reduction as well as its influence on data semantics which we measure via nearest neighbor classification. Scaling and density gradient flattening methods improve evaluation measures such as hubness and classification accuracy consistently for data sets from a wide range of domains, while centering approaches achieve the same only under specific settings.

References

[1]
Andoni A, Indyk P, Laarhoven T, Razenshteyn I, Schmidt L (2015) Practical and optimal LSH for angular distance. In: Cortes C, Lawrence ND, Lee DD, Sugiyama M, Garnett R (eds) Advances in neural information processing systems, vol 28. Curran Associates, Inc., Red Hook, pp 1225---1233
[2]
Aryal S, Ting KM, Washio T, Haffari G (2017) Data-dependent dissimilarity measure: an effective alternative to geometric distance measures. Knowl Inf Syst 53(2):479---506
[3]
Aucouturier JJ, Pachet F (2004) Improving timbre similarity: how high is the sky. J Negat Results Speech Audio Sci 1(1):1---13
[4]
Bellman RE (1961) Adaptive control processes: a guided tour. Princeton University Press, Princeton
[5]
Bergstra J, Bengio Y (2012) Random search for hyper-parameter optimization. J Mach Learn Res 13:281---305
[6]
Bischl B, Lang M, Kotthoff L, Schiffner J, Richter J, Studerus E, Casalicchio G, Jones ZM (2016) mlr: machine learning in R. J Mach Learn Res 17(170):1---5
[7]
Buza K, Nanopoulos A, Nagy G (2015) Nearest neighbor regression in the presence of bad hubs. Knowl-Based Syst 86:250---260
[8]
Camastra F, Staiano A (2016) Intrinsic dimension estimation: advances and open problems. Inf Sci 328:26---41
[9]
Cawley GC, Talbot NL (2010) On over-fitting in model selection and subsequent selection bias in performance evaluation. J Mach Learn Res 11:2079---2107
[10]
Chang CC, Lin CJ (2011) LIBSVM: a library for support vector machines. ACM Trans Intell Syst Technol (TIST) 2(3):27:1---27:27
[11]
Chavarriaga R, Sagha H, Calatroni A, Digumarti ST, Trster G, Milln J del R, Roggen D (2013) The opportunity challenge: a benchmark database for on-body sensor-based activity recognition. Pattern Recogn Lett 34(15):2033---2042
[12]
Ciarelli PM, Salles EOT, Oliveira E (2010) An evolving system based on probabilistic neural network. In: Proceedings of the eleventh Brazilian symposiun on neural networks, pp 182---187
[13]
Danziger SA, Baronio R, Ho L, Hall L, Salmon K, Hatfield GW, Kaiser P, Lathrop RH (2009) Predicting positive p53 cancer rescue regions using most informative positive (MIP) active learning. PLoS Comput Biol 5(9):1---12
[14]
Demšar J (2006) Statistical comparisons of classifiers over multiple data sets. J Mach Learn Res 7:1---30
[15]
Faddoul JB (2012) DMOZ web directory topics. URLhttp://mldata.org/repository/data/viewslug/dmoz-web-directory-topics/
[16]
Feldbauer R, Flexer A (2016) Centering versus scaling for hubness reduction. In: Villa AE, Masulli P, Rivero AJP (eds.) 25th International conference on artificial neural networks, lecture notes in computer science, pp 175---183. Springer
[17]
Flexer A (2015) Improving visualization of high-dimensional music similarity spaces. In: Proceedings of the 16th international society for music information retrieval (ISMIR) conference, pp 547---553
[18]
Flexer A (2016) An empirical analysis of hubness in unsupervised distance-based outlier detection. In: 16th International conference on data mining workshops (ICDMW), pp 716---723. IEEE
[19]
Flexer A (2016) Hubness-aware outlier detection for music genre recognition. In: Proceedings of the 19th international conference on digital audio effects
[20]
Flexer A, Schnitzer D (2013) Can shared nearest neighbors reduce hubness in high-dimensional spaces? In: IEEE 13th international conference on data mining workshops, pp 460---467. IEEE
[21]
Flexer A, Schnitzer D (2015) Choosing $$\ell ^p$$ℓp norms in high-dimensional spaces based on hub analysis. Neurocomputing 169:281---287
[22]
Flexer A, Stevens J (2018) Mutual proximity graphs for improved reachability in music recommendation. J New Music Res 47(1):17---28
[23]
Francois D, Wertz V, Verleysen M (2007) The concentration of fractional distances. IEEE Trans Knowl Data Eng 19(7):873---886
[24]
Güvenir HA, Acar B, Demiröz G, Çekin A (1997) A supervised machine learning algorithm for arrhythmia analysis. In: Proceedings of the computers in cardiology conference, pp 433---436
[25]
Hara K, Suzuki I, Kobayashi K, Fukumizu K (2015) Reducing hubness: a cause of vulnerability in recommender systems. In: Proceedings of the 38th international ACM SIGIR conference on research and development in information retrieval, pp 815---818
[26]
Hara K, Suzuki I, Kobayashi K, Fukumizu K, Radovanović M (2016) Flattening the density gradient for eliminating spatial centrality to reduce hubness. In: Proceedings of the 30th AAAI conference on artificial intelligence, pp 1659---1665
[27]
Hara K, Suzuki I, Shimbo M, Kobayashi K, Fukumizu K, Radovanovic M (2015) Localized centering: reducing hubness in large-sample data. In: Proceedings of the 29th AAAI conference on artificial intelligence (AAAI), pp 2645---2651
[28]
Higuera C, Gardiner KJ, Cios KJ (2015) Self-organizing feature maps identify proteins critical to learning in a mouse model of down syndrome. PLoS ONE 10(6):e0129,126
[29]
Hoyer PO, Henschel S, Sonnenburg S, Braun ML, Ong CS (2009) machine learning data set repository. URLhttp://mldata.org/
[30]
Hsu CW, Lin CJ (2002) A comparison of methods for multiclass support vector machines. IEEE Trans Neural Netw 13(2):415---425
[31]
Indyk P, Motwani R (1998) Approximate nearest neighbors: towards removing the curse of dimensionality. In: Proceedings of the thirtieth annual ACM symposium on theory of computing, pp 604---613. ACM
[32]
Jarvis RA, Patrick EA (1973) Clustering using a similarity measure based on shared near neighbors. IEEE Trans Comput 100(11):1025---1034
[33]
Jegou H, Harzallah H, Schmid C (2007) A contextual dissimilarity measure for accurate and efficient image search. In: IEEE conference on computer vision and pattern recognition, pp 1---8. IEEE
[34]
Knees P, Schnitzer D, Flexer A (2014) Improving neighborhood-based collaborative filtering by reducing hubness. In: Proceedings of international conference on multimedia retrieval, ICMR '14, pp 161---168
[35]
Kulis B (2013) Metric learning: a survey. Found Trends Mach Learn 5(4):287---364
[36]
Levina E, Bickel PJ (2005) Maximum likelihood estimation of intrinsic dimension. In: Saul LK, Weiss Y, Bottou L (eds) Advances in neural information processing systems, vol 17. MIT Press, Cambridge, pp 777---784
[37]
Lewis DD, Yang Y, Rose TG, Li F (2004) RCV1: a new benchmark collection for text categorization research. J Mach Learn Res 5:361---397
[38]
Lichman M (2013) UCI machine learning repository. URLhttp://archive.ics.uci.edu/ml
[39]
Low T, Borgelt C, Stober S, Nürnberger A (2013) The hubness phenomenon: Fact or artifact? In: Borgelt C, Gil MÁ, Sousa JM, Verleysen M (eds) Towards advanced data analysis by combining soft computing and statistics. Springer, Berlin, pp 267---278
[40]
McCallum A, Nigam K (1998) A comparison of event models for naive bayes text classification. In: Proceedings of the AAAI'98 workshop on learning for text categorization
[41]
Mesterharm C, Pazzani MJ (2011) Active learning using on-line algorithms. In: Proceedings of the 17th ACM SIGKDD international conference on knowledge discovery and data mining, pp 850---858
[42]
Pagnotta F, Amran HM (2016) Using data mining to predict secondary school student alcohol consumption. http://www3.dsi.uminho.pt/pcortez
[43]
Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, Blondel M, Prettenhofer P, Weiss R, Dubourg V et al (2011) Scikit-learn: machine learning in python. J Mach Learn Res 12:2825---2830
[44]
Radovanović M, Nanopoulos A, Ivanović M (2010) Hubs in space: popular nearest neighbors in high-dimensional data. J Mach Learn Res 11:2487---2531
[45]
Radovanović M, Nanopoulos A, Ivanović M (2015) Reverse nearest neighbors in unsupervised distance-based outlier detection. IEEE Trans Knowl Data Eng 27(5):1369---1382
[46]
Sakar BE, Isenkul ME, Sakar CO, Sertbas A, Gurgen F, Delil S, Apaydin H, Kursun O (2013) Collection and analysis of a Parkinson speech dataset with multiple types of sound recordings. IEEE J Biomed Health Inf 17(4):828---834
[47]
Schnitzer D, Flexer A (2015) The unbalancing effect of hubs on k-medoids clustering in high-dimensional spaces. In: International joint conference on neural networks (IJCNN), pp 1---8
[48]
Schnitzer D, Flexer A, Schedl M, Widmer G (2011) Using mutual proximity to improve content-based audio similarity. In: Proceedings of the 12th international society for music information retrieval conference, pp 79---84
[49]
Schnitzer D, Flexer A, Schedl M, Widmer G (2012) Local and global scaling reduce hubs in space. J Mach Learn Res 13(1):2871---2902
[50]
Schnitzer D, Flexer A, Schlüter J (2013) The relation of hubs to the doddington zoo in speaker verification. In: Proceedings of the 21st European signal processing conference (EUSIPCO), pp 1---5. IEEE
[51]
Schnitzer D, Flexer A, Tomašev N (2014) A case for hubness removal in high-dimensional multimedia retrieval. In: European conference on information retrieval, pp 687---692. Springer
[52]
Semeion: Semeion handwritten digit. Tech. rep., Semeion Research Center of Sciences of Communication, via Sersale 117, 00128 Rome, Italy Tattile Via Gaetano Donizetti, 1-3-5,25030 Mairano (Brescia), Italy (2008)
[53]
Soundarapandian P (2015). URLhttp://archive.ics.uci.edu/ml/datasets/Chronic_Kidney_Disease
[54]
Stiglic G, Kokol P (2010) Stability of ranked gene lists in large microarray analysis studies. J Biomed Biotechnol 2010:616358
[55]
Sun J, Yang Z, Wang P, Liu S (2010) Variable length character n-gram approach for online writeprint identification. In: Proceedings of the international conference on multimedia information networking and security, pp 486---490
[56]
Suzuki I, Hara K, Shimbo M, Saerens M, Fukumizu K (2013) Centering similarity measures to reduce hubs. In: Proceedings of the 2013 conference on empirical methods in natural language processing, 13: 613---623
[57]
Tomašev N (2015) Taming the empirical hubness risk in many dimensions. In: Proceedings of the 15th SIAM international conference on data mining (SDM), pp 1---9. SIAM
[58]
Tomašev N, Brehar R, Mladenić D, Nedevschi S (2011) The influence of hubness on nearest-neighbor methods in object recognition. In: IEEE international conference on intelligent computer communication and processing (ICCP), pp 367---374
[59]
Tomašev N, Mladenić D (2014) Hubness-aware shared neighbor distances for high-dimensional k-nearest neighbor classification. Knowl Inf Syst 39(1):89
[60]
Tomašev N, Radovanović M, Mladenić D, Ivanović M (2011) The role of hubness in clustering high-dimensional data. Adv Knowl Discov Data Min 6634:183---195
[61]
Vanschoren J, van Rijn JN, Bischl B, Torgo L (2013) Openml: networked science in machine learning. SIGKDD Explor 15(2):49---60
[62]
Vincent E, Gkiokas A, Schnitzer D, Flexer A (2014) An investigation of likelihood normalization for robust ASR. In: Proceedings of the annual conference of the international speech communication association
[63]
Wang JY (2002) Application of support vector machines in bioinformatics. Master's thesis, National Taiwan University
[64]
Zelnik-Manor L, Perona P (2005) Self-tuning spectral clustering. Adva Neural Inf Process Syst 17:1601---1608

Cited By

View all
  • (2023)Are we describing the same sound? An analysis of word embedding spaces of expressive piano performanceProceedings of the 15th Annual Meeting of the Forum for Information Retrieval Evaluation10.1145/3632754.3632759(58-66)Online publication date: 15-Dec-2023
  • (2022)Fast Hubness-Reduced Nearest Neighbor Search for Entity Alignment in Knowledge GraphsSN Computer Science10.1007/s42979-022-01417-13:6Online publication date: 6-Oct-2022
  • (2021)Hubness-aware User Identity LinkageProceedings of the 30th ACM International Conference on Information & Knowledge Management10.1145/3459637.3482122(3196-3200)Online publication date: 26-Oct-2021

Index Terms

  1. A comprehensive empirical comparison of hubness reduction in high-dimensional spaces
    Index terms have been assigned to the content through auto-classification.

    Recommendations

    Comments

    Please enable JavaScript to view thecomments powered by Disqus.

    Information & Contributors

    Information

    Published In

    cover image Knowledge and Information Systems
    Knowledge and Information Systems  Volume 59, Issue 1
    April 2019
    245 pages

    Publisher

    Springer-Verlag

    Berlin, Heidelberg

    Publication History

    Published: 01 April 2019

    Author Tags

    1. Classification
    2. Curse of dimensionality
    3. Hubness
    4. Nearest neighbors
    5. Secondary distances

    Qualifiers

    • Article

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)0
    • Downloads (Last 6 weeks)0
    Reflects downloads up to 28 Sep 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2023)Are we describing the same sound? An analysis of word embedding spaces of expressive piano performanceProceedings of the 15th Annual Meeting of the Forum for Information Retrieval Evaluation10.1145/3632754.3632759(58-66)Online publication date: 15-Dec-2023
    • (2022)Fast Hubness-Reduced Nearest Neighbor Search for Entity Alignment in Knowledge GraphsSN Computer Science10.1007/s42979-022-01417-13:6Online publication date: 6-Oct-2022
    • (2021)Hubness-aware User Identity LinkageProceedings of the 30th ACM International Conference on Information & Knowledge Management10.1145/3459637.3482122(3196-3200)Online publication date: 26-Oct-2021

    View Options

    View options

    Get Access

    Login options

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media