article

A comprehensive empirical comparison of hubness reduction in high-dimensional spaces

Authors:

Roman Feldbauer,

Arthur FlexerAuthors Info & Claims

Knowledge and Information Systems, Volume 59, Issue 1

Pages 137 - 166

https://doi.org/10.1007/s10115-018-1205-y

Published: 01 April 2019 Publication History

Abstract

Hubness is an aspect of the curse of dimensionality related to the distance concentration effect. Hubs occur in high-dimensional data spaces as objects that are particularly often among the nearest neighbors of other objects. Conversely, other data objects become antihubs, which are rarely or never nearest neighbors to other objects. Many machine learning algorithms rely on nearest neighbor search and some form of measuring distances, which are both impaired by high hubness. Degraded performance due to hubness has been reported for various tasks such as classification, clustering, regression, visualization, recommendation, retrieval and outlier detection. Several hubness reduction methods based on different paradigms have previously been developed. Local and global scaling as well as shared neighbors approaches aim at repairing asymmetric neighborhood relations. Global and localized centering try to eliminate spatial centrality, while the related global and local dissimilarity measures are based on density gradient flattening. Additional methods and alternative dissimilarity measures that were argued to mitigate detrimental effects of distance concentration also influence the related hubness phenomenon. In this paper, we present a large-scale empirical evaluation of all available unsupervised hubness reduction methods and dissimilarity measures. We investigate several aspects of hubness reduction as well as its influence on data semantics which we measure via nearest neighbor classification. Scaling and density gradient flattening methods improve evaluation measures such as hubness and classification accuracy consistently for data sets from a wide range of domains, while centering approaches achieve the same only under specific settings.

References

[1]

Andoni A, Indyk P, Laarhoven T, Razenshteyn I, Schmidt L (2015) Practical and optimal LSH for angular distance. In: Cortes C, Lawrence ND, Lee DD, Sugiyama M, Garnett R (eds) Advances in neural information processing systems, vol 28. Curran Associates, Inc., Red Hook, pp 1225---1233

Digital Library

[2]

Aryal S, Ting KM, Washio T, Haffari G (2017) Data-dependent dissimilarity measure: an effective alternative to geometric distance measures. Knowl Inf Syst 53(2):479---506

Digital Library

[3]

Aucouturier JJ, Pachet F (2004) Improving timbre similarity: how high is the sky. J Negat Results Speech Audio Sci 1(1):1---13

[4]

Bellman RE (1961) Adaptive control processes: a guided tour. Princeton University Press, Princeton

[5]

Bergstra J, Bengio Y (2012) Random search for hyper-parameter optimization. J Mach Learn Res 13:281---305

Digital Library

[6]

Bischl B, Lang M, Kotthoff L, Schiffner J, Richter J, Studerus E, Casalicchio G, Jones ZM (2016) mlr: machine learning in R. J Mach Learn Res 17(170):1---5

Digital Library

[7]

Buza K, Nanopoulos A, Nagy G (2015) Nearest neighbor regression in the presence of bad hubs. Knowl-Based Syst 86:250---260

Digital Library

[8]

Camastra F, Staiano A (2016) Intrinsic dimension estimation: advances and open problems. Inf Sci 328:26---41

Digital Library

[9]

Cawley GC, Talbot NL (2010) On over-fitting in model selection and subsequent selection bias in performance evaluation. J Mach Learn Res 11:2079---2107

Digital Library

[10]

Chang CC, Lin CJ (2011) LIBSVM: a library for support vector machines. ACM Trans Intell Syst Technol (TIST) 2(3):27:1---27:27

Digital Library

[11]

Chavarriaga R, Sagha H, Calatroni A, Digumarti ST, Trster G, Milln J del R, Roggen D (2013) The opportunity challenge: a benchmark database for on-body sensor-based activity recognition. Pattern Recogn Lett 34(15):2033---2042

Digital Library

[12]

Ciarelli PM, Salles EOT, Oliveira E (2010) An evolving system based on probabilistic neural network. In: Proceedings of the eleventh Brazilian symposiun on neural networks, pp 182---187

Digital Library

[13]

Danziger SA, Baronio R, Ho L, Hall L, Salmon K, Hatfield GW, Kaiser P, Lathrop RH (2009) Predicting positive p53 cancer rescue regions using most informative positive (MIP) active learning. PLoS Comput Biol 5(9):1---12

[14]

Demšar J (2006) Statistical comparisons of classifiers over multiple data sets. J Mach Learn Res 7:1---30

Digital Library

[15]

Faddoul JB (2012) DMOZ web directory topics. URLhttp://mldata.org/repository/data/viewslug/dmoz-web-directory-topics/

[16]

Feldbauer R, Flexer A (2016) Centering versus scaling for hubness reduction. In: Villa AE, Masulli P, Rivero AJP (eds.) 25th International conference on artificial neural networks, lecture notes in computer science, pp 175---183. Springer

[17]

Flexer A (2015) Improving visualization of high-dimensional music similarity spaces. In: Proceedings of the 16th international society for music information retrieval (ISMIR) conference, pp 547---553

[18]

Flexer A (2016) An empirical analysis of hubness in unsupervised distance-based outlier detection. In: 16th International conference on data mining workshops (ICDMW), pp 716---723. IEEE

[19]

Flexer A (2016) Hubness-aware outlier detection for music genre recognition. In: Proceedings of the 19th international conference on digital audio effects

[20]

Flexer A, Schnitzer D (2013) Can shared nearest neighbors reduce hubness in high-dimensional spaces? In: IEEE 13th international conference on data mining workshops, pp 460---467. IEEE

Digital Library

[21]

Flexer A, Schnitzer D (2015) Choosing $$\ell ^p$$ℓp norms in high-dimensional spaces based on hub analysis. Neurocomputing 169:281---287

[22]

Flexer A, Stevens J (2018) Mutual proximity graphs for improved reachability in music recommendation. J New Music Res 47(1):17---28

[23]

Francois D, Wertz V, Verleysen M (2007) The concentration of fractional distances. IEEE Trans Knowl Data Eng 19(7):873---886

Digital Library

[24]

Güvenir HA, Acar B, Demiröz G, Çekin A (1997) A supervised machine learning algorithm for arrhythmia analysis. In: Proceedings of the computers in cardiology conference, pp 433---436

[25]

Hara K, Suzuki I, Kobayashi K, Fukumizu K (2015) Reducing hubness: a cause of vulnerability in recommender systems. In: Proceedings of the 38th international ACM SIGIR conference on research and development in information retrieval, pp 815---818

Digital Library

[26]

Hara K, Suzuki I, Kobayashi K, Fukumizu K, Radovanović M (2016) Flattening the density gradient for eliminating spatial centrality to reduce hubness. In: Proceedings of the 30th AAAI conference on artificial intelligence, pp 1659---1665

Digital Library

[27]

Hara K, Suzuki I, Shimbo M, Kobayashi K, Fukumizu K, Radovanovic M (2015) Localized centering: reducing hubness in large-sample data. In: Proceedings of the 29th AAAI conference on artificial intelligence (AAAI), pp 2645---2651

Digital Library

[28]

Higuera C, Gardiner KJ, Cios KJ (2015) Self-organizing feature maps identify proteins critical to learning in a mouse model of down syndrome. PLoS ONE 10(6):e0129,126

[29]

Hoyer PO, Henschel S, Sonnenburg S, Braun ML, Ong CS (2009) machine learning data set repository. URLhttp://mldata.org/

[30]

Hsu CW, Lin CJ (2002) A comparison of methods for multiclass support vector machines. IEEE Trans Neural Netw 13(2):415---425

Digital Library

[31]

Indyk P, Motwani R (1998) Approximate nearest neighbors: towards removing the curse of dimensionality. In: Proceedings of the thirtieth annual ACM symposium on theory of computing, pp 604---613. ACM

Digital Library

[32]

Jarvis RA, Patrick EA (1973) Clustering using a similarity measure based on shared near neighbors. IEEE Trans Comput 100(11):1025---1034

Digital Library

[33]

Jegou H, Harzallah H, Schmid C (2007) A contextual dissimilarity measure for accurate and efficient image search. In: IEEE conference on computer vision and pattern recognition, pp 1---8. IEEE

[34]

Knees P, Schnitzer D, Flexer A (2014) Improving neighborhood-based collaborative filtering by reducing hubness. In: Proceedings of international conference on multimedia retrieval, ICMR '14, pp 161---168

Digital Library

[35]

Kulis B (2013) Metric learning: a survey. Found Trends Mach Learn 5(4):287---364

[36]

Levina E, Bickel PJ (2005) Maximum likelihood estimation of intrinsic dimension. In: Saul LK, Weiss Y, Bottou L (eds) Advances in neural information processing systems, vol 17. MIT Press, Cambridge, pp 777---784

Digital Library

[37]

Lewis DD, Yang Y, Rose TG, Li F (2004) RCV1: a new benchmark collection for text categorization research. J Mach Learn Res 5:361---397

Digital Library

[38]

Lichman M (2013) UCI machine learning repository. URLhttp://archive.ics.uci.edu/ml

[39]

Low T, Borgelt C, Stober S, Nürnberger A (2013) The hubness phenomenon: Fact or artifact? In: Borgelt C, Gil MÁ, Sousa JM, Verleysen M (eds) Towards advanced data analysis by combining soft computing and statistics. Springer, Berlin, pp 267---278

Digital Library

[40]

McCallum A, Nigam K (1998) A comparison of event models for naive bayes text classification. In: Proceedings of the AAAI'98 workshop on learning for text categorization

[41]

Mesterharm C, Pazzani MJ (2011) Active learning using on-line algorithms. In: Proceedings of the 17th ACM SIGKDD international conference on knowledge discovery and data mining, pp 850---858

Digital Library

[42]

Pagnotta F, Amran HM (2016) Using data mining to predict secondary school student alcohol consumption. http://www3.dsi.uminho.pt/pcortez

[43]

Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, Blondel M, Prettenhofer P, Weiss R, Dubourg V et al (2011) Scikit-learn: machine learning in python. J Mach Learn Res 12:2825---2830

Digital Library

[44]

Radovanović M, Nanopoulos A, Ivanović M (2010) Hubs in space: popular nearest neighbors in high-dimensional data. J Mach Learn Res 11:2487---2531

Digital Library

[45]

Radovanović M, Nanopoulos A, Ivanović M (2015) Reverse nearest neighbors in unsupervised distance-based outlier detection. IEEE Trans Knowl Data Eng 27(5):1369---1382

Digital Library

[46]

Sakar BE, Isenkul ME, Sakar CO, Sertbas A, Gurgen F, Delil S, Apaydin H, Kursun O (2013) Collection and analysis of a Parkinson speech dataset with multiple types of sound recordings. IEEE J Biomed Health Inf 17(4):828---834

[47]

Schnitzer D, Flexer A (2015) The unbalancing effect of hubs on k-medoids clustering in high-dimensional spaces. In: International joint conference on neural networks (IJCNN), pp 1---8

[48]

Schnitzer D, Flexer A, Schedl M, Widmer G (2011) Using mutual proximity to improve content-based audio similarity. In: Proceedings of the 12th international society for music information retrieval conference, pp 79---84

[49]

Schnitzer D, Flexer A, Schedl M, Widmer G (2012) Local and global scaling reduce hubs in space. J Mach Learn Res 13(1):2871---2902

Digital Library

[50]

Schnitzer D, Flexer A, Schlüter J (2013) The relation of hubs to the doddington zoo in speaker verification. In: Proceedings of the 21st European signal processing conference (EUSIPCO), pp 1---5. IEEE

[51]

Schnitzer D, Flexer A, Tomašev N (2014) A case for hubness removal in high-dimensional multimedia retrieval. In: European conference on information retrieval, pp 687---692. Springer

Digital Library

[52]

Semeion: Semeion handwritten digit. Tech. rep., Semeion Research Center of Sciences of Communication, via Sersale 117, 00128 Rome, Italy Tattile Via Gaetano Donizetti, 1-3-5,25030 Mairano (Brescia), Italy (2008)

[53]

Soundarapandian P (2015). URLhttp://archive.ics.uci.edu/ml/datasets/Chronic_Kidney_Disease

[54]

Stiglic G, Kokol P (2010) Stability of ranked gene lists in large microarray analysis studies. J Biomed Biotechnol 2010:616358

[55]

Sun J, Yang Z, Wang P, Liu S (2010) Variable length character n-gram approach for online writeprint identification. In: Proceedings of the international conference on multimedia information networking and security, pp 486---490

Digital Library

[56]

Suzuki I, Hara K, Shimbo M, Saerens M, Fukumizu K (2013) Centering similarity measures to reduce hubs. In: Proceedings of the 2013 conference on empirical methods in natural language processing, 13: 613---623

[57]

Tomašev N (2015) Taming the empirical hubness risk in many dimensions. In: Proceedings of the 15th SIAM international conference on data mining (SDM), pp 1---9. SIAM

[58]

Tomašev N, Brehar R, Mladenić D, Nedevschi S (2011) The influence of hubness on nearest-neighbor methods in object recognition. In: IEEE international conference on intelligent computer communication and processing (ICCP), pp 367---374

[59]

Tomašev N, Mladenić D (2014) Hubness-aware shared neighbor distances for high-dimensional k-nearest neighbor classification. Knowl Inf Syst 39(1):89

Digital Library

[60]

Tomašev N, Radovanović M, Mladenić D, Ivanović M (2011) The role of hubness in clustering high-dimensional data. Adv Knowl Discov Data Min 6634:183---195

Digital Library

[61]

Vanschoren J, van Rijn JN, Bischl B, Torgo L (2013) Openml: networked science in machine learning. SIGKDD Explor 15(2):49---60

Digital Library

[62]

Vincent E, Gkiokas A, Schnitzer D, Flexer A (2014) An investigation of likelihood normalization for robust ASR. In: Proceedings of the annual conference of the international speech communication association

[63]

Wang JY (2002) Application of support vector machines in bioinformatics. Master's thesis, National Taiwan University

[64]

Zelnik-Manor L, Perona P (2005) Self-tuning spectral clustering. Adva Neural Inf Process Syst 17:1601---1608

Digital Library

Cited By

Peter SChowdhury SCancino-Chacón CWidmer G(2023)Are we describing the same sound? An analysis of word embedding spaces of expressive piano performanceProceedings of the 15th Annual Meeting of the Forum for Information Retrieval Evaluation10.1145/3632754.3632759(58-66)Online publication date: 15-Dec-2023
https://dl.acm.org/doi/10.1145/3632754.3632759
Obraczka DRahm E(2022)Fast Hubness-Reduced Nearest Neighbor Search for Entity Alignment in Knowledge GraphsSN Computer Science10.1007/s42979-022-01417-13:6Online publication date: 6-Oct-2022
https://dl.acm.org/doi/10.1007/s42979-022-01417-1
Li CWang SHuang FXu JYu PDemartini GZuccon GCulpepper JHuang ZTong H(2021)Hubness-aware User Identity LinkageProceedings of the 30th ACM International Conference on Information & Knowledge Management10.1145/3459637.3482122(3196-3200)Online publication date: 26-Oct-2021
https://dl.acm.org/doi/10.1145/3459637.3482122

Index Terms

A comprehensive empirical comparison of hubness reduction in high-dimensional spaces
1. Computing methodologies
  1. Machine learning

Index terms have been assigned to the content through auto-classification.

Recommendations

On the behavior of intrinsically high-dimensional spaces: distances, direct and reverse nearest neighbors, and hubness

Over the years, different characterizations of the curse of dimensionality have been provided, usually stating the conditions under which, in the limit of the infinite dimensionality, distances become indistinguishable. However, these characterizations ...
Hubness-Aware shared neighbor distances for high-dimensional k-nearest neighbor classification
HAIS'12: Proceedings of the 7th international conference on Hybrid Artificial Intelligent Systems - Volume Part II

Learning from high-dimensional data is usually quite a challenging task, as captured by the well known phrase curse of dimensionality. Most distance-based methods become impaired due to the distance concentration of many widely used metrics in high-...
Can Shared Nearest Neighbors Reduce Hubness in High-Dimensional Spaces?
ICDMW '13: Proceedings of the 2013 IEEE 13th International Conference on Data Mining Workshops

'Hubness' is a recently discovered general problem of machine learning in high dimensional data spaces. Hub objects have a small distance to an exceptionally large number of data points, and anti-hubs are far from all other data points. It is related to ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image Knowledge and Information Systems

Knowledge and Information Systems Volume 59, Issue 1

April 2019

245 pages

ISSN:0219-1377

Issue’s Table of Contents

Copyright © Copyright © 2019 Springer-Verlag London Ltd., part of Springer Nature.

Publisher

Springer-Verlag

Berlin, Heidelberg

Publication History

Published: 01 April 2019

Author Tags

Qualifiers

Article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

3
Total Citations
View Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 28 Sep 2024

Other Metrics

View Author Metrics

Citations

Cited By

Peter SChowdhury SCancino-Chacón CWidmer G(2023)Are we describing the same sound? An analysis of word embedding spaces of expressive piano performanceProceedings of the 15th Annual Meeting of the Forum for Information Retrieval Evaluation10.1145/3632754.3632759(58-66)Online publication date: 15-Dec-2023
https://dl.acm.org/doi/10.1145/3632754.3632759
Obraczka DRahm E(2022)Fast Hubness-Reduced Nearest Neighbor Search for Entity Alignment in Knowledge GraphsSN Computer Science10.1007/s42979-022-01417-13:6Online publication date: 6-Oct-2022
https://dl.acm.org/doi/10.1007/s42979-022-01417-1
Li CWang SHuang FXu JYu PDemartini GZuccon GCulpepper JHuang ZTong H(2021)Hubness-aware User Identity LinkageProceedings of the 30th ACM International Conference on Information & Knowledge Management10.1145/3459637.3482122(3196-3200)Online publication date: 26-Oct-2021
https://dl.acm.org/doi/10.1145/3459637.3482122

View Options

View options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Media

Figures

Other

Tables

View Issue’s Table of Contents