Abstract
The Mallows model (MM) occupies a central role in parametric modelling of ranking data to learn preferences of a population of judges. Despite the wide range of metrics for rankings that can be considered in the model specification, the choice is typically limited to the Kendall, Cayley or Hamming distances, due to the closed-form expression of the related model normalizing constant. This work instead focuses on the Mallows model with Spearman distance (MMS). A novel approximation of the normalizing constant is introduced to allow inference even with a large number of items. This allows us to develop and implement an efficient and accurate EM algorithm for estimating finite mixtures of MMS aimed at i) enlarging the applicability to samples drawn from heterogeneous populations, and ii) dealing with partial rankings affected by diverse forms of censoring. These novelties encompass the critical inferential steps that traditionally limited the use of this distance in practice, and render the MMS comparable (or even preferable) to the MMs with other metrics in terms of computational burden. The inferential ability of the EM scheme and the effectiveness of the approximation are assessed by extensive simulation studies. Finally, we show that the application to three real-world datasets endorses our proposals also in the comparison with competing mixtures of ranking models.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Notes
Precisely, they are the unnormalized versions of the corresponding correlation coefficients.
This version is available at https://arxiv.org/abs/1405.7945v3. See in particular Section 5.1 and Appendix C.1.
Note that the definition of \(Z(\theta )\) corresponds to the canonical partition function of statistical mechanics \(Z(T) = \sum _i \exp \{-E_i/k_\beta T\} = \sum _Eg(E)\exp \{-E/k_\beta T\}\), where \(k_\beta \) is the Boltzmann constant, T is the temperature of a system, \(E_i\) is the energy of the i-th state and g(E) is the density of the states.
In the original notation, the author proved the existence and differentiability of the function \(Z\left( f, \theta \right) = \lim _{n\rightarrow \infty } \frac{1}{n}\log \text {E}\left[ \exp (n\, \theta \, \nu _{\pi }\left[ f\right] )\right] \). By setting \(f(x,y)=(x - y)^2\) and \(\theta =k\), we obtain the desired result.
Note that this criterion is useful only for automatizing the simulations: when choosing the number of clusters in applications the researcher typically uses the BIC, as well as other relevant measures, as a guidance, and decides the number of groups based also on other heuristic criteria (such as the size of the clusters, their separation, or prior information depending on the application at hand).
References
Ali, A., Meilă, M.: Experiments with Kemeny ranking: What works when? Math. Soc. Sci. 64(1), 28–40 (2012)
Alvo, M., Yu, P.L.H.: Statistical methods for ranking data. Frontiers in Probability and the Statistical Sciences. Springer, New York, USA (2014)
Andrieu, C., Roberts, G.O.: The pseudo-marginal approach for efficient Monte Carlo computations. Ann. Stat. 37(2), 697–725 (2009)
Bartholdi, J.J., Tovey, C.A., Trick, M.A.: The computational difficulty of manipulating an election. Soc. Choice Welf. 6(3), 227–241 (1989)
Beckett, L.A.: Maximum likelihood estimation in mallows’s model using partially ranked data. In: Fligner, M.A., Verducci, J.S. (eds.) Probability Models and Statistical Analyses for Ranking Data, pp. 92–107. Springer, New York (1993)
Busse, L.M., Orbanz, P., Buhmann, J.M.: Cluster analysis of heterogeneous rank data. In: Proceedings of the 24th International Conference on Machine Learning, pp. 113–120. (2007)
Caron, F., Teh, Y.W., Murphy, T.B.: Bayesian nonparametric Plackett–Luce models for the analysis of preferences for college degree programmes. Ann. Appl. Stat. 8(2), 1145–1181 (2014)
Chen, Q.R., Vansant, G., Oades, K., Pickering, M., Wei, J.S., Song, Y.K., Monforte, J., Khan, J.: Diagnosis of the small round blue cell tumors using multiplex polymerase chain reaction. J. Mol. Diagn. 9(1), 80–88 (2007)
Cramér, H.: Sur un nouveau théoreme-limite de la théorie des probabilités. Actualités scientifiques et industrielles 736, 2–23 (1938)
Crispino, M., Modugno, L., Mollica, C.: The Mallows model with respondents’ covariates. Preliminary draft available upon request. (2023)
Croux, C., Dehon, C.: Influence functions of the Spearman and Kendall correlation measures. Stat. Methods Appl. 19, 497–515 (2010)
DeConde, R.P., Hawley, S., Falcon, S., Clegg, N., Knudsen, B., Etzioni, R.: Combining results of microarray experiments: a rank aggregation approach. Stat. Appl. Genet. Mol. Biol. 5(1), 23 (2006)
Dempster, A.P., Laird, N.M., Rubin, D.B.: Maximum likelihood from incomplete data via the EM algorithm. J. R. Stat. Soc. Ser. B Stat. Method. 39(1), 1–38 (1977)
Diaconis, P.: Group representations in probability and statistics, Volume 11 of Lecture Notes–Monograph Series. Hayward, USA: Institute of Mathematical Statistics. (1988)
Eliseussen, E., Fleischer, T., Vitelli, V.: Rank-based Bayesian variable selection for genome-wide transcriptomic analyses. Stat. Med. 41(23), 4532–4553 (2022)
Ellis, R.S.: Large deviations for a general class of random vectors. Ann. Prob. 12(1), 1–12 (1984)
Feigin, P., Cohen, A.: On a model for concordance between judges. J. R. Stat. Soc. Ser. B Method. 40(2), 203–213 (1978)
Fligner, M.A., Verducci, J.S.: Distance based ranking models. J. R. Stat. Soc. B 48(3), 359–369 (1986)
Gärtner, J.: On large deviations from the invariant measure. Theory Prob. Appl. 22(1), 24–39 (1977)
Goibert, M., Calauzènes, C., Irurozki, E., Clémençon, S.: (2023). Robust consensus in ranking data analysis: Definitions, properties and computational issues. arXiv preprint arXiv:2303.12878
Goibert, M., Clémençon, S., Irurozki, E., Mozharovskyi, P.: (2022). Statistical depth functions for ranking distributions: definitions, statistical learning and applications. arXiv preprint arXiv:2201.08105
Gormley, I.C., Murphy, T.B.: Exploring voting blocs within the Irish electorate: a mixture modeling approach. J. Am. Stat. Assoc. 103(483), 1014–1027 (2008)
Gormley, I.C., Murphy, T.B.: A mixture of experts model for rank data with applications in election studies. Ann. Appl. Stat. 2(4), 1452–1477 (2008)
Henery, R.J.: Permutation probabilities as models for horse races. J. R. Stat. Soc. Ser. B Stat. Method. 43(1), 86–91 (1981)
Irurozki, E.: Sampling and learning distance-based probability models for permutation spaces. Ph. D. thesis, Department of Computer Science and Artificial Intelligence of the University of the Basque Country. (2014)
Irurozki, E., Calvo, B., Lozano, A.: PerMallows: an R package for Mallows and generalized Mallows models. J. Stat. Softw. 71(12), 1–30 (2016)
Irurozki, E., Calvo, B., Lozano, A.: Sampling and learning the Mallows and generalized Mallows models under the Cayley distance. Method. Comput. Appl. Prob. 20(1), 1–35 (2018)
Irurozki, E., Calvo, B., Lozano, J.A.: Mallows and generalized Mallows model for matchings. Bernoulli 25(2), 1160–1188 (2019)
Jacques, J., Grimonprez, Q., Biernacki, C.: Rankcluster: An R package for clustering multivariate partial rankings. R J. 6(1), 101–110 (2014)
Kendall, M.G.: Rank correlation methods, 4th edn. Griffin, London (1970)
Khan, J., Wei, J.S., Ringnér, M., Saal, L.H., Ladanyi, M., Westermann, F., Berthold, F., Schwab, M., Antonescu, C.R., Peterson, C., Meltzer, P.S.: Classification and diagnostic prediction of cancers using gene expression profiling and artificial neural networks. Nat. Med. 7(6), 673–679 (2001)
Lee, P.H., Yu, P.L.H.: Distance-based tree models for ranking data. Comput. Stat. Data Anal. 54(6), 1672–1682 (2010)
Lee, P.H., Yu, P.L.H.: Mixtures of weighted distance-based models for ranking data with applications in political studies. Comput. Stat. Data Anal. 56(8), 2486–2500 (2012)
Lee, P.H., Yu, P.L.H.: An R package for analyzing and modeling ranking data. BMC Med. Res. Method. 3(1), 1–11 (2013)
Liu, Q., Reiner, A., Frigessi, A., Scheel, I.: Diverse personalized recommendations with uncertainty from implicit preference data with the Bayesian Mallows model. Knowl. Based Syst. 186, 104960 (2019)
Mallows, C.L.: Non-null ranking models. I. Biometrika 44(1/2), 114–130 (1957)
Marden, J. I.: Analyzing and Modeling Rank Data, Volume 64 of Monographs on Statistics and Applied Probability. Cambridge, USA: Chapman & Hall. (1995)
McCullagh, P.: Models on spheres and models for permutations. In: Fligner, M.A., Verducci, J.S. (eds.) Probability Models and Statistical Analyses for Ranking Data, pp. 278–283. Springer, New York (1993)
Meilǎ, M., Bao, L.: An exponential model for infinite rankings. J. Mach. Learn. Res. 11, 3481–3518 (2010)
Møller, J., Pettitt, A.N., Reeves, R., Berthelsen, K.K.: An efficient Markov Chain Monte Carlo method for distributions with intractable normalising constants. Biometrika 93(2), 451–458 (2006)
Mollica, C., Tardella, L.: Epitope profiling via mixture modeling for ranked data. Stat. Med. 33(21), 3738–3758 (2014)
Mukherjee, S.: Estimation in exponential families on permutations. Ann. Stat. 44(2), 853–875 (2016)
Murphy, T.B., Martin, D.: Mixtures of distance-based models for ranking data. Comput. Stat. Data Anal. 41(3–4), 645–655 (2003)
Murray, I., Ghahramani, Z., MacKay, D.: MCMC for doubly-intractable distributions. (2012) arXiv preprint arXiv:1206.6848
Qian, Z., Yu, P.L.H.: Weighted distance-based models for ranking data using the R package rankdist. J. Stat. Softw. 90(5), 1–31 (2019)
Schwarz, G.: Estimating the dimension of a model. Ann. Stat. 6(2), 461–464 (1978)
Shi, C., Wei, B., Wei, S., Wang, W., Liu, H., Liu, J.: A quantitative discriminant method of elbow point for the optimal number of clusters in clustering algorithm. EURASIP J. Wireless Commun. Network. 2021(1), 31 (2021)
Sloane, N. J. A.: The Encyclopedia of Integer Sequences. (2017)
Sørensen, Ø., Crispino, M., Liu, Q., Vitelli, V.: BayesMallows: an R Package for the Bayesian Mallows Model. R J. 12(1), 324–342 (2020)
Tarsitano, A. et al. Comparing the effectiveness of rank correlation statistics. P: Dip. di Economia e Statistica, University of della Calabria. (2009)
Varadhan, S.S.: Asymptotic probabilities and differential equations. Commun. Pure Appl. Math. 19(3), 261–286 (1966)
Vitelli, V., Sørensen, Ø., Crispino, M., Frigessi, A., Arjas, E.: Probabilistic preference learning with the Mallows rank model. J. Mach. Learn. Res. 18(158), 1–49 (2018)
Xu, H., Alvo, M., Yu, P.L.H.: Angle-based models for ranking data. Comput. Stati. Data Anal. 121, 113–136 (2018)
Zhao, Q., Hautamäki, V.: Knee Point Detection in BIC for Detecting the Number of Clusters. In Blanc-Talon, J., Bourennane, S., Philips, W., Popescu, D., Scheunders, P. (eds) Advanced Concepts for Intelligent Vision Systems. ACIVS 2008. Lecture Notes in Computer Science, vol 5259. Springer, Berlin, Heidelberg, pp. 664–673. (2008)
Acknowledgements
The authors would like to thank an anonymous reviewer for her/his constructive and insightful comments that greatly improved the manuscript. The authors wish also to thank prof. Enrico Casadio Tarabusi for his precious suggestions on the approximation of the Spearman distance distribution. This work was supported by the grant N. RP12117A89FFE8A0 funded by Sapienza University of Rome.
The opinions expressed are those of the authors and do not necessarily reflect the views of the Bank of Italy or the Eurosystem.
Author information
Authors and Affiliations
Contributions
Crispino and Mollica are both first authors as they contributed equally to the development of this work as well as to write the main manuscript text. Astuti developed and described the new approximation in Section 3 and Supplementary Material A. Tardella mainly contributed to implement and computationally optimize the EM algorithm for statistical inference described in Section 2. All authors reviewed the manuscript.
Corresponding author
Ethics declarations
Conflict of interest
The authors declare no conflict of interest.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary Information
Below is the link to the electronic supplementary material.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Crispino, M., Mollica, C., Astuti, V. et al. Efficient and accurate inference for mixtures of Mallows models with Spearman distance. Stat Comput 33, 98 (2023). https://doi.org/10.1007/s11222-023-10266-8
Received:
Accepted:
Published:
DOI: https://doi.org/10.1007/s11222-023-10266-8