Nothing Special   »   [go: up one dir, main page]

Skip to main content
Log in

Efficient and accurate inference for mixtures of Mallows models with Spearman distance

  • Original Paper
  • Published:
Statistics and Computing Aims and scope Submit manuscript

Abstract

The Mallows model (MM) occupies a central role in parametric modelling of ranking data to learn preferences of a population of judges. Despite the wide range of metrics for rankings that can be considered in the model specification, the choice is typically limited to the Kendall, Cayley or Hamming distances, due to the closed-form expression of the related model normalizing constant. This work instead focuses on the Mallows model with Spearman distance (MMS). A novel approximation of the normalizing constant is introduced to allow inference even with a large number of items. This allows us to develop and implement an efficient and accurate EM algorithm for estimating finite mixtures of MMS aimed at i) enlarging the applicability to samples drawn from heterogeneous populations, and ii) dealing with partial rankings affected by diverse forms of censoring. These novelties encompass the critical inferential steps that traditionally limited the use of this distance in practice, and render the MMS comparable (or even preferable) to the MMs with other metrics in terms of computational burden. The inferential ability of the EM scheme and the effectiveness of the approximation are assessed by extensive simulation studies. Finally, we show that the application to three real-world datasets endorses our proposals also in the comparison with competing mixtures of ranking models.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6

Similar content being viewed by others

Explore related subjects

Discover the latest articles, news and stories from top researchers in related subjects.

Notes

  1. Precisely, they are the unnormalized versions of the corresponding correlation coefficients.

  2. This version is available at https://arxiv.org/abs/1405.7945v3. See in particular Section 5.1 and Appendix C.1.

  3. Note that the definition of \(Z(\theta )\) corresponds to the canonical partition function of statistical mechanics \(Z(T) = \sum _i \exp \{-E_i/k_\beta T\} = \sum _Eg(E)\exp \{-E/k_\beta T\}\), where \(k_\beta \) is the Boltzmann constant, T is the temperature of a system, \(E_i\) is the energy of the i-th state and g(E) is the density of the states.

  4. In the original notation, the author proved the existence and differentiability of the function \(Z\left( f, \theta \right) = \lim _{n\rightarrow \infty } \frac{1}{n}\log \text {E}\left[ \exp (n\, \theta \, \nu _{\pi }\left[ f\right] )\right] \). By setting \(f(x,y)=(x - y)^2\) and \(\theta =k\), we obtain the desired result.

  5. Note that this criterion is useful only for automatizing the simulations: when choosing the number of clusters in applications the researcher typically uses the BIC, as well as other relevant measures, as a guidance, and decides the number of groups based also on other heuristic criteria (such as the size of the clusters, their separation, or prior information depending on the application at hand).

References

  • Ali, A., Meilă, M.: Experiments with Kemeny ranking: What works when? Math. Soc. Sci. 64(1), 28–40 (2012)

    MathSciNet  MATH  Google Scholar 

  • Alvo, M., Yu, P.L.H.: Statistical methods for ranking data. Frontiers in Probability and the Statistical Sciences. Springer, New York, USA (2014)

    MATH  Google Scholar 

  • Andrieu, C., Roberts, G.O.: The pseudo-marginal approach for efficient Monte Carlo computations. Ann. Stat. 37(2), 697–725 (2009)

    MathSciNet  MATH  Google Scholar 

  • Bartholdi, J.J., Tovey, C.A., Trick, M.A.: The computational difficulty of manipulating an election. Soc. Choice Welf. 6(3), 227–241 (1989)

    MathSciNet  MATH  Google Scholar 

  • Beckett, L.A.: Maximum likelihood estimation in mallows’s model using partially ranked data. In: Fligner, M.A., Verducci, J.S. (eds.) Probability Models and Statistical Analyses for Ranking Data, pp. 92–107. Springer, New York (1993)

    Google Scholar 

  • Busse, L.M., Orbanz, P., Buhmann, J.M.: Cluster analysis of heterogeneous rank data. In: Proceedings of the 24th International Conference on Machine Learning, pp. 113–120. (2007)

  • Caron, F., Teh, Y.W., Murphy, T.B.: Bayesian nonparametric Plackett–Luce models for the analysis of preferences for college degree programmes. Ann. Appl. Stat. 8(2), 1145–1181 (2014)

    MathSciNet  MATH  Google Scholar 

  • Chen, Q.R., Vansant, G., Oades, K., Pickering, M., Wei, J.S., Song, Y.K., Monforte, J., Khan, J.: Diagnosis of the small round blue cell tumors using multiplex polymerase chain reaction. J. Mol. Diagn. 9(1), 80–88 (2007)

    Google Scholar 

  • Cramér, H.: Sur un nouveau théoreme-limite de la théorie des probabilités. Actualités scientifiques et industrielles 736, 2–23 (1938)

    MATH  Google Scholar 

  • Crispino, M., Modugno, L., Mollica, C.: The Mallows model with respondents’ covariates. Preliminary draft available upon request. (2023)

  • Croux, C., Dehon, C.: Influence functions of the Spearman and Kendall correlation measures. Stat. Methods Appl. 19, 497–515 (2010)

    MathSciNet  MATH  Google Scholar 

  • DeConde, R.P., Hawley, S., Falcon, S., Clegg, N., Knudsen, B., Etzioni, R.: Combining results of microarray experiments: a rank aggregation approach. Stat. Appl. Genet. Mol. Biol. 5(1), 23 (2006)

    MathSciNet  MATH  Google Scholar 

  • Dempster, A.P., Laird, N.M., Rubin, D.B.: Maximum likelihood from incomplete data via the EM algorithm. J. R. Stat. Soc. Ser. B Stat. Method. 39(1), 1–38 (1977)

    MathSciNet  MATH  Google Scholar 

  • Diaconis, P.: Group representations in probability and statistics, Volume 11 of Lecture Notes–Monograph Series. Hayward, USA: Institute of Mathematical Statistics. (1988)

  • Eliseussen, E., Fleischer, T., Vitelli, V.: Rank-based Bayesian variable selection for genome-wide transcriptomic analyses. Stat. Med. 41(23), 4532–4553 (2022)

    MathSciNet  Google Scholar 

  • Ellis, R.S.: Large deviations for a general class of random vectors. Ann. Prob. 12(1), 1–12 (1984)

    MathSciNet  MATH  Google Scholar 

  • Feigin, P., Cohen, A.: On a model for concordance between judges. J. R. Stat. Soc. Ser. B Method. 40(2), 203–213 (1978)

    MATH  Google Scholar 

  • Fligner, M.A., Verducci, J.S.: Distance based ranking models. J. R. Stat. Soc. B 48(3), 359–369 (1986)

    MathSciNet  MATH  Google Scholar 

  • Gärtner, J.: On large deviations from the invariant measure. Theory Prob. Appl. 22(1), 24–39 (1977)

    MathSciNet  MATH  Google Scholar 

  • Goibert, M., Calauzènes, C., Irurozki, E., Clémençon, S.: (2023). Robust consensus in ranking data analysis: Definitions, properties and computational issues. arXiv preprint arXiv:2303.12878

  • Goibert, M., Clémençon, S., Irurozki, E., Mozharovskyi, P.: (2022). Statistical depth functions for ranking distributions: definitions, statistical learning and applications. arXiv preprint arXiv:2201.08105

  • Gormley, I.C., Murphy, T.B.: Exploring voting blocs within the Irish electorate: a mixture modeling approach. J. Am. Stat. Assoc. 103(483), 1014–1027 (2008)

    MathSciNet  MATH  Google Scholar 

  • Gormley, I.C., Murphy, T.B.: A mixture of experts model for rank data with applications in election studies. Ann. Appl. Stat. 2(4), 1452–1477 (2008)

    MathSciNet  MATH  Google Scholar 

  • Henery, R.J.: Permutation probabilities as models for horse races. J. R. Stat. Soc. Ser. B Stat. Method. 43(1), 86–91 (1981)

    MathSciNet  MATH  Google Scholar 

  • Irurozki, E.: Sampling and learning distance-based probability models for permutation spaces. Ph. D. thesis, Department of Computer Science and Artificial Intelligence of the University of the Basque Country. (2014)

  • Irurozki, E., Calvo, B., Lozano, A.: PerMallows: an R package for Mallows and generalized Mallows models. J. Stat. Softw. 71(12), 1–30 (2016)

    MATH  Google Scholar 

  • Irurozki, E., Calvo, B., Lozano, A.: Sampling and learning the Mallows and generalized Mallows models under the Cayley distance. Method. Comput. Appl. Prob. 20(1), 1–35 (2018)

    MathSciNet  MATH  Google Scholar 

  • Irurozki, E., Calvo, B., Lozano, J.A.: Mallows and generalized Mallows model for matchings. Bernoulli 25(2), 1160–1188 (2019)

    MathSciNet  MATH  Google Scholar 

  • Jacques, J., Grimonprez, Q., Biernacki, C.: Rankcluster: An R package for clustering multivariate partial rankings. R J. 6(1), 101–110 (2014)

    Google Scholar 

  • Kendall, M.G.: Rank correlation methods, 4th edn. Griffin, London (1970)

    MATH  Google Scholar 

  • Khan, J., Wei, J.S., Ringnér, M., Saal, L.H., Ladanyi, M., Westermann, F., Berthold, F., Schwab, M., Antonescu, C.R., Peterson, C., Meltzer, P.S.: Classification and diagnostic prediction of cancers using gene expression profiling and artificial neural networks. Nat. Med. 7(6), 673–679 (2001)

    Google Scholar 

  • Lee, P.H., Yu, P.L.H.: Distance-based tree models for ranking data. Comput. Stat. Data Anal. 54(6), 1672–1682 (2010)

    MathSciNet  MATH  Google Scholar 

  • Lee, P.H., Yu, P.L.H.: Mixtures of weighted distance-based models for ranking data with applications in political studies. Comput. Stat. Data Anal. 56(8), 2486–2500 (2012)

    MathSciNet  MATH  Google Scholar 

  • Lee, P.H., Yu, P.L.H.: An R package for analyzing and modeling ranking data. BMC Med. Res. Method. 3(1), 1–11 (2013)

    Google Scholar 

  • Liu, Q., Reiner, A., Frigessi, A., Scheel, I.: Diverse personalized recommendations with uncertainty from implicit preference data with the Bayesian Mallows model. Knowl. Based Syst. 186, 104960 (2019)

    Google Scholar 

  • Mallows, C.L.: Non-null ranking models. I. Biometrika 44(1/2), 114–130 (1957)

    MathSciNet  MATH  Google Scholar 

  • Marden, J. I.: Analyzing and Modeling Rank Data, Volume 64 of Monographs on Statistics and Applied Probability. Cambridge, USA: Chapman & Hall. (1995)

  • McCullagh, P.: Models on spheres and models for permutations. In: Fligner, M.A., Verducci, J.S. (eds.) Probability Models and Statistical Analyses for Ranking Data, pp. 278–283. Springer, New York (1993)

    Google Scholar 

  • Meilǎ, M., Bao, L.: An exponential model for infinite rankings. J. Mach. Learn. Res. 11, 3481–3518 (2010)

    MathSciNet  MATH  Google Scholar 

  • Møller, J., Pettitt, A.N., Reeves, R., Berthelsen, K.K.: An efficient Markov Chain Monte Carlo method for distributions with intractable normalising constants. Biometrika 93(2), 451–458 (2006)

    MathSciNet  MATH  Google Scholar 

  • Mollica, C., Tardella, L.: Epitope profiling via mixture modeling for ranked data. Stat. Med. 33(21), 3738–3758 (2014)

    MathSciNet  Google Scholar 

  • Mukherjee, S.: Estimation in exponential families on permutations. Ann. Stat. 44(2), 853–875 (2016)

    MathSciNet  MATH  Google Scholar 

  • Murphy, T.B., Martin, D.: Mixtures of distance-based models for ranking data. Comput. Stat. Data Anal. 41(3–4), 645–655 (2003)

    MathSciNet  MATH  Google Scholar 

  • Murray, I., Ghahramani, Z., MacKay, D.: MCMC for doubly-intractable distributions. (2012) arXiv preprint arXiv:1206.6848

  • Qian, Z., Yu, P.L.H.: Weighted distance-based models for ranking data using the R package rankdist. J. Stat. Softw. 90(5), 1–31 (2019)

    Google Scholar 

  • Schwarz, G.: Estimating the dimension of a model. Ann. Stat. 6(2), 461–464 (1978)

    MathSciNet  MATH  Google Scholar 

  • Shi, C., Wei, B., Wei, S., Wang, W., Liu, H., Liu, J.: A quantitative discriminant method of elbow point for the optimal number of clusters in clustering algorithm. EURASIP J. Wireless Commun. Network. 2021(1), 31 (2021)

    Google Scholar 

  • Sloane, N. J. A.: The Encyclopedia of Integer Sequences. (2017)

  • Sørensen, Ø., Crispino, M., Liu, Q., Vitelli, V.: BayesMallows: an R Package for the Bayesian Mallows Model. R J. 12(1), 324–342 (2020)

    Google Scholar 

  • Tarsitano, A. et al. Comparing the effectiveness of rank correlation statistics. P: Dip. di Economia e Statistica, University of della Calabria. (2009)

  • Varadhan, S.S.: Asymptotic probabilities and differential equations. Commun. Pure Appl. Math. 19(3), 261–286 (1966)

    MathSciNet  MATH  Google Scholar 

  • Vitelli, V., Sørensen, Ø., Crispino, M., Frigessi, A., Arjas, E.: Probabilistic preference learning with the Mallows rank model. J. Mach. Learn. Res. 18(158), 1–49 (2018)

    MathSciNet  MATH  Google Scholar 

  • Xu, H., Alvo, M., Yu, P.L.H.: Angle-based models for ranking data. Comput. Stati. Data Anal. 121, 113–136 (2018)

    MathSciNet  MATH  Google Scholar 

  • Zhao, Q., Hautamäki, V.: Knee Point Detection in BIC for Detecting the Number of Clusters. In Blanc-Talon, J., Bourennane, S., Philips, W., Popescu, D., Scheunders, P. (eds) Advanced Concepts for Intelligent Vision Systems. ACIVS 2008. Lecture Notes in Computer Science, vol 5259. Springer, Berlin, Heidelberg, pp. 664–673. (2008)

Download references

Acknowledgements

The authors would like to thank an anonymous reviewer for her/his constructive and insightful comments that greatly improved the manuscript. The authors wish also to thank prof. Enrico Casadio Tarabusi for his precious suggestions on the approximation of the Spearman distance distribution. This work was supported by the grant N. RP12117A89FFE8A0 funded by Sapienza University of Rome.

The opinions expressed are those of the authors and do not necessarily reflect the views of the Bank of Italy or the Eurosystem.

Author information

Authors and Affiliations

Authors

Contributions

Crispino and Mollica are both first authors as they contributed equally to the development of this work as well as to write the main manuscript text. Astuti developed and described the new approximation in Section 3 and Supplementary Material A. Tardella mainly contributed to implement and computationally optimize the EM algorithm for statistical inference described in Section 2. All authors reviewed the manuscript.

Corresponding author

Correspondence to Cristina Mollica.

Ethics declarations

Conflict of interest

The authors declare no conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Below is the link to the electronic supplementary material.

Supplementary file 1 (pdf 255 KB)

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Crispino, M., Mollica, C., Astuti, V. et al. Efficient and accurate inference for mixtures of Mallows models with Spearman distance. Stat Comput 33, 98 (2023). https://doi.org/10.1007/s11222-023-10266-8

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s11222-023-10266-8

Keywords

Navigation