Re-evaluating Word Mover’s Distance

Ryoma Sato, Makoto Yamada, Hisashi Kashima
Proceedings of the 39th International Conference on Machine Learning, PMLR 162:19231-19249, 2022.

Abstract

The word mover’s distance (WMD) is a fundamental technique for measuring the similarity of two documents. As the crux of WMD, it can take advantage of the underlying geometry of the word space by employing an optimal transport formulation. The original study on WMD reported that WMD outperforms classical baselines such as bag-of-words (BOW) and TF-IDF by significant margins in various datasets. In this paper, we point out that the evaluation in the original study could be misleading. We re-evaluate the performances of WMD and the classical baselines and find that the classical baselines are competitive with WMD if we employ an appropriate preprocessing, i.e., L1 normalization. In addition, we introduce an analogy between WMD and L1-normalized BOW and find that not only the performance of WMD but also the distance values resemble those of BOW in high dimensional spaces.

Cite this Paper


BibTeX
@InProceedings{pmlr-v162-sato22b, title = {Re-evaluating Word Mover’s Distance}, author = {Sato, Ryoma and Yamada, Makoto and Kashima, Hisashi}, booktitle = {Proceedings of the 39th International Conference on Machine Learning}, pages = {19231--19249}, year = {2022}, editor = {Chaudhuri, Kamalika and Jegelka, Stefanie and Song, Le and Szepesvari, Csaba and Niu, Gang and Sabato, Sivan}, volume = {162}, series = {Proceedings of Machine Learning Research}, month = {17--23 Jul}, publisher = {PMLR}, pdf = {https://proceedings.mlr.press/v162/sato22b/sato22b.pdf}, url = {https://proceedings.mlr.press/v162/sato22b.html}, abstract = {The word mover’s distance (WMD) is a fundamental technique for measuring the similarity of two documents. As the crux of WMD, it can take advantage of the underlying geometry of the word space by employing an optimal transport formulation. The original study on WMD reported that WMD outperforms classical baselines such as bag-of-words (BOW) and TF-IDF by significant margins in various datasets. In this paper, we point out that the evaluation in the original study could be misleading. We re-evaluate the performances of WMD and the classical baselines and find that the classical baselines are competitive with WMD if we employ an appropriate preprocessing, i.e., L1 normalization. In addition, we introduce an analogy between WMD and L1-normalized BOW and find that not only the performance of WMD but also the distance values resemble those of BOW in high dimensional spaces.} }
Endnote
%0 Conference Paper %T Re-evaluating Word Mover’s Distance %A Ryoma Sato %A Makoto Yamada %A Hisashi Kashima %B Proceedings of the 39th International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2022 %E Kamalika Chaudhuri %E Stefanie Jegelka %E Le Song %E Csaba Szepesvari %E Gang Niu %E Sivan Sabato %F pmlr-v162-sato22b %I PMLR %P 19231--19249 %U https://proceedings.mlr.press/v162/sato22b.html %V 162 %X The word mover’s distance (WMD) is a fundamental technique for measuring the similarity of two documents. As the crux of WMD, it can take advantage of the underlying geometry of the word space by employing an optimal transport formulation. The original study on WMD reported that WMD outperforms classical baselines such as bag-of-words (BOW) and TF-IDF by significant margins in various datasets. In this paper, we point out that the evaluation in the original study could be misleading. We re-evaluate the performances of WMD and the classical baselines and find that the classical baselines are competitive with WMD if we employ an appropriate preprocessing, i.e., L1 normalization. In addition, we introduce an analogy between WMD and L1-normalized BOW and find that not only the performance of WMD but also the distance values resemble those of BOW in high dimensional spaces.
APA
Sato, R., Yamada, M. & Kashima, H.. (2022). Re-evaluating Word Mover’s Distance. Proceedings of the 39th International Conference on Machine Learning, in Proceedings of Machine Learning Research 162:19231-19249 Available from https://proceedings.mlr.press/v162/sato22b.html.

Related Material