Nothing Special   »   [go: up one dir, main page]

skip to main content
article

Unsupervised compositionality prediction of nominal compounds

Published: 01 March 2019 Publication History

Abstract

Nominal compounds such as red wine and nut case display a continuum of compositionality, with varying contributions from the components of the compound to its semantics. This article proposes a framework for compound compositionality prediction using distributional semantic models, evaluating to what extent they capture idiomaticity compared to human judgments. For evaluation, we introduce data sets containing human judgments in three languages: English, French, and Portuguese. The results obtained reveal a high agreement between the models and human predictions, suggesting that they are able to incorporate information about idiomaticity. We also present an in-depth evaluation of various factors that can affect prediction, such as model and corpus parameters and compositionality operations. General crosslingual analyses reveal the impact of morphological variation and corpus size in the ability of the model to predict compositionality, and of a uniform combination of the components for best results.

References

[1]
Agirre, Eneko, Enrique Alfonseca, Keith B. Hall, Jana Kravalova, Marius Pasca, and Aitor Soroa. 2009. A study on similarity and relatedness using distributional and wordnet-based approaches. In Human Language Technologies: Conference of the North American Chapter of the Association of Computational Linguistics, Proceedings, May 31-June 5, 2009, pages 19-27, Boulder, CO.
[2]
Artstein, Ron, and Massimo Poesio. 2008. Inter-coder agreement for computational linguistics. Computational Linguistics, 34(4):555-596.
[3]
Baldwin, Timothy, and Su Nam Kim. 2010. Multiword expressions. In Nitin Indurkhya and Fred J. Damerau, editors, Handbook of Natural Language Processing, 2nd edition. CRC Press, Taylor and Francis Group, Boca Raton, FL, pages 267-292.
[4]
Bannard, Colin, Timothy Baldwin, and Alex Lascarides. 2003. A statistical approach to the semantics of verb-particles. In Proceedings of the ACL 2003 Workshop on Multiword Expressions: Analysis, Acquisition and Treatment (Volume 18), pages 65-72, Stroudsburg, PA.
[5]
Baroni, Marco, Silvia Bernardini, Adriano Ferraresi, and Eros Zanchetta. 2009. The wacky wide web: A collection of very large linguistically processed web-crawled corpora. Language Resources and Evaluation, 43(3):209-226.
[6]
Baroni, Marco, Georgiana Dinu, and Germán Kruszewski. 2014. Don't count, predict! A systematic comparison of context-counting vs. context-predicting semantic vectors. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 238-247, Baltimore.
[7]
Baroni, Marco, and Alessandro Lenci. 2010. Distributional memory: A general framework for corpus-based semantics. Computational Linguistics, 36(4):673-721.
[8]
Bick, Eckhard. 2000. The Parsing System "palavras": Automatic Grammatical Analysis of Portuguese in a Constraint Grammar Framework. Ph.D. thesis, University of Aarhus.
[9]
Boos, Rodrigo, Kassius Prestes, and Aline Villavicencio. 2014. Identification of multiword expressions in the brWaC. In Proceedings of the Conference on Language Resources and Evaluation 2014, pages 728-735, ELRA. ACL Anthology Identifier: L14-1429.
[10]
Bride, Antoine, Tim Van de Cruys, and Nicholas Asher. 2015. A generalisation of lexical functions for composition in distributional semantics. In Association for Computational Linguistics (1), pages 281-291.
[11]
Bullinaria, John A., and Joseph P. Levy. 2012. Extracting semantic representations from word co-occurrence statistics: Stop-lists, stemming, and SVD. Behavior Research Methods, 44(3):890-907.
[12]
Camacho-Collados, José, Mohammad Taher Pilehvar, and Roberto Navigli. 2015. A framework for the construction of monolingual and cross-lingual word similarity datasets. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 2: Short Papers), pages 1-7, Beijing.
[13]
Cap, Fabienne, Manju Nirmal, Marion Weller, and Sabine Schulte im Walde. 2015. How to account for idiomatic German support verb constructions in statistical machine translation. In Proceedings of the 11th Workshop on Multiword Expressions, pages 19-28, Association for Computational Linguistics, Denver.
[14]
Carpuat, Marine, and Mona Diab. 2010. Task-based evaluation of multiword expressions: A pilot study in statistical machine translation. In Proceedings of NAACL/HLT 2010, pages 242-245, Los Angeles.
[15]
Church, Kenneth Ward, and Patrick Hanks. 1990. Word association norms, mutual information, and lexicography. Computational Linguistics, 16(1):22-29.
[16]
Cohen, Jacob. 1960. A coefficient of agreement for nominal scales. Educational and Psychological Measurement, 20(1):37-46.
[17]
Constant, Mathieu, Gül?en Eryiðit, Johanna Monti, Lonneke Van Der Plas, Carlos Ramisch, Michael Rosner, and Amalia Todirascu. 2017. Multiword expression processing: A survey. Computational Linguistics, 43(4):837-892.
[18]
Cordeiro, Silvio, Carlos Ramisch, Marco Idiart, and Aline Villavicencio. 2016. Predicting the compositionality of nominal compounds: Giving word embeddings a hard time. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1986-1997, Berlin.
[19]
Cordeiro, Silvio, Carlos Ramisch, and Aline Villavicencio. 2016. mwetoolkit+sem: Integrating word embeddings in the mwetoolkit for semantic MWE processing. In Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC 2016), pages 1221-1225, European Language Resources Association (ELRA), Paris.
[20]
Curran, James R., and Marc Moens. 2002. Scaling context space. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, pages 231-238.
[21]
Deerwester, Scott, Susan T. Dumais, George W. Furnas, Thomas K. Landauer, and Richard Harshman. 1990. Indexing by latent semantic analysis. Journal of the American Society for Information Science, 41(6):391.
[22]
Evert, Stefan. 2004. The Statistics of Word Cooccurrences: Word Pairs and Collocations. Ph.D. thesis, Institut für maschinelle Sprachverarbeitung, University of Stuttgart, Stuttgart, Germany.
[23]
Farahmand, Meghdad, Aaron Smith, and Joakim Nivre. 2015. A multiword expression data set: Annotating non-compositionality and conventionalization for English noun compounds. In Proceedings of the 11th Workshop on Multiword Expressions, pages 29-33, Association for Computational Linguistics, Denver.
[24]
Fazly, Afsaneh, Paul Cook, and Suzanne Stevenson. 2009. Unsupervised type and token identification of idiomatic expressions. Computational Linguistics, 35(1):61-103.
[25]
Ferret, Olivier. 2013. Identifying bad semantic neighbors for improving distributional thesauri. In Association for Computational Linguistics (1), pages 561-571.
[26]
Finlayson, Mark, and Nidhi Kulkarni. 2011. Detecting multi-word expressions improves word sense disambiguation. In Proceedings of the Association for Computational Linguistics 2011 Workshop on MWEs, pages 20-24, Portland, OR.
[27]
Firth, John R. 1957. A synopsis of linguistic theory, 1930-1955. In F. R. Palmer, ed., Selected Papers of J. R. Firth, pages 168-205, Longman, London.
[28]
Fleiss, Joseph L., and Jacob Cohen. 1973. The equivalence of weighted kappa and the intraclass correlation coefficient as measures of reliability. Educational and Psychological Measurement, 33(3):613-619.
[29]
Frege, Gottlob. 1892/1960. Über sinn und bedeutung. Zeitschrift für Philosophie und philosophische Kritik, 100:25-50. Translated, as 'On Sense and Reference,' by Max Black.
[30]
Freitag, Dayne, Matthias Blume, John Byrnes, Edmond Chow, Sadik Kapadia, Richard Rohwer, and Zhiqiang Wang. 2005. New experiments in distributional representations of synonymy. In Proceedings of the Ninth Conference on Computational Natural Language Learning, pages 25-32.
[31]
Girju, Roxana, Dan Moldovan, Marta Tatu, and Daniel Antohe. 2005. On the semantics of noun compounds. Computer Speech & Language, 19(4):479-496.
[32]
Goldberg, Adele E. 2015. Compositionality, Chapter 24. Routledge, Amsterdam.
[33]
Guevara, Emiliano. 2011. Computing semantic compositionality in distributional semantics. In Proceedings of the Ninth International Conference on Computational Semantics, IWCS '11, pages 135-144, Association for Computational Linguistics, Stroudsburg, PA.
[34]
Harris, Zellig. 1954. Distributional structure. Word, 10:146-162.
[35]
Hartung, Matthias, Fabian Kaupmann, Soufian Jebbara, and Philipp Cimiano. 2017. Learning compositionality functions on word embeddings for modelling attribute meaning in adjective-noun phrases. In Proceedings of the 15th Meeting of the European Chapter of the Association for Computational Linguistics (Volume 1), pages 54-64.
[36]
Hendrickx, Iris, Zornitsa Kozareva, Preslav Nakov, Diarmuid Ó Séaghdha, Stan Szpakowicz, and Tony Veale. 2013. Semeval-2013 task 4: Free paraphrases of noun compounds. In Proceedings of *SEM 2013 (Volume 2 -- SemEval), pages 138-143, Association for Computational Linguistics.
[37]
Hwang, Jena D., Archna Bhatia, Clare Bonial, Aous Mansouri, Ashwini Vaidya, Nianwen Xue, and Martha Palmer. 2010. Propbank annotation of multilingual light verb constructions. In Proceedings of the LAW 2010, pages 82-90, Association for Computational Linguistics.
[38]
Jagfeld, Glorianna, and Lonneke van der Plas. 2015. Towards a better semantic role labelling of complex predicates. In Proceedings of NAACL Student Research Workshop, pages 33-39, Denver.
[39]
Jurafsky, Daniel, and James H. Martin. 2009. Speech and Language Processing, 2nd Edition, Prentice-Hall, Inc., Upper Saddle River, NJ.
[40]
Kiela, Douwe, and Stephen Clark. 2014. A systematic study of semantic vector space model parameters. In Proceedings of the 2nd Workshop on Continuous Vector Space Models and their Compositionality (CVSC) at EACL, pages 21-30.
[41]
Köper, Maximilian, and Sabine Schulte im Walde. 2016. Distinguishing literal and non-literal usage of German particle verbs. In HLT-NAACL, pages 353-362.
[42]
Kruszewski, Germán, and Marco Baroni. 2014. Dead parrots make bad pets: Exploring modifier effects in noun phrases. In Proceedings of the Third Joint Conference on Lexical and Computational Semantics, *SEM@COLING 2014, August 23-24, 2014, pages 171-181, The *SEM 2014 Organizing Committee, Dublin.
[43]
Landauer, Thomas K., Peter W. Foltz, and Darrell Laham. 1998. An introduction to latent semantic analysis. Discourse Processes, 25(2-3):259-284.
[44]
Lapesa, Gabriella, and Stefan Evert. 2014. A large scale evaluation of distributional semantic models: Parameters, interactions and model selection. Transactions of the Association for Computational Linguistics, 2:531-545.
[45]
Lapesa, Gabriella, and Stefan Evert. 2017. Large-scale evaluation of dependency-based DSMs: Are they worth the effort? In EACL 2017, pages 394-400.
[46]
Lauer, Mark. 1995. How much is enough?: Data requirements for statistical NLP. CoRR, abs/cmp-lg/9509001.
[47]
Levy, Omer, Yoav Goldberg, and Ido Dagan. 2015. Improving distributional similarity with lessons learned from word embeddings. Transactions of the Association for Computational Linguistics, 3:211-225.
[48]
Lin, Dekang. 1998. Automatic retrieval and clustering of similar words. In Proceedings of the 17th International Conference on Computational Linguistics (Volume 2), pages 768-774.
[49]
Lin, Dekang. 1999. Automatic identification of non-compositional phrases. In Proceedings of the 37th Annual Meeting of the Association for Computational Linguistics on Computational Linguistics, pages 317-324.
[50]
McCarthy, Diana, Bill Keller, and John Carroll. 2003. Detecting a continuum of compositionality in phrasal verbs. In Proceedings of the Association for Computational Linguistics 2003 Workshop on Multiword Expressions: Analysis, Acquisition and Treatment, pages 73-80, Association for Computational Linguistics, Sapporo, Japan.
[51]
Mikolov, Tomas, Ilya Sutskever, Kai Chen, Greg S. Corrado, and Jeff Dean. 2013. Distributed representations of words and phrases and their compositionality. In Advances in Neural Information Processing Systems, pages 3111-3119.
[52]
Mikolov, Tomas, Wen-tau Yih, and Geoffrey Zweig. 2013. Linguistic regularities in continuous space word representations. In HLT-NAACL, pages 746-751.
[53]
Mitchell, Jeff, and Mirella Lapata. 2008. Vector-based models of semantic composition. In Association for Computational Linguistics, pages 236-244.
[54]
Mitchell, Jeff, and Mirella Lapata. 2010. Composition in distributional models of semantics. Cognitive Science, 34(8):1388-1429.
[55]
Mohammad, Saif, and Graeme Hirst. 2012. Distributional measures of semantic distance: A survey. CoRR, abs/1203.1858.
[56]
Nakov, Preslav. 2008. Paraphrasing verbs for noun compound interpretation. In Proceedings of the LREC Workshop Towards a Shared Task for MWEs, pages 46-49.
[57]
Nakov, Preslav. 2013. On the interpretation of noun compounds: Syntax, semantics, and entailment. Natural Language Engineering, 19:291-330.
[58]
Nivre, Joakim, Johan Hall, and Jens Nilsson. 2006. Maltparser: A data-driven parser-generator for dependency parsing. In Proceedings of the Conference on Language Resources and Evaluation (Volume 6), pages 2216-2219.
[59]
Padó, Sebastian, and Mirella Lapata. 2003. Constructing semantic space models from parsed corpora. In Proceedings of the 41st Annual Meeting of the Association for Computational Linguistics (Volume 1), pages 128-135.
[60]
Padó, Sebastian, and Mirella Lapata. 2007. Dependency-based construction of semantic space models. Computational Linguistics, 33(2):161-199.
[61]
Padró, Muntsa, Marco Idiart, Aline Villavicencio, and Carlos Ramisch. 2014a. Comparing similarity measures for distributional thesauri. In Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC 2014), pages 2964-2971, European Language Resources Association, Reykjavik.
[62]
Padró, Muntsa, Marco Idiart, Aline Villavicencio, and Carlos Ramisch. 2014b. Nothing like good old frequency: Studying context filters for distributional thesauri. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (Short Papers), pages 419-424, Doha, Qatar.
[63]
Pennington, Jeffrey, Richard Socher, and Christopher Manning. 2014. Glove: Global vectors for word representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1532-1543, Association for Computational Linguistics, Doha, Qatar.
[64]
Ramisch, Carlos, Silvio Cordeiro, Leonardo Zilio, Marco Idiart, Aline Villavicencio, and RodrigoWilkens. 2016. How naked is the naked truth? A multilingual lexicon of nominal compound compositionality. In The 54th Annual Meeting of the Association for Computational Linguistics, pages 156-161.
[65]
Ramisch, Carlos, Silvio Ricardo Cordeiro, and Aline Villavicencio. 2016. Filtering and measuring the intrinsic quality of human compositionality judgments. In Proceedings of the 12th Workshop on Multiword Expressions (MWE 2016), pages 32-37, Berlin.
[66]
Reddy, Siva, Diana McCarthy, and Suresh Manandhar. 2011. An empirical study on compositionality in compound nouns. In Proceedings of the 5th International Joint Conference on Natural Language Processing 2011 (IJCNLP 2011), pages 210-218, Chiang Mai, Thailand.
[67]
Ren, Zhixiang, Yajuan Lü, Jie Cao, Qun Liu, and Yun Huang. 2009. Improving statistical machine translation using domain bilingual multiword expressions. In Proceedings of the ACL 2009 Workshop on MWEs, pages 47-54, Singapore.
[68]
Riedl, Martin, and Chris Biemann. 2015. A single word is not enough: Ranking multiword expressions using distributional semantics. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pages 2430-2440, Association for Computational Linguistics.
[69]
Roller, Stephen, and Sabine Schulte im Walde. 2014. Feature norms of German noun compounds. In Proceedings of the 10th Workshop on Multiword Expressions (MWE), pages 104-108, Association for Computational Linguistics.
[70]
Roller, Stephen, Sabine Schulte im Walde, and Silke Scheible. 2013. The (un)expected effects of applying standard cleansing models to human ratings on compositionality. In Proceedings of the 9th Workshop on Multiword Expressions, pages 32-41, Association for Computational Linguistics.
[71]
Sag, Ivan A, Timothy Baldwin, Francis Bond, Ann Copestake, and Dan Flickinger. 2002, Multiword expressions: A pain in the neck for NLP. In Computational Linguistics and Intelligent Text Processing. Springer, New York, pages 1-15.
[72]
Salehi, Bahar, Paul Cook, and Timothy Baldwin. 2014. Using distributional similarity of multi-way translations to predict multiword expression compositionality. In Proceedings of the 14th Conference of the European Chapter of the Association for Computational Linguistics, pages 472-481, Gothenburg, Sweden.
[73]
Salehi, Bahar, Paul Cook, and Timothy Baldwin. 2015. A word embedding approach to predicting the compositionality of multiword expressions. In Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 977-983, Denver.
[74]
Salehi, Bahar, Nitika Mathur, Paul Cook, and Timothy Baldwin. 2015. The impact of multiword expression compositionality on machine translation evaluation. In Proceedings of the 11th Workshop on Multiword Expressions, pages 54-59, Association for Computational Linguistics, Denver.
[75]
Salle, Alexandre, Aline Villavicencio, and Marco Idiart. 2016. Matrix factorization using window sampling and negative sampling for improved word representations. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 419-424, Berlin.
[76]
Schmid, Helmut. 1995. Treetagger--A language independent part-of-speech tagger. Institut für Maschinelle Sprachverarbeitung, Universität Stuttgart, 43:28.
[77]
Schneider, Nathan, Dirk Hovy, Anders Johannsen, and Marine Carpuat. 2016. SemEval 2016 task 10: Detecting minimal semantic units and their meanings (DiMSUM). In Proceedings of SemEval, pages 546-559, San Diego.
[78]
Schone, Patrick, and Daniel Jurafsky. 2001. Is knowledge-free induction of multiword unit dictionary headwords a solved problem? In Proceedings of Empirical Methods in Natural Language Processing, pages 100-108, Pittsburgh.
[79]
Schulte im Walde, Sabine, Anna Hätty, Stefan Bott, and Nana Khvtisavrishvili. 2016. GhoSt-NN: A representative gold standard of German noun-noun compound. In Proceedings of the Conference on Language Resources and Evaluation, pages 2285-2292.
[80]
Schulte im Walde, Sabine, Stefan Müller, and Stefan Roller. 2013. Exploring vector space models to predict the compositionality of German noun-noun compounds. In Proceedings of *SEM 2013 (Volume 1), pages 255-265. Association for Computational Linguistics.
[81]
Socher, Richard, Brody Huval, Christopher D. Manning, and Andrew Y. Ng. 2012. Semantic compositionality through recursive matrix-vector spaces. In Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, pages 1201-1211.
[82]
Stymne, Sara, Nicola Cancedda, and Lars Ahrenberg. 2013. Generation of compound words in statistical machine translation into compounding languages. Computational Linguistics, 39(4):1067-1108.
[83]
Tsvetkov, Yulia, and Shuly Wintner. 2012. Extraction of multi-word expressions from small parallel corpora. Natural Language Engineering, 18(04):549-573.
[84]
Turney, Peter D., and Patrick Pantel. 2010. From frequency to meaning: vector space models of semantics. Journal of Artificial Intelligence Research, 37(1):141-188.
[85]
Van de Cruys, Tim, Laura Rimell, Thierry Poibeau, and Anna Korhonen. 2012. Multiway tensor factorization for unsupervised lexical acquisition. In COLING 2012, pages 2703-2720.
[86]
Yazdani, Majid, Meghdad Farahmand, and James Henderson. 2015. Learning semantic composition to detect non-compositionality of multiword expressions. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pages 1733-1742, Association for Computational Linguistics, Lisbon.

Cited By

View all
  • (2024)Training and evaluation of vector models for GalicianLanguage Resources and Evaluation10.1007/s10579-024-09740-058:4(1419-1462)Online publication date: 1-Dec-2024
  • (2024)Assessing linguistic generalisation in language models: a dataset for Brazilian PortugueseLanguage Resources and Evaluation10.1007/s10579-023-09664-158:1(175-201)Online publication date: 1-Mar-2024
  • (2022)“If you can’t beat them, join them”: A Word Transformation based Generalized Skip-gram for Embedding Compound WordsProceedings of the 14th Annual Meeting of the Forum for Information Retrieval Evaluation10.1145/3574318.3574346(34-42)Online publication date: 9-Dec-2022
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image Computational Linguistics
Computational Linguistics  Volume 45, Issue 1
March 2019
195 pages
ISSN:0891-2017
EISSN:1530-9312
Issue’s Table of Contents

Publisher

MIT Press

Cambridge, MA, United States

Publication History

Published: 01 March 2019
Published in COLI Volume 45, Issue 1

Qualifiers

  • Article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)6
  • Downloads (Last 6 weeks)3
Reflects downloads up to 22 Nov 2024

Other Metrics

Citations

Cited By

View all
  • (2024)Training and evaluation of vector models for GalicianLanguage Resources and Evaluation10.1007/s10579-024-09740-058:4(1419-1462)Online publication date: 1-Dec-2024
  • (2024)Assessing linguistic generalisation in language models: a dataset for Brazilian PortugueseLanguage Resources and Evaluation10.1007/s10579-023-09664-158:1(175-201)Online publication date: 1-Mar-2024
  • (2022)“If you can’t beat them, join them”: A Word Transformation based Generalized Skip-gram for Embedding Compound WordsProceedings of the 14th Annual Meeting of the Forum for Information Retrieval Evaluation10.1145/3574318.3574346(34-42)Online publication date: 9-Dec-2022
  • (2019)Weighted Compositional Vectors for Translating Collocations Using Monolingual CorporaComputational and Corpus-Based Phraseology10.1007/978-3-030-30135-4_9(113-128)Online publication date: 25-Sep-2019

View Options

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media