Abstract
Automatic methods for wordnet development in languages other than English generally exploit information found in Princeton WordNet (PWN) and translations extracted from parallel corpora. A common approach consists in preserving the structure of PWN and transferring its content in new languages using alignments, possibly combined with information extracted from multilingual semantic resources. Even if the role of PWN remains central in this process, these automatic methods offer an alternative to the manual elaboration of new wordnets. However, their limited coverage has a strong impact on that of the resulting resources. Following this line of research, we apply a cross-lingual word sense disambiguation method to wordnet development. Our approach exploits the output of a data-driven sense induction method that generates sense clusters in new languages, similar to wordnet synsets, by identifying word senses and relations in parallel corpora. We apply our cross-lingual word sense disambiguation method to the task of enriching a French wordnet resource, the WOLF, and show how it can be efficiently used for increasing its coverage. Although our experiments involve the English–French language pair, the proposed methodology is general enough to be applied to the development of wordnet resources in other languages for which parallel corpora are available. Finally, we show how the disambiguation output can serve to reduce the granularity of new wordnets and the degree of polysemy present in PWN.
Similar content being viewed by others
Notes
In these projects, the expand model was occasionally combined with the merge model which is based on monolingual resources and permits to include language-specific properties in the wordnets of different languages.
The BabelNet resource is available here: http://babelnet.org.
Compared to the initial version of WOLF (0.1.4), version 0.1.6 has an extended coverage on adverbs as a result of the work by Sagot et al. (2009).
Sentence pairs with a great difference in length, where one sentence is more than three times longer than the corresponding sentence in the other language.
The weights of the features are omitted for the sake of readability.
The table does not include information on all the neighboring PWN synsets, which was used during WSD. This information can however be easily recovered from PWN.
All accuracy scores reported for our system in Table 5 have been computed with respect to the judgments of the two annotators. More precisely, we first computed an accuracy score separately for each annotator and then retained the average of the two scores.
This information can also be highly useful for the evaluation of WSD systems as it would permit to penalize differently WSD errors involving close and distant senses (Resnik and Yarowsky 1999).
References
Apidianaki, M. (2008). Translation-oriented word sense induction based on parallel corpora. In Language resources and evaluation conference (LREC), (pp. 3269–3275). Marrakech, Morocco.
Apidianaki, M. (2009). Data-driven semantic analysis for multilingual WSD and lexical selection in translation. In Proceedings of the 12th conference of the European Chapter of the Association for Computational Linguistics (EACL), (pp. 77–85). Athens: Association for Computational Linguistics.
Apidianaki, M. & He, Y. (2010). An algorithm for cross-lingual sense clustering tested in a MT evaluation setting. In Proceedings of the 7th international workshop on spoken language translation (IWSLT), (pp. 219–226). Paris, France.
Bannard, C. & Callison-Burch, C. (2005). Paraphrasing with bilingual parallel corpora. In Proceedings of the 43rd annual meeting of the Association for Computational Linguistics (ACL’05), (pp. 597–604). Ann Arbor, MI: Association for Computational Linguistics.
Carpuat, M. & Wu, D. (2007). Improving statistical machine translation using word sense disambiguation. In Proceedings of the joint EMNLP-CoNLL conference, (pp. 61–72). Prague, Czech Republic.
Chan, Y. S., Ng, H. T. & Chiang, D. (2007). Word Sense Disambiguation Improves Statistical Machine Translation. In Proceedings of the 45th annual meeting of the association for computational linguistics (ACL-07), (pp. 33–40). Prague: Association for Computational Linguistics.
Cohen, J. (1960). A coefficient of agreement for nominal scales. Educational and Psychological Measurement, 20(1), 37–46.
Cruse, D. (1986). Lexical semantics. Cambridge: Cambridge University Press.
Diab, M. & Resnik, P. (2002). An unsupervised method for word sense tagging using parallel corpora. In Proceedings of the the 40th annual meeting of the Association for Computational Linguistics (ACL’02), (pp. 255–262). Philadelphia: Association for Computational Linguistics.
Dolan, W. B. (1994). Word sense ambiguation: Clustering related senses. In Proceedings of the 15th conference on Computational linguistics—Volume 2, COLING’94, (pp. 712–716). Stroudsburg, PA: Association for Computational Linguistics.
Dyvik, H. (1998). Translations as semantic mirrors: From parallel corpus to wordnet. In Proceedings of the workshop multilinguality in the lexicon II at the 13th biennial European conference on artificial intelligence (ECAI’98), (pp. 24–44). Brighton, UK.
Dyvik, H. (2005). Translations as a semantic knowledge source. In Proceedings of the second Baltic conference on human language technologies. Tallinn, Estonia.
Edmonds, P., & Kilgarriff, A. (2002). Introduction to the special issue on evaluating word sense disambiguation systems. Natural Language Engineering, 8(4), 279–291.
Erk, K. & McCarthy, D. (2009). Graded word sense assignment. In Proceedings of the 2009 conference on empirical methods in natural language processing, (pp. 440–449). Singapore: Association for Computational Linguistics.
Fellbaum, C. (Ed.). (1998). WordNet: An electronic lexical database. Cambridge, MA: MIT Press.
Grefenstette, G. (1994). Explorations in automatic thesaurus discovery. Dordrecht: Kluwer Academic Publisher.
Ide, N., Erjavec, T. & Tufiş, D. (2002). Sense discrimination with parallel corpora. In Proceedings of ACL’02 workshop on word sense disambiguation: Recent successes and future directions, (pp. 54–60). Philadelphia: Association for Computational Linguistics.
Ide, N. & Wilks, Y. (2006). Making sense about sense. In Word sense disambiguation: Algorithms and applications, text, speech and language technology, vol. 33, (pp. 47–74). Dordrecht: Springer.
Jurgens, D. & Klapaftis, I. (2013) SemEval-2013 task 13: Word sense induction for graded and non-graded senses. In Second joint conference on lexical and computational semantics (*SEM), volume 2: Proceedings of the seventh international workshop on semantic evaluation (SemEval 2013), (pp. 290–299). Atlanta, GA: Association for Computational Linguistics.
Koehn, P. (2005). Europarl: A parallel corpus for statistical machine translation. In Proceedings of MT Summit X, (pp. 79–86). Phuket, Thailand.
Lefever, E., & Hoste, V. (2010). SemEval-2010 task 3: Cross-lingual word sense disambiguation. In Proceedings of the 5th international workshop on semantic evaluation (pp. 15–20). Uppsala: Association for Computational Linguistics.
Manandhar, S., Klapaftis, I., Dligach, D. & Pradhan, S. (2010). SemEval-2010 task 14: Word sense induction and disambiguation. In Proceedings of the 5th international workshop on semantic evaluation, (pp. 63–68). Uppsala: Association for Computational Linguistics.
Mihalcea, R. & Moldovan, D. I. (2002). Automatic generation of a coarse grained WordNet. In 14th Flairs Conference, (pp. 454–458).
Mouton, C. & de Chalendar, G. (2010). JAWS: Just another WordNet subset. In Proceedings of the 10th TALN conference. Montreal, Canada.
Navigli, R. (2009). Word sense disambiguation: A survey. ACM Computing Surveys, 41(2), 1–69.
Navigli, R. & Ponzetto, S. P. (2010). BabelNet: Building a very large multilingual semantic network. In Proceedings of the 48th annual meeting of the Association for Computational Linguistics, (pp. 216–225). Uppsala: Association for Computational Linguistics.
Ng, H. T. & Chan, Y. S. (2007). SemEval-2007 task 11: English lexical sample task via English–Chinese parallel text. In Proceedings of the 4th international workshop on semantic evaluations (SemEval-2007), (pp. 54–58). Prague: Association for Computational Linguistics.
Ng, H. T., Wang, B. & Chan, Y. S. (2003). Exploiting parallel texts for word sense disambiguation: An empirical study. In Proceedings of the 41st annual meeting of the Association for Computational Linguistics, (pp. 455–462). Sapporo: Association for Computational Linguistics.
Och, F. J., & Ney, H. (2003). A systematic comparison of various statistical alignment models. Computational Linguistics, 29(1), 19–51.
Peters, W., Peters, I. & Vossen, P. (1998). Automatic sense clustering in EuroWordNet. In Proceedings of the first international conference on language resources and evaluation (LREC), vol. 1. Granada, Spain.
Pianta, E., Bentivogli, L. & Girardi, C. (2002) MultiWordNet: Developing an aligned multilingual database. In First international conference on global WordNet, (pp. 293–302). Mysore, India.
Resnik, P., & Yarowsky, D. (1999). Distinguishing systems and distinguishing senses: New evaluation methods for word sense disambiguation. Natural Language Engineering, 5(2), 113–133.
Sagot, B. & Fišer, D. (2008). Building a free French wordnet from multilingual resources. In Ontolex 2008. Marrakech, Morocco.
Sagot, B., Fort, K. & Venant, F. (2009). Extending the adverbial coverage of a French wordnet. In Proceedings of the NODALIDA 2009 workshop on WordNets and other lexical semantic resources. Odense, Danemark.
Schmid, H. (1994) Probabilistic part-of-speech tagging using decision trees. In Proceedings of the international conference on new methods in language processing, (pp. 44–49). Manchester, UK.
Specia, L., Stevenson, M., Nunes, M.D.G., Castelo, G. & Ribeiro, B. (2006). Multilingual versus monolingual WSD. In Proceedings of the EACL workshop making sense of sense: Bringing psycholinguistics and computational linguistics together, April 3–7, (pp. 33–40). Trento: Association for Computational Linguistics.
Steinberger, R., Pouliquen, B., Widiger, A., Ignat, C., Erjavec, T., Tufiş, D. & Varga, D. (2006). The JRC acquis: A multilingual aligned parallel corpus with 20+ languages. In Proceedings of LREC. Genoa, Italy.
Tufiş, D., Cristea, D. & Stamou, S. (2004). BalkaNet: Aims, methods, results and perspectives. A general overview. In Romanian Journal on Information Science and Technology. Special Issue on BalkaNet, vol. 7, (pp. 9–34).
van der Plas, L. & Tiedemann, J. (2006). Finding synonyms using automatic word alignment and measures of distributional similarity. In Proceedings of the COLING/ACL 2006 main conference poster sessions, (pp. 866–873). Sydney: Association for Computational Linguistics.
Vossen, P. (Ed.). (1998). EuroWordNet: A multilingual database with lexical semantic networks for European languages. Dordrecht: Kluwer.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Apidianaki, M., Sagot, B. Data-driven synset induction and disambiguation for wordnet development. Lang Resources & Evaluation 48, 655–677 (2014). https://doi.org/10.1007/s10579-014-9291-2
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10579-014-9291-2