Nothing Special   »   [go: up one dir, main page]

Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Non-English lemmas containing capital letters cannot be looked up using wordnet.lemmas() or wordnet.synsets() #1641

Closed
ExplodingCabbage opened this issue Feb 28, 2017 · 0 comments · Fixed by #1690

Comments

@ExplodingCabbage
Copy link
Contributor

This is an existing bug that I stumbled across while using the German WordNet from the EOMW via my custom-WordNet loading code in my unmerged PR at #1621. (It's dramatically more serious for German, since all nouns in German are capitalised and so a huge fraction of the language doesn't work, but it also affects the existing OMW WordNets with support built into NLTK.)

Consider the synset representing London, England. While the synset name is in lowercase, its lemmas are capitalised in both the English WordNet...

>>> london_synset = wn.synset('london.n.01')
>>> london_synset.definition()
'the capital and largest city of England; located on the Thames in southeastern England; financial and industrial and cultural center'
>>> london_synset.lemmas()
[Lemma('london.n.01.London'), Lemma('london.n.01.Greater_London'), Lemma('london.n.01.British_capital'), Lemma('london.n.01.capital_of_the_United_Kingdom')]

... and also in the French WordNet:

>>> london_synset.lemmas(lang='fra')
[Lemma('london.n.01.Grand_Londres'), Lemma('london.n.01.Hellgate:_London'), Lemma('london.n.01.London'), Lemma('london.n.01.Londres')]

But when using the English WordNet, I can look up the synset (or an individual Lemma) by lemma by passing in 'London' in whatever capitalisation I like:

>>> wn.synsets('London')
[Synset('london.n.01'), Synset('london.n.02')]
>>> wn.synsets('london')
[Synset('london.n.01'), Synset('london.n.02')]
>>> wn.synsets('lOnDoN')
[Synset('london.n.01'), Synset('london.n.02')]
>>> wn.lemmas('london')
[Lemma('london.n.01.London'), Lemma('london.n.02.London')]
>>> wn.lemmas('London')
[Lemma('london.n.01.London'), Lemma('london.n.02.London')]
>>> wn.lemmas('LoNdoN')
[Lemma('london.n.01.London'), Lemma('london.n.02.London')]

In non-English, on the other hand, it is impossible to look up this synset by lemma, because the first line of wn.synsets() coerces the lemma passed in to lowercase, and that lemma is then used as a key to look up the synset in a lemma-to-synset dictionary in which Londres is capitalised.

>>> wn.synsets('Londres', lang='fra')
[]
>>> wn.synsets('londres', lang='fra')
[]
>>> wn.lemmas('londres', lang='fra')
[]
>>> wn.lemmas('Londres', lang='fra')
[]

(Contrast this with lemmas that are lowercased in the French WordNet's tab file; they can be looked up regardless of how the lemma passed to synsets() is capitalised:

>>> wn.synsets('calin', lang='fra')
[Synset('cuddlesome.s.01')]
>>> wn.synsets('cAlIn', lang='fra')
[Synset('cuddlesome.s.01')]

)

To match the English behaviour, the behaviour of synsets() for non-English WordNets should be adjusted so that the lookup is properly case-insensitive. This was probably the intent of coercing the given lemma to lowercase before doing the lookup, but fails if the Lemma to be looked up is spelt with a capital letter in the actual WordNet data.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant