Nothing Special   »   [go: up one dir, main page]

Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TreebankWordDetokenizer inverting order of tokens #3260

Open
jmccrae opened this issue May 27, 2024 · 1 comment
Open

TreebankWordDetokenizer inverting order of tokens #3260

jmccrae opened this issue May 27, 2024 · 1 comment

Comments

@jmccrae
Copy link
jmccrae commented May 27, 2024

Steps to reproduce

from nltk.tokenize.treebank import TreebankWordDetokenizer
detokenizer = TreebankWordDetokenizer()
s = ['``', 'Shippers', 'are', 'saying', '`', 'the', 'party', "'s", 'over', ',', "'", "''", 'said', 'Mr.', 'LaLonde', '.']
detokenizer.detokenize(s)
# '"Shippers are saying ` the party\'s over,"\' said Mr. LaLonde.'

This should produce

'"Shippers are saying ` the party\'s over,\'" said Mr. LaLonde.'

Note the inversion of \' and "

The fix seems to be to change:

(re.compile(r"(\S)\s(\'\')"), r"\1\2"),

To

        (re.compile(r"([^'\s])\s(\'\')"), r"\1\2"),
@ekaf
Copy link
Contributor
ekaf commented Jun 2, 2024

Thanks @jmccrae, your solution avoids inverting the single and the double quote. It may not be a complete fix though, because it does not suppress the extraneous space between the quotes, which was the purpose of the edited substitution in L291.

However, suppressing that space is the cause of the inversion, because a double quote is represented as two single quotes, and once you have three single quotes in a row, there is no way of knowing where the double quote begins or ends. This explains why the substitution in L296 outputs the double quote first.

Another solution would be to move L291 to after L296. This suppresses the space, but the output can be hard to discern in some fonts. So, after all, it may be desirable to leave a space between both kinds of quotes for readability, and then your proposed fix seems ok.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants