Support block comment delimiters > 2 characters in Haskell backend #202

andreasabel · 2017-02-21T10:59:39Z

comment "--(" "--)";

reports

Warning: comment delimiters longer than 2 characters ignored in Haskell:
--( - --)

The text was updated successfully, but these errors were encountered:

The Haskell backend now supports block comment delimiters of any length.

andreasabel · 2019-12-15T21:56:49Z

The C/Java backends (?Lex) did not support more than one block comment form. This has been fixed now by introducing different start conditions COMMENT, COMMENT1, ... for different block comment forms so that they do not get mixed up.

Further, single line comments should not be expected to be terminated by a newline character, as it could also be the EOF character.

The C/Java backends (?Lex) did not support more than one block comment form. This has been fixed now by introducing different start conditions COMMENT, COMMENT1, ... for different block comment forms so that they do not get mixed up. Further, single line comments should not be expected to be terminated by a newline character, as it could also be the EOF character.

* Use PrettyPrint * Allow comment delimiters of more than 2 characters * Fix #111 * This also fixes some pending OCaml tests

See also #108: /* **/ int x; /* */ Extra care needed to recognize "**/" as end-of-comment.

andreasabel · 2019-12-17T17:56:26Z

For Haskell/Java, I implemented in 83681fa a translation from block comment terminator strings to regular expressions that works correctly for C-style comments /* ... */ and HTML-style comments . (The previous solution 799ab5f used a translation that sat there in BNFC.Lexing that could not deal with block comment correctly.)

However, it fails when the terminator string has non-trivial repetitions, e.g., ananas. Here the generated regular expression skips over the terminator e.g. in case of anananas.

Prg. Program ::= [Integer];
terminator Integer "";

comment "banana" "ananas";

Input:

0
banana anananas
1
banana ananas
2

Output:

[Abstract Syntax]
Prg [0,2]

[Linearized tree]
0 2

The correct lexing of anananas would, after scanning anana and seeing nas, jump to have seen anan and looking at as.

This is a well-known problem in recognizing a search word of length m in a text of length n in O(n+m) time, and solutions are implemented in the lexer generators we target. Thus, maybe we should not try to reinvent the wheel.
It seems that going via regular expressions for this problem is anyway expensive, using lexer states or start conditions (such as used for lexing comments with flex and JLex have) might be better also when targeting Alex.

This handles cases beyond comment terminators such as "*/" and "-->". In the general but unlikely case, a comment terminator may have non-trivial internal repetitions, like in "ananas". While lexing "anananas", we need, after having seen "anana", fall back to state "ana", to correctly handle the rest "nas" of the input and recognize the comment terminator. Caveat: The Ocaml backend cannot handle the general case (like "ananas" as a comment terminator), since ocamllex gives up on the regex we create: transition table overflow, automaton is too big Literature: See the Knuth-Morris-Pratt algorithm of complexity O(n+m) to recognize a keyword of length m in a text of length n. (Dragon book second edition section 3.4.5; Knuth/Morris/Pratt (J. Computing 1997), "Fast pattern matching on strings").

andreasabel · 2019-12-18T16:46:26Z

In the end, I did reinvent the wheel and generate a (sometimes huge) regex for recognizing the end of a block comment: 4e57e8e.
The reason is that in Alex, unlike flex, JLex etc., start conditions do not work out of the box, but need the monad wrapper. It seemed more work to change this than solving the puzzle how to generate the regex. (To my surprise, my solution worked on the first try, but I spent hours to formulate it elegantly using foldl/r rather than hacking it in with arrays and indices.)

Caveat: This regex can be too big for ocamllex to handle, which generates automata with max 2^15 transitions. It might be a problem with ocamllex as the resulting DFA should be small. Also, there might be a better way to do this with ocamllex.

Alex can a bit slow turning this regex into a DFA. I wonder whether there are faster algorithms that can deal with large regexes; the intermediate NFA Alex creates seems huge.

andreasabel · 2020-10-08T13:44:00Z

The remaining problem (see #280) is in the OCaml backend: ocamllex cannot stomach the generated regexes, creating too big automata. Filed bug at ocaml/ocaml#9964.

Alex could be faster, reported haskell/alex#163.

andreasabel · 2020-10-12T09:44:17Z

Xavier Leroy recommends to use ocamllex states: ocaml/ocaml#9964 (comment)

andreasabel added comments Concerning the "comment" pragma enhancement labels Nov 24, 2019

andreasabel closed this as completed in 799ab5f Nov 24, 2019

andreasabel self-assigned this Nov 24, 2019

andreasabel added this to the 2.8.4 milestone Nov 24, 2019

andreasabel mentioned this issue Nov 24, 2019

Comments with single-character delimiters not working #169

Closed

andreasabel added a commit that referenced this issue Dec 14, 2019

[ #202 #169 ] remove obsolete warning about block comment delims >= 2

2d3da13

The Haskell backend now supports block comment delimiters of any length.

andreasabel referenced this issue Dec 17, 2019

Refactor generated OCaml lexers to handle custom tokens correctly

d665fa5

* Use PrettyPrint * Allow comment delimiters of more than 2 characters * Fix #111 * This also fixes some pending OCaml tests

andreasabel added a commit that referenced this issue Dec 17, 2019

[ #202 ] Haskell, Ocaml: correct lexing of C-style block comments

83681fa

See also #108: /* **/ int x; /* */ Extra care needed to recognize "**/" as end-of-comment.

andreasabel added a commit that referenced this issue Dec 17, 2019

[ #202 ] Ocaml: fixed doctest

1c491a4

andreasabel reopened this Dec 17, 2019

andreasabel added the OCaml label Dec 18, 2019

andreasabel mentioned this issue Oct 5, 2020

Currently broken bnfc-system-tests #280

Open

8 tasks

andreasabel modified the milestones: 2.8.4, 3.0 Oct 8, 2020

andreasabel mentioned this issue Oct 8, 2020

ocamllex cannot handle large regexes (with small minimal automaton) ocaml/ocaml#9964

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support block comment delimiters > 2 characters in Haskell backend #202

Support block comment delimiters > 2 characters in Haskell backend #202