support for utf-8 / unicode characters #249

MrBenGriffin · 2018-12-19T13:02:50Z

No description provided.

MrBenGriffin · 2018-12-19T13:05:06Z

EAdd. Exp ::= Exp "⌽" Exp1 ;
-- syntax error at line 1, column 22 due to lexer error
The inability to lex simple unicode character declarations is a shortfall.
It may be there's some way of addressing this, but the documentation doesn't mention anything.

Keywords and operators in a LBNF grammar can now contain unicode characters. Works for Haskell, Java, Latex... Broken for C, C++, Ocaml [NEEDS WORK]

andreasabel · 2019-01-02T18:56:38Z

By simply allowing unicode in the generated Haskell lexer, I managed to enable this feature for LBNF and some backends, including Haskell and Java.
For some backends, the testsuite reports errors: C, C++, Ocaml.

MrBenGriffin · 2019-01-08T14:45:43Z

Thanks for this Andreas. I will try again ;-D

andreasabel · 2019-02-11T17:28:40Z

To be precise, I only changed in the generated .x file for Alex the line

$u = [\0-\255]          -- universal: any character

to

$u = [. \n]          -- universal: any character

However, I did not do anything to extend the definitions of capital and lower-case letters to unicode.

sillydan1 · 2019-04-02T09:21:26Z

Changing
$u = [\0-\255] -- universal: any character
to
$u = [\x00-\xffff] -- universal: any character
Seems to handle a lot more characters.
(Mind you, I've only tested on the unicode characters –, ” and “ )

andreasabel · 2019-04-02T13:51:13Z

Did you mean ?

Changing
$u = [. \n] -- universal: any character
to
$u = [\x00-\xffff] -- universal: any character
seems to handle a lot more characters.

sillydan1 · 2019-04-06T07:37:14Z

Yes.

andreasabel · 2019-04-07T08:23:48Z

Strange, that seems to contradict the alex documentation at https://www.haskell.org/alex/doc/html/charsets.html :

.

    The built-in set ‘.’ matches all characters except newline (\n).

    Equivalent to the set [\x00-\x10ffff] # \n.

Or maybe I do not understand it. From my naive point of view \x10ffff is a bigger number than \xffff, thus, the currently implemented range should include your range.

Could you provide me with a minimal test case, please?

MrBenGriffin · 2019-04-07T12:31:10Z

Andreas, you are correct.
Unicode extends beyond 0xffff As I understand it, Unicode enabled regular expressions treat ‘.’ as any character other than newline and null, bearing in mind that the Unicode character type can be represented with variable number of bytes. Another feature of multi-byte encodings (such as the pretty normative UTF-8) the null character is not found in a legal character string (but, depending on implementation, may be used to mark the end of string - however that’s beyond the scope of regular expressions, which are only interested in the strings themselves). It should be clear already that regular expressions are only useful for string-like types, and are not suitable for raw streams. All of this is better expressed elsewhere.

TL;DR

[. \n] is a good pattern for ‘any character’

andreasabel · 2019-04-07T16:15:01Z

@sillydan1: I am closing this, but you are welcome reopen it with a MWE.

…aml strings

andreasabel · 2020-01-19T15:26:07Z

Fixed printer for C.
C++ seems to work already.
b8701c3 broke #249 for Java/ANTLR in the lexer,
c49d1fd for Java/CUP.

Don't use showLitChar for unicode characters! b8701c3 broke #249 for Java/ANTLR in the lexer, c49d1fd for Java/CUP.

MrBenGriffin closed this as completed Dec 19, 2018

MrBenGriffin reopened this Dec 19, 2018

andreasabel self-assigned this Jan 2, 2019

andreasabel added the enhancement label Jan 2, 2019

andreasabel added this to the 2.8.3 milestone Jan 2, 2019

andreasabel added a commit that referenced this issue Jan 2, 2019

[ #249 ] Unicode support in LBNF by changing Alex lexer

1cb91fe

Keywords and operators in a LBNF grammar can now contain unicode characters. Works for Haskell, Java, Latex... Broken for C, C++, Ocaml [NEEDS WORK]

andreasabel added a commit that referenced this issue Jan 2, 2019

[ #249 ] updated LBNF-report.pdf [ci skip]

9cdc1a3

andreasabel added bug OCaml C++ C and removed enhancement labels Jan 2, 2019

andreasabel closed this as completed Apr 11, 2019

andreasabel added a commit that referenced this issue May 24, 2019

[ OCaml #249 ] Support unicode: just "show" does not produce valid Oc…

23dda42

…aml strings

andreasabel mentioned this issue Dec 17, 2019

Currently broken bnfc-system-tests #280

Open

8 tasks

andreasabel added the Java label Jan 19, 2020

andreasabel added a commit that referenced this issue Jan 19, 2020

[ #249 C ] fixed printer: use renderS instead of renderC for unicode

30260b7

andreasabel added a commit that referenced this issue Jan 19, 2020

[ #249 Java ] fixed lexer regression introduced by #257 / #276

3745b86

Don't use showLitChar for unicode characters! b8701c3 broke #249 for Java/ANTLR in the lexer, c49d1fd for Java/CUP.

andreasabel mentioned this issue Nov 13, 2020

Support unicode characters in token definitions #324

Closed

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

support for utf-8 / unicode characters #249

support for utf-8 / unicode characters #249

support for utf-8 / unicode characters #249

support for utf-8 / unicode characters #249

Comments