-
Notifications
You must be signed in to change notification settings - Fork 165
/
BNF_Report.tex
1326 lines (1091 loc) · 48.5 KB
/
BNF_Report.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
\documentclass[10pt]{article}
\usepackage{setspace}
\setlength{\leftmargin}{30mm}
\setlength{\oddsidemargin}{0mm}
%\setlength{\evensidemargin}{0mm}
\setlength{\evensidemargin}{-2mm}
\setlength{\topmargin}{-16mm}
\setlength{\textheight}{240mm}
\setlength{\textwidth}{158mm}
\newcommand{\bequ}{\begin{quote}}
\newcommand{\enqu}{\end{quote}}
\newcommand{\begit}{\begin{itemize}}
\newcommand{\enit}{\end{itemize}}
\newcommand{\LBNF}{LBNF}
% commands from LBNF documentation
\newcommand{\emptyP}{\mbox{$\epsilon$}}
\newcommand{\terminal}[1]{\mbox{{\textbf {#1}}}}
\newcommand{\nonterminal}[1]{\mbox{$\langle \mbox{{\sl #1 }} \! \rangle$}}
\newcommand{\arrow}{\mbox{::=}}
\newcommand{\delimit}{\mbox{$|$}}
\newcommand{\reserved}[1]{\mbox{{\textbf {#1}}}}
\newcommand{\literal}[1]{\mbox{{\textbf {#1}}}}
\newcommand{\symb}[1]{\mbox{{\textbf {#1}}}}
\title{{\Large \bf Labelled BNF: \\
A High-Level Formalism for Defining Well-Behaved Programming Languages} \\}
\author{ Markus Forsberg and Aarne Ranta \\
Department of Computing Science \\
Chalmers University of Technology and the University of Gothenburg \\
SE-412 96 Gothenburg, Sweden\\
\{markus,aarne\}@cs.chalmers.se}
\date{\today}
\begin{document}
\maketitle
%\begin{doublespace}
\begin{abstract}
This paper introduces
the grammar formalism \textit{Labelled BNF} (LBNF),
and the compiler construction tool
\textit{BNF Converter}.
Given a grammar written in LBNF,
the BNF Converter produces a complete compiler
front end (up to, but excluding, type checking),
i.e.\ a lexer, a parser, and an abstract
syntax definition. Moreover, it produces a pretty-printer
and a language specification in \LaTeX, as well as
a template file for the compiler back end.
A language specification in LBNF is completely declarative and therefore
portable. It reduces dramatically the effort of implementing a language.
The price to pay is that the language must be ``well-behaved'', i.e.\ that
its lexical structure must be describable by a regular expression and
its syntax by a context-free grammar.
\end{abstract}
\subsubsection*{Keywords}
\textit{compiler construction, parser generator, grammar, labelled BNF,
abstract syntax, pretty printer, document automation}
\section{Introduction}
This paper defends an old idea:
a programming language is defined by a BNF grammar \cite{algol}.
This idea is usually not followed for two reasons. One reason is that
a language may require more
powerful methods (consider, for example, languages with layout rules). The
other reason is that, when parsing, one wants to do other things already
(such as type checking etc).
Hence the idea of extending pure BNF with semantic actions,
written in a general-purpose
programming language. However, such actions destroy declarativity and portability.
To describe the language, it becomes
necessary to write a separate document, since the BNF no longer defines
the language. Also the problem of synchronization arises: how to
guarantee that the different modules---the lexer, the parser,
and the document, etc.---describe the same language and that they fit together?
The idea in LBNF is to use BNF, with construction of syntax trees as
the only semantic action. This gives a unique source for all language-related
modules, and it also solves the problem of synchronization. Thereby it
dramatically reduces the effort of implementing a new language. Generating
syntax trees instead of using more complex semantic actions is
a natural phase of \textit{multi-phase compilation}, which is recommended by most
modern-day text books about compiler construction (e.g. Appel\cite{appel}).
BNF grammars are an ingredient of all modern compilers.
When designing LBNF, we tried to keep it so simple and intuitive that
it can be learnt in a few minutes by anyone who knows ordinary BNF.
Of course, the are some drawbacks with our approach.
Not all languages can be completely
defined, although surprisingly many can (see Section~\ref{results}).
Another drawback is that the modules generated are not quite as good
as handwritten. But this is a general problem when generating code instead
of handwriting it: a problem shared by all compilers, including the
standard parser and lexer generation tools.
To use LBNF descriptions as implementations, we have
built the \textit{BNF Converter}\cite{bnfc}.
Given an input LBNF grammar,
the BNF Converter produces a lexer, a parser, and an abstract
syntax definition. Moreover, it produces a pretty-printer
and a language specification in \LaTeX.
Since all this is generated from a \textit{single source}, we can be sure
that the documentation corresponds to the actual
language, and that the lexer, parser and abstract syntax fit
seamlessly together.
The BNF Converter is written in the functional programming language
Haskell\cite{haskell98}, and
its target languages are presently Haskell, the associated
compiler tools Happy\cite{happy} and Alex\cite{alex}, and \LaTeX.
%NEW
Happy is a parser generator tool, similar to YACC\cite{johnson-yacc}, which
from a BNF-like description
builds an LALR(1) parser.
Alex is a lexer generator tool, similar to Lex\cite{lesk-lex},
which converts a regular expression
into a finite-state automaton.
%new
Over the years, Haskell and these tools have
proven to be excellent devices for compiler construction,
%NEW
to a large extent because of Haskell's
algebraic data types and a convenient method of
syntax-directed translation via pattern matching;
%new
yet they do not quite remove the need for repetitive and low-level
coding. The BNF Converter can be seen as a high-level front end
to these tools.
However, due to its declarative nature, LBNF
does not crucially depend on the target language, and
it is therefore possible to redirect the BNF Converter as a front end to
another set of compiler tools. This has in fact recently been done
for Java, CUP \cite{cup}, and
JLex \cite{jlex}\footnote{Work by Michael Pellauer at Chalmers}.
%NEW
The only essential difference between Haskell/Happy/Alex
and Java/CUP/JLex or C/YACC/Lex
is the target language included in the parser and
lexer description.
%new
\section{The LBNF Grammar Formalism}
As the first example of LBNF,
consider a triple of rules defining addition expressions with ``1'':
\begin{verbatim}
EPlus. Exp ::= Exp "+" Num ;
ENum. Exp ::= Num ;
NOne. Num ::= "1" ;
\end{verbatim}
Apart from the \textit{labels}, {\tt EPlus}, {\tt ENum}, and {\tt NOne},
the rules are
ordinary BNF rules, with terminal symbols enclosed in
double quotes and nonterminals written without quotes.
The labels serve as \textit{constructors} for
syntax trees.
From an LBNF grammar, the BNF Converter extracts
an \textit{abstract syntax}\ and
a \textit{concrete syntax}.
The abstract syntax is implemented, in Haskell, as a system of
datatype definitions
\begin{verbatim}
data Exp = EPlus Exp Exp | ENum Num
data Num = NOne
\end{verbatim}
(For other languages, including C and Java, an equivalent
representation can be given in the same way as in the Zephyr
abstract syntax specification tool \cite{zephyr}).
The concrete syntax is implemented by the
lexer, parser and pretty-printer algorithms,
which are defined in other generated program modules.
\subsection{LBNF in a nutshell}
Briefly, an LBNF grammar is a BNF grammar where every rule is given a label.
The label is used for constructing a syntax tree whose subtrees are
given by the nonterminals of the rule, in the same order.
More formally, an LBNF grammar consists of a collection of rules,
which have the following form (expressed by a regular expression;
Appendix A gives a complete BNF definition of the notation):
\bequ
Ident "{\tt .}" Ident "{\tt ::=}" (Ident $\mid$ String)* "{\tt;}" ;
\enqu
The first identifier is the \textit{rule label}, followed by the
\textit{value category}. On the right-hand side of the production
arrow ({\tt ::=}) is the list of production items. An item is either
a quoted string (\textit{terminal}) or a category symbol (\textit{non-terminal}).
A rule whose value category is $C$ is also called a \textit{production}\ for $C$.
Identifiers, that is, rule names and category symbols, can be
chosen {\em ad libitum}, with the restrictions imposed by the target language. To
satisfy Haskell, and C and Java as well, the following rule is imposed
\bequ
An identifier is a nonempty sequence of letters, starting with a
capital letter.
\enqu
LBNF is clearly sufficient for defining any context-free language.
However, the abstract syntax that it generates may often become too
detailed. Without destroying the declarative nature and the simplicity
of LBNF, we have added to it four {\em ad hoc} conventions, which are
described in the following subsection.
\subsection{LBNF conventions}
\subsubsection{Predefined basic types}
The first convention are predefined basic types.
Basic types, such as integer and character, can of course be
defined in a labelled BNF, for example:
\begin{verbatim}
Char_a. Char ::= "a" ;
Char_b. Char ::= "b" ;
\end{verbatim}
This is, however, cumbersome and inefficient. Instead, we have decided to
extend our formalism with predefined basic types, and represent their
grammar as a part of lexical structure. These types are the following,
as defined by LBNF regular expressions (see \ref{reg} for
the regular expression syntax):
\bequ
{\tt Integer} of integers, defined
\verb6digit+6
{\tt Double} of floating point numbers, defined
\verb6digit+ '.' digit+ ('e' '-'? digit+)?6
{\tt Char} of characters (in single quotes), defined
\verb6'\'' ((char - ["'\\"]) | ('\\' ["'\\nt"])) '\''6
{\tt String} of strings (in double quotes), defined
\verb6'"' ((char - ["\"\\"]) | ('\\' ["\"\\nt"]))* '"'6
{\tt Ident} of identifiers, defined
\verb6letter (letter | digit | '_' | '\'')*6
\enqu
In the abstract syntax, these types are represented as corresponding
types. In Haskell, we also need to define a new type for Ident:
\begin{verbatim}
newtype Ident = Ident String
\end{verbatim}
For example, the LBNF rules
\begin{verbatim}
EVar. Exp ::= Ident ;
EInt. Exp ::= Integer ;
EStr. Exp ::= String ;
\end{verbatim}
generate the abstract syntax
\begin{verbatim}
data Exp = EVar Ident | EInt Integer | EStr String
\end{verbatim}
where {\tt Integer} and {\tt String} have their standard Haskell meanings.
The lexer only produces the high-precision variants of integers and
floats; authors of applications can
truncate these numbers later if they want to have
low precision instead.
Predefined categories may not have explicit productions in the
grammar, since this would violate their predefined meanings.
\subsubsection{Semantic dummies}
Sometimes the concrete syntax of a language includes rules that make
no semantic difference. An example is
a BNF rule making the parser accept extra semicolons after statements:
\begin{verbatim}
Stm ::= Stm ";" ;
\end{verbatim}
As this rule is semantically dummy, we do not want to represent it by a
constructors in the abstract syntax.
Instead, we introduce the following convention:
\bequ
A rule label can be an underscore {\tt \_}, which
does not add anything to the syntax tree.
\enqu
Thus we can write the following rule in LBNF:
\begin{verbatim}
_ . Stm ::= Stm ";" ;
\end{verbatim}
Underscores are of course only meaningful as replacements of
one-argument constructors where the value type is the same as the
argument type.
Semantic dummies leave no trace in the pretty-printer.
Thus, for instance, the pretty-printer ``normalizes away''
extra semicolons.
\subsubsection{Precedence levels}
A common idiom in (ordinary) BNF is to use indexed variants of categories
to express precedence levels:
\begin{verbatim}
Exp3 ::= Integer ;
Exp2 ::= Exp2 "*" Exp3 ;
Exp ::= Exp "+" Exp2 ;
Exp ::= Exp2 ;
Exp2 ::= Exp3 ;
Exp3 ::= "(" Exp ")" ;
\end{verbatim}
The precedence level regulates the order of parsing, including
associativity. Parentheses lift an expression of any level
to the highest level.
A straightforward labelling of the above rules creates a grammar that does
have the desired recognition behavior, as the abstract syntax is cluttered
with type distinctions (between {\tt Exp}, {\tt Exp2}, and {\tt Exp3})
and constructors (from the last three rules) with no semantic content.
The BNF Converter solution is to distinguish among
category symbols those that are just indexed variants of each other:
\bequ
A category symbol can end with an integer index
(i.e.\ a sequence of digits), and is then treated as a type
synonym of the corresponding non-indexed symbol.
\enqu
Thus {\tt Exp2} and {\tt Exp3} are indexed variants of {\tt Exp}.
Transitions between indexed variants are
semantically dummy, and we do not want to represent them by
constructors in the abstract syntax. To do this, we extend the use
of underscores to indexed variants.
The example grammar above can now be labelled as follows:
\begin{verbatim}
EInt. Exp3 ::= Integer ;
ETimes. Exp2 ::= Exp2 "*" Exp3 ;
EPlus. Exp ::= Exp "+" Exp2 ;
_. Exp ::= Exp2 ;
_. Exp2 ::= Exp3 ;
_. Exp3 ::= "(" Exp ")" ;
\end{verbatim}
Thus the datatype of expressions becomes simply
\begin{verbatim}
data Exp = EInt Integer | ETimes Exp Exp | EPlus Exp Exp
\end{verbatim}
and the syntax tree for {\tt 2*(3+1)} is
\begin{verbatim}
ETimes (EInt 2) (EPlus (EInt 3) (EInt 1))
\end{verbatim}
Indexed categories {\em can} be used for other purposes than
precedence, since the only thing we can formally check is the
type skeleton (see the section \ref{typecheck}).
The parser does not need to know
that the indices mean precedence, but only that indexed
variants have values of the same type.
The pretty-printer, however, assumes that
indexed categories are used for precedence, and may produce
strange results if they are used in some other way.
\subsubsection{Polymorphic lists}
It is easy to define monomorphic list types in LBNF:
\begin{verbatim}
NilDef. ListDef ::= ;
ConsDef. ListDef ::= Def ";" ListDef ;
\end{verbatim}
However, compiler writers in languages like
Haskell may want to use predefined
polymorphic lists, because of the language support for these constructs.
LBNF permits the use of Haskell's list constructors
as labels, and list brackets in category names:
\begin{verbatim}
[]. [Def] ::= ;
(:). [Def] ::= Def ";" [Def] ;
\end{verbatim}
As the general rule, we have
\bequ
{\tt[}$C${\tt ]}, the category of lists of type $C$,
{\tt []} and {\tt (:)}, the Nil and Cons rule labels,
{\tt (:[])}, the rule label for one-element lists.
\enqu
The third rule label is used to place an at-least-one restriction,
but also to permit special treatment of one-element lists
in the concrete syntax.
In the \LaTeX\ document (for stylistic reasons) and in the Happy file (for
syntactic reasons), the category name {\tt [X]} is replaced by {\tt ListX}.
In order for this not to cause clashes, {\tt ListX}
may not be at the same time used explicitly in the grammar.
The list category constructor can be iterated: {\tt [[X]]}, {\tt [[[X]]]}, etc
behave in the expected way.
The list notation can also be seen as a variant of the Kleene star and plus, and
hence as an ingredient from Extended BNF.
\subsection{The type-correctness of LBNF rules}
\label{typecheck}
It is customary in parser generators to delegate the checking of certain
errors to the target language. For instance, a Happy source file that
Happy processes without complaints can still produce a Haskell file
that is rejected by Haskell. In the same way, the BNF converter
delegates some checking to Happy and Haskell (for instance,
the parser conflict check). However, since it is always
the easiest for the programmer to understand error messages
related to the source, the BNF Converter performs some checks,
which are mostly connected with the sanity of the abstract syntax.
The type checker uses a notion of the
\textit{category skeleton} of a rule, which is a pair
\[
(C, A\ldots B)
\]
where $C$ is the unindexed left-hand-side non-terminal and $A\ldots B$
is the sequence of unindexed right-hand-side non-terminals of the rule.
In other words, the category skeleton of a rule expresses the abstract-syntax
type of the semantic action associated to that rule.
We also need the notions of
a \textit{regular category} and
a \textit{regular rule label}.
Briefly, regular labels and categories are the user-defined ones.
More formally,
a regular category is none of
{\tt[}$C${\tt]},{\tt Integer}, {\tt Double}, {\tt Char}, {\tt String}
and {\tt Ident}.
A regular rule label is none of
{\tt \_}, {\tt []}, {\tt (:)}, and {\tt (:[])}.
The type checking rules are now the following:
\bequ
A rule labelled by {\tt \_} must have a category skeleton of form $(C,C)$.
A rule labelled by {\tt []} must have a category skeleton of form $([C],)$.
A rule labelled by {\tt (:)} must have a category skeleton of form $([C],C[C])$.
A rule labelled by {\tt (:[])} must have a category skeleton of form $([C],C)$.
Only regular categories may have productions with regular rule labels.
Every regular category occurring in the grammar
must have at least one production with a regular rule label.
All rules with the same regular rule label must have the same
category skeleton.
\enqu
The second-last rule corresponds to the absence of empty data types in Haskell.
The last rule could
be strengthened so as to require that all regular rule labels be unique:
this is needed to guarantee error-free pretty-printing.
Violating this strengthened rule currently
generates only a warning, not a type error.
\section{LBNF Pragmas}
Even well-behaved languages have features that
cannot be expressed naturally in its BNF grammar.
To take care of them, while preserving the single-source nature of
the BNF Converter, we extend the notation with what we call \textit{pragmas}.
All these pragmas are completely declarative, and the pragmas are also
reflected in the documentation.
\subsection{Comment pragmas}
The first pragma tells what kinds of \textit{comments} the language has.
Normally we do not want comments to appear in the abstract syntax,
but treat them in the lexical analysis. The comment pragma instructs the
lexer generator (and the document generator!) to
treat certain pieces of text as comments and thus to ignore them
(except for their contribution to the position information used in
parser error messages).
The simplest solution to the comment
problem would be to use some default comments
that are hard-coded into the system, e.g.\ Haskell's comments.
But this definition can hardly be stated as a condition for a language
to be well-behaved, and we could not even
define C or Java or ML then. So we have added a comment pragma, whose
regular-expression syntax is
\bequ
"{\tt comment}" String String? "{\tt ;}"
\enqu
The first string tells how a comment begins.
The second, optional, string marks the end of a comment:
if it is not given, then the comment expects a newline to end.
For instance, to describe the Haskell comment convention,
we write the following lines in our LBNF source file:
\begin{verbatim}
comment "--" ;
comment "{-" "-}" ;
\end{verbatim}
Since comments are treated in the lexical analyzer, they must
be recognized by a finite state automaton.
This excludes the use of nested comments unless defined in
the grammar itself. Discarding nested comments is one aspect
of what we call well-behaved languages.
The length of comment end markers is restricted to two characters,
due to the complexities in the lexer caused by longer end markers.
\subsection{Internal pragmas}
Sometimes we want to include in the abstract syntax
structures that are not part of the concrete syntax,
and hence not parsable.
They can be, for instance, syntax trees that are produced by a
type-annotating type checker.
Even though they are not parsable, we may want to
pretty-print them, for instance, in the type checker's
error messages.
To define such an internal constructor, we use a pragma
\begin{verbatim}
"internal" Rule ";"
\end{verbatim}
where {\tt Rule} is a normal LBNF rule. For instance,
\begin{verbatim}
internal EVarT. Exp ::= "(" Ident ":" Type ")";
\end{verbatim}
introduces a type-annotated variant of a variable expression.
\subsection{Token pragmas}
\label{reg}
The predefined lexical types are sufficient in
most cases, but sometimes we would like to have more control over the
lexer. This is provided by \textit{token pragmas}. They use
regular expressions to define new token types.
If we, for example, want to make a finer distinction for identifiers,
a distinction between lower- and upper-case letters, we can introduce
two new token types, {\tt UIdent}\ and {\tt LIdent}, as follows.
\begin{verbatim}
token UIdent (upper (letter | digit | '\_')*) ;
token LIdent (lower (letter | digit | '\_')*) ;
\end{verbatim}
The regular expression syntax of LBNF is specified in the Appendix.
The abbreviations with strings in brackets need a word of explanation:
\begin{quote}
\verb6["abc7%"]6 denotes the union of the characters \verb6'a' 'b' 'c' '7' '%'6
\verb6{"abc7%"}6 denotes the sequence of the characters \verb6'a' 'b' 'c' '7' '%'6
\end{quote}
The atomic expressions {\tt upper}, {\tt lower}, {\tt letter}, and {\tt digit}
denote the character classes suggested by their names (letters are isolatin1).
The expression {\tt char} matches any unicode character, and
the ``epsilon'' expression {\tt eps} matches the empty string.
\subsection{Entry point pragmas}
The BNF Converter generates, by default, a parser for every category in
the grammar. This is unnecessarily rich in most cases, and makes the parser
larger than needed. If the size of the parser becomes critical,
the \textit{entry points pragma} enables the user
to define which of the parsers are actually exported:
\begin{verbatim}
entrypoints (Ident ",")* Ident ;
\end{verbatim}
For instance, the following pragma defines {\tt Stm} and {\tt Exp} to be
the only entry points:
\begin{verbatim}
entrypoints Stm, Exp ;
\end{verbatim}
\section{BNF Converter code generation}
\subsection{The files}
Given an LBNF source file {\tt Foo.cf},
the BNF Converter generates the following files:
\begin{itemize}
\item {\tt AbsFoo.hs}: The abstract syntax (Haskell source file)
\item {\tt LexFoo.x}: The lexer (Alex source file)
\item {\tt ParFoo.y}: The parser (Happy source file)
\item {\tt PrintFoo.hs}: The pretty printer (Haskell source file)
\item {\tt SkelFoo.hs}: The case Skeleton (Haskell source file)
\item {\tt TestFoo.hs}: A test bench file for the parser and pretty printer (Haskell source file)
\item {\tt DocFoo.tex}: The language document (\LaTeX source file)
\item {\tt makefile}: A makefile for the lexer, the parser, and the document
\end{itemize}
In addition to these files, the user needs the Alex runtime file
{\tt Alex.hs} and the error monad definition file {\tt ErrM.hs}, both
included in the BNF Converter distribution.
\subsection{Example: {\tt JavaletteLight.cf}}
The following LBNF grammar defines a small C-like language,
Javalette Light\footnote{It is a fragment of the
language Javalette used at compiler construction courses at Chalmers University}.
% \end{doublespace}
\small
\begin{verbatim}
Fun. Prog ::= Typ Ident "(" ")" "{" [Stm] "}" ;
SDecl. Stm ::= Typ Ident ";" ;
SAss. Stm ::= Ident "=" Exp ";" ;
SIncr. Stm ::= Ident "++" ";" ;
SWhile. Stm ::= "while" "(" Exp ")" "{" [Stm] "}" ;
ELt. Exp0 ::= Exp1 "<" Exp1 ;
EPlus. Exp1 ::= Exp1 "+" Exp2 ;
ETimes. Exp2 ::= Exp2 "*" Exp3 ;
EVar. Exp3 ::= Ident ;
EInt. Exp3 ::= Integer ;
EDouble. Exp3 ::= Double ;
TInt. Typ ::= "int" ;
TDouble. Typ ::= "double" ;
[]. [Stm] ::= ;
(:). [Stm] ::= Stm [Stm] ;
-- coercions
_. Stm ::= Stm ";" ;
_. Exp ::= Exp0 ;
_. Exp0 ::= Exp1 ;
_. Exp1 ::= Exp2 ;
_. Exp2 ::= Exp3 ;
_. Exp3 ::= "(" Exp ")" ;
-- pragmas
internal ExpT. Exp ::= Typ Exp ;
comment "/*" "*/" ;
comment "//" ;
entrypoints Prog, Stm, Exp ;
\end{verbatim}
\normalsize
%\begin{doublespace}
\subsubsection{The abstract syntax {\tt AbsJavaletteLight.hs}}
The abstract syntax of Javalette generated by the
BNF Converter is essentially what a Haskell programmer would write by hand:
%\end{doublespace}
\small
\begin{verbatim}
data Prog =
Fun Typ Ident [Stm]
deriving (Eq,Show)
data Stm =
SDecl Typ Ident
| SAss Ident Exp
| SIncr Ident
| SWhile Exp [Stm]
deriving (Eq,Show)
data Exp =
ELt Exp Exp
| EPlus Exp Exp
| ETimes Exp Exp
| EVar Ident
| EInt Integer
| EDouble Double
| ExpT Typ Exp
deriving (Eq,Show)
data Typ =
TInt
| TDouble
deriving (Eq,Show)
\end{verbatim}
\normalsize
%\begin{doublespace}
\subsubsection{The lexer {\tt LexJavaletteLight.x}}
The lexer file (in Alex) consists mostly of standard rules
for literals and identifiers, but has
rules added for reserved words and symbols (i.e.\ terminals
occurring in the grammar) and for comments. Here is
a fragment with the definitions characteristic of
Javalette.
%\end{doublespace}
\small
\begin{verbatim}
{ %s = ^( | ^) | ^{ | ^} | ^; | ^= | ^+ ^+ | ^< | ^+ | ^*}
"tokens_lx"/"tokens_acts":-
<> ::= ^/^/ [.]* ^n
<> ::= ^/ ^* ([^u # ^*] | ^* [^u # ^/])* (^*)+ ^/
<> ::= ^w+
<pTSpec> ::= %s %{ pTSpec p = PT p . TS %}
<ident> ::= ^l ^i* %{ ident p = PT p . eitherResIdent TV %}
<int> ::= ^d+ %{ int p = PT p . TI %}
<double> ::= ^d+ ^. ^d+ (e (^-)? ^d+)? %{ double p = PT p . TD %}
eitherResIdent :: (String -> Tok) -> String -> Tok
eitherResIdent tv s = if isResWord s then (TS s) else (tv s) where
isResWord s = elem s ["double","int","while"]
\end{verbatim}
\normalsize
% \begin{doublespace}
The lexer file moreover defines the token type {\tt Tok} used by
the lexer and the parser.
\subsubsection{The parser {\tt ParJavaletteLight.y}}
The parser file (in Happy) has a large number of token definitions
(which we find it extremely valuable to generate automatically),
followed by parsing rules corresponding closely to the source BNF rules.
Here is a fragment containing examples of both parts:
% \end{doublespace}
\small
\begin{verbatim}
%token
'(' { PT _ (TS "(") }
')' { PT _ (TS ")") }
'double' { PT _ (TS "double") }
'int' { PT _ (TS "int") }
'while' { PT _ (TS "while") }
L_integ { PT _ (TI $$) }
L_doubl { PT _ (TD $$) }
%%
Integer : L_integ { (read $1) :: Integer }
Double : L_doubl { (read $1) :: Double }
Stm :: { Stm }
Stm : Typ Ident ';' { SDecl $1 $2 }
| Ident '=' Exp ';' { SAss $1 $3 }
| Ident '++' ';' { SIncr $1 }
| 'while' '(' Exp ')' '{' ListStm '}' { SWhile $3 (reverse $6) }
| Stm ';' { $1 }
Exp0 :: { Exp }
Exp0 : Exp1 '<' Exp1 { ELt $1 $3 }
| Exp1 { $1 }
\end{verbatim} % $
\normalsize
% \begin{doublespace}
The exported parsers have types of the following
form, for any abstract syntax type {\tt T},
\begin{verbatim}
[Tok] -> Err T
\end{verbatim}
returning either a value of type {\tt T} or an error message, using
a simple error monad. The input is a token list received from the lexer.
\subsubsection{The pretty-printer {\tt PrintJavaletteLight.hs}}
The pretty-printer consists of a Haskell class {\tt Print} with instances
for all generated datatypes, taking precedence into account. The class method
{\tt prt}\
generates a list of strings for a syntax tree of any type.
% \end{doublespace}
\small
\begin{verbatim}
instance Print Exp where
prt i e = case e of
ELt exp0 exp -> prPrec i 0 (concat [prt 1 exp0 , ["<"] , prt 1 exp])
EPlus exp0 exp -> prPrec i 1 (concat [prt 1 exp0 , ["+"] , prt 2 exp])
ETimes exp0 exp -> prPrec i 2 (concat [prt 2 exp0 , ["*"] , prt 3 exp])
\end{verbatim}
\normalsize
% \begin{doublespace}
The list is then put in layout (identation, newlines) by a \textit{rendering}
function, which is generated independently of the grammar,
but written with easy modification in mind.
\subsubsection{The case skeleton {\tt SkelJavaletteLight.hs}}
The case skeleton can be used as a basis when defining the compiler
back end, e.g.\ type checker and code generator. The same skeleton is
actually also used in the pretty printer. The case branches in the skeleton
are initialized to show error messages saying that the case is undefined.
% \end{doublespace}
\small
\begin{verbatim}
transExp :: Exp -> Result
transExp x = case x of
ELt exp0 exp -> failure x
EPlus exp0 exp -> failure x
ETimes exp0 exp -> failure x
\end{verbatim}
\normalsize
%\begin{doublespace}
\subsubsection{The language document {\tt DocJavaletteLight.tex}}
We show the main parts of the generated JavaletteLight document
in a typeset form. The grammar symbols in the document are produced
by \LaTeX\ macros, with easy modification in mind.
%\end{doublespace}
\footnotesize
\begin{quote}
\section*{The lexical structure of JavaletteLight}
\subsection*{Identifiers}
Identifiers \nonterminal{Ident} are unquoted strings beginning with a letter,
followed by any combination of letters, digits, and the characters {\tt \_ '},
reserved words excluded.
\subsection*{Literals}
Integer literals \nonterminal{Int}\ are nonempty sequences of digits.
Double-precision float literals \nonterminal{Double}\ have the structure
indicated by the regular expression $\nonterminal{digit}+ \mbox{{\it `.'}} \nonterminal{digit}+ (\mbox{{\it `e'}} \mbox{{\it `-'}}? \nonterminal{digit}+)?$ i.e.\
two sequences of digits separated by a decimal point, optionally
followed by an unsigned or negative exponent.
\subsection*{Reserved words and symbols}
The set of reserved words is the set of terminals appearing in the grammar. Those reserved words that consist of non-letter characters are called symbols, and they are treated in a different way from those that are similar to identifiers. The lexer follows rules familiar from languages like Haskell, C, and Java, including longest match and spacing conventions.
The reserved words used in JavaletteLight are the following: \\
\begin{tabular}{lll}
{\reserved{double}} &{\reserved{int}} &{\reserved{while}} \\
\end{tabular}\\
The symbols used in JavaletteLight are the following: \\
\begin{tabular}{lll}
{\symb{(}} &{\symb{)}} &{\symb{ \{ }} \\
{\symb{ \} }} &{\symb{;}} &{\symb{{$=$}}} \\
{\symb{{$+$}{$+$}}} &{\symb{{$<$}}} &{\symb{{$+$}}} \\
{\symb{*}} & & \\
\end{tabular}\\
\subsection*{Comments}
Single-line comments begin with {\symb{//}}. \\Multiple-line comments are enclosed with {\symb{/*}} and {\symb{*/}}.
\section*{The syntactic structure of JavaletteLight}
Non-terminals are enclosed between $\langle$ and $\rangle$.
The symbols {\arrow} (production), {\delimit} (union)
and {\emptyP} (empty rule) belong to the BNF notation.
All other symbols are terminals.\\
\begin{tabular}{lll}
{\nonterminal{Prog}} & {\arrow} &{\nonterminal{Typ}} {\nonterminal{Ident}} {\terminal{(}} {\terminal{)}} {\terminal{ \{ }} {\nonterminal{ListStm}} {\terminal{ \} }} \\
\end{tabular}\\
\begin{tabular}{lll}
{\nonterminal{Stm}} & {\arrow} &{\nonterminal{Typ}} {\nonterminal{Ident}} {\terminal{;}} \\
& {\delimit} &{\nonterminal{Ident}} {\terminal{{$=$}}} {\nonterminal{Exp}} {\terminal{;}} \\
& {\delimit} &{\nonterminal{Ident}} {\terminal{{$+$}{$+$}}} {\terminal{;}} \\
& {\delimit} &{\terminal{while}} {\terminal{(}} {\nonterminal{Exp}} {\terminal{)}} {\terminal{ \{ }} {\nonterminal{ListStm}} {\terminal{ \} }} \\
& {\delimit} &{\nonterminal{Stm}} {\terminal{;}} \\
\end{tabular}\\
\begin{tabular}{lll}
{\nonterminal{Exp0}} & {\arrow} &{\nonterminal{Exp1}} {\terminal{{$<$}}} {\nonterminal{Exp1}} \\
& {\delimit} &{\nonterminal{Exp1}} \\
\end{tabular}\\
\begin{tabular}{lll}
{\nonterminal{Exp1}} & {\arrow} &{\nonterminal{Exp1}} {\terminal{{$+$}}} {\nonterminal{Exp2}} \\
& {\delimit} &{\nonterminal{Exp2}} \\
\end{tabular}\\
\begin{tabular}{lll}
{\nonterminal{Exp2}} & {\arrow} &{\nonterminal{Exp2}} {\terminal{*}} {\nonterminal{Exp3}} \\
& {\delimit} &{\nonterminal{Exp3}} \\
\end{tabular}\\
\begin{tabular}{lll}
{\nonterminal{Exp3}} & {\arrow} &{\nonterminal{Ident}} \\
& {\delimit} &{\nonterminal{Integer}} \\
& {\delimit} &{\nonterminal{Double}} \\
& {\delimit} &{\terminal{(}} {\nonterminal{Exp}} {\terminal{)}} \\
\end{tabular}\\
\begin{tabular}{lll}
{\nonterminal{ListStm}} & {\arrow} &{\emptyP} \\
& {\delimit} &{\nonterminal{Stm}} {\nonterminal{ListStm}} \\
\end{tabular}\\
\begin{tabular}{lll}
{\nonterminal{Exp}} & {\arrow} &{\nonterminal{Exp0}} \\
\end{tabular}\\
\begin{tabular}{lll}
{\nonterminal{Typ}} & {\arrow} &{\terminal{int}} \\
& {\delimit} &{\terminal{double}} \\
\end{tabular}\
\end{quote}
\normalsize
% \begin{doublespace}
\subsubsection{The {\tt makefile}}
The makefile is used to run Alex on the lexer, Happy on the parser, and
\LaTeX\ on the document, by simply typing {\tt make}. The {\tt make clean}
command removes the generated files.
\subsubsection{The test bench file {\tt TestJavaletteLight.hs}}
The test bench file can be loaded in the Haskell interpreter hugs to
run the parser and the pretty-printer on terminal or file input.
The test functions display a syntax tree (or an error message)
and the pretty-printer result from the same tree.
\subsection{An optimization: left-recursive lists}
\label{leftrec}
The BNF representation of lists is right-recursive, following Haskell's
list conctructor. Right-recursive lists, however, are an inefficient way of parsing
lists in an LALR parser. The smart programmer would implement a pair of rules
such as JavaletteLight's
\begin{verbatim}
[]. [Stm] ::= ;
(:). [Stm] ::= Stm [Stm] ;
\end{verbatim}
not in the direct way,
\begin{verbatim}
ListStm : {- empty -} { [] }
| Stm ListStm { (:) $1 $3 }
\end{verbatim}
but under a left-recursive transformation:
\begin{verbatim}
ListStm : {- empty -} { [] }
| ListStm Stm { flip (:) $1 $2 }
\end{verbatim}
Then the smart programmer would also be careful to reverse the list
when it is used:
\begin{verbatim}
Prog : Typ Ident '(' ')' '{' ListStm '}' { Fun $1 $2 (reverse $6) }
\end{verbatim}%$
As reported in the Happy manual, this transformation is vital
to avoid running out of stack space with long lists.
Thus we have implemented the transformation in the BNF Converter for
pairs of rules of the form
\begin{verbatim}
[]. [C] ::= ;
(:). [C] ::= C x [C] ;
\end{verbatim}
where {\tt C} is any category and {\tt x} is any sequence of
terminals (possibly empty).
There is another important parsing technique, recursive descent,
which cannot live with left recursion at all, but loops infinitely
with left-recursive grammars (cf.\ e.g.\ \cite{appel}).
The question sometimes arises if, when designing a grammar,
one should take into account what method will be used for parsing it.
The view we are advocating is that the designer of the grammar
should in the first place think of the abstract syntax, and let
the parser generator perform automatic
grammar transformations that are needed by the parsing method.
\section{Discussion}
\subsection{Results}
\label{results}
LBNF and the BNF Converter\cite{bnfc} were introduced as a teaching tool at the
fourth-year compiler course in Spring 2003 at Chalmers.
The goal was, on the one hand, to advocate the use of declarative
and portable language definitions, and on the other hand, to
leave more time for back-end construction in a compiler course.