-
Notifications
You must be signed in to change notification settings - Fork 165
/
glr-bnfc.html
333 lines (264 loc) · 12.6 KB
/
glr-bnfc.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
<HTML>
<HEAD>
<TITLE> Using BNFC with Happy-GLR </TITLE>
</HEAD>
<BODY>
<HR>
<H1>Using BNFC with Happy-GLR </H1>
<H3><A href="http://www.dur.ac.uk/p.c.callaghan">Paul Callaghan</A>,
University of Durham</H3>
<br><br><HR>
<H2> Introduction </H2>
The <A href="http://www.cs.chalmers.se/~markus/BNFC/">BNF Converter tool</A>
provides generation of useful code in various languages from a
simple BNF specification of a grammar, including a parser, pretty-printer,
and abstract syntax.
<P>
BNFC can now output parser files compatible with the GLR mode of the
<A href="http://haskell.org/happy">Happy parser generator</A>
for <A href="http://haskell.org/">Haskell</A>.
This enables ambiguous grammars to be parsed and for all valid parses
to be examined.
<P>
Further information on the GLR mode can be found in Happy's
<A href="http://www.haskell.org/happy/doc/html/sec-glr.html">
GLR documentation</A>, supplemented by information on
Paul Callaghan's <A href="http://www.dur.ac.uk/p.c.callaghan/happy-glr/">
Happy-GLR information</A> page.
<br><br><HR>
<H2> Basic usage </H2>
<PRE>
bnfc -glr [OPTIONS] FILE.cf
</PRE>
Usage is almost identical to normal Happy mode, except for a new flag
<TT>-glr</TT> which causes a few minor changes in creation of the Happy
parser file (mainly additional type annotations) and of the default test code.
The latter involves changing the way the generated parser is called, and
is explained later.
<P>
No other aspect of BNFC is affected, so functionality remains as expected.
For more information, refer to the documentation on the
<A href="http://www.cs.chalmers.se/~markus/BNFC/">BNFC page</A>.
<br><br><HR>
<H2> Compiling the generated parser </H2>
The Happy parser files must first be converted to Haskell with the following
command. The <TT>--glr</TT> flag causes a GLR parser to be generated, and
the <TT>--decode</TT> causes the GLR parser to generate abstract syntax trees
(this flag is required).
<PRE>
happy --glr --decode [--ghc] ParNAME.y
</PRE>
The optional <TT>--ghc</TT> flag turns on certain optimisations for the
GHC compiler. Without this flag, the Happy-generated parsers should be
compatible with Hugs in <TT>-98</TT> mode: although large parsers may
be rejected by Hugs as "too complex", and other BNFC output code may
require conditional compilation (but some hand-editing may suffice here).
<P>
This invocation of <TT>happy</TT> will generate two files, called
<TT>ParNAME.hs</TT> and <TT>ParNAMEData.hs</TT>, which are the parser
driver and parser data respectively. They are in separate files because
the former needs extensive optimization, but the latter may be large
and thus too expensive to optimize. These files can be compiled with
GHC in the normal way, <EM>although sometimes you may need the
<TT>-fglasgow-exts</TT> file to permit certain instance declarations
created in the parser</EM>.
<br><br><HR>
<H2> Testing the generated GLR parser </H2>
File <TT>TestNAME.hs</TT> contains a simple test harness for the parser,
which reads from standard input and displays the resulting tree(s) or some
error message.
<P>
For GLR parsing, there are two possible errors: premature end of input
or (global) syntax error. The former should be clear, but the second may
require explanation. When ambiguity arises, GLR parsers allow several
interpretations to continue in parallel. Some of these may be discarded
when new input is incompatible with the interpretation. This is a syntax
error for this interpretation, but the only action is to discard the
interpretation. No warning or error is generated, because it doesn't
necessarily mean the overall parse will fail. However, if all interpretations
are dropped, then this is a global syntax error - no parses are possible.
<P>
If the parse succeeds and produces exactly one abstract syntax tree, then
this is shown. If more than one tree is produced, these are shown in turn,
numbering each tree to aid identification and to illustrate how many
different parses succeeded.
<br><br><HR>
<H2> Using the generated GLR parser in other applications </H2>
A simplified interface to the Happy-generated GLR parsers is provided
for BNFC. The code for this is currently given at the end of the
<TT>TestNAME.hs</TT> files, and summarized here.
The full, standard interface to is explained elsewhere, in the
<A href="http://www.haskell.org/happy/doc/html/sec-glr.html">
GLR documentation</A>.
<PRE>
type ParseFun a = [[Token]] -> (GLRResult, GLR_Output (Err a))
</PRE>
(Simplified) parsers obey the pattern represented by the <TT>ParseFun</TT>
synonym above. The input is a list of list of tokens, where the outer
list corresponds to input positions and the inner list corresponds to
different interpretations of a particular input symbol. In Natural
Language terms, the inner list allows representation of morphological
ambiguity, for example the noun and verb senses of <EM>run</EM>.
<P>
The output is the standard parsing result <TT>GLRResult</TT> plus
a more convenient form <TT>GLR_Output (Err a)</TT>.
The <TT>GLR_Output</TT> type allows representation of parse fail
(with error message(s)), or a success (with useful results).
It is polymorphic, in order to support reuse, and the occurrence of
<TT>Err</TT> reflects use of that monad within BNFC.
<P>
The fail case of <TT>GLR_Output</TT> has a main error message plus a
second string for supplementary information, e.g. more detailed output.
The main part of the success case is a list of `semantic' results
(which are abstract syntax trees in the standard BNFC setting),
so this provides convenient access to the parse results.
Also provided is a function of type <TT>(Forest -> Forest) -> [a]</TT>,
which allows modification of the parse forest to be done before the
decoding to semantic results. This is useful if some pruning or reordering
of results is required before the final trees are produced; an example of
this is given below.
<PRE>
type Forest = FiniteMap ForestId [Branch]
data GLR_Output a
= GLR_Result { pruned_decode :: (Forest -> Forest) -> [a]
, semantic_result :: [a]
}
| GLR_Fail { main_message :: String
, extra_info :: String
}
</PRE>
Parsers exported by the GLR driver code can be converted to the simple
form with the following lifting operation (the code is provided in the
<TT>TestNAME.hs</TT> files).
<PRE>
lift_parser
:: (TreeDecode a, Show a, Print a)
=> ([[Token]] -> GLRResult) -> ParseFun a
</PRE>
Finally, it is necessary to create the simplified parser object before using
it in your code. The following idiom is recommended, where <TT>TOPTYPE</TT>
is the root type in the abstract syntax trees and <TT>TOPSYMBOL</TT> is the
name of the topmost (or main) rule in the BNFC input grammar.
This causes the main parser to be lifted appropriately, and via the type
signature it fixes the type of objects to be produced by decoding.
(This is required because of the use of type classes.)
<PRE>
the_parser :: ParseFun TOPTYPE
the_parser = lift_parser pTOPSYMBOL
</PRE>
<br><br><HR>
<H2> Graphical output </H2>
It is possible, and often very useful, to view graphs of the forest.
For example, you can easily see where and how ambiguity is arising.
This may also be helpful for debugging your grammars.
See the page on
<A href="http://www.dur.ac.uk/p.c.callaghan/happy-glr/#graphs">
graph conversion information</A>, which contains extra code for
producing graph output.
<br><br><HR>
<H2> Brief example of pruning or reordering </H2>
In applications where there is moderate ambiguity, for example the parsing
of sentences of 10-20 words with a reasonably detailed NL grammar, some control
over the number and order of resulting trees may be required. This can
be done by supplying a function of type <TT>Forest -> Forest</TT>
to the function in the <TT>pruned_decode</TT> slot of <TT>GLR_Output</TT>.
Consider the following ambiguous grammar fragment
<PRE>
Plus . Expr ::= Expr "+" Expr
Lit . Expr ::= Integer
</PRE>
As an artificial example, suppose we wanted to limit the interpretations from
the first rule to the ones where the left tree covers at least as much input
as the right tree (which we can tell by inspecting the forest indices).
This can be achieved with code like the following.
<PRE>
remove_light_lefts :: Forest -> Forest
remove_light_lefts
= mapFM foo
where
foo k bs = [ b
| b <- bs
, case b_nodes b of
[_] -> True -- an Int
[(a,b,_),_,(c,d,_)] -> b - a >= d - c
]
</PRE>
In this context, <TT>mapFM</TT> applies a function of type
<TT>ForestId -> [Branch] -> [Branch]</TT> to each node in the forest,
where the <TT>ForestId</TT> is the node index and the second argument
its current value, and the result is the new value.
The code above only keeps a branch with three children whose left side is
wider or the same as its right side.
Note that forest indices contain start and end positions, so width is
easy to calculate. You can also look at the last item of the index
3-tuples to get information about grammatical categories.
<P>
For brief reference, the relevant types are listed below. <TT>GSymbol</TT>
is an enumeration of names from the grammar, each name prefixed with
<TT>G_</TT> to avoid name clashes, or it is an embedded token value.
Semantic information is held in the <TT>GSem</TT> values but in general
these are functions and not intended for general use (but see comment
below).
<PRE>
type Forest = FiniteMap ForestId [Branch]
type ForestId = (Int,Int,GSymbol)
data Branch = Branch {b_sem :: GSem, b_nodes :: [ForestId]} deriving Show
data GSymbol = {automatically generated by Happy-GLR} deriving Show
data GSem = {automatically generated by Happy-GLR} deriving Show
</PRE>
<P>
Some final comments on deeper technical aspects of this technique:
<UL>
<LI>Dead sections of the forest are not removed at present, but since
tree decoding is top-down, this won't affect the output. But there
is some overhead for the unused data, whilst the forest is stored in
a <TT>FiniteMap</TT> (as opposed to a constant-time array).
<LI>Note that setting one node's value to <TT>[]</TT> will cause failure
to be propagated to all its (non-choice) ancestors. Thus removing
the branches at some node will cause all interpretations which
include that node to be removed too.
<LI>Since the semantic values are currently stored as functions (for
various reasons, as argued elsewhere), it is not easy to access
the information. You can use <TT>decode_b</TT> in class
<TT>TreeDecode</TT> can extract some information (if you know
the expected type of the relevant sematic result, in order to
select a concrete instance for decoding), but this will cause
traversal of parts of the (original) forest and it can be expensive.
In such circumstances, the <EM>label decoding</EM> mode of
Happy-GLR is recommended (see the documentation).
<LI>If you try experiments with this functionality and get interesting
(or strange) results, please email relevant files and comments
to <A HREF="mailto:P.C.Callaghan@durham.ac.uk">Paul Callaghan</A>.
Such information may help refine and extend this aspect of Happy-GLR.
</UL>
<br><br><HR>
<H2> Known issues and deficiencies </H2>
<UL>
<LI>Multiple grammar entry points not supported yet (currently, a
warning is printed).
<LI>See the Happy-GLR documentation for more details on which features
of Happy are supported, and see Paul Callaghan's
<A href="http://www.dur.ac.uk/p.c.callaghan/happy-glr/">
Happy-GLR information</A> page for updates or news.
<LI>The decoding from parse forest to trees is inefficient in the sense
that subgraphs can be re-visited, although it does avoid the
retention of large data structures in memory during decoding.
For small or moderate examples, this is unlikely to be a
significant point.
<LI>Currently the <TT>Err</TT> monad is not essential for the GLR parsers
and could, in principle, be removed.
In particular, <TT>happyError</TT> is bypassed and no semantic rule
in the grammar uses the <TT>{% ... }</TT> form.
<LI>The <TT>GLR_Output</TT> interface may be incorporated into a future
release of Happy, but the interface is not expected to change.
</UL>
<!--
<br><br><HR>
<H2>Miscellaneous</H2>
-->
<br><br><HR>
<I><FONT SIZE=-1>
last changed: 21 mar 2005.
</FONT></I>
</BODY>
</HTML>