bnfc/document/glr-bnfc.html at master · BNFC/bnfc

History

100

101

102

103

104

105

106

107

108

109

110

111

112

113

114

115

116

117

118

119

120

121

122

123

124

125

126

127

128

129

130

131

132

133

134

135

136

137

138

139

140

141

142

143

144

145

146

147

148

149

150

151

152

153

154

155

156

157

158

159

160

161

162

163

164

165

166

167

168

169

170

171

172

173

174

175

176

177

178

179

180

181

182

183

184

185

186

187

188

189

190

191

192

193

194

195

196

197

198

199

200

201

202

203

204

205

206

207

208

209

210

211

212

213

214

215

216

217

218

219

220

221

222

223

224

225

226

227

228

229

230

231

232

233

234

235

236

237

238

239

240

241

242

243

244

245

246

247

248

249

250

251

252

253

254

255

256

257

258

259

260

261

262

263

264

265

266

267

268

269

270

271

272

273

274

275

276

277

278

279

280

281

282

283

284

285

286

287

288

289

290

291

292

293

294

295

296

297

298

299

300

301

302

303

304

305

306

307

308

309

310

311

312

313

314

315

316

317

318

319

320

321

322

323

324

325

326

327

328

329

330

331

332

333

<HTML>

<HEAD>

<TITLE> Using BNFC with Happy-GLR </TITLE>

</HEAD>

<BODY>

<HR>

<H1>Using BNFC with Happy-GLR </H1>

<H3><A href="http://www.dur.ac.uk/p.c.callaghan">Paul Callaghan</A>,

University of Durham</H3>

<H2> Introduction </H2>

The <A href="http://www.cs.chalmers.se/~markus/BNFC/">BNF Converter tool</A>

provides generation of useful code in various languages from a

simple BNF specification of a grammar, including a parser, pretty-printer,

and abstract syntax.

<P>

BNFC can now output parser files compatible with the GLR mode of the

<A href="http://haskell.org/happy">Happy parser generator</A>

for <A href="http://haskell.org/">Haskell</A>.

This enables ambiguous grammars to be parsed and for all valid parses

to be examined.

<P>

Further information on the GLR mode can be found in Happy's

GLR documentation</A>, supplemented by information on

Paul Callaghan's <A href="http://www.dur.ac.uk/p.c.callaghan/happy-glr/">

Happy-GLR information</A> page.

<H2> Basic usage </H2>

<PRE>

bnfc -glr [OPTIONS] FILE.cf

</PRE>

Usage is almost identical to normal Happy mode, except for a new flag

<TT>-glr</TT> which causes a few minor changes in creation of the Happy

parser file (mainly additional type annotations) and of the default test code.

The latter involves changing the way the generated parser is called, and

is explained later.

<P>

No other aspect of BNFC is affected, so functionality remains as expected.

For more information, refer to the documentation on the

<A href="http://www.cs.chalmers.se/~markus/BNFC/">BNFC page</A>.

<H2> Compiling the generated parser </H2>

The Happy parser files must first be converted to Haskell with the following

command. The <TT>--glr</TT> flag causes a GLR parser to be generated, and

the <TT>--decode</TT> causes the GLR parser to generate abstract syntax trees

(this flag is required).

<PRE>

happy --glr --decode [--ghc] ParNAME.y

</PRE>

The optional <TT>--ghc</TT> flag turns on certain optimisations for the

GHC compiler. Without this flag, the Happy-generated parsers should be

compatible with Hugs in <TT>-98</TT> mode: although large parsers may

be rejected by Hugs as "too complex", and other BNFC output code may

require conditional compilation (but some hand-editing may suffice here).

<P>

This invocation of <TT>happy</TT> will generate two files, called

<TT>ParNAME.hs</TT> and <TT>ParNAMEData.hs</TT>, which are the parser

driver and parser data respectively. They are in separate files because

the former needs extensive optimization, but the latter may be large

and thus too expensive to optimize. These files can be compiled with

GHC in the normal way, <EM>although sometimes you may need the

<TT>-fglasgow-exts</TT> file to permit certain instance declarations

created in the parser</EM>.

<H2> Testing the generated GLR parser </H2>

File <TT>TestNAME.hs</TT> contains a simple test harness for the parser,

which reads from standard input and displays the resulting tree(s) or some

error message.

<P>

For GLR parsing, there are two possible errors: premature end of input

or (global) syntax error. The former should be clear, but the second may

require explanation. When ambiguity arises, GLR parsers allow several

interpretations to continue in parallel. Some of these may be discarded

when new input is incompatible with the interpretation. This is a syntax

error for this interpretation, but the only action is to discard the

interpretation. No warning or error is generated, because it doesn't

necessarily mean the overall parse will fail. However, if all interpretations

are dropped, then this is a global syntax error - no parses are possible.

<P>

If the parse succeeds and produces exactly one abstract syntax tree, then

this is shown. If more than one tree is produced, these are shown in turn,

numbering each tree to aid identification and to illustrate how many

different parses succeeded.

<H2> Using the generated GLR parser in other applications </H2>

A simplified interface to the Happy-generated GLR parsers is provided

for BNFC. The code for this is currently given at the end of the

<TT>TestNAME.hs</TT> files, and summarized here.

The full, standard interface to is explained elsewhere, in the

GLR documentation</A>.

<PRE>

type ParseFun a = [[Token]] -> (GLRResult, GLR_Output (Err a))

</PRE>

(Simplified) parsers obey the pattern represented by the <TT>ParseFun</TT>

synonym above. The input is a list of list of tokens, where the outer

list corresponds to input positions and the inner list corresponds to

different interpretations of a particular input symbol. In Natural

Language terms, the inner list allows representation of morphological

ambiguity, for example the noun and verb senses of <EM>run</EM>.

<P>

The output is the standard parsing result <TT>GLRResult</TT> plus

a more convenient form <TT>GLR_Output (Err a)</TT>.

The <TT>GLR_Output</TT> type allows representation of parse fail

(with error message(s)), or a success (with useful results).

It is polymorphic, in order to support reuse, and the occurrence of

<TT>Err</TT> reflects use of that monad within BNFC.

<P>

The fail case of <TT>GLR_Output</TT> has a main error message plus a

second string for supplementary information, e.g. more detailed output.

The main part of the success case is a list of `semantic' results

(which are abstract syntax trees in the standard BNFC setting),

so this provides convenient access to the parse results.

Also provided is a function of type <TT>(Forest -> Forest) -> [a]</TT>,

which allows modification of the parse forest to be done before the

decoding to semantic results. This is useful if some pruning or reordering

of results is required before the final trees are produced; an example of

this is given below.

<PRE>

type Forest = FiniteMap ForestId [Branch]

data GLR_Output a

= GLR_Result { pruned_decode :: (Forest -> Forest) -> [a]

, semantic_result :: [a]

}

| GLR_Fail { main_message :: String

, extra_info :: String

}

</PRE>

Parsers exported by the GLR driver code can be converted to the simple

form with the following lifting operation (the code is provided in the

<TT>TestNAME.hs</TT> files).

<PRE>

lift_parser

:: (TreeDecode a, Show a, Print a)

=> ([[Token]] -> GLRResult) -> ParseFun a

</PRE>

Finally, it is necessary to create the simplified parser object before using

it in your code. The following idiom is recommended, where <TT>TOPTYPE</TT>

is the root type in the abstract syntax trees and <TT>TOPSYMBOL</TT> is the

name of the topmost (or main) rule in the BNFC input grammar.

This causes the main parser to be lifted appropriately, and via the type

signature it fixes the type of objects to be produced by decoding.

(This is required because of the use of type classes.)

<PRE>

the_parser :: ParseFun TOPTYPE

the_parser = lift_parser pTOPSYMBOL

</PRE>

<H2> Graphical output </H2>

It is possible, and often very useful, to view graphs of the forest.

For example, you can easily see where and how ambiguity is arising.

This may also be helpful for debugging your grammars.

See the page on

graph conversion information</A>, which contains extra code for

producing graph output.

<H2> Brief example of pruning or reordering </H2>

In applications where there is moderate ambiguity, for example the parsing

of sentences of 10-20 words with a reasonably detailed NL grammar, some control

over the number and order of resulting trees may be required. This can

be done by supplying a function of type <TT>Forest -> Forest</TT>

to the function in the <TT>pruned_decode</TT> slot of <TT>GLR_Output</TT>.

Consider the following ambiguous grammar fragment

<PRE>

Plus . Expr ::= Expr "+" Expr

Lit . Expr ::= Integer

</PRE>

As an artificial example, suppose we wanted to limit the interpretations from

the first rule to the ones where the left tree covers at least as much input

as the right tree (which we can tell by inspecting the forest indices).

This can be achieved with code like the following.

<PRE>

remove_light_lefts :: Forest -> Forest

remove_light_lefts

= mapFM foo

where

foo k bs = [ b

| b <- bs

, case b_nodes b of

[_] -> True -- an Int

[(a,b,_),_,(c,d,_)] -> b - a >= d - c

]

</PRE>

In this context, <TT>mapFM</TT> applies a function of type

<TT>ForestId -> [Branch] -> [Branch]</TT> to each node in the forest,

where the <TT>ForestId</TT> is the node index and the second argument

its current value, and the result is the new value.

The code above only keeps a branch with three children whose left side is

wider or the same as its right side.

Note that forest indices contain start and end positions, so width is

easy to calculate. You can also look at the last item of the index

3-tuples to get information about grammatical categories.

<P>

For brief reference, the relevant types are listed below. <TT>GSymbol</TT>

is an enumeration of names from the grammar, each name prefixed with

<TT>G_</TT> to avoid name clashes, or it is an embedded token value.

Semantic information is held in the <TT>GSem</TT> values but in general

these are functions and not intended for general use (but see comment

below).

<PRE>

type Forest = FiniteMap ForestId [Branch]

type ForestId = (Int,Int,GSymbol)

data Branch = Branch {b_sem :: GSem, b_nodes :: [ForestId]} deriving Show

data GSymbol = {automatically generated by Happy-GLR} deriving Show

data GSem = {automatically generated by Happy-GLR} deriving Show

</PRE>

<P>

Some final comments on deeper technical aspects of this technique:

<UL>

<LI>Dead sections of the forest are not removed at present, but since

tree decoding is top-down, this won't affect the output. But there

is some overhead for the unused data, whilst the forest is stored in

a <TT>FiniteMap</TT> (as opposed to a constant-time array).

<LI>Note that setting one node's value to <TT>[]</TT> will cause failure

to be propagated to all its (non-choice) ancestors. Thus removing

the branches at some node will cause all interpretations which

include that node to be removed too.

<LI>Since the semantic values are currently stored as functions (for

various reasons, as argued elsewhere), it is not easy to access

the information. You can use <TT>decode_b</TT> in class

<TT>TreeDecode</TT> can extract some information (if you know

the expected type of the relevant sematic result, in order to

select a concrete instance for decoding), but this will cause

traversal of parts of the (original) forest and it can be expensive.

In such circumstances, the <EM>label decoding</EM> mode of

Happy-GLR is recommended (see the documentation).

<LI>If you try experiments with this functionality and get interesting

(or strange) results, please email relevant files and comments

to <A HREF="mailto:P.C.Callaghan@durham.ac.uk">Paul Callaghan</A>.

Such information may help refine and extend this aspect of Happy-GLR.

</UL>

<H2> Known issues and deficiencies </H2>

<UL>

<LI>Multiple grammar entry points not supported yet (currently, a

warning is printed).

<LI>See the Happy-GLR documentation for more details on which features

of Happy are supported, and see Paul Callaghan's

Happy-GLR information</A> page for updates or news.

<LI>The decoding from parse forest to trees is inefficient in the sense

that subgraphs can be re-visited, although it does avoid the

retention of large data structures in memory during decoding.

For small or moderate examples, this is unlikely to be a

significant point.

<LI>Currently the <TT>Err</TT> monad is not essential for the GLR parsers

and could, in principle, be removed.

In particular, <TT>happyError</TT> is bypassed and no semantic rule

in the grammar uses the <TT>{% ... }</TT> form.

<LI>The <TT>GLR_Output</TT> interface may be incorporated into a future

release of Happy, but the interface is not expected to change.

</UL>

<!--

<H2>Miscellaneous</H2>

-->

last changed: 21 mar 2005.

</FONT></I>

</BODY>

</HTML>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

glr-bnfc.html

glr-bnfc.html

Files

glr-bnfc.html

Latest commit

History

glr-bnfc.html

File metadata and controls