bnfc/document/BNF_Report.tex at master · BNFC/bnfc

History

100

101

102

103

104

105

106

107

108

109

110

111

112

113

114

115

116

117

118

119

120

121

122

123

124

125

126

127

128

129

130

131

132

133

134

135

136

137

138

139

140

141

142

143

144

145

146

147

148

149

150

151

152

153

154

155

156

157

158

159

160

161

162

163

164

165

166

167

168

169

170

171

172

173

174

175

176

177

178

179

180

181

182

183

184

185

186

187

188

189

190

191

192

193

194

195

196

197

198

199

200

201

202

203

204

205

206

207

208

209

210

211

212

213

214

215

216

217

218

219

220

221

222

223

224

225

226

227

228

229

230

231

232

233

234

235

236

237

238

239

240

241

242

243

244

245

246

247

248

249

250

251

252

253

254

255

256

257

258

259

260

261

262

263

264

265

266

267

268

269

270

271

272

273

274

275

276

277

278

279

280

281

282

283

284

285

286

287

288

289

290

291

292

293

294

295

296

297

298

299

300

301

302

303

304

305

306

307

308

309

310

311

312

313

314

315

316

317

318

319

320

321

322

323

324

325

326

327

328

329

330

331

332

333

334

335

336

337

338

339

340

341

342

343

344

345

346

347

348

349

350

351

352

353

354

355

356

357

358

359

360

361

362

363

364

365

366

367

368

369

370

371

372

373

374

375

376

377

378

379

380

381

382

383

384

385

386

387

388

389

390

391

392

393

394

395

396

397

398

399

400

401

402

403

404

405

406

407

408

409

410

411

412

413

414

415

416

417

418

419

420

421

422

423

424

425

426

427

428

429

430

431

432

433

434

435

436

437

438

439

440

441

442

443

444

445

446

447

448

449

450

451

452

453

454

455

456

457

458

459

460

461

462

463

464

465

466

467

468

469

470

471

472

473

474

475

476

477

478

479

480

481

482

483

484

485

486

487

488

489

490

491

492

493

494

495

496

497

498

499

500

501

502

503

504

505

506

507

508

509

510

511

512

513

514

515

516

517

518

519

520

521

522

523

524

525

526

527

528

529

530

531

532

533

534

535

536

537

538

539

540

541

542

543

544

545

546

547

548

549

550

551

552

553

554

555

556

557

558

559

560

561

562

563

564

565

566

567

568

569

570

571

572

573

574

575

576

577

578

579

580

581

582

583

584

585

586

587

588

589

590

591

592

593

594

595

596

597

598

599

600

601

602

603

604

605

606

607

608

609

610

611

612

613

614

615

616

617

618

619

620

621

622

623

624

625

626

627

628

629

630

631

632

633

634

635

636

637

638

639

640

641

642

643

644

645

646

647

648

649

650

651

652

653

654

655

656

657

658

659

660

661

662

663

664

665

666

667

668

669

670

671

672

673

674

675

676

677

678

679

680

681

682

683

684

685

686

687

688

689

690

691

692

693

694

695

696

697

698

699

700

701

702

703

704

705

706

707

708

709

710

711

712

713

714

715

716

717

718

719

720

721

722

723

724

725

726

727

728

729

730

731

732

733

734

735

736

737

738

739

740

741

742

743

744

745

746

747

748

749

750

751

752

753

754

755

756

757

758

759

760

761

762

763

764

765

766

767

768

769

770

771

772

773

774

775

776

777

778

779

780

781

782

783

784

785

786

787

788

789

790

791

792

793

794

795

796

797

798

799

800

801

802

803

804

805

806

807

808

809

810

811

812

813

814

815

816

817

818

819

820

821

822

823

824

825

826

827

828

829

830

831

832

833

834

835

836

837

838

839

840

841

842

843

844

845

846

847

848

849

850

851

852

853

854

855

856

857

858

859

860

861

862

863

864

865

866

867

868

869

870

871

872

873

874

875

876

877

878

879

880

881

882

883

884

885

886

887

888

889

890

891

892

893

894

895

896

897

898

899

900

901

902

903

904

905

906

907

908

909

910

911

912

913

914

915

916

917

918

919

920

921

922

923

924

925

926

927

928

929

930

931

932

933

934

935

936

937

938

939

940

941

942

943

944

945

946

947

948

949

950

951

952

953

954

955

956

957

958

959

960

961

962

963

964

965

966

967

968

969

970

971

972

973

974

975

976

977

978

979

980

981

982

983

984

985

986

987

988

989

990

991

992

993

994

995

996

997

998

999

1000

\documentclass[10pt]{article}

\usepackage{setspace}

\setlength{\leftmargin}{30mm}

\setlength{\oddsidemargin}{0mm}

%\setlength{\evensidemargin}{0mm}

\setlength{\evensidemargin}{-2mm}

\setlength{\topmargin}{-16mm}

\setlength{\textheight}{240mm}

\setlength{\textwidth}{158mm}

\newcommand{\bequ}{\begin{quote}}

\newcommand{\enqu}{\end{quote}}

\newcommand{\begit}{\begin{itemize}}

\newcommand{\enit}{\end{itemize}}

\newcommand{\LBNF}{LBNF}

% commands from LBNF documentation

\newcommand{\emptyP}{\mbox{$\epsilon$}}

\newcommand{\terminal}[1]{\mbox{{\textbf {#1}}}}

\newcommand{\nonterminal}[1]{\mbox{$\langle \mbox{{\sl #1 }} \! \rangle$}}

\newcommand{\arrow}{\mbox{::=}}

\newcommand{\delimit}{\mbox{$|$}}

\newcommand{\reserved}[1]{\mbox{{\textbf {#1}}}}

\newcommand{\literal}[1]{\mbox{{\textbf {#1}}}}

\newcommand{\symb}[1]{\mbox{{\textbf {#1}}}}

\title{{\Large \bf Labelled BNF: \\

A High-Level Formalism for Defining Well-Behaved Programming Languages} \\}

\author{ Markus Forsberg and Aarne Ranta \\

Department of Computing Science \\

Chalmers University of Technology and the University of Gothenburg \\

SE-412 96 Gothenburg, Sweden\\

\{markus,aarne\}@cs.chalmers.se}

\date{\today}

\begin{document}

\maketitle

%\begin{doublespace}

\begin{abstract}

This paper introduces

the grammar formalism \textit{Labelled BNF} (LBNF),

and the compiler construction tool

\textit{BNF Converter}.

Given a grammar written in LBNF,

the BNF Converter produces a complete compiler

front end (up to, but excluding, type checking),

i.e.\ a lexer, a parser, and an abstract

syntax definition. Moreover, it produces a pretty-printer

and a language specification in \LaTeX, as well as

a template file for the compiler back end.

A language specification in LBNF is completely declarative and therefore

portable. It reduces dramatically the effort of implementing a language.

The price to pay is that the language must be ``well-behaved'', i.e.\ that

its lexical structure must be describable by a regular expression and

its syntax by a context-free grammar.

\end{abstract}

\subsubsection*{Keywords}

\textit{compiler construction, parser generator, grammar, labelled BNF,

abstract syntax, pretty printer, document automation}

\section{Introduction}

This paper defends an old idea:

a programming language is defined by a BNF grammar \cite{algol}.

This idea is usually not followed for two reasons. One reason is that

a language may require more

powerful methods (consider, for example, languages with layout rules). The

other reason is that, when parsing, one wants to do other things already

(such as type checking etc).

Hence the idea of extending pure BNF with semantic actions,

written in a general-purpose

programming language. However, such actions destroy declarativity and portability.

To describe the language, it becomes

necessary to write a separate document, since the BNF no longer defines

the language. Also the problem of synchronization arises: how to

guarantee that the different modules---the lexer, the parser,

and the document, etc.---describe the same language and that they fit together?

The idea in LBNF is to use BNF, with construction of syntax trees as

the only semantic action. This gives a unique source for all language-related

modules, and it also solves the problem of synchronization. Thereby it

dramatically reduces the effort of implementing a new language. Generating

syntax trees instead of using more complex semantic actions is

a natural phase of \textit{multi-phase compilation}, which is recommended by most

modern-day text books about compiler construction (e.g. Appel\cite{appel}).

BNF grammars are an ingredient of all modern compilers.

When designing LBNF, we tried to keep it so simple and intuitive that

it can be learnt in a few minutes by anyone who knows ordinary BNF.

Of course, the are some drawbacks with our approach.

Not all languages can be completely

defined, although surprisingly many can (see Section~\ref{results}).

Another drawback is that the modules generated are not quite as good

as handwritten. But this is a general problem when generating code instead

of handwriting it: a problem shared by all compilers, including the

standard parser and lexer generation tools.

To use LBNF descriptions as implementations, we have

built the \textit{BNF Converter}\cite{bnfc}.

Given an input LBNF grammar,

the BNF Converter produces a lexer, a parser, and an abstract

syntax definition. Moreover, it produces a pretty-printer

and a language specification in \LaTeX.

Since all this is generated from a \textit{single source}, we can be sure

that the documentation corresponds to the actual

language, and that the lexer, parser and abstract syntax fit

seamlessly together.

The BNF Converter is written in the functional programming language

Haskell\cite{haskell98}, and

its target languages are presently Haskell, the associated

compiler tools Happy\cite{happy} and Alex\cite{alex}, and \LaTeX.

%NEW

Happy is a parser generator tool, similar to YACC\cite{johnson-yacc}, which

from a BNF-like description

builds an LALR(1) parser.

Alex is a lexer generator tool, similar to Lex\cite{lesk-lex},

which converts a regular expression

into a finite-state automaton.

%new

Over the years, Haskell and these tools have

proven to be excellent devices for compiler construction,

%NEW

to a large extent because of Haskell's

algebraic data types and a convenient method of

syntax-directed translation via pattern matching;

%new

yet they do not quite remove the need for repetitive and low-level

coding. The BNF Converter can be seen as a high-level front end

to these tools.

However, due to its declarative nature, LBNF

does not crucially depend on the target language, and

it is therefore possible to redirect the BNF Converter as a front end to

another set of compiler tools. This has in fact recently been done

for Java, CUP \cite{cup}, and

JLex \cite{jlex}\footnote{Work by Michael Pellauer at Chalmers}.

%NEW

The only essential difference between Haskell/Happy/Alex

and Java/CUP/JLex or C/YACC/Lex

is the target language included in the parser and

lexer description.

%new

\section{The LBNF Grammar Formalism}

As the first example of LBNF,

consider a triple of rules defining addition expressions with ``1'':

\begin{verbatim}

EPlus. Exp ::= Exp "+" Num ;

ENum. Exp ::= Num ;

NOne. Num ::= "1" ;

\end{verbatim}

Apart from the \textit{labels}, {\tt EPlus}, {\tt ENum}, and {\tt NOne},

the rules are

ordinary BNF rules, with terminal symbols enclosed in

double quotes and nonterminals written without quotes.

The labels serve as \textit{constructors} for

syntax trees.

From an LBNF grammar, the BNF Converter extracts

an \textit{abstract syntax}\ and

a \textit{concrete syntax}.

The abstract syntax is implemented, in Haskell, as a system of

datatype definitions

\begin{verbatim}

data Exp = EPlus Exp Exp | ENum Num

data Num = NOne

\end{verbatim}

(For other languages, including C and Java, an equivalent

representation can be given in the same way as in the Zephyr

abstract syntax specification tool \cite{zephyr}).

The concrete syntax is implemented by the

lexer, parser and pretty-printer algorithms,

which are defined in other generated program modules.

\subsection{LBNF in a nutshell}

Briefly, an LBNF grammar is a BNF grammar where every rule is given a label.

The label is used for constructing a syntax tree whose subtrees are

given by the nonterminals of the rule, in the same order.

More formally, an LBNF grammar consists of a collection of rules,

which have the following form (expressed by a regular expression;

Appendix A gives a complete BNF definition of the notation):

\bequ

Ident "{\tt .}" Ident "{\tt ::=}" (Ident $\mid$ String)* "{\tt;}" ;

\enqu

The first identifier is the \textit{rule label}, followed by the

\textit{value category}. On the right-hand side of the production

arrow ({\tt ::=}) is the list of production items. An item is either

a quoted string (\textit{terminal}) or a category symbol (\textit{non-terminal}).

A rule whose value category is $C$ is also called a \textit{production}\ for $C$.

Identifiers, that is, rule names and category symbols, can be

chosen {\em ad libitum}, with the restrictions imposed by the target language. To

satisfy Haskell, and C and Java as well, the following rule is imposed

\bequ

An identifier is a nonempty sequence of letters, starting with a

capital letter.

\enqu

LBNF is clearly sufficient for defining any context-free language.

However, the abstract syntax that it generates may often become too

detailed. Without destroying the declarative nature and the simplicity

of LBNF, we have added to it four {\em ad hoc} conventions, which are

described in the following subsection.

\subsection{LBNF conventions}

\subsubsection{Predefined basic types}

The first convention are predefined basic types.

Basic types, such as integer and character, can of course be

defined in a labelled BNF, for example:

\begin{verbatim}

Char_a. Char ::= "a" ;

Char_b. Char ::= "b" ;

\end{verbatim}

This is, however, cumbersome and inefficient. Instead, we have decided to

extend our formalism with predefined basic types, and represent their

grammar as a part of lexical structure. These types are the following,

as defined by LBNF regular expressions (see \ref{reg} for

the regular expression syntax):

\bequ

{\tt Integer} of integers, defined

\verb6digit+6

{\tt Double} of floating point numbers, defined

\verb6digit+ '.' digit+ ('e' '-'? digit+)?6

{\tt Char} of characters (in single quotes), defined

\verb6'\'' ((char - ["'\\"]) | ('\\' ["'\\nt"])) '\''6

{\tt String} of strings (in double quotes), defined

\verb6'"' ((char - ["\"\\"]) | ('\\' ["\"\\nt"]))* '"'6

{\tt Ident} of identifiers, defined

\verb6letter (letter | digit | '_' | '\'')*6

\enqu

In the abstract syntax, these types are represented as corresponding

types. In Haskell, we also need to define a new type for Ident:

\begin{verbatim}

newtype Ident = Ident String

\end{verbatim}

For example, the LBNF rules

\begin{verbatim}

EVar. Exp ::= Ident ;

EInt. Exp ::= Integer ;

EStr. Exp ::= String ;

\end{verbatim}

generate the abstract syntax

\begin{verbatim}

data Exp = EVar Ident | EInt Integer | EStr String

\end{verbatim}

where {\tt Integer} and {\tt String} have their standard Haskell meanings.

The lexer only produces the high-precision variants of integers and

floats; authors of applications can

truncate these numbers later if they want to have

low precision instead.

Predefined categories may not have explicit productions in the

grammar, since this would violate their predefined meanings.

\subsubsection{Semantic dummies}

Sometimes the concrete syntax of a language includes rules that make

no semantic difference. An example is

a BNF rule making the parser accept extra semicolons after statements:

\begin{verbatim}

Stm ::= Stm ";" ;

\end{verbatim}

As this rule is semantically dummy, we do not want to represent it by a

constructors in the abstract syntax.

Instead, we introduce the following convention:

\bequ

A rule label can be an underscore {\tt \_}, which

does not add anything to the syntax tree.

\enqu

Thus we can write the following rule in LBNF:

\begin{verbatim}

_ . Stm ::= Stm ";" ;

\end{verbatim}

Underscores are of course only meaningful as replacements of

one-argument constructors where the value type is the same as the

argument type.

Semantic dummies leave no trace in the pretty-printer.

Thus, for instance, the pretty-printer ``normalizes away''

extra semicolons.

\subsubsection{Precedence levels}

A common idiom in (ordinary) BNF is to use indexed variants of categories

to express precedence levels:

\begin{verbatim}

Exp3 ::= Integer ;

Exp2 ::= Exp2 "*" Exp3 ;

Exp ::= Exp "+" Exp2 ;

Exp ::= Exp2 ;

Exp2 ::= Exp3 ;

Exp3 ::= "(" Exp ")" ;

\end{verbatim}

The precedence level regulates the order of parsing, including

associativity. Parentheses lift an expression of any level

to the highest level.

A straightforward labelling of the above rules creates a grammar that does

have the desired recognition behavior, as the abstract syntax is cluttered

with type distinctions (between {\tt Exp}, {\tt Exp2}, and {\tt Exp3})

and constructors (from the last three rules) with no semantic content.

The BNF Converter solution is to distinguish among

category symbols those that are just indexed variants of each other:

\bequ

A category symbol can end with an integer index

(i.e.\ a sequence of digits), and is then treated as a type

synonym of the corresponding non-indexed symbol.

\enqu

Thus {\tt Exp2} and {\tt Exp3} are indexed variants of {\tt Exp}.

Transitions between indexed variants are

semantically dummy, and we do not want to represent them by

constructors in the abstract syntax. To do this, we extend the use

of underscores to indexed variants.

The example grammar above can now be labelled as follows:

\begin{verbatim}

EInt. Exp3 ::= Integer ;

ETimes. Exp2 ::= Exp2 "*" Exp3 ;

EPlus. Exp ::= Exp "+" Exp2 ;

_. Exp ::= Exp2 ;

_. Exp2 ::= Exp3 ;

_. Exp3 ::= "(" Exp ")" ;

\end{verbatim}

Thus the datatype of expressions becomes simply

\begin{verbatim}

data Exp = EInt Integer | ETimes Exp Exp | EPlus Exp Exp

\end{verbatim}

and the syntax tree for {\tt 2*(3+1)} is

\begin{verbatim}

ETimes (EInt 2) (EPlus (EInt 3) (EInt 1))

\end{verbatim}

Indexed categories {\em can} be used for other purposes than

precedence, since the only thing we can formally check is the

type skeleton (see the section \ref{typecheck}).

The parser does not need to know

that the indices mean precedence, but only that indexed

variants have values of the same type.

The pretty-printer, however, assumes that

indexed categories are used for precedence, and may produce

strange results if they are used in some other way.

\subsubsection{Polymorphic lists}

It is easy to define monomorphic list types in LBNF:

\begin{verbatim}

NilDef. ListDef ::= ;

ConsDef. ListDef ::= Def ";" ListDef ;

\end{verbatim}

However, compiler writers in languages like

Haskell may want to use predefined

polymorphic lists, because of the language support for these constructs.

LBNF permits the use of Haskell's list constructors

as labels, and list brackets in category names:

\begin{verbatim}

[]. [Def] ::= ;

(:). [Def] ::= Def ";" [Def] ;

\end{verbatim}

As the general rule, we have

\bequ

{\tt[}$C${\tt ]}, the category of lists of type $C$,

{\tt []} and {\tt (:)}, the Nil and Cons rule labels,

{\tt (:[])}, the rule label for one-element lists.

\enqu

The third rule label is used to place an at-least-one restriction,

but also to permit special treatment of one-element lists

in the concrete syntax.

In the \LaTeX\ document (for stylistic reasons) and in the Happy file (for

syntactic reasons), the category name {\tt [X]} is replaced by {\tt ListX}.

In order for this not to cause clashes, {\tt ListX}

may not be at the same time used explicitly in the grammar.

The list category constructor can be iterated: {\tt [[X]]}, {\tt [[[X]]]}, etc

behave in the expected way.

The list notation can also be seen as a variant of the Kleene star and plus, and

hence as an ingredient from Extended BNF.

\subsection{The type-correctness of LBNF rules}

\label{typecheck}

It is customary in parser generators to delegate the checking of certain

errors to the target language. For instance, a Happy source file that

Happy processes without complaints can still produce a Haskell file

that is rejected by Haskell. In the same way, the BNF converter

delegates some checking to Happy and Haskell (for instance,

the parser conflict check). However, since it is always

the easiest for the programmer to understand error messages

related to the source, the BNF Converter performs some checks,

which are mostly connected with the sanity of the abstract syntax.

The type checker uses a notion of the

\textit{category skeleton} of a rule, which is a pair

(C, A\ldots B)

where $C$ is the unindexed left-hand-side non-terminal and $A\ldots B$

is the sequence of unindexed right-hand-side non-terminals of the rule.

In other words, the category skeleton of a rule expresses the abstract-syntax

type of the semantic action associated to that rule.

We also need the notions of

a \textit{regular category} and

a \textit{regular rule label}.

Briefly, regular labels and categories are the user-defined ones.

More formally,

a regular category is none of

{\tt[}$C${\tt]},{\tt Integer}, {\tt Double}, {\tt Char}, {\tt String}

and {\tt Ident}.

A regular rule label is none of

{\tt \_}, {\tt []}, {\tt (:)}, and {\tt (:[])}.

The type checking rules are now the following:

\bequ

A rule labelled by {\tt \_} must have a category skeleton of form $(C,C)$.

A rule labelled by {\tt []} must have a category skeleton of form $([C],)$.

A rule labelled by {\tt (:)} must have a category skeleton of form $([C],C[C])$.

A rule labelled by {\tt (:[])} must have a category skeleton of form $([C],C)$.

Only regular categories may have productions with regular rule labels.

Every regular category occurring in the grammar

must have at least one production with a regular rule label.

All rules with the same regular rule label must have the same

category skeleton.

\enqu

The second-last rule corresponds to the absence of empty data types in Haskell.

The last rule could

be strengthened so as to require that all regular rule labels be unique:

this is needed to guarantee error-free pretty-printing.

Violating this strengthened rule currently

generates only a warning, not a type error.

\section{LBNF Pragmas}

Even well-behaved languages have features that

cannot be expressed naturally in its BNF grammar.

To take care of them, while preserving the single-source nature of

the BNF Converter, we extend the notation with what we call \textit{pragmas}.

All these pragmas are completely declarative, and the pragmas are also

reflected in the documentation.

\subsection{Comment pragmas}

The first pragma tells what kinds of \textit{comments} the language has.

Normally we do not want comments to appear in the abstract syntax,

but treat them in the lexical analysis. The comment pragma instructs the

lexer generator (and the document generator!) to

treat certain pieces of text as comments and thus to ignore them

(except for their contribution to the position information used in

parser error messages).

The simplest solution to the comment

problem would be to use some default comments

that are hard-coded into the system, e.g.\ Haskell's comments.

But this definition can hardly be stated as a condition for a language

to be well-behaved, and we could not even

define C or Java or ML then. So we have added a comment pragma, whose

regular-expression syntax is

\bequ

"{\tt comment}" String String? "{\tt ;}"

\enqu

The first string tells how a comment begins.

The second, optional, string marks the end of a comment:

if it is not given, then the comment expects a newline to end.

For instance, to describe the Haskell comment convention,

we write the following lines in our LBNF source file:

\begin{verbatim}

comment "--" ;

comment "{-" "-}" ;

\end{verbatim}

Since comments are treated in the lexical analyzer, they must

be recognized by a finite state automaton.

This excludes the use of nested comments unless defined in

the grammar itself. Discarding nested comments is one aspect

of what we call well-behaved languages.

The length of comment end markers is restricted to two characters,

due to the complexities in the lexer caused by longer end markers.

\subsection{Internal pragmas}

Sometimes we want to include in the abstract syntax

structures that are not part of the concrete syntax,

and hence not parsable.

They can be, for instance, syntax trees that are produced by a

type-annotating type checker.

Even though they are not parsable, we may want to

pretty-print them, for instance, in the type checker's

error messages.

To define such an internal constructor, we use a pragma

\begin{verbatim}

"internal" Rule ";"

\end{verbatim}

where {\tt Rule} is a normal LBNF rule. For instance,

\begin{verbatim}

internal EVarT. Exp ::= "(" Ident ":" Type ")";

\end{verbatim}

introduces a type-annotated variant of a variable expression.

\subsection{Token pragmas}

\label{reg}

The predefined lexical types are sufficient in

most cases, but sometimes we would like to have more control over the

lexer. This is provided by \textit{token pragmas}. They use

regular expressions to define new token types.

If we, for example, want to make a finer distinction for identifiers,

a distinction between lower- and upper-case letters, we can introduce

two new token types, {\tt UIdent}\ and {\tt LIdent}, as follows.

\begin{verbatim}

token UIdent (upper (letter | digit | '\_')*) ;

token LIdent (lower (letter | digit | '\_')*) ;

\end{verbatim}

The regular expression syntax of LBNF is specified in the Appendix.

The abbreviations with strings in brackets need a word of explanation:

\begin{quote}

\verb6["abc7%"]6 denotes the union of the characters \verb6'a' 'b' 'c' '7' '%'6

\verb6{"abc7%"}6 denotes the sequence of the characters \verb6'a' 'b' 'c' '7' '%'6

\end{quote}

The atomic expressions {\tt upper}, {\tt lower}, {\tt letter}, and {\tt digit}

denote the character classes suggested by their names (letters are isolatin1).

The expression {\tt char} matches any unicode character, and

the ``epsilon'' expression {\tt eps} matches the empty string.

\subsection{Entry point pragmas}

The BNF Converter generates, by default, a parser for every category in

the grammar. This is unnecessarily rich in most cases, and makes the parser

larger than needed. If the size of the parser becomes critical,

the \textit{entry points pragma} enables the user

to define which of the parsers are actually exported:

\begin{verbatim}

entrypoints (Ident ",")* Ident ;

\end{verbatim}

For instance, the following pragma defines {\tt Stm} and {\tt Exp} to be

the only entry points:

\begin{verbatim}

entrypoints Stm, Exp ;

\end{verbatim}

\section{BNF Converter code generation}

\subsection{The files}

Given an LBNF source file {\tt Foo.cf},

the BNF Converter generates the following files:

\begin{itemize}

\item {\tt AbsFoo.hs}: The abstract syntax (Haskell source file)

\item {\tt LexFoo.x}: The lexer (Alex source file)

\item {\tt ParFoo.y}: The parser (Happy source file)

\item {\tt PrintFoo.hs}: The pretty printer (Haskell source file)

\item {\tt SkelFoo.hs}: The case Skeleton (Haskell source file)

\item {\tt TestFoo.hs}: A test bench file for the parser and pretty printer (Haskell source file)

\item {\tt DocFoo.tex}: The language document (\LaTeX source file)

\item {\tt makefile}: A makefile for the lexer, the parser, and the document

\end{itemize}

In addition to these files, the user needs the Alex runtime file

{\tt Alex.hs} and the error monad definition file {\tt ErrM.hs}, both

included in the BNF Converter distribution.

\subsection{Example: {\tt JavaletteLight.cf}}

The following LBNF grammar defines a small C-like language,

Javalette Light\footnote{It is a fragment of the

language Javalette used at compiler construction courses at Chalmers University}.

% \end{doublespace}

\small

\begin{verbatim}

Fun. Prog ::= Typ Ident "(" ")" "{" [Stm] "}" ;

SDecl. Stm ::= Typ Ident ";" ;

SAss. Stm ::= Ident "=" Exp ";" ;

SIncr. Stm ::= Ident "++" ";" ;

SWhile. Stm ::= "while" "(" Exp ")" "{" [Stm] "}" ;

ELt. Exp0 ::= Exp1 "<" Exp1 ;

EPlus. Exp1 ::= Exp1 "+" Exp2 ;

ETimes. Exp2 ::= Exp2 "*" Exp3 ;

EVar. Exp3 ::= Ident ;

EInt. Exp3 ::= Integer ;

EDouble. Exp3 ::= Double ;

TInt. Typ ::= "int" ;

TDouble. Typ ::= "double" ;

[]. [Stm] ::= ;

(:). [Stm] ::= Stm [Stm] ;

-- coercions

_. Stm ::= Stm ";" ;

_. Exp ::= Exp0 ;

_. Exp0 ::= Exp1 ;

_. Exp1 ::= Exp2 ;

_. Exp2 ::= Exp3 ;

_. Exp3 ::= "(" Exp ")" ;

-- pragmas

internal ExpT. Exp ::= Typ Exp ;

comment "/*" "*/" ;

comment "//" ;

entrypoints Prog, Stm, Exp ;

\end{verbatim}

\normalsize

%\begin{doublespace}

\subsubsection{The abstract syntax {\tt AbsJavaletteLight.hs}}

The abstract syntax of Javalette generated by the

BNF Converter is essentially what a Haskell programmer would write by hand:

%\end{doublespace}

\small

\begin{verbatim}

data Prog =

Fun Typ Ident [Stm]

deriving (Eq,Show)

data Stm =

SDecl Typ Ident

| SAss Ident Exp

| SIncr Ident

| SWhile Exp [Stm]

deriving (Eq,Show)

data Exp =

ELt Exp Exp

| EPlus Exp Exp

| ETimes Exp Exp

| EVar Ident

| EInt Integer

| EDouble Double

| ExpT Typ Exp

deriving (Eq,Show)

data Typ =

TInt

| TDouble

deriving (Eq,Show)

\end{verbatim}

\normalsize

%\begin{doublespace}

\subsubsection{The lexer {\tt LexJavaletteLight.x}}

The lexer file (in Alex) consists mostly of standard rules

for literals and identifiers, but has

rules added for reserved words and symbols (i.e.\ terminals

occurring in the grammar) and for comments. Here is

a fragment with the definitions characteristic of

Javalette.

%\end{doublespace}

\small

\begin{verbatim}

{ %s = ^( | ^) | ^{ | ^} | ^; | ^= | ^+ ^+ | ^< | ^+ | ^*}

"tokens_lx"/"tokens_acts":-

<> ::= ^/^/ [.]* ^n

<> ::= ^/ ^* ([^u # ^*] | ^* [^u # ^/])* (^*)+ ^/

<> ::= ^w+

<pTSpec> ::= %s %{ pTSpec p = PT p . TS %}

<ident> ::= ^l ^i* %{ ident p = PT p . eitherResIdent TV %}

<int> ::= ^d+ %{ int p = PT p . TI %}

<double> ::= ^d+ ^. ^d+ (e (^-)? ^d+)? %{ double p = PT p . TD %}

eitherResIdent :: (String -> Tok) -> String -> Tok

eitherResIdent tv s = if isResWord s then (TS s) else (tv s) where

isResWord s = elem s ["double","int","while"]

\end{verbatim}

\normalsize

% \begin{doublespace}

The lexer file moreover defines the token type {\tt Tok} used by

the lexer and the parser.

\subsubsection{The parser {\tt ParJavaletteLight.y}}

The parser file (in Happy) has a large number of token definitions

(which we find it extremely valuable to generate automatically),

followed by parsing rules corresponding closely to the source BNF rules.

Here is a fragment containing examples of both parts:

% \end{doublespace}

\small

\begin{verbatim}

%token

'(' { PT _ (TS "(") }

')' { PT _ (TS ")") }

'double' { PT _ (TS "double") }

'int' { PT _ (TS "int") }

'while' { PT _ (TS "while") }

L_integ { PT _ (TI $$) }

L_doubl { PT _ (TD $$) }

Integer : L_integ { (read $1) :: Integer }

Double : L_doubl { (read $1) :: Double }

Stm :: { Stm }

Stm : Typ Ident ';' { SDecl $1 $2 }

| Ident '=' Exp ';' { SAss $1 $3 }

| Ident '++' ';' { SIncr $1 }

| 'while' '(' Exp ')' '{' ListStm '}' { SWhile $3 (reverse $6) }

| Stm ';' { $1 }

Exp0 :: { Exp }

Exp0 : Exp1 '<' Exp1 { ELt $1 $3 }

| Exp1 { $1 }

\end{verbatim} % $

\normalsize

% \begin{doublespace}

The exported parsers have types of the following

form, for any abstract syntax type {\tt T},

\begin{verbatim}

[Tok] -> Err T

\end{verbatim}

returning either a value of type {\tt T} or an error message, using

a simple error monad. The input is a token list received from the lexer.

\subsubsection{The pretty-printer {\tt PrintJavaletteLight.hs}}

The pretty-printer consists of a Haskell class {\tt Print} with instances

for all generated datatypes, taking precedence into account. The class method

{\tt prt}\

generates a list of strings for a syntax tree of any type.

% \end{doublespace}

\small

\begin{verbatim}

instance Print Exp where

prt i e = case e of

ELt exp0 exp -> prPrec i 0 (concat [prt 1 exp0 , ["<"] , prt 1 exp])

EPlus exp0 exp -> prPrec i 1 (concat [prt 1 exp0 , ["+"] , prt 2 exp])

ETimes exp0 exp -> prPrec i 2 (concat [prt 2 exp0 , ["*"] , prt 3 exp])

\end{verbatim}

\normalsize

% \begin{doublespace}

The list is then put in layout (identation, newlines) by a \textit{rendering}

function, which is generated independently of the grammar,

but written with easy modification in mind.

\subsubsection{The case skeleton {\tt SkelJavaletteLight.hs}}

The case skeleton can be used as a basis when defining the compiler

back end, e.g.\ type checker and code generator. The same skeleton is

actually also used in the pretty printer. The case branches in the skeleton

are initialized to show error messages saying that the case is undefined.

% \end{doublespace}

\small

\begin{verbatim}

transExp :: Exp -> Result

transExp x = case x of

ELt exp0 exp -> failure x

EPlus exp0 exp -> failure x

ETimes exp0 exp -> failure x

\end{verbatim}

\normalsize

%\begin{doublespace}

\subsubsection{The language document {\tt DocJavaletteLight.tex}}

We show the main parts of the generated JavaletteLight document

in a typeset form. The grammar symbols in the document are produced

by \LaTeX\ macros, with easy modification in mind.

%\end{doublespace}

\footnotesize

\begin{quote}

\section*{The lexical structure of JavaletteLight}

\subsection*{Identifiers}

Identifiers \nonterminal{Ident} are unquoted strings beginning with a letter,

followed by any combination of letters, digits, and the characters {\tt \_ '},

reserved words excluded.

\subsection*{Literals}

Integer literals \nonterminal{Int}\ are nonempty sequences of digits.

Double-precision float literals \nonterminal{Double}\ have the structure

indicated by the regular expression $\nonterminal{digit}+ \mbox{{\it `.'}} \nonterminal{digit}+ (\mbox{{\it `e'}} \mbox{{\it `-'}}? \nonterminal{digit}+)?$ i.e.\

two sequences of digits separated by a decimal point, optionally

followed by an unsigned or negative exponent.

\subsection*{Reserved words and symbols}

The set of reserved words is the set of terminals appearing in the grammar. Those reserved words that consist of non-letter characters are called symbols, and they are treated in a different way from those that are similar to identifiers. The lexer follows rules familiar from languages like Haskell, C, and Java, including longest match and spacing conventions.

The reserved words used in JavaletteLight are the following: \\

\begin{tabular}{lll}

{\reserved{double}} &{\reserved{int}} &{\reserved{while}} \\

\end{tabular}\\

The symbols used in JavaletteLight are the following: \\

\begin{tabular}{lll}

{\symb{(}} &{\symb{)}} &{\symb{ \{ }} \\

{\symb{ \} }} &{\symb{;}} &{\symb{{$=$}}} \\

{\symb{{$+$}{$+$}}} &{\symb{{$<$}}} &{\symb{{$+$}}} \\

{\symb{*}} & & \\

\end{tabular}\\

\subsection*{Comments}

Single-line comments begin with {\symb{//}}. \\Multiple-line comments are enclosed with {\symb{/*}} and {\symb{*/}}.

\section*{The syntactic structure of JavaletteLight}

Non-terminals are enclosed between $\langle$ and $\rangle$.

The symbols {\arrow} (production), {\delimit} (union)

and {\emptyP} (empty rule) belong to the BNF notation.

All other symbols are terminals.\\

\begin{tabular}{lll}

{\nonterminal{Prog}} & {\arrow} &{\nonterminal{Typ}} {\nonterminal{Ident}} {\terminal{(}} {\terminal{)}} {\terminal{ \{ }} {\nonterminal{ListStm}} {\terminal{ \} }} \\

\end{tabular}\\

\begin{tabular}{lll}

{\nonterminal{Stm}} & {\arrow} &{\nonterminal{Typ}} {\nonterminal{Ident}} {\terminal{;}} \\

& {\delimit} &{\nonterminal{Ident}} {\terminal{{$=$}}} {\nonterminal{Exp}} {\terminal{;}} \\

& {\delimit} &{\nonterminal{Ident}} {\terminal{{$+$}{$+$}}} {\terminal{;}} \\

& {\delimit} &{\terminal{while}} {\terminal{(}} {\nonterminal{Exp}} {\terminal{)}} {\terminal{ \{ }} {\nonterminal{ListStm}} {\terminal{ \} }} \\

& {\delimit} &{\nonterminal{Stm}} {\terminal{;}} \\

\end{tabular}\\

\begin{tabular}{lll}

{\nonterminal{Exp0}} & {\arrow} &{\nonterminal{Exp1}} {\terminal{{$<$}}} {\nonterminal{Exp1}} \\

& {\delimit} &{\nonterminal{Exp1}} \\

\end{tabular}\\

\begin{tabular}{lll}

{\nonterminal{Exp1}} & {\arrow} &{\nonterminal{Exp1}} {\terminal{{$+$}}} {\nonterminal{Exp2}} \\

& {\delimit} &{\nonterminal{Exp2}} \\

\end{tabular}\\

\begin{tabular}{lll}

{\nonterminal{Exp2}} & {\arrow} &{\nonterminal{Exp2}} {\terminal{*}} {\nonterminal{Exp3}} \\

& {\delimit} &{\nonterminal{Exp3}} \\

\end{tabular}\\

\begin{tabular}{lll}

{\nonterminal{Exp3}} & {\arrow} &{\nonterminal{Ident}} \\

& {\delimit} &{\nonterminal{Integer}} \\

& {\delimit} &{\nonterminal{Double}} \\

& {\delimit} &{\terminal{(}} {\nonterminal{Exp}} {\terminal{)}} \\

\end{tabular}\\

\begin{tabular}{lll}

{\nonterminal{ListStm}} & {\arrow} &{\emptyP} \\

& {\delimit} &{\nonterminal{Stm}} {\nonterminal{ListStm}} \\

\end{tabular}\\

\begin{tabular}{lll}

{\nonterminal{Exp}} & {\arrow} &{\nonterminal{Exp0}} \\

\end{tabular}\\

\begin{tabular}{lll}

{\nonterminal{Typ}} & {\arrow} &{\terminal{int}} \\

& {\delimit} &{\terminal{double}} \\

\end{tabular}\

\end{quote}

\normalsize

% \begin{doublespace}

\subsubsection{The {\tt makefile}}

The makefile is used to run Alex on the lexer, Happy on the parser, and

\LaTeX\ on the document, by simply typing {\tt make}. The {\tt make clean}

command removes the generated files.

\subsubsection{The test bench file {\tt TestJavaletteLight.hs}}

The test bench file can be loaded in the Haskell interpreter hugs to

run the parser and the pretty-printer on terminal or file input.

The test functions display a syntax tree (or an error message)

and the pretty-printer result from the same tree.

\subsection{An optimization: left-recursive lists}

\label{leftrec}

The BNF representation of lists is right-recursive, following Haskell's

list conctructor. Right-recursive lists, however, are an inefficient way of parsing

lists in an LALR parser. The smart programmer would implement a pair of rules

such as JavaletteLight's

\begin{verbatim}

[]. [Stm] ::= ;

(:). [Stm] ::= Stm [Stm] ;

\end{verbatim}

not in the direct way,

\begin{verbatim}

ListStm : {- empty -} { [] }

| Stm ListStm { (:) $1 $3 }

\end{verbatim}

but under a left-recursive transformation:

\begin{verbatim}

ListStm : {- empty -} { [] }

| ListStm Stm { flip (:) $1 $2 }

\end{verbatim}

Then the smart programmer would also be careful to reverse the list

when it is used:

\begin{verbatim}

Prog : Typ Ident '(' ')' '{' ListStm '}' { Fun $1 $2 (reverse $6) }

\end{verbatim}%$

As reported in the Happy manual, this transformation is vital

to avoid running out of stack space with long lists.

Thus we have implemented the transformation in the BNF Converter for

pairs of rules of the form

\begin{verbatim}

[]. [C] ::= ;

(:). [C] ::= C x [C] ;

\end{verbatim}

where {\tt C} is any category and {\tt x} is any sequence of

terminals (possibly empty).

There is another important parsing technique, recursive descent,

which cannot live with left recursion at all, but loops infinitely

with left-recursive grammars (cf.\ e.g.\ \cite{appel}).

The question sometimes arises if, when designing a grammar,

one should take into account what method will be used for parsing it.

The view we are advocating is that the designer of the grammar

should in the first place think of the abstract syntax, and let

the parser generator perform automatic

grammar transformations that are needed by the parsing method.

\section{Discussion}

\subsection{Results}

\label{results}

LBNF and the BNF Converter\cite{bnfc} were introduced as a teaching tool at the

fourth-year compiler course in Spring 2003 at Chalmers.

The goal was, on the one hand, to advocate the use of declarative

and portable language definitions, and on the other hand, to

leave more time for back-end construction in a compiler course.

View remainder of file in raw view

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BNF_Report.tex

BNF_Report.tex

Files

BNF_Report.tex

Latest commit

History

BNF_Report.tex

File metadata and controls