Chapter 3 - Syntax Analyzer

Compiler Design
Instructor: Mohammed O.
Email: momoumer90@gmail.com
Samara University
Chapter Three
This Chapter Covers:
Syntax Analyzer
Top-Down Parsing
Predictive Parsing
Regular Expression vs Context Free Grammar
Recursive Descent Parsing
Non Recursive Predictive Parsing
Syntax Analyzer
Syntax Analyzer creates the syntactic structure of the given
source program.
This syntactic structure is mostly a parse tree.
Syntax Analyzer is also known as parser.
The syntax of a programming is described by a context-
free grammar (CFG). We will use BNF (Backus-Naur
Form) notation in the description of CFGs.
The syntax analyzer (parser) checks whether a given
source program satisfies the rules implied by a context-free
grammar or not.
If it satisfies, the parser creates the parse tree of that
program.
Otherwise the parser gives the error messages.
Parser
A context-free grammar
gives a precise (accurate) syntactic specification of a
programming language.
the design of the grammar is an initial phase of the design
of a compiler.
a grammar can be directly converted into a parser by
some tools.
Parser works on a stream of tokens.
The smallest item is a token.
Parsers (cont.)
We categorize the parsers into two groups:
Top-Down Parser
 the parse tree is created top to bottom, starting from the
root.
Bottom-Up Parser
 the parse is created bottom to top; starting from the leaves
Both top-down and bottom-up parsers scan the input from
left to right (one symbol at a time).
Efficient top-down and bottom-up parsers can be
implemented only for sub-classes of context-free grammars.
 LL for top-down parsing
 LR for bottom-up parsing
Context-Free Grammars
CFG is a formal grammar which is used to generate all
possible strings in a given formal language.
In a CFG , G (where G describes the grammar) can be defined
by four tuples as: G= (V, T, P, S)
 T describes a finite set of terminal symbols.
 V describes a finite set of non-terminal symbols.
 P describes a set of productions rules in the following form
A   where A is a non-terminal and
 is a string of terminals and non-terminals (including the
empty string)
 S is the start symbol (one of the non-terminal symbol)
Example: E  E + E | E – E | E * E | E / E | - E
E (E)
E  id
Derivations
In CFG, the start symbol is used to derive the string. You can
derive the string by repeatedly replacing a non-terminal by
the right hand side of the production, until all non-terminal
have been replaced by terminal symbols.
E  E+E
E+E derives from E
we can replace E by E+E
to able to do this, we have to have a production rule
EE+E in our grammar.
E  E+E  id+E  id+id

A sequence of replacements of non-terminal symbols is
called a derivation of id+id from E.
Derivations (Cont.)
In general a derivation step is
A   if there is a production rule A in our
grammar
where  and  are arbitrary strings of
terminal and non-terminal symbols
1  2  ...  n (n derives from 1 or 1 derives n )
 : derives in one step

*
 : derives in zero or more steps
+
 : derives in one or more steps
Derivation Example
E  -E  -(E)  -(E+E)  -(id+E)  -(id+id)
OR
E  -E  -(E)  -(E+E)  -(E+id)  -(id+id)
At each derivation step, we can choose any of the non-

terminal in the sentential form of G for the replacement.
If we always choose the left-most non-terminal in each
derivation step, this derivation is called as left-most
derivation.
If we always choose the right-most non-terminal in each
derivation step, this derivation is called as right-most
derivation.
CFG - Terminology
L(G) is the language of G (the language generated by G)
which is a set of sentences.
A sentence of L(G) is a string of terminal symbols of G.
If S is the start symbol of G then
+
 is a sentence of L(G) iff S   where  is a string of
terminals of G.
If G is a context-free grammar, L(G) is a context-free

language.
Two grammars are equivalent if they produce the same
language.
CFG (Cont.)
*
S   - If  contains non-terminals, it is called as a sentential
form of G.
- If  does not contain non-terminals, it is
called as a sentence of G.
Left-Most and Right-Most Derivations
Left-Most Derivation
E  -E  -(E)  -(E+E)  -(id+E)  -(id+id)
Right-Most Derivation
E  -E  -(E)  -(E+E)  -(E+id)  -(id+id)
We will see that the top-down parsers try to find the left-
most derivation of the given source program.
We will see that the bottom-up parsers try to find the right-
most derivation of the given source program in the reverse
order.
Parse Tree
Inner nodes of a parse tree are non-terminal symbols.
The leaves of a parse tree are terminal symbols.
A parse tree can be seen as a graphical representation of a
derivation.
E
E  -E E E - E
 -(E)  -(E+E)
- E - E ( E )
( E ) E + E
E E
 -(id+E) -  -(id+id)
E - E
( E )
( E )
E + E
E + E
id
id id
Ambiguity
A grammar produces more than one parse tree for a
sentence is called as an ambiguous grammar.
E  E+E  id+E  id+E*E
 id+id*E  id+id*id E
E + E
id E * E
id id
E  E*E  E+E*E  id+E*E
E
 id+id*E  id+id*id
E * E
E + E id
id id
Ambiguity (cont.)
For the most parsers, the grammar must be unambiguous.
unambiguous grammar
 unique selection of the parse tree for a sentence
We should eliminate the ambiguity in the grammar during

the design phase of the compiler.
An unambiguous grammar should be written to eliminate
the ambiguity.
We have to prefer one of the parse trees of a sentence
(generated by an ambiguous grammar) to disambiguate
that grammar to restrict to this choice.
Ambiguity (cont.)
stmt  if expr then stmt |

if expr then stmt else stmt | otherstmts
if E1 then if E2 then S1 else S2

Ambiguity – Operator Precedence
Ambiguous grammars (because of ambiguous operators)
can be disambiguated according to the precedence and
associativity rules.
E  E+E | E*E | E^E | id | (E)
disambiguate the grammar
precedence: ^ (right to left)
* (left to right)
+ (left to right)
E  E+T | T
T  T*F | F
F  G^F | G
G  id | (E)
Left Recursion
A grammar is left recursive if it has a non-terminal A
such that there is a derivation.
+
A  A for some string 
Top-down parsing techniques cannot handle left-recursive

grammars.
So, we have to convert our left-recursive grammar into an
equivalent grammar which is not left-recursive.
The left-recursion may appear in a single step of the
derivation (immediate left-recursion), or may appear in
more than one step of the derivation.
Immediate Left-Recursion
AA|  where  does not start with A
 eliminate immediate left recursion
A   A’
A’   A’ |  an equivalent grammar
In general,
A  A 1 | ... | A m | 1 | ... | n where 1 ... n do not start
with A
A  1 A’ | ... | n A’
A’  1 A’ | ... | m A’ |  an equivalent grammar
Immediate Left-Recursion -- Example
E  E+T | T
T  T*F | F
F  id | (E)

E  T E’
E’  +T E’ | 
T  F T’
T’  *F T’ | 
F  id | (E)
Left-Recursion -- Problem
A grammar cannot be immediately left-recursive, but it
still can be left-recursive.
By just eliminating the immediate left-recursion, we may
not get a grammar which is not left-recursive.
S  Aa | b
A  Sc | d This grammar is not immediately left-
recursive, but it is still left-recursive.
S  Aa  Sca or
A  Sc  Aac causes to a left-recursion
So, we have to eliminate all left-recursions from our
grammar
Eliminate Left-Recursion -- Example
S  Aa | b
A  Ac | Sd | f
- Order of non-terminals: S, A
for S:
- we do not enter the inner loop.
- there is no immediate left recursion in S.
for A:
- Replace A  Sd with A  Aad | bd
So, we will have A  Ac | Aad | bd | f
- Eliminate the immediate left-recursion in A
A  bdA’ | fA’
A’  cA’ | adA’ | 
Cont.
So, the resulting equivalent grammar which is not left-
recursive is:
S  Aa | b
A  bdA’ | fA’
A’  cA’ | adA’ | 
Left-Factoring
A predictive parser (a top-down parser without
backtracking) insists that the grammar must be left-
factored.
grammar  a new equivalent grammar suitable for
predictive parsing
stmt  if expr then stmt else stmt |

if expr then stmt
when we see if, we cannot now which production rule to

choose to re-write stmt in the derivation.
Left-Factoring (cont.)
In general,
A  1 | 2 where  is non-empty and the first

symbols of 1 and 2 (if they have one)are different.
when processing  we cannot know whether expand

A to 1 or
A to 2
But, if we re-write the grammar as follows
A  A’
A’  1 | 2 so, we can immediately expand A to A’
Left-Factoring -- Algorithm
For each non-terminal A with two or more alternatives
(production rules) with a common non-empty prefix, let
say
A  1 | ... | n | 1 | ... | m
convert it into
A  A’ | 1 | ... | m
A’  1 | ... | n
Left-Factoring – Example1
A  abB | aB | cdg | cdeB | cdfB

A  aA’ | cdg | cdeB | cdfB
A’  bB | B

A  aA’ | cdA’’
A’  bB | B
A’’  g | eB | fB
Left-Factoring – Example2
A  ad | a | ab | abc | b

A  aA’ | b
A’  d |  | b | bc

A  aA’ | b
A’  d |  | bA’’
A’’   | c

Chapter 3 - Syntax Analyzer

Uploaded by

Copyright:

Available Formats

Chapter 3 - Syntax Analyzer

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Chapter 3 - Syntax Analyzer

Uploaded by

Copyright:

Available Formats

Compiler Design

E  E+E  id+E  id+id

1  2  ...  n (n derives from 1 or 1 derives n )

 : derives in one step

At each derivation step, we can choose any of the non-

If G is a context-free grammar, L(G) is a context-free

We should eliminate the ambiguity in the grammar during

stmt  if expr then stmt |

if E1 then if E2 then S1 else S2

Top-down parsing techniques cannot handle left-recursive

 eliminate immediate left recursion

stmt  if expr then stmt else stmt |

when we see if, we cannot now which production rule to

A  1 | 2 where  is non-empty and the first

when processing  we cannot know whether expand

A  1 | ... | n | 1 | ... | m

You might also like