Syntax Analysis

SYNTAX
ANALYSIS
Syntax Analyzer
 Syntax Analyzer creates the syntactic structure of the given
source program.
 This syntactic structure - parse tree.

 Syntax Analyzer is also known as parser.
 The syntax analyzer (parser) checks whether a given source

program satisfies the rules implied by a context-free grammar
or not.
 If it satisfies, the parser creates the parse tree of that program.
 Otherwise the parser gives the error messages.

INTRODUCTION
 Every programming language has precise rules that prescribe the syntactic
structure of well-formed programs.
 Program is made up of functions, a function out of declarations and statements, a

statement out of expressions
 The syntax of programming language constructs can be specified by context-

free grammars
 A context-free grammar
 gives a precise syntactic specification of a programming language.
 the design of the grammar is an initial phase of the design of a

compiler.
 a grammar can be directly converted into a parser by some tools.

Parser
• Parser works on a stream of tokens.
• The smallest item is a token.
source Lexical token parse tree

program Parser
Analyzer get next token
Parsing
 Parsing is the process of determining whether a string of tokens can be
generated by a grammar.
 Parsing methods
 The top-down
 Bottom-up methods.
 Top-down parsing, construction starts at the root and proceeds to the

leaves.
 Bottom-up parsing, construction starts at the leaves and proceeds towards
the root.
 Top-down parsers are easy to build by hand.
 Bottom-up parsing,
 Can handle a larger class of grammars.
 They are not as easy to build, but tools for generating them directly from a grammar are available.
 Both top-down and bottom-up parsers scan the input from left to right (one symbol at a time).
Top- Down Parsing
 Done by starting with the root, labeled with the starting nonterminal stmt,
and repeatedly performing the following two steps.
 At node N, labeled with nonterminal A, select one of the productions for A and
construct children at N for the symbols in the production body.
 Find the next node at which a subtree is to be constructed, typically the leftmost
unexpanded nonterminal of the tree.
 The current terminal being scanned in the input is frequently referred to as

the lookahead symbol.
Top- Down Parsing
Top- Down Parsing
Top- Down Parsing
Top-Down Parsing
 Top-Down Parsing is an attempt to find a left-most
derivation for an input string
 Example:
S  cAd Find a derivation for
A  ab | a for w  cad
S S Backtrack S
/|\  /|\  /|\
cAd cAd cAd
/ \ |
a b a
Predictive Parsing
 Recursive-descent parsing is a top-down method of syntax analysis in

which a set of recursive procedures is used to process the input.
 Simple form of recursive descent – Predictive Parsing

Syntax Error Handling
 Goals in error handling
 Report the presence of errors clearly and accurately.
 Recover from each error quickly enough to detect subsequent errors.
 Add minimal overhead to the processing of correct programs.

Error-Recovery Strategies
 The simplest approach is for the parser to quit with an informative error
message when it detects the first error.
 Panic-mode recovery
 Phrase-level recovery
 Error-productions
 Global-correction.
Panic-Mode Recovery
 The parser discards input symbols one at a time until one of a designated set of
synchronizing tokens is found.
 The synchronizing tokens are usually delimiters, such as ; or }.
 Skips a considerable amount of input without checking for additional errors
 It has the advantage of simplicity, and is guaranteed not to go into an infinite

loop.
Phrase-Level Recovery
 Perform local correction on the remaining input;
 It may replace a prefix of the remaining input by some string that allows the
parser to continue.
 A typical local correction is to replace a comma by a semicolon.
 Delete an extraneous semicolon.
 Insert a missing semicolon.
 Disadvantage in coping with situations in which the actual error has occurred
before the point of detection.
Error Productions
 Expand the grammar for the language at hand with productions that generate the
erroneous constructs.
 The parser can then generate appropriate error diagnostics about the erroneous
construct that has been recognized in the input.
Global Correction
 Compiler to make as few changes as possible in processing an incorrect input
string.
 Given an incorrect input string x and grammar G, algorithms will find a parse
tree for a related string y, such that the number of insertions, deletions, and
changes of tokens required to transform x into y is as small as possible.
 Not implemented.
Syntax Definition
 A grammar describes the hierarchical structure of programming language constructs.
 Eg: if ( expression ) statement else statement
 An if-else statement is the concatenation of the keyword if, an opening parenthesis, an

expression, a closing parenthesis, a statement, the keyword else, and another statement.
 Stmt -> if ( expr ) stmt else stmt
 Rule is called a production.
 In a production, lexical elements if and the parentheses are called terminals.
 Variables like expr and stmt are called nonterminals.

A Context Free Grammar
 A context-free grammar has four components:
 A set of terminal symbols, sometimes referred to as "tokens.“
 A set of nonterminals, sometimes called "syntactic variables."
 A set of productions, where each production consists of a nonterminal,called the head or

left side of the production, an arrow, and a sequence of terminals and/or nonterminals ,
called the body or right side of the production
 A designation of one of the nonterminals as the start symbol.

A Context Free Grammar
The terminal symbols are

Notational Conventions
These symbols are terminals:
 Lowercase letters early in the alphabet, such as a, b, c.
 Operator symbols such as +, *, and so on.
 Punctuation symbols such as parentheses, comma, and so on.
 The digits 0, 1, . . . , 9.
 Boldface strings such as id or if, each of which represents a single terminal

symbol.
These symbols are nonterminals:
 Uppercase letters early in the alphabet, such as A, B, C.
 The letter s, which, when it appears, is usually the start symbol.
 Lowercase, italic names such as expr or stmt.
 Uppercase letters may be used t o represent nonterminals for the constructs.

For example, nonterminals for expressions, terms, and factors are often
represented by E, T, and F, respectively.
 Uppercase letters late in the alphabet, such as X, Y, Z, represent grammar
symbols; that is, either nonterminals or terminals.
 Lowercase letters late in the alphabet , chiefly u, v, ... ,z, represent (possibly
empty) strings of terminals.
 Lowercase greek letters,α, β, γ for example, represent (possibly empty) strings

of grammar symbols.
 A set of productions a -> α 1 , a -> α2, ... , a -> α k with a common head
 A (call them a-productions) , may be written A -> α 1 I α 2 I . , . I α k · call α1 ,

α2 , ... ,αk the alternatives for A.
 Unless stated otherwise, the head of the first production is the start symbol
Derivations
 E  E+E : E+E derives from E
 E  E+E  id+E  id+id
 A sequence of replacements of non-terminal symbols is called a derivation

of id+id from E.
 A   if there is a production rule A in our grammar and  and
 are arbitrary strings of terminal and non-terminal symbols
1  2  ...  n (n derives from 1 or 1 derives n )
 : derives in one step


*
: derives in zero or more steps
+
 : derives in one or more steps
CFG - Terminology
 L(G) is the language of G (the language generated by G) which is a set of
sentences.
 A sentence of L(G) is a string of terminal symbols of G.
 If S is the start symbol of G then

 is a sentence of L(G) iff S   where  is a string of terminals of G
*
 If G is a context-free grammar, L(G) is a context-free language.
 Two grammars are equivalent if they produce the same language.
 S - If  contains non-terminals, it is called as a sentential form of G.

*
- If  does not contain non-terminals, it is called as a sentence of G.
Derivation Example
 E  -E  -(E)  -(E+E)  -(id+E)  -(id+id)
OR
 E  -E  -(E)  -(E+E)  -(E+id)  -(id+id)
 At each derivation step, we can choose any of the non-terminal in the

sentential form of G for the replacement.
 If we always choose the left-most non-terminal in each derivation

step, this derivation is called as left-most derivation.
 If we always choose the right-most non-terminal in each derivation

step, this derivation is called as right-most derivation.
Left-Most and Right-Most Derivations
Left-Most Derivation
E  -E lm-(E)  -(E+E)
lm
 -(id+E)
lm
 -(id+id)
lm
lm
Right-Most Derivation
Erm -E 
rm -(E)  rm
-(E+E)  -(E+id)
rm  -(id+id)
rm
 We will see that the top-down parsers try to find the left-most derivation of the
given source program.
 We will see that the bottom-up parsers try to find the right-most derivation of
the given source program in the reverse order.
Parse Trees and Derivations
 A parse tree is a graphical representation of a derivation that filters out the
order in which productions are applied to replace nonterminals.
 The interior node is labeled with the nonterminal A in the head of the
production;
 The children of the node are labeled, from left to right, by the symbols in the
body of the production
 The leaves of a parse tree are labeled by nonterminals or terminals
 Read from left to right, constitute a sentential form, called the yield or frontier
of the tree.
 There is a many-to-one relationship between derivations and parse trees.
Ambiguity
 a grammar that produces more than one parse tree for some sentence is said
to be ambiguous
1
2
3
4
f
Writing a Grammar
 Grammars are capable of describing most, of the syntax of programming
languages .
 Grammar should be unambiguous.
 Left-recursion elimination and left factoring - are useful for rewriting
grammars .
 From the resulting grammar we can create top down parsers without
backtracking.
 Such parsers are called predictive parsers or recursive-descent parser
Eliminating Ambiguity
 ambiguous grammar can be rewritten to eliminate the ambiguity.

 stmt -> if expr then stmt
|if expr then stmt else stmt
|other
 is ambiguous since the string
 if E1 then if E2 then S1 else S2 has the two parse trees
Two parse trees for an ambiguous sentence
Eliminating Ambiguity
 The general rule is, "Match each else with the closest unmatched then."
Left Recursion
 A grammar is left recursive if it has a non-terminal A such that there is a
derivation.
A  A* for some string 
 Top-down parsing techniques cannot handle left-recursive grammars.
 The left-recursion may appear in a single step of the derivation (immediate left-
recursion), or may appear in more than one step of the derivation.
Immediate Left-Recursion
AA|  where  does not start with A
 eliminate immediate left recursion

A   A’
A’   A’ | 
In general,
A  A 1 | ... | A m | 1 | ... | n where 1 ... n do not start with A
 eliminate immediate left recursion

A  1 A’ | ... | n A’
A’  1 A’ | ... | m A’ |  an equivalent grammar
Left-Recursion -- Problem
• A grammar cannot be immediately left-recursive, but it still can

be left-recursive.
S  Aa | b
A  Sc | d
S  Aa  Sca
A  Sc  Aac causes to a left-recursion
Eliminate Left-Recursion -- Algorithm
- Arrange non-terminals in some order: A1 ... An
- for i from 1 to n do {
- for j from 1 to i-1 do {
replace each production
Ai  Aj 
by
Ai  1  | ... | k 
where Aj  1 | ... | k
}
- eliminate immediate left-recursions among Ai productions
}
Eliminate Left-Recursion
S  Aa | b
A  Ac | Sd | f
- Order of non-terminals: S, A
- A  Ac | Aad | bd | f
- Eliminate the immediate left-recursion in A
A  bdA’ | fA’
A’  cA’ | adA’ | 
So, the resulting equivalent grammar which is not left-recursive is:
S  Aa | b
A  bdA’ | fA’
A’  cA’ | adA’ | 
Eliminate Left-Recursion – Example2
S  Aa | b
A  Ac | Sd | f
- Order of non-terminals: A, S
- Eliminate the immediate left-recursion in A
A  SdA’ | fA’
A’  cA’ | 
- Replace S  Aa with S  SdA’a | fA’a
- Eliminate the immediate left-recursion in S
S  fA’aS’ | bS’
S’  dA’aS’ | 
So, the resulting equivalent grammar which is not left-recursive is:
S  fA’aS’ | bS’
S’  dA’aS’ | 
A  SdA’ | fA’
A’  cA’ | 
Left-Recursive Grammars III
 Here is an example of a (directly) left-recursive grammar:
EE+T|T
TT*F|F
F  ( E ) | id
 This grammar can be re-written as the following non left-

recursive grammar:
E  T E’ E’  + TE’ | є
T  F T’ T’  * F T’ | є
F  (E) | id
Left Factoring
 Left factoring is a grammar transformation that is useful for
producing a grammar suitable for predictive, or top-down,
parsing.
 Stmt -> if expr then stmt else stmt
|if expr then stmt
 A ->α 1 | α 2
 So it should be left factored as
Left-Factoring -- Algorithm
 For each non-terminal A with two or more alternatives (production rules)
with a common non-empty prefix
A  1 | ... | n | 1 | ... | m
convert it into
A  A’ | 1 | ... | m
A’  1 | ... | n
Left-Factoring – Example1
A  abB | aB | cdg | cdeB | cdfB

A  aA’ | cdg | cdeB | cdfB
A’  bB | B

A  aA’ | cdA’’
A’  bB | B
A’’  g | eB | fB
Left-Factoring – Example2
A  ad | a | ab | abc | b

A  aA’ | b
A’  d |  | b | bc

A  aA’ | b
A’  d |  | bA’’
A’’   | c
Top-Down Parsing
 The parse tree is created top to bottom.
 Top-down parser
 Recursive-descent parsing
 Backtracking is needed
 It is a general parsing technique, but not widely used.
 Not efficient
 Predictive parsing
 No backtracking
 Efficient
 Needs a special form of grammars - (LL(1) grammars).
 Recursive predictive parsing is a special form of recursive descent parsing without
backtracking.
 Non-recursive (table driven) predictive parser is also known as LL(1) parser.
Recursive Predictive Parsing
Each non-terminal corresponds to a procedure.
Ex: A  aBb
proc A {
- match the current token with a, and move to the next
token;
- call ‘B’;
- match the current token with b, and move to the next
token;
}
Recursive Predictive Parsing (cont.)
A  aBb | bAB
proc A {
case of the current token {
‘a’: - match the current token with a, and move to the next token;
- call ‘B’;
- match the current token with b, and move to the next token;
‘b’: - match the current token with b, and move to the next token;
- call ‘A’;
- call ‘B’;
}
}
Top-down parse for id + id * id
FIRST and FOLLOW
 FIRST and FOLLOW allow us to choose which production toapply, based on the
next input symbol.
 FIRST(α), where α is any string of grammar symbols, to be the set of terminals that
begin strings derived from α.
 If α => ε, then ε is also in FIRST(α) .
 A => cY, so c is in FIRST(A)
 FOLLOW(A) is the set of the terminals which occur immediately after (follow) the
non-terminal A in the strings derived from the starting symbol.
 a terminal a is in FOLLOW(A) if S  Aa
 $ is in FOLLOW(A) if S  A
*
*
FIRST
1. If X is a terminal, then FIRST(X) = {X}.
2. If X is a nonterminal and X -> YI Y2 ... Yk is a production for some k>=1, then

place a in FIRST(X) if for some i, a is in FIRST(Yi), and ε is in all of
FIRST(YI), ... ,FIRST(Yi-I); that * is , YI Y2 ... Yi-1 => ε. If ε is in FIRST (Yj) for
all j = 1, 2,... ,k, then add ε to FIRST (X). For example, everything in FIRST(Y1)
is surely in FIRST(X) . If Yi does not derive ε then we add nothing more to
FIRST (X) , but if Y1 => ε, then we add FIRST(Y2), and so on.
3. 3. If X => ε is a production, then add ε to FIRST (X).

FOLLOW
LL ( 1 ) Grammars
 L: scanning the input from left to right
 L: producing a leftmost derivation
 1 : one input symbol of lookahead at each step
 A grammar G is LL(1) if and only if whenever A -> α | β are two distinct

productions of G, the following conditions hold:
Construction of a predictive parsing
table.

Syntax Analysis

Uploaded by

Copyright:

Available Formats

Syntax Analysis

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Syntax Analysis

Uploaded by

Copyright:

Available Formats

SYNTAX

 This syntactic structure - parse tree.

 The syntax analyzer (parser) checks whether a given source

 Otherwise the parser gives the error messages.

 Program is made up of functions, a function out of declarations and statements, a

 The syntax of programming language constructs can be specified by context-

 gives a precise syntactic specification of a programming language.

 the design of the grammar is an initial phase of the design of a

 a grammar can be directly converted into a parser by some tools.

• The smallest item is a token.

source Lexical token parse tree

 Top-down parsing, construction starts at the root and proceeds to the

 The current terminal being scanned in the input is frequently referred to as

 Recursive-descent parsing is a top-down method of syntax analysis in

 Simple form of recursive descent – Predictive Parsing

 Report the presence of errors clearly and accurately.

 Recover from each error quickly enough to detect subsequent errors.

 Add minimal overhead to the processing of correct programs.

 The synchronizing tokens are usually delimiters, such as ; or }.

 Skips a considerable amount of input without checking for additional errors

 It has the advantage of simplicity, and is guaranteed not to go into an infinite

 Perform local correction on the remaining input;

 A typical local correction is to replace a comma by a semicolon.

 Delete an extraneous semicolon.

 Insert a missing semicolon.

 Eg: if ( expression ) statement else statement

 An if-else statement is the concatenation of the keyword if, an opening parenthesis, an

 Stmt -> if ( expr ) stmt else stmt

 Rule is called a production.

 In a production, lexical elements if and the parentheses are called terminals.

 Variables like expr and stmt are called nonterminals.

 A set of terminal symbols, sometimes referred to as "tokens.“

 A set of nonterminals, sometimes called "syntactic variables."

 A set of productions, where each production consists of a nonterminal,called the head or

 A designation of one of the nonterminals as the start symbol.

The terminal symbols are

 Lowercase letters early in the alphabet, such as a, b, c.

 Operator symbols such as +, *, and so on.

 Punctuation symbols such as parentheses, comma, and so on.

 Boldface strings such as id or if, each of which represents a single terminal

 Uppercase letters early in the alphabet, such as A, B, C.

 The letter s, which, when it appears, is usually the start symbol.

 Lowercase, italic names such as expr or stmt.

 Uppercase letters may be used t o represent nonterminals for the constructs.

 Lowercase greek letters,α, β, γ for example, represent (possibly empty) strings

 A (call them a-productions) , may be written A -> α 1 I α 2 I . , . I α k · call α1 ,

 E  E+E  id+E  id+id

 A sequence of replacements of non-terminal symbols is called a derivation

1  2  ...  n (n derives from 1 or 1 derives n )

 : derives in one step

 A sentence of L(G) is a string of terminal symbols of G.

 If S is the start symbol of G then

 Two grammars are equivalent if they produce the same language.

 S - If  contains non-terminals, it is called as a sentential form of G.

 E  -E  -(E)  -(E+E)  -(E+id)  -(id+id)

 At each derivation step, we can choose any of the non-terminal in the

 If we always choose the left-most non-terminal in each derivation

 If we always choose the right-most non-terminal in each derivation

 ambiguous grammar can be rewritten to eliminate the ambiguity.