Nothing Special   »   [go: up one dir, main page]

Syntax Analysis

Download as pdf or txt
Download as pdf or txt
You are on page 1of 58

SYNTAX

ANALYSIS
Syntax Analyzer
 Syntax Analyzer creates the syntactic structure of the given
source program.

 This syntactic structure - parse tree.


 Syntax Analyzer is also known as parser.

 The syntax analyzer (parser) checks whether a given source


program satisfies the rules implied by a context-free grammar
or not.
 If it satisfies, the parser creates the parse tree of that program.

 Otherwise the parser gives the error messages.


INTRODUCTION
 Every programming language has precise rules that prescribe the syntactic
structure of well-formed programs.

 Program is made up of functions, a function out of declarations and statements, a


statement out of expressions

 The syntax of programming language constructs can be specified by context-


free grammars

 A context-free grammar

 gives a precise syntactic specification of a programming language.

 the design of the grammar is an initial phase of the design of a


compiler.

 a grammar can be directly converted into a parser by some tools.


Parser
• Parser works on a stream of tokens.

• The smallest item is a token.

source Lexical token parse tree


program Parser
Analyzer get next token
Parsing
 Parsing is the process of determining whether a string of tokens can be
generated by a grammar.
 Parsing methods
 The top-down

 Bottom-up methods.

 Top-down parsing, construction starts at the root and proceeds to the


leaves.
 Bottom-up parsing, construction starts at the leaves and proceeds towards
the root.
 Top-down parsers are easy to build by hand.
 Bottom-up parsing,
 Can handle a larger class of grammars.

 They are not as easy to build, but tools for generating them directly from a grammar are available.

 Both top-down and bottom-up parsers scan the input from left to right (one symbol at a time).
Top- Down Parsing
 Done by starting with the root, labeled with the starting nonterminal stmt,
and repeatedly performing the following two steps.

 At node N, labeled with nonterminal A, select one of the productions for A and
construct children at N for the symbols in the production body.

 Find the next node at which a subtree is to be constructed, typically the leftmost
unexpanded nonterminal of the tree.

 The current terminal being scanned in the input is frequently referred to as


the lookahead symbol.
Top- Down Parsing
Top- Down Parsing
Top- Down Parsing
Top-Down Parsing
 Top-Down Parsing is an attempt to find a left-most
derivation for an input string
 Example:
S  cAd Find a derivation for
A  ab | a for w  cad

S S Backtrack S
/|\  /|\  /|\
cAd cAd cAd
/ \ |
a b a
Predictive Parsing

 Recursive-descent parsing is a top-down method of syntax analysis in


which a set of recursive procedures is used to process the input.

 Simple form of recursive descent – Predictive Parsing


Syntax Error Handling
 Goals in error handling

 Report the presence of errors clearly and accurately.

 Recover from each error quickly enough to detect subsequent errors.

 Add minimal overhead to the processing of correct programs.


Error-Recovery Strategies
 The simplest approach is for the parser to quit with an informative error
message when it detects the first error.

 Panic-mode recovery

 Phrase-level recovery

 Error-productions

 Global-correction.
Panic-Mode Recovery
 The parser discards input symbols one at a time until one of a designated set of
synchronizing tokens is found.

 The synchronizing tokens are usually delimiters, such as ; or }.

 Skips a considerable amount of input without checking for additional errors

 It has the advantage of simplicity, and is guaranteed not to go into an infinite


loop.
Phrase-Level Recovery

 Perform local correction on the remaining input;

 It may replace a prefix of the remaining input by some string that allows the
parser to continue.

 A typical local correction is to replace a comma by a semicolon.

 Delete an extraneous semicolon.

 Insert a missing semicolon.

 Disadvantage in coping with situations in which the actual error has occurred
before the point of detection.
Error Productions

 Expand the grammar for the language at hand with productions that generate the
erroneous constructs.

 The parser can then generate appropriate error diagnostics about the erroneous
construct that has been recognized in the input.
Global Correction
 Compiler to make as few changes as possible in processing an incorrect input
string.

 Given an incorrect input string x and grammar G, algorithms will find a parse
tree for a related string y, such that the number of insertions, deletions, and
changes of tokens required to transform x into y is as small as possible.

 Not implemented.
Syntax Definition
 A grammar describes the hierarchical structure of programming language constructs.

 Eg: if ( expression ) statement else statement

 An if-else statement is the concatenation of the keyword if, an opening parenthesis, an


expression, a closing parenthesis, a statement, the keyword else, and another statement.

 Stmt -> if ( expr ) stmt else stmt

 Rule is called a production.

 In a production, lexical elements if and the parentheses are called terminals.

 Variables like expr and stmt are called nonterminals.


A Context Free Grammar
 A context-free grammar has four components:

 A set of terminal symbols, sometimes referred to as "tokens.“

 A set of nonterminals, sometimes called "syntactic variables."

 A set of productions, where each production consists of a nonterminal,called the head or


left side of the production, an arrow, and a sequence of terminals and/or nonterminals ,
called the body or right side of the production

 A designation of one of the nonterminals as the start symbol.


A Context Free Grammar

The terminal symbols are


Notational Conventions
These symbols are terminals:

 Lowercase letters early in the alphabet, such as a, b, c.

 Operator symbols such as +, *, and so on.

 Punctuation symbols such as parentheses, comma, and so on.

 The digits 0, 1, . . . , 9.

 Boldface strings such as id or if, each of which represents a single terminal


symbol.
Notational Conventions
These symbols are nonterminals:

 Uppercase letters early in the alphabet, such as A, B, C.

 The letter s, which, when it appears, is usually the start symbol.

 Lowercase, italic names such as expr or stmt.

 Uppercase letters may be used t o represent nonterminals for the constructs.


For example, nonterminals for expressions, terms, and factors are often
represented by E, T, and F, respectively.
Notational Conventions
 Uppercase letters late in the alphabet, such as X, Y, Z, represent grammar
symbols; that is, either nonterminals or terminals.

 Lowercase letters late in the alphabet , chiefly u, v, ... ,z, represent (possibly
empty) strings of terminals.

 Lowercase greek letters,α, β, γ for example, represent (possibly empty) strings


of grammar symbols.

 A set of productions a -> α 1 , a -> α2, ... , a -> α k with a common head

 A (call them a-productions) , may be written A -> α 1 I α 2 I . , . I α k · call α1 ,


α2 , ... ,αk the alternatives for A.

 Unless stated otherwise, the head of the first production is the start symbol
Notational Conventions
Derivations
 E  E+E : E+E derives from E

 E  E+E  id+E  id+id

 A sequence of replacements of non-terminal symbols is called a derivation


of id+id from E.

 A   if there is a production rule A in our grammar and  and
 are arbitrary strings of terminal and non-terminal symbols

1  2  ...  n (n derives from 1 or 1 derives n )

 : derives in one step



*
: derives in zero or more steps
+
 : derives in one or more steps
CFG - Terminology
 L(G) is the language of G (the language generated by G) which is a set of
sentences.

 A sentence of L(G) is a string of terminal symbols of G.

 If S is the start symbol of G then


 is a sentence of L(G) iff S   where  is a string of terminals of G
*
 If G is a context-free grammar, L(G) is a context-free language.

 Two grammars are equivalent if they produce the same language.

 S - If  contains non-terminals, it is called as a sentential form of G.


*
- If  does not contain non-terminals, it is called as a sentence of G.
Derivation Example
 E  -E  -(E)  -(E+E)  -(id+E)  -(id+id)

OR

 E  -E  -(E)  -(E+E)  -(E+id)  -(id+id)

 At each derivation step, we can choose any of the non-terminal in the


sentential form of G for the replacement.

 If we always choose the left-most non-terminal in each derivation


step, this derivation is called as left-most derivation.

 If we always choose the right-most non-terminal in each derivation


step, this derivation is called as right-most derivation.
Left-Most and Right-Most Derivations
Left-Most Derivation

E  -E lm-(E)  -(E+E)
lm
 -(id+E)
lm
 -(id+id)
lm
lm

Right-Most Derivation

Erm -E 
rm -(E)  rm
-(E+E)  -(E+id)
rm  -(id+id)
rm

 We will see that the top-down parsers try to find the left-most derivation of the
given source program.

 We will see that the bottom-up parsers try to find the right-most derivation of
the given source program in the reverse order.
Parse Trees and Derivations
 A parse tree is a graphical representation of a derivation that filters out the
order in which productions are applied to replace nonterminals.
 The interior node is labeled with the nonterminal A in the head of the
production;

 The children of the node are labeled, from left to right, by the symbols in the
body of the production
 The leaves of a parse tree are labeled by nonterminals or terminals
 Read from left to right, constitute a sentential form, called the yield or frontier
of the tree.
 There is a many-to-one relationship between derivations and parse trees.
Ambiguity
 a grammar that produces more than one parse tree for some sentence is said
to be ambiguous
1
2
3
4

f
Writing a Grammar
 Grammars are capable of describing most, of the syntax of programming
languages .
 Grammar should be unambiguous.
 Left-recursion elimination and left factoring - are useful for rewriting
grammars .
 From the resulting grammar we can create top down parsers without
backtracking.
 Such parsers are called predictive parsers or recursive-descent parser
Eliminating Ambiguity

 ambiguous grammar can be rewritten to eliminate the ambiguity.


 stmt -> if expr then stmt
|if expr then stmt else stmt
|other
 is ambiguous since the string
 if E1 then if E2 then S1 else S2 has the two parse trees
Two parse trees for an ambiguous sentence
Eliminating Ambiguity
 The general rule is, "Match each else with the closest unmatched then."
Left Recursion
 A grammar is left recursive if it has a non-terminal A such that there is a
derivation.

A  A* for some string 

 Top-down parsing techniques cannot handle left-recursive grammars.

 The left-recursion may appear in a single step of the derivation (immediate left-
recursion), or may appear in more than one step of the derivation.
Immediate Left-Recursion
AA|  where  does not start with A

 eliminate immediate left recursion


A   A’
A’   A’ | 
In general,

A  A 1 | ... | A m | 1 | ... | n where 1 ... n do not start with A

 eliminate immediate left recursion


A  1 A’ | ... | n A’
A’  1 A’ | ... | m A’ |  an equivalent grammar
Left-Recursion -- Problem

• A grammar cannot be immediately left-recursive, but it still can


be left-recursive.

S  Aa | b
A  Sc | d

S  Aa  Sca
A  Sc  Aac causes to a left-recursion
Eliminate Left-Recursion -- Algorithm
- Arrange non-terminals in some order: A1 ... An
- for i from 1 to n do {
- for j from 1 to i-1 do {
replace each production
Ai  Aj 
by
Ai  1  | ... | k 
where Aj  1 | ... | k
}
- eliminate immediate left-recursions among Ai productions
}
Eliminate Left-Recursion
S  Aa | b
A  Ac | Sd | f

- Order of non-terminals: S, A
- A  Ac | Aad | bd | f
- Eliminate the immediate left-recursion in A
A  bdA’ | fA’
A’  cA’ | adA’ | 
So, the resulting equivalent grammar which is not left-recursive is:
S  Aa | b
A  bdA’ | fA’
A’  cA’ | adA’ | 
Eliminate Left-Recursion – Example2
S  Aa | b
A  Ac | Sd | f
- Order of non-terminals: A, S
- Eliminate the immediate left-recursion in A
A  SdA’ | fA’
A’  cA’ | 
- Replace S  Aa with S  SdA’a | fA’a
- Eliminate the immediate left-recursion in S
S  fA’aS’ | bS’
S’  dA’aS’ | 
So, the resulting equivalent grammar which is not left-recursive is:
S  fA’aS’ | bS’
S’  dA’aS’ | 
A  SdA’ | fA’
A’  cA’ | 
Left-Recursive Grammars III
 Here is an example of a (directly) left-recursive grammar:

EE+T|T

TT*F|F

F  ( E ) | id

 This grammar can be re-written as the following non left-


recursive grammar:

E  T E’ E’  + TE’ | є

T  F T’ T’  * F T’ | є

F  (E) | id
Left Factoring
 Left factoring is a grammar transformation that is useful for
producing a grammar suitable for predictive, or top-down,
parsing.
 Stmt -> if expr then stmt else stmt
|if expr then stmt
 A ->α 1 | α 2
 So it should be left factored as
Left-Factoring -- Algorithm
 For each non-terminal A with two or more alternatives (production rules)
with a common non-empty prefix
A  1 | ... | n | 1 | ... | m

convert it into

A  A’ | 1 | ... | m
A’  1 | ... | n
Left-Factoring – Example1
A  abB | aB | cdg | cdeB | cdfB


A  aA’ | cdg | cdeB | cdfB
A’  bB | B


A  aA’ | cdA’’
A’  bB | B
A’’  g | eB | fB
Left-Factoring – Example2
A  ad | a | ab | abc | b


A  aA’ | b
A’  d |  | b | bc


A  aA’ | b
A’  d |  | bA’’
A’’   | c
Top-Down Parsing
 The parse tree is created top to bottom.
 Top-down parser
 Recursive-descent parsing
 Backtracking is needed
 It is a general parsing technique, but not widely used.
 Not efficient
 Predictive parsing
 No backtracking
 Efficient
 Needs a special form of grammars - (LL(1) grammars).
 Recursive predictive parsing is a special form of recursive descent parsing without
backtracking.
 Non-recursive (table driven) predictive parser is also known as LL(1) parser.
Recursive Predictive Parsing
Each non-terminal corresponds to a procedure.
Ex: A  aBb
proc A {
- match the current token with a, and move to the next
token;
- call ‘B’;
- match the current token with b, and move to the next
token;
}
Recursive Predictive Parsing (cont.)
A  aBb | bAB
proc A {
case of the current token {
‘a’: - match the current token with a, and move to the next token;
- call ‘B’;
- match the current token with b, and move to the next token;
‘b’: - match the current token with b, and move to the next token;
- call ‘A’;
- call ‘B’;
}
}
Top-down parse for id + id * id
FIRST and FOLLOW
 FIRST and FOLLOW allow us to choose which production toapply, based on the
next input symbol.

 FIRST(α), where α is any string of grammar symbols, to be the set of terminals that
begin strings derived from α.
 If α => ε, then ε is also in FIRST(α) .

 A => cY, so c is in FIRST(A)

 FOLLOW(A) is the set of the terminals which occur immediately after (follow) the
non-terminal A in the strings derived from the starting symbol.

 a terminal a is in FOLLOW(A) if S  Aa

 $ is in FOLLOW(A) if S  A
*
*
FIRST
1. If X is a terminal, then FIRST(X) = {X}.

2. If X is a nonterminal and X -> YI Y2 ... Yk is a production for some k>=1, then


place a in FIRST(X) if for some i, a is in FIRST(Yi), and ε is in all of
FIRST(YI), ... ,FIRST(Yi-I); that * is , YI Y2 ... Yi-1 => ε. If ε is in FIRST (Yj) for
all j = 1, 2,... ,k, then add ε to FIRST (X). For example, everything in FIRST(Y1)
is surely in FIRST(X) . If Yi does not derive ε then we add nothing more to
FIRST (X) , but if Y1 => ε, then we add FIRST(Y2), and so on.

3. 3. If X => ε is a production, then add ε to FIRST (X).


FOLLOW
LL ( 1 ) Grammars
 L: scanning the input from left to right

 L: producing a leftmost derivation

 1 : one input symbol of lookahead at each step

 A grammar G is LL(1) if and only if whenever A -> α | β are two distinct


productions of G, the following conditions hold:
Construction of a predictive parsing
table.

You might also like