Nothing Special   »   [go: up one dir, main page]

CFL Pumping Lemma

Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 11

CFL Pumping Lemma

Similar to regular-language PL, but you have to pump two strings in the middle of the string, in tandem (i.e., the same number of copies of each). Formally:

Outline of Proof of PL
Let there be a Chomsky-normal-form CFG for L with m variables. Pick n = 2m. Because CNF grammars have bodies of no more than 2 symbols, a string z of length n must have some path with at least m + 1 variables. Thus, some variable must appear twice on the path.
Compare with the DFA argument about a path longer than the number of states.

 CFL L  integer n  z in L, with |z| u n  uvwxy = z such that |vwx|  i u 0, uviwxiy is in L.

n and |vx| > 0

Focus on some path that is as long as any path in the tree. In this path, we can find a duplication of some variable A among the bottom m + 1 variables on the path.
Let the lower A derive w and the upper A derive vwx.

CNF guarantees us that |vwx|

vx { I. By repeatedly replacing the lower A's tree by the upper A's tree, we see uviwxiy has a parse tree for all i > 1.
And replacing the upper by the lower shows the case i = 0; i.e., uwy is in L.

n and

S A

A A

S A A

w u y

u v v

x y x

Another formulation
Consider the derivation S* uAy*uvAxy*uvwxy
S

* A vAx
v A x

A path in a binary tree representing a derivation for a grammar in CNF either is empty or consists of a node, one of its descendents, and all nodes in between. Length of the path is the number of nodes it contains Height of a binary tree is the length of its longest path.
LEMMA: For any h u 1, a binary tree having more than 2h leaf nodes, must have a height greater than h + 1

* Aw
w

@ uviwxiy L

Pumping lemma constant for CFLs is 2n+1 where n is the number of variables in the grammar in CNF Why? Because the derivation tree for a sufficiently long string must have a height of at least n + 2
It has more than 2n leaf nodes, and therefore its height is greater than n + 1 If n non-terminals, string must have tree with height n+2 to be sufficiently long for the PL to be applied
Height of n+1 gets us to the level of the bottom-most non-leaf nodes (nodes labeled with a non-terminal)

S C1 C3 C4 A B C2

z = (ab)(b)(ab)(b)(a) =u v w x y

Root of vwx

C5 C6 a C7 C9 b

@ Height of this subtree e n + 2

Cannot overlap w
@|vx| > 0

b A
C8

Consider a leaf and the n + 1 nodes above it: since there are only n variables, one must appear twice.

The classic non-CFL

Example
k2

L = {aibici | i u 0} is not a CFL.


Suppose it were. Then let n be the PL constant for L. Consider z = anbncn. We can write z = uvwxy, with |vwx| e n and |vx| > 0 (i.e., either v or x is non-empty). Two cases to consider: 1. Both v and x contain only one type of alphabet symbol: v does not contain both a's and b's or both b's and c's, and the same holds for x. But in this case uv2wx2y cannot contain equal numbers of a's, b's, and c's. 2. Either v or x contain more than one type of symbol: in this case uv2wx2y may contain equal numbers of a's, b's, and c's but they won't be in the correct order. One of these cases must occur, and both result in contradiction. So the assumption that L is a CFL is false.

Example
L = {0 | k is any integer} is not a CFL.
Suppose it were. Then let n be the PL constant for L. Consider z = 0n2. We can write z = uvwxy, with |vwx| e n and |vx| > 0. Then uvvwxxy should be in L. But n2 < |uvvwxxy| e n2 + n < (n + 1)2, so there is no perfect square that |uvvwxxy| could be. By "proof by contradiction, L is not a CFL.

Example
L = {ww | w {0,1}*} is not a CFL. Suppose it were. Then let n be the PL constant for L. Choosing a string is less obvious for this language.
Try z = 0n10n1. But it can be pumped by dividing as follows:
0n1 0n1 000000 0 1 0 0000001 u v x y z

Try another string: 0n1n0n1n


seems to capture more of the "essence" of the language

Use PL condition that the string can be pumped by dividing into z = uvwxy, where |vwx| e n. vwx must straddle the midpoint of z. Otherwise, if only in the first half of z, pumping up to uv2wx2y moves a 1 into the first position of the second half, so it cannot be of form ww. If in the second half, a 0 is moved into the last position of the first half, so cannot be of form ww. If vwx straddles the midpoint of z, pumping z down to uwy yields 0n1i0j1n, where i and j cannot both be n. This string cannot be of form ww. Contradiction!

Example
L= {aibjck | i < j < k} is not a CFL
Suppose it were. Then let n be the PL constant for L.

Example
L = {aibjck | 0 e i e j e k} is not a CFL
Suppose it were. Then let n be the PL constant for L. Consider z = anbncn. We can write z = uvwxy, with |vwx| e n and |vx| > 0, and uviwxiy L for every i u 0.. When both v and x contain only one type of symbol, v does not contain both as and bs and bs or cs and the same holds for x. Must divide into three sub-cases:
1. 2. No as. Then try pumping down to obtain uv0wx0y = uwy. Contains too few bs or cs. No bs. Then either as or cs must appear in v or x because both cant be the empty string. If as appear, then uv2wx2y contains more as than bs. If cs appear, then uv0wx0y contains more bs than cs. No cs. The string uv2wx2y contains more as or more bs than cs.

Consider z = anbn+1cn+2. We can write z = uvwxy, with |vwx| e n and |vx| > 0, and uviwxiy L for every i u 0.
This time must pump down as well as pump up. First we consider the case where vx contains at least one a. Then since |vwx| e n, vx can contain no c's. Therefore, uv2wx2y has at least n + 1 a's and exactly n + 2 c's, which is impossible for strings in L. If vx contains no a's, then it must contain either b or c. In this case, uv0wx0y = uwy has either fewer than n + 1 b's or fewer than n + 2 c's, but in either case exactly no a's. This is also impossible for strings in L. By "proof by contradiction," L is not a CFL.

3.

When either v or x contain more than one type of symbol, uv2wx2y will not contain symbols in the correct order. By "proof by contradiction," L is not a CFL.

Example
L = {xyx | x,y {a,b}* and |x| 1}is not a CFL
Suppose it were. Then let n be the PL constant for L.

Closure Properties of CFLs


CFLs are closed under union, concatenation, and Kleene star. CFLs are closed under reversal. Proofs are the same as for regular languages: by construction

Let z = anbnanbn. Then z = uvwxy for some u, v, w, x, and y, satisfying |vx| > 0, |vwx| e n, and uviwxiy L for every i u 0.
Suppose that vx contains either only a's from the first group or only b's from the last group. Then uv2wx2y is either an+i bnanbn or anbnanbn+i for some 0 < i e n, and in neither case can this string be in the form xyx for any x with |x| > 0. Otherwise, vx contains either a b from the first group or an a from the second. In this case uv0wx0y is either ai bjakbn or anbi ajbk where in either case i and k are positive and j < n. Neither of these strings can be in the form required for L either. By "proof by contradiction," L is not a CFL.

Start with CFGs G1 = (V1,7,S1,P1) and G2 = (V2,7,S2,P2)

If L1 and L2 are CFLs, then L1 L2 is a CFL


Construct Gu = (Vu,7,Su,Pu) generating L1 L2
rename elements of V2 if necessary so that V1 V2 = * define Vu = V1 V2 {Su}, Su V1 ,V2 define Pu = P1 P2 {Su p S1 | S2}

Non-closure Under Intersection


The following language L = {0i1j2k3l | i = k and j = l} is not a CFL. Intuitively, you need a variable and productions like A p 0A2 | 02 to
generate the matching 0's and 2's, while you need another variable to generate matching 1's and 3's. But these variables would have to generate strings that did not interleave.

If L1 and L2 are CFLs, then L1L2 is a CFL


Construct Gc = (Vc,7,Sc,Pc) generating L1L2 - rename elements of V2 so that V1 V2 = *
- define Vc = V1 V2 {Sc}, Sc V1 ,V2 - define Pc = P1 P2 {Sc p S1S2}

However, the simpler language {0i1j2k3l | i = k} is a CFL.


A grammar: S p S3 | A A p 0A2 |B B p 1B | I

If L1 is a CFL, then L1* is a CFL


Construct G* = (V*,7,S*,P*) generating L1*
- define V* = V1 {S*} - define P* = P1 {S* p S1S* | I}

Likewise the CFL {0i1j2k3l | j = l}. Their intersection is L.

Closure under Complement?


The complements of some CFLs are also CFLs Example: {anbn | n 0} Complement can be accepted by a PDA:
swap accepting states of PDA that recognizes anbn.

Non-closure of CFL's Under Complement


But not always!
The complement of non-CFL L = {0i1j2k3l | i = k and j = l} is a CFL.
Here is a PDA P recognizing it:
Non-deterministically choose whether to check i { k or j { l.
Non-deterministic PDAchecks one or the other, but capable of checking either one

Say we want to check i { k. As long as 0's come in, count them on the stack. Ignore 1's. Pop the stack for each 2. As long as we have not just exposed the bottom-of-stack marker when the first 3 comes in, accept, and keep accepting as long as 3's come in. But we also have to accept, and keep accepting, as soon as we see that the input is not in L(0*1*2*3*).

Another Example
If L1 and L2 are CFLs, then if the CFLs are closed under complementation, then the language

Testing Emptiness of a CFL


As for regular languages, we really take a representation of some language and ask whether it represents *.
In this case, the representation can be a CFG or PDA.
Our choice, since there are algorithms to convert one to the other.

L ! L1 L2
Is context free because the CFLs are closed under union But by DeMorgans Law,

L ! L1 L2

It would then follow that CFLs are closed under intersection. But we have already seen that they are not!

The test: Use a CFG; check if the start symbol is useless

Testing Finiteness of a CFL


Let L be a CFL. Then there is some pumping lemma constant n for L. Test all strings of length between n and 2n - 1 for membership (as in next slide). If there is any such string, it can be pumped, and the language is infinite. If there is no such string, then n - 1 is an upper limit on the length of strings, so the language is finite.
Trick: If there were a string z = uvwxy of length 2n or longer, you can find a shorter string uwy in L, but it's at most n shorter (why?). Thus, if there are any strings of length 2n or more, you can repeatedly cut out vx to get, eventually, a string whose length is in the range n to 2n - 1.

Testing Membership of a String in a CFL


Simulating a PDA for L on string w doesn't quite work, because the PDA can grow its stack indefinitely on I input, and we never finish, even if the PDA is deterministic. There is an O(n3) algorithm (n = length of w) that uses a "dynamic programming" technique.
Called Cocke-Younger-Kasami (CYK) algorithm.

CYK Algorithm
Start with a CNF grammar for L. Build a two-dimensional table:
Row = length of a substring of w. Column = beginning position of the substring. Entry in row i and column j = set of variables that generate the substring of w beginning at position j and extending for i positions. These entries are denoted Xj,i+j-1 i.e., the subscripts are the first and last positions of the string represented, so the first row is X11,X22, ,Xnn, the second row is X12,X23, ,Xn-1,n, and so on.

Table
The horizontal axis corresponds to the positions of the string w = a1a2an Table entry Xij is the set of non* terminals A such that A aiai+1aj
We are particularly interested in whether S is in * Xn1 because that is the same as saying S w (that is, w is in L)

Basis: (row 1) Xii = the set of variables A such that A p a is a production, and a is the symbol at position i of w.
The grammar is in CNF, therefore the only way to derive a terminal is with a production of the form A p a, so Xii is the set of non-terminals such that A p ai is a production of G

Example
We'll use the algorithm to determine if the string w = aabbb is in the language generated by the grammar S p AB A p BB | a B p AB |b Note that w11 = a, so X11 is the set of all variables that immediately derive a, that is X11 = {A}. Since w22 = a, we also have X22 = {A}, and so on to get
X11 = {A}, X22 = {A}, X33 = {B}, X44 = {B}, X55 = {B}

Induction: Suppose we want to compute Xij, which is in row j i +1


We can derive aiai+1 aj from A if there is a production A p BC, B derives any prefix of aiai+1 aj, and C derives the rest. Thus, we must ask if there is any value of k such that i e k < j. B is in Xik. C is in Xk+1,j .

Compute X12 : since X11 = {A} and X22 = {A}, X12 consists of all variables on the left side of a production whose right side is AA. None, so X12 is empty. Next X23 = {A | A pBB, B X22, B X33} so the required right side is AB, thus X23 = {S,B} Rest is easy:
X12 = *, X23 = {S,B}, X34 = {A}, X45 = {A}, X13 = {S,B}, X24 = {A}, X35 = {S,B}, X14 = {A}, X25 = {S,B}, X15 = {S,B}

S p AB A p BB | a B p AB |b

a 1,1 2,2

a 3,3

b 4,4

b 5,5

A
1,2

A
2,3

B
3,4

B
4,5

1,3

2,4

3,5

1,4

2,5

1,5

Since S is in X15, w L(G)

S p AB A p BB | a B p AB |b
1,1 2,2 3,3 4,4 5,5

a a b b b
B
3,4

Another Example
B B

A
1,2

A
2,3

B
4,5

B A A B A S, B S,B S,B

S, B
1,3 2,4

A
3,5

X p aXb | ab Step 1: put into CNF Apply CYK algorithm to aaabbb

S, B
1,4

A
2,5

S, B

A
1,5

S, B

S, B

X p aXb | ab

a 1,1 1,2 1,3 1,4 1,5 1,6

a 2,2 2,3 2,4 2,5 2,6

a 3,3 3,4 3,5 3,6

b 4,4 4,5 4,6

b 5,5 5,6

b 6,6

S p AB | BC A p BA | a B p CC | b C p AB |a
b 1,1 1,2 1,3 1,4 1,5 2,2 2,3 2,4 2,5 a 3,3 3,4 3,5

Another Example
Test for string baaba
a 4,4 4,5 a 5,5 a

CYK as a Parsing Algorithm


Applicability of the CYK algorithm as a parser limited by the computational requirements needed to find a derivation
For an input string of length n, (n2+n)/2 sets need to be constructed to complete the dynamic programming table. Each of these sets may require the consideration of several decompositions of the associated substring

Preview of Undecidable CFL Problems Is a given CFG ambiguous? Is a given CFG inherently ambiguous? Is the intersection of two CFLs empty? Are two CFLs the same? Is a given CFL equal to *, where is the alphabet of the language?

The Chomsky Hierarchy


Turing Machine
Recursively Enumerable Languages Context Sensitive Languages Context Free Languages

Context-Sensitive Grammars
The next grammar type, more powerful than CFGs, is a "somewhat restricted" grammar A grammar is context-sensitive if all productions are of the form x p y, where x,y are in (V T)+ and |x| |y|
Fundamental property:
grammar is non-contracting--i.e., the length of successive sentential forms can never decrease

Linear Bounded Automata

Regular Languages

Push Down Automata

Finite Automata

Why "context-sensitive"?
All productions can be rewritten in a normal form xAy p xvy Effectively, "A can be replaced by v only in the context of a preceding x and a following y"

Example
CSG for {anbncn | n 1} S Ab Ac bB aB p abc | aAbc p bA p Bbcc p Bb p aa | aaA
S aAbc abAc abBbcc aBbbcc aaAbbcc aabAbcc aabbAcc aabbBbccc aabBbbccc aaBbbbccc aaabbbccc

Linear-Bounded Automata
A limited TM in which tape use is restricted
Use only part of the tape occupied by the input I.e., has an unbounded tape, but the amount that can be used is a function of the input
Restrict usable part of tape to exactly the cells taken by the input

Try to derive a3b3c3

A and B are "messengers"- an A is created on the left, travels to the right to the first c, creates another b and c. Then sends B back to create the corresponding a. Similar to the way one would program a TM to accept the language.

LBA is assumed to be nondeterministic

10

Relation between CSLs and LBAs


If a language L is accepted by some linear bounded automaton, then there is a context-sensitive grammar that generates L.
Every step in a derivation from a CSG is a bounded function of |w| because any CSG G is non-contracting

11

You might also like