Nothing Special   »   [go: up one dir, main page]

Finite Automata Formal Languages: Operations On Sentences

Download as doc, pdf, or txt
Download as doc, pdf, or txt
You are on page 1of 20

Finite Automata Formal Languages

Characters: symbols, usually a, b, c, … or 0 and 1

Alphabet: A finite, non-empty set of characters, usually indicated by  = {a, b}


The Greek sigma symbol, , represents the set of symbols that make up the alphabet we
are currently working with.

String: A finite sequence of zero or more characters from a given alphabet.


We often use the letter w to stand for a string, as in w = aaabbb
The Greek lambda symbol, , represents the string that consists of no characters, or the
“empty string”. || = 0

Language: A set of strings (this may be a finite or an infinite set); a subset of *


* is a set of strings obtained by concatenating 0 or more symbols from , our alphabet
+ = * - λ

Sentence: a string in a language L

You can represent a language by listing the sentences in it:


L = {a, aa, aaa, aaaa, …}
L = {}
L = {}

Operations on sentences:

Length: The length of a sentence, represented as |w|, is the number of characters in w


(think "cardinality of a set")

Reversal: The reversal of a sentence, represented as wR, is simply the characters in w


given in reverse order

Substring: if w = {abb}, then its substrings are: λ, {a}, {b}, {ab}, {bb}, {abb}
If u, v, and w are sentences, and w = uv, then u is a prefix of w, and v is a suffix of w

Concatenation: The concatenation of two sentences v and u, represented as vu, is the


sentence consisting of all the characters in v followed by all the characters in u.
The iterated concatenation of a sentence, denoted wk, is simply k copies of w
concatenated together.
We say that w0 = 

1
Deterministic Finite automaton

A finite automaton is a 5-tuple (Q, S, d, q0, F), where


• Q is a finite set of states,
• S is a finite alphabet,
• d : Q ´ S ® Q is the transition function,
• q0 is the start state, and
• F Í Q is the set of accept states.

The language accepted by a DFA

Let M = (Q, S, q0, d, F) be an FA.

• A string w Î S* is accepted by M if
d*(q0, w) Î F

• The language accepted (or recognized) by M is the set of all strings on S that are
accepted by M

• Formally:
L(M) = {w Î S* : δ* (q0, w) Î F}

Non Deterministic Finite automaton

A non-deterministic finite accepter (abbreviated NFA or NDFA) is defined by the


quintuple:
M = (Q, S, d, q0, F)
Q is a finite, nonempty set of states
S is finite set of input symbols called alphabet
d: Q ´ (S È {l}) ® 2Q is the transition function
q0 Î Q is the initial state
F Í Q is a set of final or “accepting” states

Differences between a DFA and an NFA:

(1) in an NFA, the range of d is in the powerset of Q (instead of just Q), so that from the
current state, upon reading a symbol:
(a) more than one state might be the next state of the NFA, or
(b) no state may be defined as the next state of the NFA, and
(2) l-moves are possible; that is, a transition from one state to another may occur without
reading a symbol from the input string.
DFA:
transition function examples

2
δ:Q×Σ→Q δ(q0, a) = q1

NFA:
transition function examples
δ : Q × (Σ È {λ}) → 2Q δ(q1, a) = {q1, q2}
δ(q1, b) = {}
δ(q1, λ) = {q2}

The language L accepted by an NFA

M = (Q, S, d, q0, F) is defined as


L(M) = {w Î S* : δ* (q0, w) Ç F ¹ Æ}
That is, the language consists of all strings w for which there is a walk labeled w from the
start state to a final state in the transition graph.

Advantages of nondeterminism

• an NFA can be smaller, easier to construct and easier to understand than a DFA
that accepts the same language

• useful for proving some theorems

• good introduction to nondeterminism in more powerful computational models,


where nondeterminism plays an important role

Applications of regular expressions

• Validation
– checking that an input string is in valid format
– example 1: checking format of email address on
WWW entry form
– example 2: UNIX regex command
• Search and selection
– looking for strings that match a certain pattern
– example: UNIX grep command
• Tokenization
– converting sequence of characters (a string) into sequence of tokens (e.g.,
keywords, identifiers)
– used in lexical analysis phase of compiler

Closure properties of regular languages

Theorem: The class of regular languages is closed under union, concatenation, and
Kleene’s star.
Proof:

3
Let L1 and L2 be regular languages over S.
Then, there are regular expressions r1 and r2 corresponding to L1 and L2.
By the definition of regular expression and regular languages, r1+r2 ,r1×r2, and r1* are
regular expressions corresponding to L1ÈL2, L1×L2, and L1*.
Thus, the class of regular languages is closed under union, concatenation, and Kleene’s
star.

Theorem: The complement of any regular language is a regular language.

Proof:
Let L be any regular language. By definition there must be some DFA
M = (Q, S, d, q0, F) with L(M) = L. Define a new DFA M' = (Q, S, d, q0, Q-F).
This has the same transition function d as M, but for any string x Î S* it accepts x if
and only if M rejects x. Thus L(M') is the complement of L. Because there is a DFA
for it, we conclude that the complement of L is regular

Theorem : If L1 and L2 are any regular languages, L1 Ç L2 is also a regular language.

Proof1:
• Let L1 and L2 be any regular languages
• By definition there must be DFAs for them:
– M1 = (Q, S, d1, q0, F1) with L(M1) = L1
– M2 = (R, S, d2, r0, F2) with L(M2) = L2
• Define a new DFA M3 = (Q´R, S, d, (q0,r0), F1´F2)
• For d, define it so that for all q Î Q, r Î R, and a Î S, we have d((q,r),a) =
(d1(q,a), d2(r,a))
• M3 accepts if and only if both M1 and M2 accept
• So L(M3 ) = L1 Ç L2, so that intersection is regular

Proof2:
DeMorgan’s Law: L1∩L2 = (L1^ U L2^)^
 L1 and L2 are regular
 So L1^ and L2^ are regular (Closure under complementation)
 So L1^ U L2^ is regular (Closure under union)
 So (L1^ U L2^)^ is regular. (Closure under comp.)
 So L1 ∩ L2 is regular.

Closure Under Difference :


Theorm: If L1 and L2 are any regular languages, L1 - L2 is also a regular language.
 L1 – L2 is regular if L1 and L2 are regular
 L1 – L2 = L1 ∩ L2^
 L1 and L2 are regular
 Then L2^ is regular (closure under comp.)
 Then L1 ∩ L2^ is regular (closure under inter.)
 So L1 – L2 is regular

4
Closure Under Reversal :
Theorm: If L1 and L2 are any regular languages, reversals of these languages are also a
regular.

Homomorphism:

Let Σ and Δ be alphabets. A function h: Σ → Δ* is a homomorphism.


A homomorphism may be extended uniquely as a function h: Σ* → Δ* :
h(ε) = ε, and h(x λ) = h(x) h(λ) for all x Î Σ * and λ Î Σ .
Finally, homomorphism may be applied to languages, h: p(Σ*) → p(Δ*) , by the
elementwise extension (p(S) = power set of S):
h(L) = {h(x) | x Î L} for each L Î Σ*.

Let Σ and Δ be alphabets. A function s: Σ → p(Δ*) is a substitution (i.e., each letter of Σ


is associated with a language over Δ). A substitution is called regular if each of the
languages s(λ) is regular.
A substitution may be also extended as a function s: Σ* → p(Δ*) as inductively defined
by s(ε) = {ε}, and s(x λ) =s(x) s(λ) for all x Î Σ * and λ Î Σ .
Finally, substitutions may be applied to languages, s: p(Σ*) → p(Δ*), by element-wise
extension s(L) = U s(xi). where xi Î L

Every homomorphism h can be regarded as a substitution h' where if h(λ) = w, h'(λ) =


{w}. Homeomorphisms map a letter to one string. Substitutions map a letter to set of
strings.

Inverse homomorphism:
Let h be a homomorphism from some alphabet S to strings in another alphabet T. Let L
be an RL over T. Then h- 1(L ) is the set of strings w such that h(w) is in L.

Decision Properties:
– Assume
• #symbols = constant
• #states = n.

– Conversion from an e-NFA to a DFA --- requiring O(n32n) time in the worse cases
– Conversion from a DFA to an NFA --- requiring O(n) time
– From an automaton (DFA) to an RE --- requiring O(n34n) time
– From an RE to an automaton (e-NFA) --- requiring linear time in the size of the RE
– The table filling algorithm requires O(n2) time where n = #states.

Context-free grammar

5
A context-free grammar (CFG) is a 4-tuple, G = (V, T, S, P), where V and T are
disjoint sets, S Î V, and P is a finite set of rules of the form A ® x, where A Î V and x
Î (V È T)*.
V = non-terminals or variables
T = terminals
S = Start symbol
P = Productions or grammar rules

Derivation:
In a leftmost derivation, the leftmost nonterminal is replaced at each step.
In a rightmost derivation, the rightmost nonterminal is replaced at each step.

Parse Tree

Let G = (V, T, S, P) be a context-free grammar. An ordered tree is a derivation tree for G


iff it has the following properties:
1. The root is labeled S
2. Every leaf has a label from T È {λ}
3. Every interior vertex (not a leaf) has a label from V.
4. If a vertex has label A Î V, and its children are labeled (from left to right) a1, a2,...,
an, then P must contain a production of the form A ® a1a2...an5. A leaf labeled λ has no
siblings; that is, a vertex with a child labeled λ can have no other children

A partial derivation tree is one in which property 1 does not necessarily hold and in
which property 2 is replaced by:
2a. Every leaf has a label from V È T È {λ}

The yield of the tree is the string of symbols in the order they are encountered when the
tree is traversed in a depth-first manner, always taking the leftmost unexplored branch.

Ambiguity

A grammar is ambiguous if it can generate a string with two possible parse trees.
A string has more than one parse tree if and only if it has more than one leftmost
derivation.

Ambiguous grammar

<statement> := IF < expression> THEN <statement> |


IF <expression> THEN <statement> ELSE <statement> |
<otherstatement>

Unambiguous grammar

<statement> := <st1> | <st2>

6
<st1> := IF <expression> THEN <st1> ELSE <st1> |
<otherstatement>
<st2> := IF <expression> THEN <statement> |
IF <expression> THEN <st1> ELSE <st2>

Simplification of Context-free Grammars and Normal Forms

λ -productions

Any production of a context-free grammar of the form


A®λ
is called a λ-production.

Any variable A for which the derivation A Þ* λ is possible is called nullable.

A nullable variable in a context-free grammar G = (V, T, S, P) is defined as follows:

1. Any variable A for which P contains the production A ® l is nullable.

2. If P contains the production A ® B1B2…Bn and B1B2…Bn are nullable variables,


then A is nullable.

3. No other variables in V are nullable.

Unit Production

Any production of a context-free grammar of the form


A® B,
(where A, B ÎV) is called a unit-production.

Useless Productions

So a variable is useless if either:


1. it is not live (i.e., cannot derive a terminal string), or
2. it is not reachable from the start symbol
A production is useless if it involves any useless variables.

Chomsky Normal Form (CNF)


A context-free grammar is in Chomsky Normal Form (CNF) if every production is one of
these two types:

A ® BC or A®a

where A, B, and C are variables and a is a terminal symbol.

7
Greibach Normal Form ( GNF)

A context-free grammar is said to be in Greibach Normal Form if all productions have


the form
A ® aX
where a ÎT and XÎ V*

Pushdown Automata

A nondeterministic pushdown automaton (NPDA) is a 7-tuple M = (Q, , G, q0, d, z, F),


where
Q is a finite set of states
 is the input alphabet (a finite set)
G is the stack alphabet (a finite set)
d : Q ´ ( È { λ }) ´ G ® (finite subsets of Q ´ G*)
is the transition function
q0 Î Q is the start state
z Î G is the initial stack symbol
F Í Q is the set of accepting states

Instantaneous description

Given the transition function


d : Q ´ ( È { λ }) ´ G ® (finite subsets of Q ´ G*)
a configuration, or instantaneous description, of M is a snapshot of the current status of
the PDA. It consists of a triple:
(q, w, u)
where:
q Î Q (q is the current state of the control unit)
w Î * (w is the remaining unread part of the input string), and
u Î G* (u is the current stack contents, with the leftmost symbol indicating the top of the
stack)

Acceptance by final state

If M = (Q, , G, q0, d, z, F), is a push-down automaton and w Î *, the string w is


accepted by M if:
(q0, w, z) |--* (qf, λ , u)
for some u Î G* and some qf Î F.

8
This means that we start at the start state, with the z on stack , and after processing the
string w, we end up in an accepting state, with no more characters left to process in the
original string. We don’t care what is left on the stack.
This is called acceptance by final state.

Acceptance by Empty Stack

If M = (Q, , G, q0, d, z, F), is a push-down automaton and w Î *, the string w is


accepted by M if:
(q0, w, z) |--* (q, λ , λ)
for some u Î G* and some q Î Q.
This means that we start at the start state, with the z on stack , and after processing the
string w, we end up in any state, with no more characters left to process in the original
string. Nothing should be left on the stack. Ie., stack must be empty.
This is called acceptance by final state.

Constructing a PDA from a CFG

Given G = (V, T, S, P),


construct M = (Q, , G, δ, q0, z, F, ), with
Q = {q0, q1, q2}
G = V È  È {z}
F = q2

δ (q0, λ, z) = {(q1, S z)}


For every A Î V,
δ (q1, λ , A) = {(q1, a)}, where A ® a
For every a Î ,
δ (q1, a, a) = {(q1, λ)}
δ (q1, λ, z) = {(q2, λ)}

Closure properties of CFGs

CFLs are closed under Union, Concatenation and Kleene closure.

Proof by construction:
Let
G1 = (V1, T1, S1, P1) and
G2 = (V2, T2, S2, P2)
with
L1 = L(G1) and
L2 = L(G2)

9
Union

We create grammar Gu = (Vu, T1 È T2, Su, Pu) generating L1 È L2


1. Rename the elements of V2 if necessary so that V1 Ç V2 = Æ.
2. Create a new start symbol Su, not already in V1 or V2.
3. Set Vu = V1 È V2 È {Su}
4. Set Pu = P1 È P2 È {Su ® S1 | S2}.

Construction completed.

Concatenation

We create grammar Gc = (Vc, T1 È T2, Sc, Pc) generating L1.L2

1. Rename the elements of V2 if necessary so that V1 Ç V2 = Æ.


2. Create a new start symbol Sc, not already in V1 or V2.
3. Set Vc = V1 È V2 È {Sc}
4. Set Pc = P1 È P2 È {Sc ® S1S2}

Construction completed.

Kleene Closure

We create grammar G* = (V, T, S, P) generating L1*

1. Create a new start symbol S, not already in V1.


2. Set V* = V1 È {S}
3. Set P* = P1 È {S ® S1S | l}

Construction completed.

The Standard Turing Machine

A Finite Automaton has only its states to serve as memory - very limited in what
languages it could recognize.
In a Pushdown Automaton we added a stack to a Finite Automaton - this gave it better
memory, and allowed it to recognize more languages. But it was still limited by the
nature of the stack.
We could add another stack to a Pushdown Automaton. This would allow it to accept
some non-context free languages (e.g., anbncn).
Or we could change the stack to a queue. This would allow other languages to be
accepted (e.g., ww).

Definition

10
A Turing machine (TM) is a 7-Tuple
T = (Q, , Γ, δ, q0, Δ, F), where
Q = a finite set of states not containing h (the halt state)
 = a finite set of symbols constituting the input alphabet;
SÍ Γ–{Δ}
Γ = a finite set of symbols constituting the tape alphabet
δ = the transition function: Q ´ G ® Q ´ G ´ {L, R}
q0 = the start state (q0 Î Q)
Δ is a special symbol, the blank symbol
F Í Q is the set of final states

Turing machines can be thought of as a finite state automaton with slightly different
labels on the arcs, plus a tape as auxiliary memory, and a tape head (or read-write head).

Δ 1 0 0 1 D D

Each cell on the tape on a Turing machine can hold:


• a symbol from the input alphabet, or
• a symbol from the tape alphabet (which may include some symbols which are not
in the input alphabet), or
• a blank symbol Δ, which is distinct from the input alphabet.

The Turing machine has a tape head which reads from and writes to the tape during each
move.• The tape head can move right or left, or may stay in the same position. The tape
head can read a character from the cell under which it is positioned, or it may write a
character to the cell.

The standard Turing machine described by our textbook has several features:
The TM has a tape that is unbounded in both directions.
The TM is deterministic in the sense that d defines at most one move per configuration
There is no special input file; the input is copied onto the tape.
There is no special output device; the tape contains the output, if any.

The TM transition function has the following form:


d(q, X) = (r, Y, D)

This says that:


• when the TM is in state q, and
• the read-write head reads a certain symbol (X) off the tape,
then the TM will:
• change state to state r,
• write a Y onto the tape, and
• move in direction D.

11
A TM that accepts {a, b}* {aba} {a, b}*

A TM that accepts {a, b}* {aba}

A TM that accepts pal, the palindrome language:

12
13
A TM that accepts strings of L = {ss}:

 The addition of two numbers is as simple as the concatenation of two strings.


 Assume two integers are represented in unary form on the tape of a Turing
machine.
 Assume the two integers are separated by a 0 between them.
 Design a Turing machine that will add these two numbers.

14
Variations on TMs

Multitape TM

• Consider a TM with k tapes and a separate head for each tape. It reads and writes
on these tapes in parallel.
• We can show that this does not increase the computational power of a TM by
showing that any multi-tape TM can be simulated by a standard, single-tape TM.
• The details of the simulation involve dividing a single tape into multiple tracks --
using an alphabet consisting of tuples, with one element for each track.

Decidability

 A Turing computable language has a characteristic function that is Turing


computable. A Turing computable language is also called a decidable language.
 A semi-decidable language has a TM that outputs 1 (or equivalently, halts) for
every input string in the language, and does not halt for any input string that is not
in the language.
 So, we talk about computability for functions, and decidability for languages. But
it’s the same idea.

Enumerability

 A language is said to be Turing enumerable if there is a TM that lists all the


strings of the language. (Note that the TM never terminates if the language is
infinite.)
 Some facts:
– A language is Turing enumerable if and only if it is semi-decidable.
– If a language and its complement are Turing enumerable, then the
language is decidable.

15
– If a language is decidable, then its complement is decidable.

Recursive and recursively enumerable languages

Remember that the strings that a TM accepts constitute the language of the Turing
machine. We represent this as L(T).
A Turing machine always accepts the words of its language by stopping in the halting
state.
However, it is allowed to reject strings that don’t belong to its language either by
crashing (in a finite number of steps), or by looping forever.

Infinite loops are bad for us, because if we are trying to decide whether a string belongs
to the language of a TM or not, we can’t tell after waiting a finite amount of time whether
the TM is going to halt on the very next step, or whether it is going to go on forever. We
would prefer to have our TMs crash to reject a string.

It turns out that these distinctions exactly correspond to the last two major classes of
languages that we want to discuss in this course:
Recursively enumerable = accepted by a TM that may loop (or may crash) to
reject
Recursive = accepted by a TM that always crashes to reject

Definition
A language is recursively enumerable if there is a TM that accepts L.

A language is recursive if there is a TM that recognizes L.

This means that a language is recursive iff there exists a membership algorithm for it.
Otherwise the language is recursively enumerable.

Linear Bounded Automata

The last machine model of computation which we shall examine is the linear bounded
automaton or lba. These were originally developed as models for actual computers rather
than models for the computational process. They have become important in the theory of
computation even though they have not emerged in applications to the extent which
pushdown automata enjoy.
They do not have unending amounts of storage like Turing machines. Thus any actual
computation done on a computer is not as extensive as that which could be completed on
a Turing machine.
Definition. A linear bounded automaton (lba) is a multi-track Turing machine which has
only one tape, and this tape is exactly the same length as the input.

We allow the computing device to use just the storage it was given at the beginning of its
computation. As a safety feature, we shall employ endmarkers (* on the left and # on the

16
right) on our lba tapes and never allow the machine to go past them. This will ensure that
the storage bounds are maintained and help keep our machines from leaving their tapes.
Let's have linear bounded automata accept just like Turing machines. Thus for lba
halting means accepting. For these new machines computation is restricted to an area
bounded by a constant (the number of tracks) times the length of the input.

An lba configuration is the same as a Turing machine configuration and consists


of:
a) an instruction,
b) the tape head's position, and
c) the content of the tape.

Halting problem
The determination of whether a Turing machine will come to a halt given a particular
input program. The halting problem is solvable for machines with less than four states.
However, the four-state case is open, and the five-state case is almost certainly
unsolvable due to the fact that it includes machines iterating Collatz-like congruential
functions, and such specific problems are currently open. The problem of whether a
general Turing machine halts is undecidable, as first proved by Turing .Turing proved the
answer is, no. Since many questions can be recast to this problem, we can conclude that
some programs are absolutely impossible, although heuristic or partial solutions are
possible.

In this abstract framework, there are no resource limitations of memory or time on the
program's execution; it can take arbitrarily long, and use arbitrarily much storage space,
before halting. The question is simply whether the given program will ever halt on a
particular input.

For example, in pseudocode, the program

while True: continue

does not halt; rather, it goes on forever in an infinite loop. On the other hand, the program

print "Hello World!" halts very quickly.

The halting problem is famous because it was one of the first problems proven
algorithmically undecidable. This means there is no algorithm which can be applied to
any arbitrary program and input to decide whether the program stops when run with that
input.

17
Representing the halting problem as a set

The conventional representation of decision problems is the set of objects possessing the
property in question. The halting set

K := { (i, x) | program i will eventually halt if run with input x} represents the
halting problem.

This set is recursively enumerable, which means there is a computable function that lists
all of the pairs (i,x) it contains. However, the complement of this set is not recursively
enumerable.

There are many equivalent formulations of the halting problem; any set whose Turing
degree equals that of the halting problem is such a formulation. Examples of such sets
include:

{ i | program i eventually halts when run with input 0 }

{ i | there is an input x such that program i eventually halts when run with input x }.

Post's correspondence problem


Definition: Given a set of pairs of strings, find a sequence of pairs such that the
concatenation of all first members of the pairs is the same string as the concatenation of
all second members. This is an un-decidable problem.

Definition of the problem

The input of the problem consists of two finite lists and of


words over some alphabet A having at least two symbols. A solution to this problem is a
sequence of indices with and for all k, such that

The decision problem then is to decide whether such a solution exists or not.

Example 1

Consider the following two lists:

α1 α2 α3 β1 β2 β3

18
a ab bba baa aa bb

A solution to this problem would be the sequence (3, 2, 3, 1), because

α3α2α3α1 = bba + ab + bba + a = bbaabbbaa = bb + aa + bb +


baa = β3β2β3β1.
Furthermore, since (3, 2, 3, 1) is a solution, so are all of its "repetitions", such as (3, 2, 3,
1, 3, 2, 3, 1), etc.; that is, when a solution exists, there are infinitely many solutions of
this repetitive kind.

However, if the two lists had consisted of only α2,α3 and β2,β3, then there would have
been no solution (because then no matching pair would have the same last letter, as must
occur at the end of a solution). A convenient way to view an instance of a Post
correspondence problem is as a collection of blocks of the form

αi

βi

there being an unlimited supply of each type of block. Thus the above example is viewed
as

a
ab bba
, ,
baa
aa bb

i=1 i=2 i=3

where the solver has an endless supply of each of these three block types. A solution
corresponds to some way of laying blocks next to each other so that the string in the top
cells corresponds to the string in the bottom cells. Then the solution to the above example
corresponds to:

bba
ab bba a
19
bb aa bb baa

i1 = 3i2 = 2i3 = 3i4 = 1


Example 2

Again using blocks to represent an instance of the problem, the following is an example
that has infinitely many solutions in addition to the kind obtained by merely "repeating" a
solution. For readability, just the block number is shown below each block:

bb
ab c
, ,
b
ba bc

1 2 3

In this instance, every sequence of the form (1, 2, 2, ..., 2, 3) is a solution (in addition to
all their repetitions):

bb
ab
ab ab c
...
b ba
ba ba bc

1 2 2… 2 3

20

You might also like