Nothing Special   »   [go: up one dir, main page]

Edit Dist

Download as pdf or txt
Download as pdf or txt
You are on page 1of 35

Minimum

Edit
Distance

Definition of Minimum Edit


Distance
How similar are two strings?

Spell correction Computational Biology


graffe Align two sequences of nucleotides
Which is closest? AGGCTATCACCTGACCTCCAGGCCGATGCCC
graf TAGCTATCACGACCGCGGTCGATTTGCCCGAC
graft Resulting alignment:
grail
giraffe -AGGCTATCACCTGACCTCCAGGCCGA--TGCCC---
TAG-CTATCAC--GACCGC--GGTCGATTTGCCCGAC

Also for Machine Translation, Information Extraction, Speech Recognition


Edit Distance
The minimum edit distance between two strings
Is the minimum number of editing operations
Insertion
Deletion
Substitution
Needed to transform one into the other
Minimum Edit Distance
Two strings and their alignment:
Minimum Edit Distance

If each operation has cost of 1


Distance between these is 5
If substitutions cost 2 (Levenshtein)
Distance between them is 8
Alignment in Computational Biology

Given a sequence of bases


AGGCTATCACCTGACCTCCAGGCCGATGCCC
TAGCTATCACGACCGCGGTCGATTTGCCCGAC

An alignment:
-AGGCTATCACCTGACCTCCAGGCCGA--TGCCC---
TAG-CTATCAC--GACCGC--GGTCGATTTGCCCGAC

Given two sequences, align each letter to a letter or gap


Other uses of Edit Distance in NLP

Evaluating Machine Translation and speech recognition


R Spokesman confirms senior government adviser was appointed
H Spokesman said the senior adviser was appointed
S I D I

Named Entity Extraction and Entity Coreference


IBM Inc. announced today
IBM profits
Stanford Professor Jennifer Eberhardt announced yesterday
for Professor Eberhardt
Minimum Edit as Search
But the space of all edit sequences is huge!

Lots of distinct paths wind up at the same state.

Just the shortest path to each of those revisted states.


Defining Min Edit Distance
For two strings
X of length n
Y of length m
We define D(i,j)
the edit distance between X[1..i] and Y[1..j]
i.e., the first i characters of X and the first j characters of Y
The edit distance between X and Y is thus D(n,m)
Minimum
Edit Definition of Minimum Edit
Distance
Distance
Minimum
Edit Computing Minimum Edit
Distance
Distance
Dynamic Programming for Minimum Edit Distance

Dynamic programming: A tabular computation of D(n,m)


Solving problems by combining solutions to subproblems.
Bottom-up
We compute D(i,j) for small i,j
And compute larger D(i,j) based on previously computed smaller
values
i.e., compute D(i,j) for all i (0 < i < n) and j (0 < j < m)
Defining Min Edit Distance (Levenshtein)
Initialization
D(i,0) = i
D(0,j) = j
Recurrence Relation:
For each i

D(i-1,j) + 1
D(i,j)= min D(i,j-1) + 1
D(i-1,j-1) + 2; if X(i) Y(j)
0; if X(i) = Y(j)
Termination:
D(N,M) is distance
The Edit Distance Table
N 9
O 8
I 7

T 6
N 5
E 4
T 3
N 2
I 1
# 0 1 2 3 4 5 6 7 8 9
# E X E C U T I O N
The Edit Distance Table
N 9
O 8
I 7
T 6
N 5
E 4
T 3
N 2
I 1
# 0 1 2 3 4 5 6 7 8 9
# E X E C U T I O N
Edit Distance
N 9
O 8
I 7

T 6
N 5
E 4
T 3
N 2
I 1
# 0 1 2 3 4 5 6 7 8 9
# E X E C U T I O N
The Edit Distance Table
N 9 8 9 10 11 12 11 10 9 8
O 8 7 8 9 10 11 10 9 8 9
I 7 6 7 8 9 10 9 8 9 10
T 6 5 6 7 8 9 8 9 10 11
N 5 4 5 6 7 8 9 10 11 10
E 4 3 4 5 6 7 8 9 10 9
T 3 4 5 6 7 8 7 8 9 8
N 2 3 4 5 6 7 8 7 8 7
I 1 2 3 4 5 6 7 6 7 8
# 0 1 2 3 4 5 6 7 8 9
# E X E C U T I O N
Minimum
Edit Computing Minimum Edit
Distance
Distance
Minimum
Edit Backtrace for Computing
Alignments
Distance
Computing alignments

We often need to align each character of the two strings to


each other
backtrace
Every time we enter a cell, remember where we came
from
When we reach the end,
Trace back the path from the upper right corner to read off the
alignment
Edit Distance
N 9
O 8
I 7

T 6
N 5
E 4
T 3
N 2
I 1
# 0 1 2 3 4 5 6 7 8 9
# E X E C U T I O N
MinEdit with Backtrace
Adding Backtrace to Minimum Edit Distance

Base conditions: Termination:


D(i,0) = i D(0,j) = j D(N,M) is distance
Recurrence Relation:
For each i

D(i-1,j) + 1 deletion

D(i,j)= min D(i,j-1) + 1 insertion


D(i-1,j-1) + 2; if X(i) Y(j) substitution
0; if X(i) = Y(j)
LEFT insertion
ptr(i,j)= DOWN deletion
DIAG substitution
The Distance Matrix

xN
Every non-decreasing path

from (0,0) to (M, N)

corresponds to
an alignment
of the two sequences

x0
An optimal alignment is composed
y0 yM of optimal subalignments
SLIDE ADAPTED FROM SERAFIM BATZOGLOU WITH PERMISSION
Result of Backtrace
Two strings and their alignment:
Performance

Time:
O(nm)
Space:
O(nm)
Backtrace
O(n+m)
Minimum
Edit Backtrace for Computing
Alignments
Distance
Minimum
Edit Weighted Minimum Edit
Distance
Distance
Weighted Edit Distance
Why would we add weights to the computation?
Spell Correction: some letters are more likely to be
mistyped than others
Biology: certain kinds of deletions or insertions are more
likely than others
Confusion matrix for spelling errors
Weighted Min Edit Distance
Initialization:
D(0,0) = 0
D(i,0) = D(i-1,0) + del[x(i)]; 1 < i
D(0,j) = D(0,j-1) + ins[y(j)]; 1 < j M
Recurrence Relation:
D(i-1,j) + del[x(i)]
D(i,j)= min D(i,j-1) + ins[y(j)]
D(i-1,j-1) + sub[x(i),y(j)]
Termination:
D(N,M) is distance
Minimum
Edit Weighted Minimum Edit
Distance
Distance
Minimum
Edit Minimum Edit Distance in
Computational Biology
Distance

You might also like