Documentation

Mathlib.Data.List.EditDistance.Defs

Levenshtein distances #

We define the Levenshtein edit distance levenshtein C xy ys between two List α, with a customizable cost structure C for the delete, insert, and substitute operations.

As an auxiliary function, we define suffixLevenshtein C xs ys, which gives the list of distances from each suffix of xs to ys. This is defined by recursion on ys, using the internal function Levenshtein.impl, which computes suffixLevenshtein C xs (y :: ys) using xs, y, and suffixLevenshtein C xs ys. (This corresponds to the usual algorithm using the last two rows of the matrix of distances between suffixes.)

After setting up these definitions, we prove lemmas specifying their behaviour, particularly

theorem suffixLevenshtein_eq_tails_map :
    (suffixLevenshtein C xs ys).1 = xs.tails.map fun xs' => levenshtein C xs' ys := ...

and

theorem levenshtein_cons_cons :
    levenshtein C (x :: xs) (y :: ys) =
    min (C.delete x + levenshtein C xs (y :: ys))
      (min (C.insert y + levenshtein C (x :: xs) ys)
        (C.substitute x y + levenshtein C xs ys)) := ...

structure Levenshtein.Cost (α : Type u_4) (β : Type u_5) (δ : Type u_6) :

Type (max (max u_4 u_5) u_6)

A cost structure for Levenshtein edit distance.

delete : α → δ
Cost to delete an element from a list.
insert : β → δ
Cost in insert an element into a list.
substitute : α → β → δ
Cost to substitute one element for another in a list.

Instances For

@[simp]

theorem Levenshtein.defaultCost_delete {α : Type u_1} [DecidableEq α] :

∀ (x : α), Levenshtein.defaultCost.delete x = 1

@[simp]

theorem Levenshtein.defaultCost_insert {α : Type u_1} [DecidableEq α] :

∀ (x : α), Levenshtein.defaultCost.insert x = 1

@[simp]

theorem Levenshtein.defaultCost_substitute {α : Type u_1} [DecidableEq α] (a : α) (b : α) :

Levenshtein.defaultCost.substitute a b = if a = b then 0 else 1

def Levenshtein.defaultCost {α : Type u_1} [DecidableEq α] :

Levenshtein.Cost α α ℕ

The default cost structure, for which all operations cost 1.

Equations

Levenshtein.defaultCost = { delete := fun (x : α) => 1, insert := fun (x : α) => 1, substitute := fun (a b : α) => if a = b then 0 else 1 }

Instances For

instance Levenshtein.instInhabitedCostNat {α : Type u_1} [DecidableEq α] :

Inhabited (Levenshtein.Cost α α ℕ)

Equations

Levenshtein.instInhabitedCostNat = { default := Levenshtein.defaultCost }

@[simp]

theorem Levenshtein.weightCost_substitute {α : Type u_1} (f : α → ℕ) (a : α) (b : α) :

(Levenshtein.weightCost f).substitute a b = max (f a) (f b)

@[simp]

theorem Levenshtein.weightCost_insert {α : Type u_1} (f : α → ℕ) (b : α) :

(Levenshtein.weightCost f).insert b = f b

@[simp]

theorem Levenshtein.weightCost_delete {α : Type u_1} (f : α → ℕ) (a : α) :

(Levenshtein.weightCost f).delete a = f a

def Levenshtein.weightCost {α : Type u_1} (f : α → ℕ) :

Levenshtein.Cost α α ℕ

Cost structure given by a function. Delete and insert cost the same, and substitution costs the greater value.

Equations

Levenshtein.weightCost f = { delete := fun (a : α) => f a, insert := fun (b : α) => f b, substitute := fun (a b : α) => max (f a) (f b) }

Instances For

@[simp]

theorem Levenshtein.stringLengthCost_substitute (a : String) (b : String) :

Levenshtein.stringLengthCost.substitute a b = max (String.length a) (String.length b)

@[simp]

theorem Levenshtein.stringLengthCost_delete (a : String) :

Levenshtein.stringLengthCost.delete a = String.length a

@[simp]

theorem Levenshtein.stringLengthCost_insert (b : String) :

Levenshtein.stringLengthCost.insert b = String.length b

def Levenshtein.stringLengthCost :

Levenshtein.Cost String String ℕ

Cost structure for strings, where cost is the length of the token.

Equations

Levenshtein.stringLengthCost = Levenshtein.weightCost String.length

Instances For

@[simp]

theorem Levenshtein.stringLogLengthCost_insert (b : String) :

Levenshtein.stringLogLengthCost.insert b = Nat.log2 (String.length b + 1)

@[simp]

theorem Levenshtein.stringLogLengthCost_substitute (a : String) (b : String) :

Levenshtein.stringLogLengthCost.substitute a b = max (Nat.log2 (String.length a + 1)) (Nat.log2 (String.length b + 1))

@[simp]

theorem Levenshtein.stringLogLengthCost_delete (a : String) :

Levenshtein.stringLogLengthCost.delete a = Nat.log2 (String.length a + 1)

def Levenshtein.stringLogLengthCost :

Levenshtein.Cost String String ℕ

Cost structure for strings, where cost is the log base 2 length of the token.

Equations

Levenshtein.stringLogLengthCost = Levenshtein.weightCost fun (s : String) => Nat.log2 (String.length s + 1)

Instances For

def Levenshtein.impl {α : Type u_1} {β : Type u_2} {δ : Type u_3} [AddZeroClass δ] [Min δ] (C : Levenshtein.Cost α β δ) (xs : List α) (y : β) (d : { r : List δ // 0 < List.length r }) :

{ r : List δ // 0 < List.length r }

(Implementation detail for levenshtein)

Given a list xs and the Levenshtein distances from each suffix of xs to some other list ys, compute the Levenshtein distances from each suffix of xs to y :: ys.

(Note that we don't actually need to know ys itself here, so it is not an argument.)

The return value is a list of length x.length + 1, and it is convenient for the recursive calls that we bundle this list with a proof that it is non-empty.

Equations

One or more equations did not get rendered due to their size.

Instances For

theorem Levenshtein.impl_cons {α : Type u_1} {β : Type u_2} {δ : Type u_3} [AddZeroClass δ] [Min δ] {C : Levenshtein.Cost α β δ} (x : α) (xs : List α) (y : β) (d : δ) (ds : List δ) (w : 0 < List.length (d :: ds)) (w' : 0 < List.length ds) :

Levenshtein.impl C (x :: xs) y { val := d :: ds, property := w } = match Levenshtein.impl C xs y { val := ds, property := w' } with | { val := r, property := w } => { val := min (C.delete x + r[0]) (min (C.insert y + d) (C.substitute x y + ds[0])) :: r, property := ⋯ }

theorem Levenshtein.impl_cons_fst_zero {α : Type u_1} {β : Type u_2} {δ : Type u_3} [AddZeroClass δ] [Min δ] {C : Levenshtein.Cost α β δ} (x : α) (xs : List α) (y : β) (d : δ) (ds : List δ) (w : 0 < List.length (d :: ds)) (h : 0 < List.length ↑(Levenshtein.impl C (x :: xs) y { val := d :: ds, property := w })) (w' : 0 < List.length ds) :

(↑(Levenshtein.impl C (x :: xs) y { val := d :: ds, property := w }))[0] = match Levenshtein.impl C xs y { val := ds, property := w' } with | { val := r, property := w } => min (C.delete x + r[0]) (min (C.insert y + d) (C.substitute x y + ds[0]))

theorem Levenshtein.impl_length {α : Type u_1} {β : Type u_2} {δ : Type u_3} [AddZeroClass δ] [Min δ] {C : Levenshtein.Cost α β δ} (xs : List α) (y : β) (d : { r : List δ // 0 < List.length r }) (w : List.length ↑d = List.length xs + 1) :

List.length ↑(Levenshtein.impl C xs y d) = List.length xs + 1

def suffixLevenshtein {α : Type u_1} {β : Type u_2} {δ : Type u_3} [AddZeroClass δ] [Min δ] (C : Levenshtein.Cost α β δ) (xs : List α) (ys : List β) :

{ r : List δ // 0 < List.length r }

suffixLevenshtein C xs ys computes the Levenshtein distance (using the cost functions provided by a C : Cost α β δ) from each suffix of the list xs to the list ys.

The first element of this list is the Levenshtein distance from xs to ys.

Note that if the cost functions do not satisfy the inequalities

C.delete a + C.insert b ≥ C.substitute a b
C.substitute a b + C.substitute b c ≥ C.substitute a c (or if any values are negative) then the edit distances calculated here may not agree with the general geodesic distance on the edit graph.

Equations

One or more equations did not get rendered due to their size.

Instances For

theorem suffixLevenshtein_length {α : Type u_1} {β : Type u_2} {δ : Type u_3} [AddZeroClass δ] [Min δ] {C : Levenshtein.Cost α β δ} (xs : List α) (ys : List β) :

List.length ↑(suffixLevenshtein C xs ys) = List.length xs + 1

theorem suffixLevenshtein_eq {α : Type u_1} {β : Type u_2} {δ : Type u_3} [AddZeroClass δ] [Min δ] {C : Levenshtein.Cost α β δ} (xs : List α) (y : β) (ys : List β) :

Levenshtein.impl C xs y (suffixLevenshtein C xs ys) = suffixLevenshtein C xs (y :: ys)

def levenshtein {α : Type u_1} {β : Type u_2} {δ : Type u_3} [AddZeroClass δ] [Min δ] (C : Levenshtein.Cost α β δ) (xs : List α) (ys : List β) :

δ

levenshtein C xs ys computes the Levenshtein distance (using the cost functions provided by a C : Cost α β δ) from the list xs to the list ys.

Note that if the cost functions do not satisfy the inequalities

C.delete a + C.insert b ≥ C.substitute a b
C.substitute a b + C.substitute b c ≥ C.substitute a c (or if any values are negative) then the edit distance calculated here may not agree with the general geodesic distance on the edit graph.

Equations

levenshtein C xs ys = match suffixLevenshtein C xs ys with | { val := r, property := w } => r[0]

Instances For

theorem suffixLevenshtein_nil_nil {α : Type u_1} {β : Type u_2} {δ : Type u_3} [AddZeroClass δ] [Min δ] {C : Levenshtein.Cost α β δ} :

↑(suffixLevenshtein C [] []) = [0]

theorem List.eq_of_length_one {α : Type u_1} (x : List α) (w : List.length x = 1) :

let_fun this := ⋯; x = [x[0]]

theorem suffixLevenshtein_nil' {α : Type u_1} {β : Type u_2} {δ : Type u_3} [AddZeroClass δ] [Min δ] {C : Levenshtein.Cost α β δ} (ys : List β) :

↑(suffixLevenshtein C [] ys) = [levenshtein C [] ys]

theorem suffixLevenshtein_cons₂ {α : Type u_1} {β : Type u_2} {δ : Type u_3} [AddZeroClass δ] [Min δ] {C : Levenshtein.Cost α β δ} (xs : List α) (y : β) (ys : List β) :

suffixLevenshtein C xs (y :: ys) = Levenshtein.impl C xs y (suffixLevenshtein C xs ys)

theorem suffixLevenshtein_cons₁_aux {δ : Type u_3} {x : { r : List δ // 0 < List.length r }} {y : { r : List δ // 0 < List.length r }} (w₀ : (↑x)[0] = (↑y)[0]) (w : List.tail ↑x = List.tail ↑y) :

x = y

theorem suffixLevenshtein_cons₁ {α : Type u_1} {β : Type u_2} {δ : Type u_3} [AddZeroClass δ] [Min δ] {C : Levenshtein.Cost α β δ} (x : α) (xs : List α) (ys : List β) :

suffixLevenshtein C (x :: xs) ys = { val := levenshtein C (x :: xs) ys :: ↑(suffixLevenshtein C xs ys), property := ⋯ }

theorem suffixLevenshtein_cons₁_fst {α : Type u_1} {β : Type u_2} {δ : Type u_3} [AddZeroClass δ] [Min δ] {C : Levenshtein.Cost α β δ} (x : α) (xs : List α) (ys : List β) :

↑(suffixLevenshtein C (x :: xs) ys) = levenshtein C (x :: xs) ys :: ↑(suffixLevenshtein C xs ys)

theorem suffixLevenshtein_cons_cons_fst_get_zero {α : Type u_1} {β : Type u_2} {δ : Type u_3} [AddZeroClass δ] [Min δ] {C : Levenshtein.Cost α β δ} (x : α) (xs : List α) (y : β) (ys : List β) (w : 0 < List.length ↑(suffixLevenshtein C (x :: xs) (y :: ys))) :

(↑(suffixLevenshtein C (x :: xs) (y :: ys)))[0] = match suffixLevenshtein C xs (y :: ys) with | { val := dx, property := property } => match suffixLevenshtein C (x :: xs) ys with | { val := dy, property := property_1 } => match suffixLevenshtein C xs ys with | { val := dxy, property := property_2 } => min (C.delete x + dx[0]) (min (C.insert y + dy[0]) (C.substitute x y + dxy[0]))

theorem suffixLevenshtein_eq_tails_map {α : Type u_1} {β : Type u_2} {δ : Type u_3} [AddZeroClass δ] [Min δ] {C : Levenshtein.Cost α β δ} (xs : List α) (ys : List β) :

↑(suffixLevenshtein C xs ys) = List.map (fun (xs' : List α) => levenshtein C xs' ys) (List.tails xs)

@[simp]

theorem levenshtein_nil_nil {α : Type u_1} {β : Type u_2} {δ : Type u_3} [AddZeroClass δ] [Min δ] {C : Levenshtein.Cost α β δ} :

levenshtein C [] [] = 0

@[simp]

theorem levenshtein_nil_cons {α : Type u_1} {β : Type u_2} {δ : Type u_3} [AddZeroClass δ] [Min δ] {C : Levenshtein.Cost α β δ} (y : β) (ys : List β) :

levenshtein C [] (y :: ys) = C.insert y + levenshtein C [] ys

@[simp]

theorem levenshtein_cons_nil {α : Type u_1} {β : Type u_2} {δ : Type u_3} [AddZeroClass δ] [Min δ] {C : Levenshtein.Cost α β δ} (x : α) (xs : List α) :

levenshtein C (x :: xs) [] = C.delete x + levenshtein C xs []

@[simp]

theorem levenshtein_cons_cons {α : Type u_1} {β : Type u_2} {δ : Type u_3} [AddZeroClass δ] [Min δ] {C : Levenshtein.Cost α β δ} (x : α) (xs : List α) (y : β) (ys : List β) :

levenshtein C (x :: xs) (y :: ys) = min (C.delete x + levenshtein C xs (y :: ys)) (min (C.insert y + levenshtein C (x :: xs) ys) (C.substitute x y + levenshtein C xs ys))