Sparse Dynamic Programming for Longest Common Substring
Premise
This is NOT the Longest Common Subsequence problem. Sometimes, people confuse the two.
In the Longest Common Substring problem, given two strings $x$ and $y$, we need to return
a string $s$ such that:
$s$ is a substring of both $x$ and $y$, i.e. a common substring;
for each common substring of $x$ and $y$, let it be $t$, we have that
$t$ is not longer than $s$.
Introduction
Let $n$ and $m$ denote the length of the strings $x$ and $y$, respectively.
Let
$\#(x,y)$ be the number
of pairs
$(i,j) \in \{1,\,\ldots,\,n\}\!\times\!\{1,\,\ldots,\,m\}$ such that $x[i]=y[j]$.
In this page, I present a sparse Dynamic Programming (DP) algorithm
for Longest Common Substring, that I have developed.
It requires $\Theta(n+m+\#(x,y))$ time [1] and $\Theta(\min(n,m))$ space.
Two optimizations, discussed in more detail at the end of the page, (may) allow to further improve its speed.
In the github repo you can find
a benchmark against a suffix tree, which I ran in a i.i.d. model while using one of those optimizations.
While that result is encouraging, I lack the domain
knowledge to conduct meaningful benchmarks in relevant scenarios. So, contributions would be welcome.
A defining feature for this algorithm is its conceptual simplicity. So, I wrote this page also to provide an educational
resource. In particular, i follow the roadmap: Brute Force -> Dynamic Programming -> Sparse DP.
Check the preprint for a more complete explanation.
[1] To be precise, $\Theta(n+m+\#(x,y))$ expected amortized time. "Expected" due
to hash map lookups/insertions, "amortized" due to array appends.
This also holds for the other time complexities in the introduction.
Contact
Email:
Basic Notations
Bear with this.
We denote by $\varepsilon$ the empty string. Let $u$ be a string. We denote by $|u|$ its length.
For all $i \in \{1,\,\ldots,\,|u|\}$, we denote by $u[i]$ the symbol at position $i$.
For all $i,j \in \{1,\,\ldots,\,|u|\}$, with $i\leq j$, we denote by $u[i \,.\,.\, j]$
the string $u[i]*u[i\!+\!1]* \ldots * u[j]$ (the asterisk denotes the concatenation operator). Let $s$ be a string such that $|s| \leq |u|$.
For all $i \in
\{|s|,\,\ldots,\,|u|\}$, we say that $s$ ends at position $i$ of $u$ iff $u[i\!-\!|s|\!+\!1 \,.\,.\, i] = s$.
We say that $s$ is a substring of $u$ iff it ends at some position $i \in \{|s|,\,\ldots,\,|u|\}$ of $u$.
From now on, let $x$ and $y$ be two strings, and let $n:=|x|$ and $m:=|y|$.
We say that $u$ is a common substring of
$x$ and $y$ iff $u$ is a substring of $x$ and a substring of $y$. We denote by $\mathrm{CS}$ the set of common
substrings of $x$ and $y$. We say that $u$ is a longest common substring of $x$ and $y$ iff $u \in
\mathrm{CS}$ and $|u| \geq |s|$ for all $s \in \mathrm{CS}$. We denote by $\mathrm{LCS}$ the set of
longest common substrings of $x$ and $y$. Let $(i,j) \in \{1,\,\ldots,\,n\} \! \times \! \{1,\,\ldots,\,m\}$.
We define $\mathrm{CS}(i,j)$ as the set of common substrings of $x$ and $y$ ending,
respectively, at positions $i$ and $j$. That is:
$$\mathrm{CS}(i,j) \quad := \quad \{s \in \mathrm{CS}(x,y) \,:\, s=x[i\!-\!|s|\!+\!1 \,.\,.\, i] \, \land \, s=y[j\!-\!|s|\!+\!1 \,.\,.\, j]\}$$
We define $\mathrm{lcs}(i,j)$
as the longest element of $\mathrm{CS}(i,j)$. Let $i \in \{1,\,\ldots,\,n\}$. We define $\mathrm{CS}(i)$ as the set of
common substrings of $x$ and $y$ ending at position $i$ of $x$.
Indexing
For maths/presentation, we use $1$-based indexing. In code, we use $0$-based indexing.
Longest Common Substring problem
Given $x$ and $y$, return an element of $\mathrm{LCS}$.
Brute Force
The following observation allows us to identify a search space.
For each $s \in \mathrm{LCS}$, there exists a pair $(i,j) \in I$ such that $\mathrm{lcs}(i,j) = s$.
Proof
Let $s \in \mathrm{LCS}$ .Since $s \in \mathrm{CS}$, it ends
at some position $i \in \{1,\,\ldots,\,n\}$
of $x$ and at some position $j \in \{1,\,\ldots,\,m\}$ of $y$ .That is,
there exists $(i,j) \in I$ such that $s \in \mathrm{CS}(i,j)$ .Since
$s \in \mathrm{LCS}$, we have $|s| \geq |u|$ for all $u \in
\mathrm{CS}(i,j)$ .Therefore, $s = \mathrm{lcs}(i,j)$.
So, $\mathrm{LCS} \,\subseteq\, \{\mathrm{lcs}(i,j) : (i,j) \in I\}$. Let $(i,j) \in I$. How do we know
whether $\mathrm{lcs}(i,j) \in \mathrm{LCS}$ ?
If $x[i] \neq y[j]$, then
$\mathrm{lcs}(i,j) = \varepsilon$ . Suppose that
$x[i] = y[j]$.
If $i = 1 \lor j = 1$,then $\mathrm{lcs}(i,j) = x[i]$.
Otherwise, $\mathrm{lcs}(i,j) =
\mathrm{lcs}(i\!-\!1,j\!-\!1) * x[i]$ .
Proof
Suppose that $i > 1 \,\land\, j > 1 \,\land\, x[i]=y[j]$, otherwise it is trivial.
Let $\ell := |\mathrm{lcs}(i,j)|$. We have
$\ell \geq 1$ and
$\mathrm{lcs}(i,j) = x[i\!-\!\ell\!+\!1 \,.\,.\, i]
= y[j\!-\!\ell\!+\!1 \,.\,.\, j]$. Before continuing, we make an important
observation: if $i\!-\!\ell \geq 1\, \land \,j\!-\!\ell \geq 1$,
then $x[i\!-\!\ell]\neq y[j\!-\!\ell]$. In fact, if it were $x[i\!-\!\ell]=y[j\!-\!\ell]$,
then we would have $\mathrm{lcs}(i,j) \neq x[i\!-\!\ell\!+\!1 \,.\,.\, i]$. We continue by distinguishing two cases:
Suppose $\ell=1$. Since $i > 1 \,\land\, j > 1$,
we have $i\!-\!\ell \geq 1 \,\land\, j\!-\!\ell \geq 1$. So,
$\mathrm{lcs}(i\!-\!1,j\!-\!1) = \varepsilon$ and $\mathrm{lcs}(i,j) = x[i]$.
Consequently,
$\mathrm{lcs}(i,j) = \mathrm{lcs}(i\!-\!1,j\!-\!1) * x[i]$ holds.
Suppose $\ell > 1$. As we know,
$x[i\!-\!\ell\!+\!1 \,.\,.\, i\!-\!1]$ equals $y[j\!-\!\ell\!+\!1 \,.\,.\, j\!-\!1]$.
So, regardless of whether $i\!-\!\ell \geq 1 \,\land\, j\!-\!\ell \geq 1$ is true or not,
we have $\mathrm{lcs}(i\!-\!1,j\!-\!1) = x[i\!-\!\ell\!+\!1 \,.\,.\, i\!-\!1]$.
Since $\mathrm{lcs}(i,j) =
x[i\!-\!\ell\!+\!1 \,.\,.\, i] = x[i\!-\!\ell\!+\!1 \,.\,.\, i\!-\!1] * x[i]$, we conclude that
$\mathrm{lcs}(i,j) = \mathrm{lcs}(i\!-\!1,j\!-\!1) * x[i]$ holds.
These observations let us obtain our first algorithm:
def lcs_ij(x, y, i, j):
if x[i] != y[j]: return ""
if i == 0 or j == 0: return x[i]
return lcs_ij(x, y, i-1, j-1) + x[i]
def lcs(x, y):
n = len(x)
m = len(y)
ans = ""
for i in range(n):
for j in range(m):
cur = lcs_ij(x, y, i, j)
if len(cur) > len(ans):
ans = cur
return ans
It requires $O(n \cdot m \cdot \min^2(n,m))$ time and $O(\min(n,m))$ space. If the repeated concatenations inside
$\mathsf{lcs}\_\mathsf{ij}$ were to be removed, then the (worst case) time complexity would be $O(n \cdot m \cdot \min(n,m))$.
While we could remove them by optimizing $\mathsf{lcs}\_\mathsf{ij}$, we can do better. Observe:
Let $(i,j) \in I$ and $\ell := |\mathrm{lcs}(i,j)|$ .If $\ell \geq |\mathrm{lcs}(i',j')|$ for all $(i',j') \in I$,
then $x[i\!-\!\ell\!+\!1 \,.\,.\, i] \in \mathrm{LCS}$ .
Proof
If $\ell$ satisfies the above hypothesis, then by
2 we have $\mathrm{lcs}(i,j)
\in \mathrm{LCS}$ . The sequence $\mathrm{lcs}(i,j)$ ends at position $i$ of $x$ and has length $\ell$,
so $x[i\!-\!\ell\!+\!1 \,.\,.\, i] \in \mathrm{LCS}$ .
So, for each $(i,j) \in I$, it is sufficient to calculate $|\mathrm{lcs}(i,j)|$ instead of $\mathrm{lcs}(i,j)$.
We can do that by adapting 3. We said this was better.
Why? Because
the time complexity still becomes $O(n \cdot m \cdot \min(n,m))$, but we also achieve $O(1)$ space and (arguably) a cleaner
algorithm. Here is the adapted code:
def len_lcs_ij(x, y, i, j):
if x[i] != y[j]: return 0
if i == 0 or j == 0: return 1
return len_lcs_ij(x, y, i-1, j-1) + 1
def lcs(x, y):
n = len(x)
m = len(y)
(p,q) = (0,0)
l = 0
for i in range(n):
for j in range(m):
cur_l = len_lcs_ij(x, y, i, j)
if cur_l > l:
(p,q) = (i,j)
l = cur_l
return x[p-l+1 : p+1] if l > 0 else ""
I named "$(p,q)$" the pair for tracking the solution, instead of "$(i,j)$" as 4
would suggest; this is because it reads better to use $i$ and $j$ for the loops, since python does not allow "$i'$" as a variable name.
For clarity,
I stored the entire pair $(p,q)$ instead of just $p$.Later, I will drop the second coordinate.
Dynamic Programming
For the remaining explanations, but not in code, we assume that $n \geq m$.
Reason: clarity.
At the expense of $\Theta(m)$ space, we can reduce the time complexity to $\Theta(nm)$. Observe:
Let $(i,j) \in I$ with $i > 1 \, \land \, j > 1$
.Our approach involves computing both $|\mathrm{lcs}(i\!-\!1,j\!-\!1)|$
and $|\mathrm{lcs}(i,j)|$ . However, if $x[i] = y[j]$,
then the latter depends on the former (recall
3).So, if we compute and store
$|\mathrm{lcs}(i\!-\!1,j\!-\!1)|$ before
$|\mathrm{lcs}(i,j)|$, then we can compute $|\mathrm{lcs}(i,j)|$ in constant time.
To employ 5, we introduce an auxiliary array
$\mathsf{row}$ of length $m$; moreover, we reverse the iteration order of the inner loop (that is, we scan $(1,\,\ldots,\,m)$ right to left).
During the first iteration of the outer loop, for each $j \in \{1,\,\ldots,\,m\}$ we store $|\mathrm{lcs}(1,j)|$ in $\mathsf{row}[j]$.
Let $i \in \{2,\,\ldots,\,n\}$, and consider the start of the $i$-th iteration of the outer loop. Suppose that, at this point,
for each $j \in \{1,\,\ldots,\,m\}$ we have $\mathsf{row}[j] = |\mathrm{lcs}(i\!-\!1,j)|$. Let $j \in \{2,\,\ldots,\,m\}$.
Because we scan $(1,\,\ldots,\,m)$ right to left, if we need $|\mathrm{lcs}(i\!-\!1,j\!-\!1)|$, then we can find it in $\mathsf{row}[j\!-\!1]$. So, the value
$|\mathrm{lcs}(i,j)|$, which we store in $\mathsf{row}[j]$, can be computed in $O(1)$ time. Here is the new version:
def lcs(x, y):
if len(y) > len(x): x,y = y,x
n = len(x) ; m = len(y)
p = 0 ; l = 0
row = [0 for _ in range(m)]
for i in range(n):
for j in range(m-1, -1, -1):
if x[i] == y[j]:
row[j] = row[j-1]+1 if j > 0 else 1
else:
row[j] = 0
if row[j] > l:
p = i
l = row[j]
return x[p-l+1 : p+1] if l > 0 else ""
Note: this algorithm is very similar to one that is frequently found on the internet.
Sparse Dynamic Programming
We now reduce the time complexity to $\Theta(n + m + \#(x,y))$, even though in expected and amortized terms,
while still requiring $\Theta(m)$ space. Recall that $\#(x,y)$ is the number of pairs $(i,j) \in I$ such that $x[i]=y[j]$.
The following observation exposes a key inefficiency of the dynamic programming algorithm:
Let $i \in \{1,\,\ldots,\,n\}$. At the start of the $i$-th iteration of the outer loop,
the variable $l$
contains a value not smaller than $0$. Thus, for each $j \in \{1,\,\ldots,\,m\}$
such that $|\mathrm{lcs}(i,j)|
= 0$, the control $\mathsf{row}[j] > l$ is guaranteed to fail.
That is, each pair $(i,j) \in I$ such that $x[i] \neq y[j]$ is irrelevant to improving the current best solution.
So, we skip them. To this end:
During the initialization phase, we build a hash map mapping each symbol $\sigma$
of $y$
to the strictly increasing ordering of $\{j \in \{1,\,\ldots,\,m\} : y[j]=\sigma \}$ .
It can be built in $\Theta(m)$ expected amortized time, requires $\Theta(m)$ space, and might be replaceable by an array.
Skipping the pairs $(i,j) \in I$ such that $x[i] \neq y[j]$ introduces an issue. To be clear, we are currently
considering the DP algorithm with the hash map trick. Let $(i,j) \in I$ be such that $i > 1 \,\land\, j > 1 \,\land\, x[i]=y[j]$.
When it considers this pair, the algorithm relies on $\mathsf{row}[j\!-\!1]$ to compute
$|\mathrm{lcs}(i,j)|$. If $x[i\!-\!1]=y[j\!-\!1]$, then
$\mathsf{row}[j\!-\!1]$ was last modified
during the $i\!-\!1$-th iteration of the outer loop; so, it currently contains $|\mathrm{lcs}(i\!-\!1,j\!-\!1)|$.
However, if $x[i\!-\!1]\neq y[j\!-\!1]$, then either $\mathsf{row}[j\!-\!1]$
was never modified after its initialization, or it was last modified during the $i'$-iteration of the outer loop, for some $i' \in \{1,\,\ldots,\,i\!-\!2\}$.
In the second case, it currently contains $|\mathrm{lcs}(i',j\!-\!1)|$, which is bigger than $0$.
Since $|\mathrm{lcs}(i\!-\!1,j\!-\!1)| = 0$, relying on $\mathsf{row}[j\!-\!1]$
to compute $|\mathrm{lcs}(i,j)|$ would be wrong.
We solve this issue by storing the last position at which $\mathsf{row}[j\!-\!1]$ was modified:
We augment each element of $\mathsf{row}$
to contain two fields: $\mathsf{len}$ and $\mathsf{upd}$. During the initialization phase, we set to $-1$ the
$\mathsf{upd}$ field of each element. Then, for each $(i,j) \in I$
such that $x[i]=y[j]$, we assign $|\mathrm{lcs}(i,j)|$ to $\mathsf{row}[j]\mathsf{.len}$,
and $i$ to $\mathsf{row}[j]\mathsf{.upd}$.
Let's see why this works. Consider a pair $(i,j) \in I$ such that $i > 1 \,\land\, j > 1 \,\land\, x[i]=y[j]$.
In order to compute $|\mathrm{lcs}(i,j)|$,
it is sufficient to check $\mathsf{row}[j\!-\!1]\mathsf{.upd}$:
Suppose that $\mathsf{row}[j\!-\!1]\mathsf{.upd} \neq i\!-\!1$ .
If $x[i\!-\!1] = y[j\!-\!1]$ were true, then $\mathsf{row}[j\!-\!1]\mathsf{.upd}$
would contain $i\!-\!1$.
So, we find that $|\mathrm{lcs}(i\!-\!1,j\!-\!1)| = 0$ .
Suppose that $\mathsf{row}[j\!-\!1]\mathsf{.upd} = i\!-\!1$ .
By contradiction, suppose that $x[i\!-\!1] \neq y[j\!-\!1]$ holds. The algorithm cannot have considered the pair
$(i\!-\!1,j\!-\!1)$, so $\mathsf{row}[j\!-\!1]$ was not modified during the $i\!-\!1$-th iteration of the outer loop.
If no prior iteration had modified $\mathsf{row}[j\!-\!1].\mathsf{upd}$, then now it would contain $-1$. This cannot have
happened, since $i\!-\!1 \geq 0$. So, it must have been modified during a prior outer loop iteration.
Let $i'$ identify the latest such iteration. During the $i'$-th iteration, $\mathsf{row}[j\!-\!1].\mathsf{upd}$ has been set to $i'$.
So, now it contains $i'$. However, $i'$ necessarily belongs to
$\{1,\,\ldots,\,i\!-\!2\}$. This contradicts the hypothesis $\mathsf{row}[j\!-\!1]\mathsf{.upd} = i\!-\!1$.
Consequently, we establish that $x[i\!-\!1] = y[j\!-\!1]$ is true.
This implies that $\mathsf{row}[j\!-\!1]\mathsf{.len}$ contains
$|\mathrm{lcs}(i\!-\!1,j\!-\!1)|$.
We finally obtain our sparse DP algorithm (I renamed $\mathsf{len}$ with $\mathsf{length}$):
from dataclasses import dataclass
@dataclass
class Cell:
length: int
upd: int
def build_map(s):
d = dict()
for i,c in enumerate(s):
if c not in d: d[c] = []
d[c].append(i)
return d
def lcs(x, y):
if len(y) > len(x): x,y = y,x
n = len(x) ; m = len(y)
p = 0 ; l = 0
mapEnds = build_map(y)
row = [Cell(length=0,upd=-2) for _ in range(m)]
for i in range(n):
lstEnds = mapEnds.get(x[i],[])
for j in reversed(lstEnds):
row[j].upd = i
if j-1 >= 0 and row[j-1].upd == i-1:
row[j].length = row[j-1].length+1
else:
row[j].length = 1
if row[j].length > l:
l = row[j].length
p = i
return x[p-l+1 : p+1] if l > 0 else ""
Even Further Beyond !
I hope you got the quote.
Let $k \in \{1,\,\ldots,\,\min(n,m)\}$. As an attempt to "increase sparsity",
we can run our algorithm on the sequences $x^{(k)}$ and $y^{(k)}$, defined as:
$$x^{(k)} \quad := \quad (x[1 \,.\,.\, k],\,\, x[2 \,.\,.\, k\!+\!1],\,\, \ldots,\,\,
x[n\!-\!k\!+\!1 \,.\,.\, n])$$
$$y^{(k)} \quad := \quad (y[1 \,.\,.\, k],\,\, y[2 \,.\,.\, k\!+\!1],\,\, \ldots,\,\, y[m\!-\!k\!+\!1 \,.\,.\, m])$$
This optimization requires almost no modification to the code. Its correctness is discussed in the
preprint.
If there exists $s \in \mathrm{CS}$ such that $|s| \geq k$, then the time complexity becomes
$\Theta(k(n+m-k) + \#(x^{(k)},y^{(k)}))$, while the space complexity becomes $\Theta(k(n+m-k))$.
In this case, bit-packing (if feasible) allows to further reduce complexities: they become
$\Theta(n+m-k + \#(x^{(k)},y^{(k)}))$ and $\Theta(n+m-k)$, respectively. Both time complexities are still
expressed in expected and amortized terms.
Otherwise, if no suitable $s \in \mathrm{CS}$ exists, then our sparse DP
won't find a solution. So, we will need to search one by considering the interval of lengths $[0,k\!-\!1]$. To this end,
we can use linear search (or something faster).