Sparse Dynamic Programming for Longest Common Substring

Premise

This is NOT the Longest Common Subsequence problem. Sometimes, people confuse the two.
In the Longest Common Substring problem, given two strings $x$ and $y$, we need to return a string $s$ such that:

Introduction

Let $n$ and $m$ denote the length of the strings $x$ and $y$, respectively. Let $\#(x,y)$ be the number of pairs $(i,j) \in \{1,\,\ldots,\,n\}\!\times\!\{1,\,\ldots,\,m\}$ such that $x[i]=y[j]$. In this page, I present a sparse Dynamic Programming (DP) algorithm for Longest Common Substring, that I have developed. It requires $\Theta(n+m+\#(x,y))$ time [1] and $\Theta(\min(n,m))$ space. Two optimizations, discussed in more detail at the end of the page, (may) allow to further improve its speed.

In the github repo you can find a benchmark against a suffix tree, which I ran in a i.i.d. model while using one of those optimizations. While that result is encouraging, I lack the domain knowledge to conduct meaningful benchmarks in relevant scenarios. So, contributions would be welcome.

A defining feature for this algorithm is its conceptual simplicity. So, I wrote this page also to provide an educational resource. In particular, i follow the roadmap: Brute Force -> Dynamic Programming -> Sparse DP.
Check the preprint for a more complete explanation.

[1] To be precise, $\Theta(n+m+\#(x,y))$ expected amortized time.
"Expected" due to hash map lookups/insertions, "amortized" due to array appends.
This also holds for the other time complexities in the introduction.

Contact

Email:

Basic Notations

Bear with this.

We denote by $\varepsilon$ the empty string. Let $u$ be a string. We denote by $|u|$ its length. For all $i \in \{1,\,\ldots,\,|u|\}$, we denote by $u[i]$ the symbol at position $i$. For all $i,j \in \{1,\,\ldots,\,|u|\}$, with $i\leq j$, we denote by $u[i \,.\,.\, j]$ the string $u[i]*u[i\!+\!1]* \ldots * u[j]$ (the asterisk denotes the concatenation operator). Let $s$ be a string such that $|s| \leq |u|$. For all $i \in \{|s|,\,\ldots,\,|u|\}$, we say that $s$ ends at position $i$ of $u$ iff $u[i\!-\!|s|\!+\!1 \,.\,.\, i] = s$. We say that $s$ is a substring of $u$ iff it ends at some position $i \in \{|s|,\,\ldots,\,|u|\}$ of $u$.

From now on, let $x$ and $y$ be two strings, and let $n:=|x|$ and $m:=|y|$. We say that $u$ is a common substring of $x$ and $y$ iff $u$ is a substring of $x$ and a substring of $y$. We denote by $\mathrm{CS}$ the set of common substrings of $x$ and $y$. We say that $u$ is a longest common substring of $x$ and $y$ iff $u \in \mathrm{CS}$ and $|u| \geq |s|$ for all $s \in \mathrm{CS}$. We denote by $\mathrm{LCS}$ the set of longest common substrings of $x$ and $y$. Let $(i,j) \in \{1,\,\ldots,\,n\} \! \times \! \{1,\,\ldots,\,m\}$. We define $\mathrm{CS}(i,j)$ as the set of common substrings of $x$ and $y$ ending, respectively, at positions $i$ and $j$. That is: $$\mathrm{CS}(i,j) \quad := \quad \{s \in \mathrm{CS}(x,y) \,:\, s=x[i\!-\!|s|\!+\!1 \,.\,.\, i] \, \land \, s=y[j\!-\!|s|\!+\!1 \,.\,.\, j]\}$$ We define $\mathrm{lcs}(i,j)$ as the longest element of $\mathrm{CS}(i,j)$. Let $i \in \{1,\,\ldots,\,n\}$. We define $\mathrm{CS}(i)$ as the set of common substrings of $x$ and $y$ ending at position $i$ of $x$.

Indexing

For maths/presentation, we use $1$-based indexing. In code, we use $0$-based indexing.

Longest Common Substring problem

Given $x$ and $y$, return an element of $\mathrm{LCS}$.

Brute Force

The following observation allows us to identify a search space.
1
For each $s \in \mathrm{LCS}$, there exists a pair $(i,j) \in I$ such that $\mathrm{lcs}(i,j) = s$.
Proof Let $s \in \mathrm{LCS}$ .Since $s \in \mathrm{CS}$, it ends at some position $i \in \{1,\,\ldots,\,n\}$ of $x$ and at some position $j \in \{1,\,\ldots,\,m\}$ of $y$ .That is, there exists $(i,j) \in I$ such that $s \in \mathrm{CS}(i,j)$ .Since $s \in \mathrm{LCS}$, we have $|s| \geq |u|$ for all $u \in \mathrm{CS}(i,j)$ .Therefore, $s = \mathrm{lcs}(i,j)$.
So, $\mathrm{LCS} \,\subseteq\, \{\mathrm{lcs}(i,j) : (i,j) \in I\}$. Let $(i,j) \in I$. How do we know whether $\mathrm{lcs}(i,j) \in \mathrm{LCS}$ ?
2
If $|\mathrm{lcs}(i,j)| \geq |\mathrm{lcs}(i',j')|$ for all $(i',j') \in I$, then $\mathrm{lcs}(i,j) \in \mathrm{LCS}$ .
Proof Suppose that $|\mathrm{lcs}(i,j)| \geq |\mathrm{lcs}(i',j')|$ for all $(i',j') \in I$ .By definition of $\mathrm{lcs}(i',j')$: $$\forall (i',j') \in I \quad \quad |\mathrm{lcs}(i,j)| \,\, \geq \,\, \max \{|s| : s \in \mathrm{CS}(i',j')$$Noting that: $$\bigcup_{(i',j') \in I}{\mathrm{CS}(i',j')} \quad = \quad \mathrm{CS}$$ we establish that $|\mathrm{lcs}(i,j)| \geq \max \{|s| : s \in \mathrm{CS}(i,j)\}$.So, $\mathrm{lcs}(i,j) \in \mathrm{LCS}$ .
How do we compute $\mathrm{lcs}(i,j)$ ?
3
If $x[i] \neq y[j]$, then $\mathrm{lcs}(i,j) = \varepsilon$ . Suppose that $x[i] = y[j]$. If $i = 1 \lor j = 1$,then $\mathrm{lcs}(i,j) = x[i]$.
Otherwise, $\mathrm{lcs}(i,j) = \mathrm{lcs}(i\!-\!1,j\!-\!1) * x[i]$ .
Proof Suppose that $i > 1 \,\land\, j > 1 \,\land\, x[i]=y[j]$, otherwise it is trivial. Let $\ell := |\mathrm{lcs}(i,j)|$. We have $\ell \geq 1$ and $\mathrm{lcs}(i,j) = x[i\!-\!\ell\!+\!1 \,.\,.\, i] = y[j\!-\!\ell\!+\!1 \,.\,.\, j]$. Before continuing, we make an important observation: if $i\!-\!\ell \geq 1\, \land \,j\!-\!\ell \geq 1$, then $x[i\!-\!\ell]\neq y[j\!-\!\ell]$. In fact, if it were $x[i\!-\!\ell]=y[j\!-\!\ell]$, then we would have $\mathrm{lcs}(i,j) \neq x[i\!-\!\ell\!+\!1 \,.\,.\, i]$. We continue by distinguishing two cases:
  • Suppose $\ell=1$. Since $i > 1 \,\land\, j > 1$, we have $i\!-\!\ell \geq 1 \,\land\, j\!-\!\ell \geq 1$. So, $\mathrm{lcs}(i\!-\!1,j\!-\!1) = \varepsilon$ and $\mathrm{lcs}(i,j) = x[i]$. Consequently, $\mathrm{lcs}(i,j) = \mathrm{lcs}(i\!-\!1,j\!-\!1) * x[i]$ holds.
  • Suppose $\ell > 1$. As we know, $x[i\!-\!\ell\!+\!1 \,.\,.\, i\!-\!1]$ equals $y[j\!-\!\ell\!+\!1 \,.\,.\, j\!-\!1]$. So, regardless of whether $i\!-\!\ell \geq 1 \,\land\, j\!-\!\ell \geq 1$ is true or not, we have $\mathrm{lcs}(i\!-\!1,j\!-\!1) = x[i\!-\!\ell\!+\!1 \,.\,.\, i\!-\!1]$. Since $\mathrm{lcs}(i,j) = x[i\!-\!\ell\!+\!1 \,.\,.\, i] = x[i\!-\!\ell\!+\!1 \,.\,.\, i\!-\!1] * x[i]$, we conclude that $\mathrm{lcs}(i,j) = \mathrm{lcs}(i\!-\!1,j\!-\!1) * x[i]$ holds.
These observations let us obtain our first algorithm:
def lcs_ij(x, y, i, j):
	if x[i] != y[j]: return ""
	if i == 0 or j == 0: return x[i]
	return lcs_ij(x, y, i-1, j-1) + x[i]

def lcs(x, y):
	n = len(x)
	m = len(y)
	ans = ""
	for i in range(n):
		for j in range(m):
			cur = lcs_ij(x, y, i, j)
			if len(cur) > len(ans):
				ans = cur
	return ans
It requires $O(n \cdot m \cdot \min^2(n,m))$ time and $O(\min(n,m))$ space. If the repeated concatenations inside $\mathsf{lcs}\_\mathsf{ij}$ were to be removed, then the (worst case) time complexity would be $O(n \cdot m \cdot \min(n,m))$. While we could remove them by optimizing $\mathsf{lcs}\_\mathsf{ij}$, we can do better. Observe:
4
Let $(i,j) \in I$ and $\ell := |\mathrm{lcs}(i,j)|$ .If $\ell \geq |\mathrm{lcs}(i',j')|$ for all $(i',j') \in I$, then $x[i\!-\!\ell\!+\!1 \,.\,.\, i] \in \mathrm{LCS}$ .
Proof If $\ell$ satisfies the above hypothesis, then by 2 we have $\mathrm{lcs}(i,j) \in \mathrm{LCS}$ . The sequence $\mathrm{lcs}(i,j)$ ends at position $i$ of $x$ and has length $\ell$, so $x[i\!-\!\ell\!+\!1 \,.\,.\, i] \in \mathrm{LCS}$ .
So, for each $(i,j) \in I$, it is sufficient to calculate $|\mathrm{lcs}(i,j)|$ instead of $\mathrm{lcs}(i,j)$. We can do that by adapting 3. We said this was better. Why? Because the time complexity still becomes $O(n \cdot m \cdot \min(n,m))$, but we also achieve $O(1)$ space and (arguably) a cleaner algorithm. Here is the adapted code:
def len_lcs_ij(x, y, i, j):
	if x[i] != y[j]: return 0
	if i == 0 or j == 0: return 1
	return len_lcs_ij(x, y, i-1, j-1) + 1

def lcs(x, y):
	n = len(x)
	m = len(y)
	(p,q) = (0,0)
	l = 0
	for i in range(n):
		for j in range(m):
			cur_l = len_lcs_ij(x, y, i, j)
			if cur_l > l:
				(p,q) = (i,j)
				l = cur_l
	return x[p-l+1 : p+1] if l > 0 else ""
I named "$(p,q)$" the pair for tracking the solution, instead of "$(i,j)$" as 4 would suggest; this is because it reads better to use $i$ and $j$ for the loops, since python does not allow "$i'$" as a variable name. For clarity, I stored the entire pair $(p,q)$ instead of just $p$.Later, I will drop the second coordinate.

Dynamic Programming

For the remaining explanations, but not in code, we assume that $n \geq m$. Reason: clarity. At the expense of $\Theta(m)$ space, we can reduce the time complexity to $\Theta(nm)$. Observe:
5
Let $(i,j) \in I$ with $i > 1 \, \land \, j > 1$ .Our approach involves computing both $|\mathrm{lcs}(i\!-\!1,j\!-\!1)|$ and $|\mathrm{lcs}(i,j)|$ . However, if $x[i] = y[j]$, then the latter depends on the former (recall 3).So, if we compute and store $|\mathrm{lcs}(i\!-\!1,j\!-\!1)|$ before $|\mathrm{lcs}(i,j)|$, then we can compute $|\mathrm{lcs}(i,j)|$ in constant time.
To employ 5, we introduce an auxiliary array $\mathsf{row}$ of length $m$; moreover, we reverse the iteration order of the inner loop (that is, we scan $(1,\,\ldots,\,m)$ right to left). During the first iteration of the outer loop, for each $j \in \{1,\,\ldots,\,m\}$ we store $|\mathrm{lcs}(1,j)|$ in $\mathsf{row}[j]$. Let $i \in \{2,\,\ldots,\,n\}$, and consider the start of the $i$-th iteration of the outer loop. Suppose that, at this point, for each $j \in \{1,\,\ldots,\,m\}$ we have $\mathsf{row}[j] = |\mathrm{lcs}(i\!-\!1,j)|$. Let $j \in \{2,\,\ldots,\,m\}$. Because we scan $(1,\,\ldots,\,m)$ right to left, if we need $|\mathrm{lcs}(i\!-\!1,j\!-\!1)|$, then we can find it in $\mathsf{row}[j\!-\!1]$. So, the value $|\mathrm{lcs}(i,j)|$, which we store in $\mathsf{row}[j]$, can be computed in $O(1)$ time. Here is the new version:
def lcs(x, y):
	if len(y) > len(x): x,y = y,x
	n = len(x) ; m = len(y)
	p = 0 ; l = 0
	row = [0 for _ in range(m)]
	for i in range(n):
		for j in range(m-1, -1, -1):
			if x[i] == y[j]:
				row[j] = row[j-1]+1 if j > 0 else 1
			else:
				row[j] = 0
			if row[j] > l:
				p = i
				l = row[j]
	return x[p-l+1 : p+1] if l > 0 else ""
Note: this algorithm is very similar to one that is frequently found on the internet.

Sparse Dynamic Programming

We now reduce the time complexity to $\Theta(n + m + \#(x,y))$, even though in expected and amortized terms, while still requiring $\Theta(m)$ space. Recall that $\#(x,y)$ is the number of pairs $(i,j) \in I$ such that $x[i]=y[j]$.
The following observation exposes a key inefficiency of the dynamic programming algorithm:
6
Let $i \in \{1,\,\ldots,\,n\}$. At the start of the $i$-th iteration of the outer loop, the variable $l$ contains a value not smaller than $0$. Thus, for each $j \in \{1,\,\ldots,\,m\}$ such that $|\mathrm{lcs}(i,j)| = 0$, the control $\mathsf{row}[j] > l$ is guaranteed to fail.
That is, each pair $(i,j) \in I$ such that $x[i] \neq y[j]$ is irrelevant to improving the current best solution.
So, we skip them. To this end:
7
During the initialization phase, we build a hash map mapping each symbol $\sigma$ of $y$ to the strictly increasing ordering of $\{j \in \{1,\,\ldots,\,m\} : y[j]=\sigma \}$ .
It can be built in $\Theta(m)$ expected amortized time, requires $\Theta(m)$ space, and might be replaceable by an array.
Skipping the pairs $(i,j) \in I$ such that $x[i] \neq y[j]$ introduces an issue. To be clear, we are currently considering the DP algorithm with the hash map trick. Let $(i,j) \in I$ be such that $i > 1 \,\land\, j > 1 \,\land\, x[i]=y[j]$. When it considers this pair, the algorithm relies on $\mathsf{row}[j\!-\!1]$ to compute $|\mathrm{lcs}(i,j)|$. If $x[i\!-\!1]=y[j\!-\!1]$, then $\mathsf{row}[j\!-\!1]$ was last modified during the $i\!-\!1$-th iteration of the outer loop; so, it currently contains $|\mathrm{lcs}(i\!-\!1,j\!-\!1)|$. However, if $x[i\!-\!1]\neq y[j\!-\!1]$, then either $\mathsf{row}[j\!-\!1]$ was never modified after its initialization, or it was last modified during the $i'$-iteration of the outer loop, for some $i' \in \{1,\,\ldots,\,i\!-\!2\}$. In the second case, it currently contains $|\mathrm{lcs}(i',j\!-\!1)|$, which is bigger than $0$. Since $|\mathrm{lcs}(i\!-\!1,j\!-\!1)| = 0$, relying on $\mathsf{row}[j\!-\!1]$ to compute $|\mathrm{lcs}(i,j)|$ would be wrong.
We solve this issue by storing the last position at which $\mathsf{row}[j\!-\!1]$ was modified:
8
We augment each element of $\mathsf{row}$ to contain two fields: $\mathsf{len}$ and $\mathsf{upd}$. During the initialization phase, we set to $-1$ the $\mathsf{upd}$ field of each element. Then, for each $(i,j) \in I$ such that $x[i]=y[j]$, we assign $|\mathrm{lcs}(i,j)|$ to $\mathsf{row}[j]\mathsf{.len}$, and $i$ to $\mathsf{row}[j]\mathsf{.upd}$.
Let's see why this works. Consider a pair $(i,j) \in I$ such that $i > 1 \,\land\, j > 1 \,\land\, x[i]=y[j]$. In order to compute $|\mathrm{lcs}(i,j)|$, it is sufficient to check $\mathsf{row}[j\!-\!1]\mathsf{.upd}$: We finally obtain our sparse DP algorithm (I renamed $\mathsf{len}$ with $\mathsf{length}$):
from dataclasses import dataclass

@dataclass
class Cell:
	length: int
	upd: int

def build_map(s):
	d = dict()
	for i,c in enumerate(s):
		if c not in d: d[c] = []
		d[c].append(i)
	return d

def lcs(x, y):
	if len(y) > len(x): x,y = y,x
	n = len(x) ; m = len(y)
	p = 0 ; l = 0
    mapEnds = build_map(y)
    row = [Cell(length=0,upd=-2) for _ in range(m)]
    for i in range(n):
        lstEnds = mapEnds.get(x[i],[])
        for j in reversed(lstEnds):
            row[j].upd = i
            if j-1 >= 0 and row[j-1].upd == i-1:
                row[j].length = row[j-1].length+1
            else:
                row[j].length = 1
            if row[j].length > l:
                l = row[j].length
                p = i
    return x[p-l+1 : p+1] if l > 0 else ""

Even Further Beyond !

I hope you got the quote. Let $k \in \{1,\,\ldots,\,\min(n,m)\}$. As an attempt to "increase sparsity", we can run our algorithm on the sequences $x^{(k)}$ and $y^{(k)}$, defined as: $$x^{(k)} \quad := \quad (x[1 \,.\,.\, k],\,\, x[2 \,.\,.\, k\!+\!1],\,\, \ldots,\,\, x[n\!-\!k\!+\!1 \,.\,.\, n])$$ $$y^{(k)} \quad := \quad (y[1 \,.\,.\, k],\,\, y[2 \,.\,.\, k\!+\!1],\,\, \ldots,\,\, y[m\!-\!k\!+\!1 \,.\,.\, m])$$ This optimization requires almost no modification to the code. Its correctness is discussed in the preprint.

If there exists $s \in \mathrm{CS}$ such that $|s| \geq k$, then the time complexity becomes $\Theta(k(n+m-k) + \#(x^{(k)},y^{(k)}))$, while the space complexity becomes $\Theta(k(n+m-k))$. In this case, bit-packing (if feasible) allows to further reduce complexities: they become $\Theta(n+m-k + \#(x^{(k)},y^{(k)}))$ and $\Theta(n+m-k)$, respectively. Both time complexities are still expressed in expected and amortized terms.

Otherwise, if no suitable $s \in \mathrm{CS}$ exists, then our sparse DP won't find a solution. So, we will need to search one by considering the interval of lengths $[0,k\!-\!1]$. To this end, we can use linear search (or something faster).