KMP Algorithm

How to search a text in O(N)?

KMP Algorithm

Given strings s and w, return the positions in s that match to w. Let the length of s and w be N and M, respectively. The time complexity of the naive approach is $O(NM)$.

An example run of the naive algorithm
======================================

Check 5 letters to find the 5th character does not match. Slide w by 1.
    0123456789012
s = AABAAABAABAAA
w = AABAAB
    012345

Check 2 letters to find the 2nd character does not match. Slide w by 1.
    0123456789012
s = AABAAABAABAAA
w =  AABAAB
     012345

Check 1 letter to find it does not match. Slide w by 1.
    0123456789012
s = AABAAABAABAAA
w =   AABAAB
      012345

......
The naive algorithm repeatedly checks M letters for N times, in the worst case.

On the other hand, the KMP (Knuth-Morris-Pratt) algorithm runs in $O(N)$ to do the same task.

An example run of the the KMP algorithm
=======================================

Check 5 letters and find the 5th character does not match.
Because w[3:5] = w[0:2] = AA, we slide w by 5-2 = 3.
    0123456789012
s = AABAAABAABAAA
w = AABAAB
    012345

    0123456789012
s = AABAAABAABAAA
w =    AABAAB
       012345
We know that w[0:2] matches to s[3:5].
We start to match w againt s from s[5] and w[2]. They do not match.
Because w[1:2] = w[0:1] = A, we slide w by 2-1 = 1.

    0123456789012
s = AABAAABAABAAA
w =     AABAAB
        012345
We know that w[0:1] matches to s[4:5].
We start to match w againt s from s[5] and w[1]. They match.
Add index 4 to the answer list.
Because w[3:6] = w[0:4] = AAB, we slide w by 6-3 = 3.

    0123456789012
s = AABAAABAABAAA
w =        AABAAB
           012345

......
The KMP algorithm skips unnecessary checks.

As we can see from the above example, the KMP algorithm utilizes the structure of w. Regardless of whether it finds a match or a mismatch in an iteration, it shifts w using that knowledge.

Let's take a closer look at the knowledge here.

Check 5 letters and find the 5th character does not match.
Because w[3:5] = w[0:2] = AA, we slide w by 5-2 = 3.
    0123456789012
s = AABAAABAABAAA
w = AABAAB
    012345

Matching has proceeded to the 5th letter in w means s and w matched up to the 4th letter in w. Examining w[0:5] (=AABAA), we can see that the longest proper prefix of w[0:5] that is also a suffix of w[0:5] is AA. That teaches us that if we slide w by 5-2=3, we can have w[0:2] matched and start matching from w[2].

Creating LPS Array

As we have seen, we need to prepare an array of the length of the longest proper prefix that is also a suffix (LPS) for all w[0:i] $(0 \lt i \le M)$. Here is how we calculate the LPS array by hand.

How to create an LPS array in O(M)
===================================

w[0:1] = A
There is no proper prefix in A.
LPS[0] = 0

w[0:2] = AA
The suffix A is also a proper prefix of AA.
LPS[1] = 1

w[0:3] = AAB
There is no suffix of AAB that is also a proper prefix of AAB.
LPS[2] = 0

W[0:4] = AABA
The suffix A is also a proper prefix of AABA.
LPS[3] = 1

W[0:5] = AABAA
The suffix AA is also a proper prefix of AABAA.
LPS[4] = 2

W[0:6] = AABAAB
The suffix AAB is also a proper prefix of AABAAB.
LPS[5] = 3

LPS = [0, 1, 0, 1, 2, 3]

Implementation

Here is my implementation of the KMP algorithm in TypeScript.

Proof of len=lps[len-1]

Time Complexity

The preparation of the LPS array takes $O(M)$ and the search takes $O(N)$. Assuming $N > M$, we can say that the whole algorithm runs in $O(N)$.