String Matching | How algorithms use to search for pattern in texts.

5 min readJul 25, 2024

In here what we try to do is we have large text and we find if given string is in the text. We have Text(T) contains large number of characters. We have a Pattern(P) contains set of chars.

Goal : We are finding whether we have given patter in out text or how many occurrences is there.

Naive String Matching

This is the basic string matching techique. We slide our pattern through Text and check if the patter match the text.

Complexity : O(mn), m = Text.size n= Pattern.size

We shift the patter by 1 and do the check

Boyer Moore Algorithm

The basic is as like Naive String search. But in here string search is improved. Lets see how it improved using Character jump heuristic and Last occurrence theorem.

Character Jump heuristic

Scenario 1 : Lets say there is Unique character (Z) in the Pattern but it is not in the text. So we can just say there is no match.

Scenario 2 : There is a unique character in the Text but not in the Pattern. So when we come near to that unique character we can have a full jump.

When we meet that unique character we do full jump. Ignore few steps until we pass that unique character.

Scenario 3 : Lets say unique character is in the Text and Pattern both. We can align those letters and check.

That the simplified explanation of Character Jump heuristic.

Last occurrence

Lets see how to create Last occurance table first.

Rule 1 : We match pattern by starting from last element

Rule2 : If we have mismatch we get the mismatch char of the text and get our last occurrence table and align to match and check from last again.

Rule 3 : If the last occurrence already passed we shift patter by 1 and check by last again.

Lets see how it works

Algorithm :

Preprocessing : Create last occurance table
Process : Match by back to Front. If mismatch found check for the last occurrence and if last occurrence in left align it to last occurrence else shift by 1 and repeat process.

Advantages : Works well in large alphabet patterns such as english. Not works well in Binary match or DNA which has small character alphabet.

KMP Algorithm

In here we use prefix and suffix to make our match easy. Lets first identify what is prefix and suffix.

ABAC

Prefix : A, AB, ABA, ABAC, null
Suffix : C, AC, BAC, ABAC, null

Proper prefix is without null and the string. Proper suffix also same

Proper prefix : A, AB, ABA
Proper suffix : C, AC, BAC

IF we have Pattern We divide it into prefixes.

ABABACA : A, AB, ABA, ABAB, ABABA, ABABAC, ABABACA

In the prefixes we find what is the length of longest proper prefix that match proper suffix of the prefix.

Preprocess : Prefix table

Process : We match characters from front to Back. When mismartch happens we get the last match position and get the number from the table and according to that number we align pattern to matching char of the text.

Lets see how it works

Hint : If Prefix table says 0 We pass the pattern until that character passes.

Advantages : Performs well with Strings that has large repitition such as DNA, Binary strings

Rabin Karp Algorithm

In here what we do is we get the hash of the pattern. Next we get the substring from text with the size of Pattern and check the hash matches the pattern hash. If matches we do character match. This mechanism is called Rolling Hash.

So it works as like this

SO that is the end of this chapter.

Summary :

Naive String Matching Algorithm : Very simple and easy to understand and implement., Poor Performance on Repetitive Patterns, Inefficient for Large Texts and Patterns

Boyer-Moore Algorithm : Efficient for Large Alphabets, Last Occurrence table need to do as preprocess, Poor Performance on Repetitive Patterns. O(n×m)

KMP : Efficient for Repetitive Patterns, Prefix table need to do as preprocess. O(n+m)

Rabin-Karp Algorithm : Good for Multiple Pattern Matching, O(n×m)