String Matching | How algorithms use to search for pattern in texts.
In here what we try to do is we have large text and we find if given string is in the text. We have Text(T) contains large number of characters. We have a Pattern(P) contains set of chars.
Goal : We are finding whether we have given patter in out text or how many occurrences is there.
Naive String Matching
This is the basic string matching techique. We slide our pattern through Text and check if the patter match the text.
Complexity : O(mn), m = Text.size n= Pattern.size
We shift the patter by 1 and do the check
Boyer Moore Algorithm
The basic is as like Naive String search. But in here string search is improved. Lets see how it improved using Character jump heuristic and Last occurrence theorem.
Character Jump heuristic
Scenario 1 : Lets say there is Unique character (Z) in the Pattern but it is not in the text. So we can just say there is no match.
Scenario 2 : There is a unique character in the Text but not in the Pattern. So when we come near to that unique character we can have a full jump.
When we meet that unique character we do full jump. Ignore few steps until we pass that unique character.
Scenario 3 : Lets say unique character is in the Text and Pattern both. We can align those letters and check.
That the simplified explanation of Character Jump heuristic.
Last occurrence
Lets see how to create Last occurance table first.
Rule 1 : We match pattern by starting from last element
Rule2 : If we have mismatch we get the mismatch char of the text and get our last occurrence table and align to match and check from last again.
Rule 3 : If the last occurrence already passed we shift patter by 1 and check by last again.
Lets see how it works
Algorithm :
- Preprocessing : Create last occurance table
- Process : Match by back to Front. If mismatch found check for the last occurrence and if last occurrence in left align it to last occurrence else shift by 1 and repeat process.
Advantages : Works well in large alphabet patterns such as english. Not works well in Binary match or DNA which has small character alphabet.
KMP Algorithm
In here we use prefix and suffix to make our match easy. Lets first identify what is prefix and suffix.
ABAC
- Prefix : A, AB, ABA, ABAC, null
- Suffix : C, AC, BAC, ABAC, null
Proper prefix is without null and the string. Proper suffix also same
- Proper prefix : A, AB, ABA
- Proper suffix : C, AC, BAC
IF we have Pattern We divide it into prefixes.
ABABACA : A, AB, ABA, ABAB, ABABA, ABABAC, ABABACA
In the prefixes we find what is the length of longest proper prefix that match proper suffix of the prefix.
Preprocess : Prefix table
Process : We match characters from front to Back. When mismartch happens we get the last match position and get the number from the table and according to that number we align pattern to matching char of the text.
Lets see how it works
Hint : If Prefix table says 0 We pass the pattern until that character passes.
Advantages : Performs well with Strings that has large repitition such as DNA, Binary strings
Rabin Karp Algorithm
In here what we do is we get the hash of the pattern. Next we get the substring from text with the size of Pattern and check the hash matches the pattern hash. If matches we do character match. This mechanism is called Rolling Hash.
So it works as like this
SO that is the end of this chapter.
Summary :
Naive String Matching Algorithm : Very simple and easy to understand and implement., Poor Performance on Repetitive Patterns, Inefficient for Large Texts and Patterns
Boyer-Moore Algorithm : Efficient for Large Alphabets, Last Occurrence table need to do as preprocess, Poor Performance on Repetitive Patterns. O(n×m)
KMP : Efficient for Repetitive Patterns, Prefix table need to do as preprocess. O(n+m)
Rabin-Karp Algorithm : Good for Multiple Pattern Matching, O(n×m)