The length of the longest proper prefix that matches a proper suffix...
Knuth-Morris-Pratt Algorithm is indeed a hard nut to crack. I tried to understand it and now trying to explain it the way I understood.
Before we go on to understand the algorithm, let us have a look at the complicated wikipedia explanation:
In computer science, the Knuth–Morris–Pratt string searching algorithm (or KMP algorithm) searches for occurrences of a "word" W within a main "text string" S by employing the observation that when a mismatch occurs, the word itself embodies sufficient information to determine where the next match could begin, thus bypassing re-examination of previously matched characters.
If you do not understand the above lines in first few attempts, it is ok. Let us try to understand it together.
In order to understand the KMP algorithm better, let us divide it into two parts:
- Creating Partial Match Table.
- Using Partial Match Table.
Creating Partial Match Table
Partial Match Table (aka Failure Function) is the key to KMP.
Let us create a sample Partial Match table for pattern "abababca".
character: | a | b | a | b | a | b | c | a | index: | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | value: | 0 | 0 | 1 | 2 | 3 | 4 | 0 | 1 |
Before we understand the creation of this partial match table, let us understand few things:
- Pattern "abababca" has 8 characters. Hence, pattern match table has 8 cells. Thing to understand here is that if you are looking for a character in 6th cell i.e. b, then you are interested in only 6 characters of the pattern as per KMP. Content of last two cell do not matter. This implies that every time we look at a cell, we essentially create a sub-pattern. For 6th cell, the sub-pattern is "ababab".
- Proper Prefix: If our pattern is "Harry" then its proper prefixes are: "H", "Ha", "Har" and "Harr". So, proper prefix is all characters in the string with one or more characters removed from the end.
- Proper Suffix: If our pattern is "Harry" then proper suffixes are: "y", "ry", "rry" and "arry". So proper prefix is all characters in the string with one or more characters removed from the beginning.
Keeping, these three points in mind, we can deduce the values in the Pattern Match Table based on following notion:
The length of the longest proper prefix in the (sub)pattern that matches a proper suffix in the same (sub)pattern.
Let us try it for cell 5. Sub-pattern for cell 5 is "ababa"
Proper Prefix: a, ab, aba, abab
Proper Suffix: a, ba, aba, baba
Let us find out the common string in proper prefix and proper suffix. We get "a" and "aba".
Length of longest one (i.e. "aba") is 3.
Therefore, value of cell 5 is 3.
This is how we created the pattern match table
Using the Pattern Match Table
Once we have created the pattern match table, we can use its values to skip ahead when we find partial matches. It works in the following manner:
If a partial match of length partial_match_length is found and table[partial_match_length - 1] > 1, we may skip ahead partial_match_length - table[partial_match_length - 1] characters.
Let us match the pattern "abababca" against the text "bacbababaabcbab". We already have the Pattern Match table defined above.
We find the first partial match in following situation:
bacbababaabcbab | abababca
partial_match_length = 1
table[partial_match_length - 1] = table[0] = 0 which is not >1.
Therefore, we do not get to skip ahead at all.
Next partial match is found at:
bacbababaabcbab ||||| abababca
partial_match_length = 5
table[partial_match_length - 1] = table[4] = 3 which is >1.
Therefore we get to skip ahead by,
partial_match_length - table[partial_match_length - 1] = 5 -3 = 2 characters.
Hence, now matching will start as below:
//x means skip bacbababaabcbab xx||| abababca
Conclusion
This is the basic idea of how the algorithm works. To read the algorithm, please read Introduction to Algorithms by Thomas H. Cormen.
This is the basic idea of how the algorithm works. To read the algorithm, please read Introduction to Algorithms by Thomas H. Cormen.
No comments:
Post a Comment