Build a Basic Regex Parser
Implement a regular expression function isMatch that supports the '.' and '*' symbols. The function receives two strings - text and pattern - and should return true if the text matches the pattern as a regular expression. For simplicity, assume that the actual symbols '.' and '*' do not appear in the text string and are used as special symbols only in the pattern string.
If you need a quick refresher on regular expressions, check out our refresher [link to strings lesson here]*
Regular expression functions determine if a given string matches a pattern where:
'.'is treated as a single a character wildcard (see the third example below), and'*'is matched for a sequence of zero or more characters of the previous letter (see fourth and fifth examples).
Explain your algorithm, and analyze its time and space complexities.
Examples:
input: text = "aa", pattern = "a" output: false input: text = "aa", pattern = "aa" output: true input: text = "abc", pattern = "a.c" output: true input: text = "abbb", pattern = "ab*" output: true input: text = "acd", pattern = "ab*c." output: true
Take a moment to refresh your knowledge of regular expressions.
Letter-to-letter matching won't work here as the symbol '*' induces many options for a correct matching, but it may help to consider how you'd implement a simple solution (without special characters) first.
Stuck? Take a step back. How would you check whether a specific index in the text matches the pattern? Consider recursion.
Have you considered the many edge-cases in this question? For example, empty strings and the '.*' pattern. Can you think of more?
Have you generated test cases?
The solution for this kind of pattern matching is usually done by recursion. Review first the algorithm below and then read the explanation that follows.
Pseudocode:
function isMatch(text, pattern): return isMatchHelper(text, pattern, 0, 0) # Input: # text - the text to check, # pattern - the regular expression to be checked, # textIndex - the index the text is checked from # patIndex - the index the pattern is checked from # Output: # true if the text from the index textIndex matches # the regular expression pattern from the pattern Index. # E.g. isMatchHelper(“aaabb”,”cab*”,2, 1) since the text # from index 2 (“abb”) matches the pattern from index 1 (“ab*”) function isMatchHelper(text, pattern, textIndex, patIndex): # base cases - one of the indexes reached the end of text or pattern if (textIndex >= text.length): if (patIndex >= pattern.length): return true else: if (patIndex+1 < pattern.length) AND (pattern[patIndex+1] == '*'): return isMatchHelper(text, pattern, textIndex, patIndex + 2) else: return false else if (patIndex >= pattern.length) AND (textIndex < text.length): return false # string matching for character followed by '*' else if (patIndex+1 < pattern.length) AND (pattern[patIndex+1] == '*'): if (pattern[patIndex] == '.') OR (text[textIndex] == pattern[patIndex]): return (isMatchHelper(text, pattern, textIndex, patIndex + 2) OR isMatchHelper(text, pattern, textIndex + 1, patIndex)) else: return isMatchHelper(text, pattern, textIndex, patIndex + 2) # string matching for '.' or ordinary char. else if (pattern[patIndex] == '.') OR (pattern[patIndex] == text[textIndex]): return isMatchHelper(text, pattern, textIndex + 1, patIndex + 1) else: return false
Note: in the code, a text of length N begins in position 0 and ends in position N-1.
As we can see, this is a recursive function that has many cases to cover:
First the base cases - if textIndex points to the end of the text:
- If the pattern also points to the end of the text, then both are the empty string and both match.
- If the pattern has some character followed by
'*', then the only possibility that the text matches is if the character before the'*'is matched exactly 0 times, thus incrementing the pattern index forward by two (after the'*'symbol). - Otherwise, the text has ended but the pattern has not, thus the strings don't match. Another similar base case is if the pattern has ended but the text hasn't, which also means that the two don't match.
Next, we skip to the last four lines in the code, to the cases where pattern[patIndex + 1] is not '*', i.e. the current character is not followed by '*':
- If the
pattern[patIndex]is'.', or the(pattern[patIndex] == text[textIndex])we proceed to check the next indexes, as the current indexes match. - Otherwise, the text and pattern don't match.
Finally, the case where the current character in the pattern is followed by the '*' sign:
- If the current characters match or the pattern in the current index is
'.'then:- The
'x*'in the pattern (where x is the current character in the pattern) is matched as a sequence of zero characters - in which case we simply incrementpatIndexby2. - The
'x*'is matched to the current character along with the next character - in which case we increment thetextIndexonly.
- The
Time Complexity: in the worst case, the solution above runs in time exponential in the size of the pattern. Patterns that
take the most time, are the ones with multiple '*' symbols, that may provoke an exponential number of recursion calls:
For example, for the text “bbbbbbbb” and the pattern “.*.*.*.*a”, this function will open a new recursive call to itself for every single way to divide the text in four (the number of “.*”).
Space Complexity: the space required is also exponential because of the number of recursion calls filling up the stack. There are, in fact, algorithms that solve the matching problem in polynomial time and space. They are based on Nondeterministic Finite State Machines, which we didn't provide here due to the fact that it requires more knowledge in theoretical computer science.
def is_match(text, pattern):
def is_match_helper(text, pattern, text_index, pat_index):
if text_index >= len(text):
if pat_index >= len(pattern):
return True
elif pat_index + 1 < len(pattern) and pattern[pat_index + 1] == '*':
return is_match_helper(text, pattern, text_index, pat_index + 2)
else:
return False
elif pat_index >= len(pattern) and text_index < len(text):
return False
elif pat_index + 1 < len(pattern) and pattern[pat_index + 1] == '*':
if pattern[pat_index] == '.' or text[text_index] == pattern[pat_index]:
return (is_match_helper(text, pattern, text_index, pat_index + 2) or
is_match_helper(text, pattern, text_index + 1, pat_index))
else:
return is_match_helper(text, pattern, text_index, pat_index + 2)
elif pattern[pat_index] == '.' or pattern[pat_index] == text[text_index]:
return is_match_helper(text, pattern, text_index + 1, pat_index + 1)
else:
return False
return is_match_helper(text, pattern, 0, 0)
func isMatch(text: String, pattern: String) -> Bool { // Convert strings to arrays for easier indexing let s = Array(text.characters) let p = Array(pattern.characters) guard !s.isEmpty && !p.isEmpty else { return true } // Create DP table: dp[i][j] represents if s[0...i-1] matches p[0...j-1] var dp = Array(repeating: Array(repeating: false, count: p.count + 1), count: s.count + 1) // Empty pattern matches empty string dp[0][0] = true // Handle patterns with leading stars for j in 1...p.count { if p[j-1] == "*" { dp[0][j] = dp[0][j-2] } } // Fill the dp table for i in 1...s.count { for j in 1...p.count { if p[j-1] == "." || p[j-1] == s[i-1] { // Direct match or dot dp[i][j] = dp[i-1][j-1] } else if p[j-1] == "*" { // Star pattern dp[i][j] = dp[i][j-2] // Zero occurrence if p[j-2] == "." || p[j-2] == s[i-1] { dp[i][j] = dp[i][j] || dp[i-1][j] // One or more occurrences } } } } return dp[s.count][p.count] }