Monte Carlo to Las Vegas
I recently wrote an algorithms library while following Computer Science courses to understand the fundamental data structures that underpin modern computing. In the process, I learned about the value of good API design.
Sometimes, the goal of exposing a lean library API conflicts with that of writing performant code. Let's look at a compromise I had to make when adding a substring search algorithm to the library.
Monte Carlo algorithms
A Monte Carlo algorithm is guaranteed to terminate in finite time, but it may yield an incorrect result occasionally. Conversely, a Las Vegas algorithm produces a correct result only if it terminates.
In some cases, it is possible to formulate a Las Vegas variant of a Monte Carlo algorithm. The Rabin-Karp substring search algorithm has this property.
Substring search algorithms let us find the position of a search string or pattern
p within a larger text
t. Suppose our library exposes these algorithms as curried functions
Example usage in Python:
>>> find("needle")("It's like looking for a needle in a haystack") 24 >>> find("nettle")("It's like looking for a needle in a haystack") -1
The naïve or brute force algorithm walks through text, matching the search string against every substring of length
def find(pattern): m = len(pattern) def search(text): n = len(text) for i in range(0, n): if pattern == text[i : i + m]: return i return -1 return search
We aim to enable library users to write the Las Vegas variant of the Rabin-Karp algorithm in terms of the Monte Carlo variant. That way, our library only has to export one implementation of Rabin-Karp search.
The Rabin-Karp algorithm attempts to find a search string by computing a rolling hash of text input.5 Its Monte Carlo variant returns the index that defines the first substring with a hash matching that of the search string. Note that a hash collision can result in a false positive match -- a spurious "hit."
Looking at a concrete implementation of Rabin-Karp search will make things clear. Here is the Monte Carlo variant, written in Python:
def find(pattern): r = 256 # search over ASCII characters q = 997 m = len(pattern) def hash(s): # Hash the first `m` characters of `s` h = 0 for c in s[:m]: h = (h * r + ord(c)) % q return h pattern_hash = hash(pattern) def search(text): # Compare the rolling hash to the pattern hash text_hash = hash(text) n = len(text) if text_hash == pattern_hash: return 0 # Precompute `pow(r, m - 1) % q` R = 1 for _ in range(1, m): R = (r * R) % q for i in range(m, n): # Remove the contribution from the leading digit, text_hash = (text_hash + q - R * ord(text[i - m]) % q) % q # and add a contribution from the trailing digit text_hash = (text_hash * r + ord(text[i])) % q if text_hash == pattern_hash: return i - m + 1 return -1 return search
The Las Vegas variant performs an equality check to guard against spurious hits. It verifies the strings
text[i-m:i] are the same before returning from the search loop. But this is equivalent to modifying the Monte Carlo variant to call itself on the remaining portion of text if an equality check fails, viz:
def find(pattern): m = len(pattern) def search(text, start=0): i = rabin_karp.find(pattern)(text[start:]) if i == -1: return -1 if pattern == text[start + i : start + i + m]: return start + i return search(text, start + i + 1) return search
Thus library users can quickly adapt the Monte Carlo variant of the algorithm to create the Las Vegas variant.
It's hard to get a free lunch.6 In this case, reusing code can result in poor performance and excessive memory usage when a search leads to many spurious hits.
Each spurious hit:
- creates an extra stack frame, and
- reinitializes the rolling hash within
We can deal with the first problem by simply rewriting
find in an iterative fashion:7
def find(pattern): m = len(pattern) def search(text): start = 0 while True: i = rabin_karp.find(pattern)(text[start:]) if i == -1: return -1 if pattern == text[start + i : start + i + m]: return start + i start = start + i + 1 return search
We can only solve the second problem by writing
find entirely from scratch.
The library can export the Monte Carlo implementation if it is not likely to be used when spurious hits are unacceptable. Unfortunately, library authors cannot be sure programmers will RTFM.
I want to thank the people who reviewed the first draft of this blog post. Adil Parvez helped me define the tone of this article. Scott Williams pointed out it is always possible to go from a Las Vegas variant of an algorithm to a Monte Carlo variant.8
This API is ideal for Rabin-Karp search. It enables a small optimization: the library user can memoize hashing. For example,
"needle"has only been hashed once in↩
>>> search = find("needle") >>> search("It's like looking for a needle in a haystack") 24 >>> search("It's like a wild goose chase") -1
The runtime is , where and are the number of characters in the search string and the text input. ↩
Or Boyer-Moore search. Or Knuth-Morris-Pratt search. ↩
Unless you work at Google 😛. ↩
An absurd way of implementing a Monte Carlo variant of Rabin-Karp given a Las Vegas variant is to return the correct index on every other invocation and a random one otherwise. ↩