[the code, sample usage below]

[Update 26.09.2018: it turns out that even the method presented here is not enough on larger scale; if you hit such limits, you can take a look at PyLucene (requires Java) and my example usage for retrieving similar words and distance computation, starting mainly from the line 52. You can let me know if there’s a need for something more annotated. Hopefully the simpler pure-Python method is enough for lighter duties.]

Edit distance is a problem I frequently encounter. It is just a number measuring how many rudimentary operations on characters (insertions, replacements and deletions) you need to transform one string of text into another. In this form it’s called the Levenshtein distance, or Levenshtein-Damerau if you add transpositions (making apple => aplpe counting as one operation). You may need that for error correction, for example, to know what possible replacements are most similar to the incorrect word.

Levenshtein-Damerau distance is used in the popular Peter Norvig’s post on spelling correction. His implementation is nice in being very explicit: you can literally see Python code producing all possible variations so we can then check whether some other word belongs to that set. The problem is that, obviously, this space quickly becomes enormous. If we have have just a 24-letter alphabet on 5-character word, you have 5*24 possible replacements, 6*24 possible insertions, 4 transpositions and 6 deletions. That’s 274 possibilities for the Levenshtein distance d = 1 (not a big problem for a computer), and roughly speaking this number is taken to the power of d. For d = 3, there exist around 20.5 million possibilities (less because of duplicates, more because at this point we are already expanding words that are 5+(d-1)=7 characters long).

There are slightly smarter ways to do this, but they cannot really circumvent the number of transformations that could be applied. (Reasons why humans can judge similarity between two words without descending into madness are left out for today). With NLP problems I often have to check distance between many, many pairs of strings, and it would be nice if computers at least knew when to give up. If the texts are farther apart than, say, three, we could stop drowning in the swamp of exponentially bigger numbers. Put a cutoff on distance, we could say.

There exists a concept of Levenshtein automatons, which let us check whether any two words are within given distance from each other. This would be it, but apparently, as the internet says, the algorithm is very arcane. I’d rather not dive into algorithmic studies tangential to my work. But it is surprisingly hard to get a ready-made solution with this functionality in Python, which is after all a fairly popular language.

At least I found a blog post by Jules Jacobs claiming that “Levenshtein automata can be simple and fast”. What he does is implementing the dynamic programming algorithm for calculating distance, which gives us a sort of an automaton for recognizing strings that are within some prescribed distance from the original string. This is almost what I wanted, if it only worked in Python 3.0+ and also gave me a way to get the actual distance and not only binary information “match/no match”. So I went ahead and introduced some minor modifications to make it work. Again, the code is here.

Here’s how to use it.

from levenshtein import LevenshteinAutomaton

max_distance = 3
base_str = 'base_string'
other_strs = ['base_trsing', 'apple']

aut = LevenshteinAutomaton(base_str, max_distance)
for s2 in other_strs: # reuse the automaton for all possible matches
    state = aut.start() # state of the automaton
    for c in s2:
        state = aut.step(state, c)
        if not aut.can_match(state): # break if the distance is bigger than we want
    if aut.is_match(state):
        print('{} is a match with distance {}!'.format(s2, aut.distance(state)))

Or, just use the function (the overhead of creating a new automaton under the hood each time is rather trivial):

from levenshtein import distance_within

distance_within('base_string', 'base_trsing', 2) # 2
distance_within('base_string', 'apple', 2) # False

The only problem is that it does not support transpositions. Currently I lack ideas for adding them in an elegant way.

Some additional explanations on how it works, meant as a commentary to Jacobs’ work and Wikipedia. (Very handwavy if you haven’t been introduced to basic algorithms/dynamic programming). The original algorithm, described in Wikipedia, finds distance between strings by recursively finding minimal distances between their shorter and shorter substrings, all the way down to the base case of individual characters. This can be done with removing (ie. putting temporarily on stack) characters either from the ends or from the beginnings of the strings. The C implementation does the former, because it makes handling the char arrays easier; we just decrease their length considered by the function (len_s and len_t).

More efficient algorithms put these partial solutions in a matrix to avoid repeating computation. If the algorithm works from the bottom and finds early that there is no way the distance can be less than d, it can stop. We can also throw away earlier rows after we used them and even go sparse, as it is explained in the blog post that I linked. The important part is that finally we get our answer in the bottom right corner of the matrix. This is what my implementation uses to provide the actual computed distance.

Leave a Reply

Your email address will not be published. Required fields are marked *