Dan Bernier bio photo

Dan Bernier

Lazy skeptic, asking dumb questions.

Email Twitter LinkedIn Github Stackoverflow

Fuzzy Matching with Cosine Similarity

A simple algorithm to tell when things are just a LITTLE different.

How many times have we tried to match up items from two systems, and they don’t match perfectly, so we can’t match them directly?


via

WARNING: Fuzzy-matching is probabilistic, and will probably eat your face off if you trust it too much. That said, I recently found the science on an effective, simple solution: cosine similarity.

I got the deets from Grant Ingersoll’s book Taming Text, and from Richard Clayton’s posts. Joel also found this post that goes into more detail, and more math and code.

Cosine matching is a way to determine how similar two things are to each other. It’s a linear algebra trick:

  • two things are similar if their properties are similar
  • if you can express each of these properties as a number, you can think of those numbers as coordinate values in a grid - i.e., as a vector
  • it’s straight-forward to calculate the angle between two vectors (this involves their “dot product”, and their length, or “magnitude”, which is just their Euclidean distance)
  • if the angle is small, they’re similar; if it’s large, they’re dissimilar

This is a useful idea! (Linear algebra is the math I always wish I’d paid more attention to, and stuff like this is why.)

I implemented it. The code is below, but first, here’s how it ranks some strings (scores truncated to 4 decimal places for readability):

One String Another String cosine similarity
string gnirts 1.0
Radiohead Carly Rae Jepsen 0.4717
The Beatles Taylor Swift 0.4905
identical strings identical strings 1.0
First National Bank of Alaska Fst Nat’l Bank of Alaska 0.9534
First National Bank of Alaska First National Bank of Kentucky 0.8964

Notice that “string” and “gnirts” (“string” backwards) are considered identical. Remember: it’ll eat your face off if you trust it too much. That said, “The Beatles” and “Taylor Swift” are reassuringly dissimilar.

Here’s the ruby:

 1 def cosine(a, b)
 2   ac = a.downcase.split('')
 3   bc = b.downcase.split('')
 4 
 5   set = ac | bc
 6 
 7   afreq = count(ac)
 8   bfreq = count(bc)
 9 
10   av = set.map { |x| afreq.fetch(x, 0) }
11   bv = set.map { |x| bfreq.fetch(x, 0) }
12 
13   dot(av, bv) / (mag(av) * mag(bv).to_f)
14 end
15 
16 def dot(a, b)
17   # Now that I've written it in Ruby, I think
18   # I might have a chance of remembering it.
19   a.zip(b).map { |m, n| m * n }.reduce(:+)
20 end
21 
22 def mag(v)
23   Math.sqrt(v.map { |x| x ** 2 }.reduce(:+))
24 end
25 
26 def count(items)
27   items.group_by { |x| x }.map { |x, xs| [x, xs.size] }.to_h
28 end

There’s not much to it.

Bi-grams

It’s annoying that “string” and “gnirts” count as identical. But it’s a good reminder that you have to be careful how you express an item’s properties as numbers. Here, the only properties are how many times each character appears.

But before you shout “Levenshtein edit distance,” we can improve the matches by counting not characters, but character bi-grams. Don’t count %w(s t r i n g), count %w(st tr ri in ng). Since “string” and “gnirts” have no bi-grams in common, their cosine similarity drops to 0. That’s pretty good.

Here are the scores, side-by-side:

One String Another String cosine, letters cosine, bi-grams
string gnirts 1.0 0.0
Radiohead Carly Rae Jepsen 0.4717 0.1400
The Beatles Taylor Swift 0.4905 0.0801
identical strings identical strings 1.0 1.0
First National Bank of Alaska Fst Nat’l Bank of Alaska 0.9534 0.8232
First National Bank of Alaska First National Bank of Kentucky 0.8964 0.7647

Notice that “The Beatles” and “Taylor Swift” are now even more reassuringly dissimilar.

Here’s the extra code that implements the bi-gram comparison:

 1 def ngrams(s)
 2   # Add spaces, front & back, so we count which letters start & end words.
 3   letters = [' '] + s.downcase.split('') + [' ']
 4   letters.each_cons(2).to_a
 5 end
 6 
 7 def cosine_ngram(a, b)
 8   a_ngrams = ngrams(a)
 9   b_ngrams = ngrams(b)
10 
11   set = a_ngrams | b_ngrams
12 
13   afreq = count(a_ngrams)
14   bfreq = count(b_ngrams)
15 
16   av = set.map { |x| afreq.fetch(x, 0) }
17   bv = set.map { |x| bfreq.fetch(x, 0) }
18 
19   dot(av, bv) / (mag(av) * mag(bv).to_f)
20 end

It’s Not Just For Strings!

The reason I’m more excited about cosine similarity than something like Levenshtein edit distance is because it’s not specific to comparing bits of text. Remember: if you can express the properties you care about as a number, you can use cosine similarity to calculate the similarity between two items.

At Continuity, we clearly think a lot about Banks and Credit Unions. Suppose we’re trying to tell whether a record from the FDIC’s BankFind service matches a record in our database. We can cosine-similarity compare the names, of course, but both the FDIC and Continuity store other data: the FDIC Certificate number, the asset size, the number of branches…each of these properties can be expressed as a dimension of the vector, which means they can also help us decide whether two records represent the same Bank or Credit Union.

This generality makes cosine similarity so useful! We’re already reconsidering some engineering problems in light of cosine similarity, and it’s helping us through situations where we can’t expect exact matches. A simple, general, useful idea is a good thing to find.

:wq