Matching Similar Strings

I recently needed some code to compare the descriptions of bank transactions, and look for “matches”. On the surface this sounds pretty simple, but on closer inspection there are many variations of any bank transaction description. Many transactions include a date in their description which will always be different. So: strip the numbers right? Except many transactions are just a date and an account number. Okay, so ust remove the dates? But then: what about transactions marked PENDING, or transactions that are truncated, or transactions that change some random bit of their description with a transaction ID, or…

Well, we had a lot of rules and things still weren’t ideal. I spent days thinking about it, and didn’t really get anywhere except creating more rules and exceptions. Until on the way home one night, I had a thought: “Character Ngrams”. If you don’t know it already, an ngram is just a “group of something” but the way they’re generated is special. For example: the string "Character Ngrams" if tokenized into “bigrams” (two characters each) look like this:

1
"Ch", "ha", "ar", "ra", "ac", "ct", "te", "er", "r ", " N", "Ng", "gr", "am", "ms"

What is great about these is that they help you reconstruct them, add more characters and they become easier to reconstruct. You’ll see why this is useful later on. So the first thing I wrote was a tokenize function that returned 3 character ngrams of a string, but as a Set since this code is about comparing the differences (and therefore similarity) of two strings:

1
2
3
4
5
fun tokenize(description: String, tokenSize: Int = 3) =
(0..description.length - tokenSize)
.asSequence() // no need for the intermediary List object
.map { idx -> description.substring(idx, idx + tokenSize) }
.toSet()

We also needed a simple function to run some basic transformation on each string. Here I’ll use a regex, but it can be done faster using more manual processes (I’m going for readability here, not speed):

1
2
3
4
5
fun sanitize(description: String) =
description
.toLowerCase()
.replace("[^A-Za-z0-1]", "")
.replace("\\s+", " ")

Finally, here is the actual isMatch implementation. I’ve marked out where you might want to tweek the values for your own uses:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
fun isMatch(s1: String, s2: String): Boolean {
val sanitized1 = sanitize(s1)
val sanitized2 = sanitize(s2)

// do a quick prefix / equality check to cover simple cases
if((sanitized1.length > sanitized2.length && sanitized1.startsWith(sanitized2))
|| (sanitized2.length > sanitized1.length && sanitized2.startsWith(sanitized1))
|| sanitized1 == sanitized2)
return true

val tokens1 = tokenize(sanitized1)
val tokens2 = tokenize(sanitized2)

// subtract the one set from the other, only the differences are left
val common = tokens2 - tokens1

// how much commonality do the two sets have?
// 0.4 - 0.6 are good values, tweek it to suit your needs
return (common.size.toDouble() < floor(tokens2.size.toDouble()) * 0.4)
}

This code should work on all variations of Kotlin (JVM, JS or Native) and is very easy to port to other languages. I suggest writing a trueth list of strings you want to match, and others you want to !match and then tune things until you’re happy.

The advantage of using ngrams instead of just comparing the characters is that it makes the algorithm order-aware while still handling spelling mistakes and similar issues. A character-by-character implementation would consider “abc123” an exact-match to “321cba” (or any other combination of those 6 characters).

Finally: you can make this a String extension function very easily, but I wanted to keep the example simple.