Rant on fuzzy searching
I spent too much time trying to solve a small problem in a good-enough solution
published: Tue 16 June 2026(This is related to my To Type a Tale game, but I'm keeping it in unsorted thoughts for now.)
For anyone who has commented on or wondered about the highlighted text not matching what you expect as you type: it is a remarkably difficult task to figure out where in a large section of text someone is trying to type. Consider the sentence "You are your own worst enemy." If the player has just typed "you", should we highlight "You", the start of "your", or are they a really bad typist and they meant to type "worst"? The game can't know for sure, so it guesses the first two (because it would take more corrections to match the third), but it allows itself the possibility of the third later on. I felt this was... fine for the game I was trying to make.
However, three feedback comments really started getting under my skin and I wanted to see if I could fix the issues and get a smoother typing experience:
- The first is that pausing your typing to wait for the highlighting to reset so you could type the other side of spiky words (thus, defusing them) is an annoying disruption to gameflow (aside: yes, that's kind of the point).
- The second is that the game greys-out collected words, but does not allow you to skip them when typing over that section again (e.g. if the player has already typed "worst", they can't then type "own enemy" and have the game successfully skip "worst").
- Finally, the number of errors you can make while typing is fixed per typing session, which means that trying to type a full paragraph has the same error tolerance as typing a single word.
The third is something I could potentially solve with my current algorithm by slowly increasing the error limit as you type a longer string. But the first two just won't work with the way my fuzzy-search algorithm works right now. So... I spent nearly 12 hours researching, experimenting, and trying new algorithms. And my conclusion is that I just have to stick with the one I already have.
The current algorithm could be summarized as "linear forward search with fixed error allowance". The fixed error allowance is needed to let the algorithm run in a reasonable amount of time. Without it, the game grinds to a halt after typing about 10 characters because the possibility space has just grown too large. Here's an example: you have just typed a "y". The possibilities are:
- it's a typo and you could be anywhere in the text
- start of "You"
- start of "your"
- end of "enemy"
Then you type an "o". Now the possibilities are:
- both are typos and you could be anywhere
- start of "You"
- start of "your"
- the o is a typo and it's the end of "enemy"
- the y is a typo and it's the middle of "worst"
You can see how this could blow up fast? There are some opportunities for improvement (e.g. my algorithm currently also considers the possibility that the o is a typo, but it's still the start of "you" or "your"), but it needs a firm hand in the form of fixed error allowances.
My research on Google, etc. was largely fruitless until Monday morning. This is because I kept searching for "fuzzy search." There are two general types of fuzzy search and neither is what I want: 1) comparing two roughly-equal-length strings and finding the "distance" between them (see: Levenshtein); 2) finding a short string in a large collection of other strings (see: most "search" websites, fzf). The main problem is that most algorithms of both types just return the confidence score of the matches, rather than the matching string indexes, which is what I really need. Some do return that, but the performance tanks when trying to run unequal strings of moderate size.
Monday morning, I realized there's a third type: diff. So I went on a mad quest to find the algorithms used to calculate diffs. Lo and behold, the Myers diff algorithm looks like it could be exactly what I need! It doesn't tell me the indices, but I could use some simple logic to split the source string based on which letters the diff told me are "kept" and then do a direct in-string find (relatively cheap to perform). Unfortunately, Myers was a mathematician, meaning the main paper was written with math notation and poorly explained. I love my fellow mathematicians, but someone needs to explain to them that using m, n, k, and d as your variables in an algorithm is extremely frustrating. Variable names should be self-descriptive! Fine, whatever. I looked up a Python implementation, because Python devs usually know how to write things more legibly. I got the algorithm reimplemented in GDScript and... it's way too damn slow. I could tweak it stop once it reaches a certain number of inserts (i.e. the target text is way larger than the source text), but it had other issues and was failing to find easy phrases. So this clearly wasn't the vimdiff algorithm that works so well.
Prior to Monday morning, I also made several attempts to create my own algorithms: confidence-calculating; heat-map; incremental; trigram; etc. The only promising one of the lot was the confidence-calculating one, but that also hallucinated a bunch.
Between the repeated failures of my own algorithms, the inscrutable text of the diff algorithm papers, and the licensing on the vimdiff code, I decided to just give up and continue using the algorithm I know works reasonably well.