Friday, December 20, 2013

URL- and Text-suggestion & correction using n-grams (Fuzzy Text Search)

by Hasan Karagülmez


First things first, the code accompanying this article is hosted on Github here:

For now, it's created with Visual Studio 2012, targeting .NET 4.0, though .NET 3.5 (which is the .NET 2.0 CLR) should also work.

You can download pre-build assemblies of the code that's used in this article from my Google Drive. It's ready to run, so you can see what it does first-hand.
Important: In Windows, after you've downloaded the zip, first right-click on it, choose Properties, and choose "Unblock" in the lower section of the properties window.

The below article describes what the code does, and how you can use it.

Why would you use this?

Well, the use-case for this program is trying to help users in case of a 404-page. Often times, you'd just get a sterile 404-page, which basically tells you that you're lost, 
and you should take a look at some enormous sitemap (yay...) or use the search-functionality which hopefully does contain what you're looking for.

But the reason that we got on the 404-page in the first place is that we've entered an URL which does not exist. 
So, why don't we try a bit harder and try and help the user and give a suggestion to the page which they might have meant?

We do have some input for this, namely the URL that caused us to come to the 404-page in the first place.

Wikipedia tries to do something like this on their 404 page, try this non-existing page:

You´ll get redirected to a page where you can create the page (I guess that means it's a non-existing page, probably not what users meant).

Google also has something which tries to do this, using a 404-widget, you can check that out here:
It does come with some caveats, but it does show that there could be a valid use-case here.

Also, a benefit of doing this ourselves is that we can work on this offline (handy for developing websites) and that we can exert influence on the results.

Note that, even though this use-case uses fuzzy-text matching for url-matching, the referenced program will actually work for plain text as well. In fact, a demo-project which does both is included in the Github-repository:

Interested? Let's see how we can implement such a system using fuzzy-text searching.

The what?

Fuzzy Text Search kinda sounds like Fuzzy Logic, but it actually falls more in the line of "Approximate Text Matching".

I think the Wikipedia-article describes this pretty well, check it out here:

So what are n-grams? N-grams simply are a variation of fuzzy-text-search, other famous ones are calculating the levensthein distance and its variations.
A very spiffy way of doing fuzzy-text searching is by using a Levensthein Automaton, but it doesn't seem that easy to implement :)

What is easy to understand and implement is using n-grams. Basically n-grams split up a piece of text (a "string") into partitions, and then tries and do text-matching against a dictionary.

If you really want to know how n-grams work, and why wouldn't you, take a look at the n-gram Wikipedia article. It explains it actually pretty well.

What is handy to know, is that we call it "bigrams" where n=2, and "trigrams" where n=3.

Btw, the code referenced in this article is just a variation of an n-gram implementation, but uses several optimizations to speed up the process, even making use of multi-core processors. Also, it returns a matching-percentage which you can use as a metric. 
For example, if you've got a 90% percent match, you might want to redirect to an url. Or in case of text-matching, go ahead and auto-correct words.

In case you've missed it, you can view and the code (in C#) from GitHib here:

Using plugins for maximum flexablility

Whilst the use-case initially was to create something that can handle URL-suggestions, it would be great if it could handle plain old regular text as well, wouldn't it?
N-grams certainly aren't inheritly couple to URL's, so the program that's created shouldn't either.

We do want to have influence on *WHAT* we want to use for doing our fuzzy-text search, so that hints towards a plugin-system, right?

If we would do this, all that the user of this program would have to do is to create a small plugin which adheres to some specified interfaces. This makes sure that the plugin implements code to supply the n-gram engine with the data it is supposed to use.

If we have n-gram-engine, the interfaces, and the user-supplied data-providing plugin, we have effectively decoupled our data-gathering and our fuzzy-text-search-engine.
That means that we're so flexable now, that we can work against any data-source - we don't put any constraints on that. 

Why is that handy?
Well, it means that we're not coupled to any specific system. For example, say that you want to provide data based on published pages in your CMS? Sure, make a plugin which does that.
Want to use a plain-text file? Sure, go ahead. 
Want to use morse-code smoke-signals read out by a laser as a data-source? Uhm... not sure why you would do that, but be my guest, write a plugin :)

So, basically, we could create something like this:

Actually, this is exactly the principle what the code in the Github library uses, see how there are four components?

Note that the *ONLY* thing you have to do is to create a plugin to supply data. 

Thus, the complexity of the plugin-code entirely depends on you.

The Github-code is a console-application that consumes the FuzzyTextSearch-assembly, and includes two sample plugins.

One which reads from a Sitemap.xml, which is actually the sitemap.xml of which contains a staggering 17.500 (!!!) urls.
Nonetheless, we can still perform suggestions in milliseconds - about 20ms on my quad-core i7 with hyperthreading using bigrams, about 7ms using trigrams.

If your site has got less than 17.500 urls (which is most of you I reckon), results will be returned even faster. 
Another sample I've got, based on a sitemap of 1500 url's, returns results in about 3milliseonds.

The other, which works against a text file which contains the text-contents of the Wikipedia-article about DNA and creates a dictionary of all the words in that article. Afterall, there's quite a chance that you'll mistype "Deoxyribonucleic" or "phosphodiester". I know I would :)

If you would take this line further, you could make something that indexes the textual contents of your site, build a dictionary off of that and actually make corrections based on all the site-content.
That could be handy for example when people mistype things into a search box. That would be pretty cool, wouldn't it?

Note also how at runtime the used plugin is switched, without reloading, and without instantiating a new FuzzyTextSearch-class.

Both plugins are actually in the same assembly, called Acme.MySiteItems, to indicate that it's YOUR assembly which should perform this. 

Nonetheless, the next blog will go into more detail about how to create a plugin.

Feel free to look at the code, use it, write a plugin for your own use-case. If you've got any questions, please don't hesitate to fire away.

Feedback and constructive criticism is also always appreciated :)

No comments:

Post a Comment