Quality isn't always better than quantity

Reading about data processing with MapReduce I was astonished when I first encountered the Stupid backoff algorithm’s tale.

The story is pretty simple: the Kneser-Ney smoothing strategy was a state-of-the-art way for processing data, but it had an heavy computational cost.

We introduce a new smoothing method, dubbed Stupid Backoff, that is inexpensive to train on large data sets and approaches the quality of Kneser-Ney Smoothing as the amount of training data increases.

In 2007 Thorsten Brants developed a new smoothing algorithm, simpler than the Kneser-Ney one, which was very lean and appliable to large amounts of data.

The result?

These algorithms were heavily used in machine translations, and you can already figure out what happened: with small datasets the backoff was generating less-accurate translations but, as the amount of data analized growed, it was able to extract more valid translations, eventually beating Kneser-Ney’s score1.

I’d like you to read a few notes about the stupid backoff’s introductory paper:

  1. This was possible, in the machine-translation scenario, thanks to the fact that the algorithm could be “trained” to perform better translations as the dataset grew

Hi there! I recently wrote an ebook on web application security, currently sold on leanpub, the Amazon Kindle store and gumroad.

It contains 160+ pages of content dedicated to securing web applications and improving your security awareness when building web apps, with chapters ranging from explaining how to secure HTTP cookies with the right flags to understanding why it is important to consider joining a bug bounty program.

Feel free to skim through some of the free chapters published on this blog and, if the content seems interesting enough to you, grab a copy on leanpub, the Amazon Kindle store, gumroad or simply checkout right down below!

Buy the Web Application Security ebook for $9.99

In the mood for some more reading?

...or check the archives.