Quality isn't always better than quantity

Reading about data processing with MapReduce I was astonished when I first encountered the Stupid backoff algorithm’s tale.

The story is pretty simple: the Kneser-Ney smoothing strategy was a state-of-the-art way for processing data, but it had an heavy computational cost.

We introduce a new smoothing method, dubbed Stupid Backoff, that is inexpensive to train on large data sets and approaches the quality of Kneser-Ney Smoothing as the amount of training data increases.
Thorsten Brants Large Language Models in Machine Translation

In 2007 Thorsten Brants developed a new smoothing algorithm, simpler than the Kneser-Ney one, which was very lean and appliable to large amounts of data.

The result?

These algorithms were heavily used in machine translations, and you can already figure out what happened: with small datasets the backoff was generating less-accurate translations but, as the amount of data analized growed, it was able to extract more valid translations, eventually beating Kneser-Ney’s score¹.

I’d like you to read a few notes about the stupid backoff’s introductory paper:

Notes

This was possible, in the machine-translation scenario, thanks to the fact that the algorithm could be “trained” to perform better translations as the dataset grew ↩

The result?

In the mood for some more reading?