Tagged: statistical NLP

Saving Big Data from the Zeros

ZerosBecause of the hype cycle, Big Data inevitably attracts dissenters who want to deflate a bit the lofty expectations that are built around new technologies that appear mystifying to those on the outside of the Silicon Valley machine. The first response is generally “so what?” and that there is nothing new here, just rehashing efforts like grid computing and Beowulf and whatnot. This skepticism is generally a healthy inoculation against aggrandizement and any kind of hangover from unmet expectations. Hence, the NY Times op-ed from April 6th, Eight (No, Nine!) Problems with Big Data should be embraced for enumerating eight or nine different ways that Big Data technologies, algorithms and thinking might be stretching the balloon of hope towards a loud, but ineffectual, pop.

The eighth of the list bears some scrutiny, though. The authors, who I am not familiar with, focus on the overuse of trigrams in building statistical language models. And they note that language is very productive and that even a short sentence from Rob Lowe, “dumbed-down escapist fare,” doesn’t appear in the indexed corpus of Google. Shades of “colorless green ideas…” from Chomsky, but an important lessonĀ in how to manage the composition of meaning. Dumbed-down escapist fare doesn’t translate well back-and-forth through German via the Google translate capability. For the authors, that shows the failure of the statistical translation methodology linked to Big Data, and ties in to their other concerns about predicting rare occurrences or even, in the case of Lowe’s quote, zero occurrences.

In reality, though, these methods of statistical translation through parallel text learning date to the late 1980s and reflect a distinct journey through ways of thinking about natural language and computing. Throughout the field’s history, phrasal bracketing and the alignment of those phrases to build a statistical concordance has been driven by more than trigrams. And where higher-order ngrams get sparse (or go to zero probability like Rob Lowe’s phrase), the better estimate is based on the composition of the probabilities of each sub phrase or words:


Google Search,hits

”dumbed-down”, 690\,000

escapist, 1\,860\,000

fare, 132\,000\,000

”dumbed-down escapist fare”, 110 (all about the NY Times article)



Indeed, reweighting language models to accommodate the complexities of unknowns has been around for a long time. The idea is called “back-off probability” and can be used for ngram text modeling or even for rather elegant multi-length compression models like Prediction by Partial Matching (PPM). In both cases, the composition of the substrings become important when the phrasal whole has no reference. And when an unknown word is present in a sequence of knowns, the combined semantics of the knowns with the estimate of the part-of-speech of the unknown based on syntax regularities provides clues; “dumbed-down engorssian fare” could be about the arts or food but likely not about space travel or delivery vans.

And so we rescue Big Data from the threat of zeros.