Category: Big Data

Inequality and Big Data Revolutions

industrial-revolutionsI had some interesting new talking points in my Rock Stars of Big Data talk this week. On the same day, MIT Technology Review published Technology and Inequality by David Rotman that surveys the link between a growing wealth divide and technological change. Part of my motivating argument for Big Data is that intelligent systems are likely the next industrial revolution via Paul Krugman of Nobel Prize and New York Times fame. Krugman builds on Robert Gordon’s analysis of past industrial revolutions that reached some dire conclusions about slowing economic growth in America. The consequences of intelligent systems on everyday life will have enormous impact and will disrupt everything from low-wage workers through to knowledge workers. And how does Big Data lead to that disruption?

Krugman’s optimism was built on the presumption that the brittleness of intelligent systems so far can be overcome by more and more data. There are some examples where we are seeing incremental improvements due to data volumes. For instance, having larger sample corpora to use for modeling spoken language enhances automatic speech recognition. Google Translate builds on work that I had the privilege to be involved with in the 1990s that used “parallel texts” (essentially line-by-line translations) to build automatic translation systems based on phrasal lookup. The more examples of how things are translated, the better the system gets. But what else improves with Big Data? Maybe instrumenting many cars and crowdsourcing driving behaviors through city streets would provide the best data-driven approach to self-driving cars. Maybe instrumenting individuals will help us overcome some of things we do effortlessly that are strangely difficult to automate like folding towels and understanding complex visual scenes.

But regardless of the methods, the consequences need to be considered. Our current fascination with Big Data may not lead to Industrial Revolution 4 in five years or twenty, but unless there is some magical barrier that we are not aware of, IR4 seems to be inevitable. And the impacts will perhaps be more profound than the past revolutions because, unlike those transitions, the direct displacement of workers is a key component of the IR4 plan. In Rotman’s article, Thomas Piketty’s r > g is invoked to explain the excess return on capital (r) versus economic growth rate (g) and how that leads to a concentration of wealth among the richest members of our society, creating a barbell distribution of economic opportunities where the middle class has been dismantled due to (per Gordon) the equalization of labor costs through outsourcing to low-cost nations. But at least there remains a left bell to that barbell in that it is largely impossible to eliminate the services jobs that are critical to retail, restaurant, logistics, health care, and a raft of other economic sectors.

All that changes in IR4 and the barbell turns into the hammer from the Olympic hammer throw as the owners of the capital take over the entire cost structure for a huge range of economic activities. The middle may not initially be gone, however, as maintenance of the machinery will require a skilled workforce. Even this will be a point of Big Data optimization, however, as predictive maintenance and self-healing systems optimize against their failure modes over usage cycles.

So let’s go back to Gordon’s pessimism (economics is, after all, the “dismal science”). What headwinds and tailwinds are left in IR4? Perhaps the most cogent is the recommended use of redistributive methods for accelerating educational opportunities while reducing the debt load of American students. The other areas that are discussed include unlimited immigration to try to offset hours per capita declines due to retirement and demographic effects, but Gordon’s application of this is not necessarily valid in IR4 where low-skilled immigration would cease because of a lack of economic opportunities and even higher-skilled workers might find themselves displaced.

One lesson learned from past industrial revolutions is that they created more opportunities than worker displacements. Steam power displaced animal labor and the workers needed to shoe and train and feed those animals. Diesel trains displaced steam engine builders and mechanics. Cars and aircraft displaced trains. But in each case there were new jobs that accompanied the shift. We might be equally optimistic about IR4, speculating about robot trainers and knowledge engineers, massive extraction industries and materials production, or enhanced creative and entertainment systems like Michael Crichton’s dystopian Westworld of the early 70s. Is this enough to buffer against the headwind of the loss of the service sector? Perhaps, but it will not come without enormous global disruption.

Profiled Against a Desert Ribbon

The desert abloomCatch a profile of me in this month’s IEEE Spectrum Magazine. Note Yggdrasil in the background! It’s been great working with IEEE’s Cloud Computing Initiative (CCI) these last two years. CCI will be ending soon, but it’s impact will live on in, for instance, the Intercloud Interoperability Standard and other ways. Importantly, I’ll be at the IEEE Big Data Initiative Workshop in Hoboken, NJ, at the end of the month working on the next initiative in support of advanced data analytics. Note that Hoboken and Jersey City have better views of Manhattan than Manhattan itself!

“Animal” was the name of the program and it built simple decision trees based on yes/no answers (does it have hair? does it have feathers?). If it didn’t guess your animal it added a layer to the tree with the correct answer. Incremental learning at its most elementary, but it left an odd impression on me: how do we overcome the specification of rules to create self-specifying (occasionally, maybe) intelligence? I spent days wandering the irrigation canals of the lower New Mexico Rio Grande trying to overcome this fatal flaw that I saw in such simplified ideas about intelligence. And I didn’t really go home for days, it seemed, given the freedom to drift through my pre-teen and then teen years in a way I can’t imagine today, creating myself among my friends and a penumbra of ideas, the green chile and cotton fields a thin ribbon surrounded by stark Chihuahuan desert.

Action on Hadoop

hadoopinactionThe back rooms of everyone from Pandora to the NSA are filled with machines working in parallel to enrich and analyze data. And mostly at the core is Doug Cutting’s Hadoop that provides an open source implementation of the Google BigTable MapReduce framework combined with a distributed file system for replication and failover. With Hadoop Summit arriving this week (the 6th I’ve been to and the 7th ever), the importance and impact of these technologies continues to grow.

I hope to see you there and I’ll take this opportunity to announce that I am co-authoring Hadoop in Action, 2nd Edition with the original author, Chuck Lam. The new version will provide updates to this best-selling book and introduce all of the newest animals in the Hadoop zoo.

Inching Towards Shannon’s Oblivion

SkynetFollowing Bill Joy’s concerns over the future world of nanotechnology, biological engineering, and robotics in 2000’s Why the Future Doesn’t Need Us, it has become fashionable to worry over “existential threats” to humanity. Nuclear power and weapons used to be dreadful enough, and clearly remain in the top five, but these rapidly developing technologies, asteroids, and global climate change have joined Oppenheimer’s misquoted “destroyer of all things” in portending our doom. Here’s Max Tegmark, Stephen Hawking, and others in Huffington Post warning again about artificial intelligence:

One can imagine such technology outsmarting financial markets, out-inventing human researchers, out-manipulating human leaders, and developing weapons we cannot even understand. Whereas the short-term impact of AI depends on who controls it, the long-term impact depends on whether it can be controlled at all.

I almost always begin my public talks on Big Data and intelligent systems with a presentation on industrial revolutions that progresses through Robert Gordon’s phases and then highlights Paul Krugman’s argument that Big Data and the intelligent systems improvements we are seeing potentially represent a next industrial revolution. I am usually less enthusiastic about the timeline than nonspecialists, but after giving a talk at PASS Business Analytics Friday in San Jose, I stuck around to listen in on a highly technical talk concerning statistical regularization and deep learning and I found myself enthused about the topic once again. Deep learning is using artificial neural networks to classify information, but is distinct from traditional ANNs in that the systems are pre-trained using auto-encoders to have a general knowledge about the data domain. To be clear, though, most of the problems that have been tackled are “subsymbolic” for image recognition and speech problems. Still, the improvements have been fairly impressive based on some pretty simple ideas. First, the pre-training is accompanied by systematic bottlenecking of the number of nodes that can be used for learning. Second, the amount that each fires is kept low to avoid overfitting to nodes with dominating magnitudes. Together, the auto-encoders learn the patterns without training and can then be trained faster and easier to associate those patterns with classes.

I still have my doubts concerning the threat timeline, however. For one, these are mostly sub-symbolic systems that are not capable of the kinds of self-directed system modifications that many fear can lead to exponential self-improvement. Second, the tasks that are seeing improvements are not new but just relatively well-known classification problems. Finally, the improvements, while impressive, are incremental improvements. There is probably a meaningful threat profile that can convert into a decision tree for when action is needed. For global climate change there are consensus estimates about sea level changes for instance. For Evil AI I think we need to wait for a single act of machine intelligence out-of-control before spending excessively on containment, policy, or regulation. In the meantime, though, keep a close eye on your laptop.

And then there’s the mild misanthropy of Claude Shannon, possibly driven by living too long in New Jersey:

I visualize a time when we will be to robots what dogs are to humans, and I’m rooting for the machines.

Computing the Madness of People

Bubble playing cardThe best paper I’ve read so far this year has to be Pseudo-Mathematics and Financial Charlatanism: The Effects of Backtest Overfitting on Out-of-sample Performance by David Bailey, Jonathan Borwein, Marcos López de Prado, and Qiji Jim Zhu. The title should ring alarm bells with anyone who has ever puzzled over the disclaimers made by mutual funds or investment strategists that “past performance is not a guarantee of future performance.” No, but we have nothing but that past performance to judge the fund or firm on; we could just pick based on vague investment “philosophies” like the heroizing profiles in Kiplingers seem to promote or trust that all the arbitraging has squeezed the markets into perfect equilibria and therefore just use index funds.

The paper’s core tenets extend well beyond financial charlatanism, however. They point out that the same problem arises in drug discovery where main effects of novel compounds may be due to pure randomness in the sample population in a way that is masked by the sample selection procedure. The history of mental illness research has similar failures, with the head of NIMH remarking that clinical trials and the DSM for treating psychiatric symptoms is too often “shooting in the dark.”

The core suggestion of the paper is remarkably simple, however: use held-out data to validate models. Remarkably simple but apparently rarely done in quantitative financial analysis. The researchers show how simple random walks can look like a seasonal price pattern, and how by sending binary signals about market performance to clients (market will rise/market will fall) investment advisors can create a subpopulation that thinks they are geniuses as other clients walk away due to losses. These rise to the level of charlatanism but the problem of overfitting is just one of pseudo-mathematics where insufficient care is used in managing the data. And the fix is simple: do what we do in machine learning. Divide the training data into 10 buckets, train on 9 of them and verify on the last one. Then rotate or do another division/training cycle. Anomalies in the data start popping out very quickly and ensemble-based methods can help to cope with breakdowns of independence assumptions and stability.

Perhaps the best quote in the paper is from Sir Isaac Newton who lamented that he could not calculate  “the madness of people” after losing a minor fortune in the South Sea Bubble of 1720. If we might start to compute that madness it is important to do it right.

Saving Big Data from the Zeros

ZerosBecause of the hype cycle, Big Data inevitably attracts dissenters who want to deflate a bit the lofty expectations that are built around new technologies that appear mystifying to those on the outside of the Silicon Valley machine. The first response is generally “so what?” and that there is nothing new here, just rehashing efforts like grid computing and Beowulf and whatnot. This skepticism is generally a healthy inoculation against aggrandizement and any kind of hangover from unmet expectations. Hence, the NY Times op-ed from April 6th, Eight (No, Nine!) Problems with Big Data should be embraced for enumerating eight or nine different ways that Big Data technologies, algorithms and thinking might be stretching the balloon of hope towards a loud, but ineffectual, pop.

The eighth of the list bears some scrutiny, though. The authors, who I am not familiar with, focus on the overuse of trigrams in building statistical language models. And they note that language is very productive and that even a short sentence from Rob Lowe, “dumbed-down escapist fare,” doesn’t appear in the indexed corpus of Google. Shades of “colorless green ideas…” from Chomsky, but an important lesson in how to manage the composition of meaning. Dumbed-down escapist fare doesn’t translate well back-and-forth through German via the Google translate capability. For the authors, that shows the failure of the statistical translation methodology linked to Big Data, and ties in to their other concerns about predicting rare occurrences or even, in the case of Lowe’s quote, zero occurrences.

In reality, though, these methods of statistical translation through parallel text learning date to the late 1980s and reflect a distinct journey through ways of thinking about natural language and computing. Throughout the field’s history, phrasal bracketing and the alignment of those phrases to build a statistical concordance has been driven by more than trigrams. And where higher-order ngrams get sparse (or go to zero probability like Rob Lowe’s phrase), the better estimate is based on the composition of the probabilities of each sub phrase or words:


Google Search,hits

”dumbed-down”, 690\,000

escapist, 1\,860\,000

fare, 132\,000\,000

”dumbed-down escapist fare”, 110 (all about the NY Times article)



Indeed, reweighting language models to accommodate the complexities of unknowns has been around for a long time. The idea is called “back-off probability” and can be used for ngram text modeling or even for rather elegant multi-length compression models like Prediction by Partial Matching (PPM). In both cases, the composition of the substrings become important when the phrasal whole has no reference. And when an unknown word is present in a sequence of knowns, the combined semantics of the knowns with the estimate of the part-of-speech of the unknown based on syntax regularities provides clues; “dumbed-down engorssian fare” could be about the arts or food but likely not about space travel or delivery vans.

And so we rescue Big Data from the threat of zeros.


Signals and Apophenia

qrcode-distortThe central theme in Signals and Noise is that of the inverse problem and its consequences: given an ocean of data, how does one uncover the true signals hidden in the noise? Is there even such a thing? There’s an obsessive balance between apophenia and modeling somewhere built into our skulls.

The cover art for Signals and Noise reflects those tendencies. There is a QR Code that encodes a passage from the book, and then there is a distortion of the content of the QR Code. The distortion, in turn, creates a compelling image. Is it a fly creeping to the left or a lion’s head tilted to the right?


A free hard-cover copy of Signals and Noise to anyone who decodes the QR Code. Post a copy of the text to claim your reward.

A Paradigm of Guessing

boxesThe most interesting thing I’ve read this week comes from Jurgen Schmidhuber’s paper, Algorithmic Theories of Everything, which should be provocative enough to pique the most jaded of interests. And the quote is from way into the paper:

The first number is 2, the second is 4, the third is 6, the fourth is 8. What is the fifth? The correct answer is “250,” because the nth number is n 5 −5n^4 −15n^3 + 125n^2 −224n+ 120. In certain IQ tests, however, the answer “250” will not yield maximal score, because it does not seem to be the “simplest” answer consistent with the data (compare [73]). And physicists and others favor “simple” explanations of observations.

And this is the beginning and the end of logical positivism. How can we assign truth to inductive judgments without crossing from fact to value, and what should that value system be?