Category: Big Data

Industrial Revolution #4

Paul Krugman at New York Times consumes Robert Gordon’s analysis of economic growth and the role of technology and comes up more hopeful than Gordon. The kernel in Krugman’s hope is that Big Data analytics can provide a shortcut to intelligent machines by bypassing the requirement for specification and programming that was once assumed to be a requirement for artificial intelligence. Instead, we don’t specify but use “data-intensive ways” to achieve a better result. And we might get to IR#4, following Gordon’s taxonomy where IR stands for “industrial revolution.” IR#1 was steam and locomotives  IR#2 was everything up to computers. IR#3 is computers and cell phones and whatnot.

Krugman implies that IR#4 might spur the typical economic consequences of grand technological change, including the massive displacement of workers, but like in previous revolutions it is also assumed that economic growth built from new industries will ultimately eclipse the negatives. This is not new, of course. Robert Anton Wilson argued decades ago for the R.I.C.H. economy (Rising Income through Cybernetic Homeostasis). Wilson may have been on acid, but Krugman wasn’t yet tuned in, man. (A brief aside: the Krugman/Wilson notions probably break down over extraction and agribusiness/land rights issues. If labor is completely replaced by intelligent machines, the land and the ingredients it contains nevertheless remain a bottleneck for economic growth. Look at the global demand for copper and rare earth materials, for instance.)

But why the particular focus on Big Data technologies? Krugman’s hope teeters on the assumption that data-intensive algorithms possess a fundamentally different scale and capacity than human-engineered approaches. Having risen through the computational linguistics and AI community working on data-driven methods for approaching intelligence, I can certainly sympathize with the motivation, but there are really only modest results to report at this time. For instance, statistical machine translation is still pretty poor quality, and is arguably not of better quality than the rules-based methods from the 70s and 80s in anything other than scale and diversity of the languages that are being used. Recent achievements like the DARPA grand challenge for self-driving vehicles were not achieved through data-intensive methods but through careful examination of the limits of the baseline system. In that case, baseline meant a system that used a scanning laser rangefinder to avoid obstacles while following a map and an improvement meant marginally outrunning the distance limitations of the rangefinder by using optical image recognition to support a modest speedup. Speech recognition is better due to accumulating many examples of labeled, categorized text, true. And we can certainly guess that the relevance of advertising placed on a web page is better than it once was, if only because it is an easy problem to attack without the necessity of deep considerations of human understanding–unless you take our buying behavior to be a deep indicator of our beings. We can also see some glimmers of data-intensive methods in the IBM Watson system, though the Watson team will be the first to tell you that they dealt with only medium-scale data (wikipedia) in the design of their system.

Still, there is a clear economic-growth argument for the upshot of replacing workers in manual drudgery straight through to fairly intelligent drudgery, which gives an economist like Krugman reason for hope. Now, if the limitations of energy and resource requirements can just be replaced, we can all retire to RICH, creative lives.

Keep Suspicious and Carry On

I’ve previously argued that it is unlikely that resource-constrained simulations can achieve adequate levels of fidelity to be sufficient for what we observe around us. This argument was a combination of computational irreducibility and assumptions about the complexity of evolutionary trajectories of living beings. There may also be an argument about the observed contingency of the evolutionary process that is an argument against any kind of “intelligent” organizing principle though not against simulation itself.

Leave it to physicists to envision a test of the Bostrom hypothesis that we are living in a computer simulation. Martin Savage and his colleagues look at Quantum Chromodynamic (QCD) theory and current simulation methods for QCD. They conclude that if we are, in fact, living in a simulation, then we might observe specific inconsistencies that arise from finite computing power for the universe as a whole. Those inconsistencies would be observed in looking at the distribution of cosmic ray energies, specifically. Note that if the distribution is not unusual the universe could either be a simulation (just a sophisticated one) or could be a truly physical one (free running and not on another entity’s computational framework). It is only if the distribution is unusual that it might be a simulation.

Sparse Grokking

Jeff Hawkins of Palm fame shows up in the New York Times hawking his Grok for Big Data predictions. Interestingly, if one drills down into the details of Grok, we see once again that randomized sparse representations are the core of the system. That is, if we assign symbols random representational vectors that are sparse, we can sum the vectors for co-occurring symbols and, following J.R. Firth’s pithy “words shall be known by the company that they keep” start to develop a theory of meaning that would not offend Wittgenstein.

Is there anything new to Hawkins’ effort? For certain types of time-series prediction, the approach parallels artificial neural network designs, replacing the complexity of shifting, multi-epoch training regimens that, in effect, build the high-dimensional distances between co-occurring events by gradually moving time-correlated data together and uncorrelated data apart with an end-run around all the computational complexity. But then there is Random Indexing, which I’ve previously discussed, here. If one restricts Random Indexing to operating on temporal patterns, or on spatial patterns, then the results start to look like Numenta’s offering.

While there is a bit of opportunism in Hawkins latching onto Big Data to promote an application of methods he has been working on for years, there are very real opportunities for trying to mine leading indicators to help with everything from ecommerce to research and development. Many flowers will bloom, grok, die, and be reborn.

Pressing Snobs into Hell

Paul Vitanyi has been a deep advocate for Kolmogorov complexity for many years. His book with Ming Li, An Introduction to Kolmogorov Complexity and Its Applications, remains on my book shelf (and was a bit of an investment in grad school).

I came across a rather interesting paper by Vitanyi with Rudi Cilibrasi called “Clustering by Compression” that illustrates perhaps more easily and clearly than almost any other recent work the tight connections between meaning, repetition, and informational structure. Rather than describing the paper, however, I wanted to conduct an experiment that demonstrates their results. To do this, I asked the question: are the writings of Dante more similar to other writings of Dante than to Thackeray? And is the same true of Thackeray relative to Dante?

Now, we could pursue these questions at many different levels. We might ask scholars, well-versed in the works of each, to compare and contrast the two authors. They might invoke cultural factors, the memes of their respective eras, and their writing styles. Ultimately, though, the scholars would have to get down to some textual analysis, looking at the words on the page. And in so doing, they would draw distinctions by lifting features of the text, comparing and contrasting grammatical choices, word choices, and other basic elements of the prose and poetry on the page. We might very well be able to take parts of the knowledge of those experts and distill it into some kind of a logical procedure or algorithm that would parse the texts and draw distinctions based on the distributions of words and other structural cues. If asked, we might say that a similar method might work for the so-called language of life, DNA, but that it would require a different kind of background knowledge to build the analysis, much less create an algorithm to perform the same task. And perhaps a similar procedure would apply to music, finding both simple similarities between features like tempos, as well as complex stylistic and content-based alignments.

Yet, what if we could invoke some kind of universal approach to finding exactly what features of each domain are important for comparisons? The universal approach would need to infer the appropriate symbols, grammars, relationships, and even meaning in order to do the task. And this is where compression comes in. In order to efficiently compress a data stream, an algorithm must infer repetitive patterns that are both close together like the repetition of t-h-e in English, as well as further apart, like verb agreement and other grammatical subtleties in the same language. By inferring those patterns, the compressor can then create a dictionary and emit just a reference to the pattern in the dictionary, rather than the pattern itself.

Cilibrasi and Vitanyi, in their paper, conduct a number of studies on an innovative application of compression to finding similarities and, conversely, differences. To repeat their experiment, I used a standard data compressor called bzip2 and grabbed two cantos from Dante’s Divine Comedy at random, and then two chapters from Thackeray’s The Book of Snobs. I then followed their procedure and compressed each individually using bzip2, as well as concatenating each together (pairwise) and compressing the combined document. The idea is that when you concatenate them together, the similarities present between the two documents should manifest as improved compression (smaller sized files) because the pattern dictionary will be more broadly applicable. The length of the files needs to be normalized a bit, however, because the files themselves vary in length, so following Cilibrasi and Vitanyi’s procedure, I subtracted the minimum of the compressed sizes of the two independent files and divided by the maximum of the same.

The results were perfect:

X1 X1 Size X2 X2 Size X1X2 Size NCD
d1 2828 d2 3030 5408 4933.748232
d1 2828 t1 3284 5834 6201.203678
d1 2828 t2 2969 5529 5280.703324
d2 3030 t1 3284 6001 5884.148845
d2 3030 t2 2969 5692 5000.694389
t1 3284 t2 2969 5861 4599.207369


Note that d1 has the lowest distance (NCD) to d2. t1 is also closest to t2. In this table, d1 and d2 are the Dante cantos. t1 and t2 are the Thackeray chapters. NCD is the Normalized Compression Distance. X1 Size is the compressed size of X1 in bytes. X2 Size is the compressed size for X2. Combined size is the compressed size of the combined original text (combined by concatenation in the order X1 followed by X2). NCD is calculated per the paper.

Fish eating fish eating fish

Decompressing in NorCal following a vibrant Hadoop World. More press mentions:

· Big Data, Big News: 10 Things To See At Hadoop World, CRN, October 23, 2012 – (Circulation 53,397)

· Quest Software Announces Hadoop-Centric Software Analytics, CloudNewsDaily, October 23, 2012-coverage of Hadoop product announcements.

· Quest Launches New Analytics Software for Hadoop, SiliconANGLE, October 23, 2012- coverage of Hadoop Product.

· Continuing its M&A software push, Dell moves into ‘big data’ analytics via Kitenga buy, 451 Research

· Cisco Updates Schedule to Automate Hadoop Big Data Analysis Systems, Eweek, October 24, 2012- mention of Kitenga product announcement at Hadoop. (Circulation 196,157)

· Quest Launches New Analytics Software for Hadoop, DABBC, October 24, 2012

And what about fish? Dell == Big Fish, Quest == Medium Fish, Kitenga == Happy Minnow.

Dell Acquires Kitenga

Dell Inc. : Quest Software Expands Its Big Data Solution with New Hadoop-Centric Software Capabilities for Business Analytics

10/23/2012| 08:05am US/Eastern

  • Complete solution includes application development, data replication, and data analysis

Hadoop World 2012-Quest Software, Inc. (now part of Dell) announced three significant product releases today aimed at helping customers more quickly adopt Hadoop and exploit their Big Data. When used together, the three products offer a complete solution that addresses the greatest challenge with Hadoop: the shortage of technical and analytical skills needed to gain meaningful business insight from massive volumes of captured data. Quest builds on its long history in data and database management to open the world of Big Data to more than just the data scientist.

News Facts:

  • Kitenga Analytics: Based on the recent acquisition of Kitenga, Quest Software now enables customers to analyze structured, semi-structured and unstructured data stored in Hadoop. Available immediately, Kitenga Analytics delivers sophisticated capabilities, including text search, machine learning, and advanced visualizations, all from an easy-to-use interface that does not require understanding of complex programming or the Hadoop stack itself. With Kitenga Analytics and the Quest Toad®Business Intelligence Suite, an organization has a complete self-service analysis environment that empowers business and systems analysts across a variety of backgrounds and job roles.

An Exit to a New Beginning

I am thrilled to note that my business partner and I sold our Big Data analytics startup to a large corporation yesterday. I am currently unemployed but start anew doing the same work on Monday.

Thrilled is almost too tame a word. Ecstatic does better describing the mood around here and the excitement we have over having triumphed in Sili Valley. There are many war stories that we’ve been swapping over the last 24 hours, including how we nearly shut down/rebooted at the start of 2012. But now it is over and we have just a bit of cleanup work left to dissolve the existing business structures and a short vacation to attend to.