Category: Big Data

Keep Suspicious and Carry On

I’ve previously argued that it is unlikely that resource-constrained simulations can achieve adequate levels of fidelity to be sufficient for what we observe around us. This argument was a combination of computational irreducibility and assumptions about the complexity of evolutionary trajectories of living beings. There may also be an argument about the observed contingency of the evolutionary process that is an argument against any kind of “intelligent” organizing principle though not against simulation itself.

Leave it to physicists to envision a test of the Bostrom hypothesis that we are living in a computer simulation. Martin Savage and his colleagues look at Quantum Chromodynamic (QCD) theory and current simulation methods for QCD. They conclude that if we are, in fact, living in a simulation, then we might observe specific inconsistencies that arise from finite computing power for the universe as a whole. Those inconsistencies would be observed in looking at the distribution of cosmic ray energies, specifically. Note that if the distribution is not unusual the universe could either be a simulation (just a sophisticated one) or could be a truly physical one (free running and not on another entity’s computational framework). It is only if the distribution is unusual that it might be a simulation.

Sparse Grokking

Jeff Hawkins of Palm fame shows up in the New York Times hawking his Grok for Big Data predictions. Interestingly, if one drills down into the details of Grok, we see once again that randomized sparse representations are the core of the system. That is, if we assign symbols random representational vectors that are sparse, we can sum the vectors for co-occurring symbols and, following J.R. Firth’s pithy “words shall be known by the company that they keep” start to develop a theory of meaning that would not offend Wittgenstein.

Is there anything new to Hawkins’ effort? For certain types of time-series prediction, the approach parallels artificial neural network designs, replacing the complexity of shifting, multi-epoch training regimens that, in effect, build the high-dimensional distances between co-occurring events by gradually moving time-correlated data together and uncorrelated data apart with an end-run around all the computational complexity. But then there is Random Indexing, which I’ve previously discussed, here. If one restricts Random Indexing to operating on temporal patterns, or on spatial patterns, then the results start to look like Numenta’s offering.

While there is a bit of opportunism in Hawkins latching onto Big Data to promote an application of methods he has been working on for years, there are very real opportunities for trying to mine leading indicators to help with everything from ecommerce to research and development. Many flowers will bloom, grok, die, and be reborn.

Pressing Snobs into Hell

Paul Vitanyi has been a deep advocate for Kolmogorov complexity for many years. His book with Ming Li, An Introduction to Kolmogorov Complexity and Its Applications, remains on my book shelf (and was a bit of an investment in grad school).

I came across a rather interesting paper by Vitanyi with Rudi Cilibrasi called “Clustering by Compression” that illustrates perhaps more easily and clearly than almost any other recent work the tight connections between meaning, repetition, and informational structure. Rather than describing the paper, however, I wanted to conduct an experiment that demonstrates their results. To do this, I asked the question: are the writings of Dante more similar to other writings of Dante than to Thackeray? And is the same true of Thackeray relative to Dante?

Now, we could pursue these questions at many different levels. We might ask scholars, well-versed in the works of each, to compare and contrast the two authors. They might invoke cultural factors, the memes of their respective eras, and their writing styles. Ultimately, though, the scholars would have to get down to some textual analysis, looking at the words on the page. And in so doing, they would draw distinctions by lifting features of the text, comparing and contrasting grammatical choices, word choices, and other basic elements of the prose and poetry on the page. We might very well be able to take parts of the knowledge of those experts and distill it into some kind of a logical procedure or algorithm that would parse the texts and draw distinctions based on the distributions of words and other structural cues. If asked, we might say that a similar method might work for the so-called language of life, DNA, but that it would require a different kind of background knowledge to build the analysis, much less create an algorithm to perform the same task. And perhaps a similar procedure would apply to music, finding both simple similarities between features like tempos, as well as complex stylistic and content-based alignments.

Yet, what if we could invoke some kind of universal approach to finding exactly what features of each domain are important for comparisons? The universal approach would need to infer the appropriate symbols, grammars, relationships, and even meaning in order to do the task. And this is where compression comes in. In order to efficiently compress a data stream, an algorithm must infer repetitive patterns that are both close together like the repetition of t-h-e in English, as well as further apart, like verb agreement and other grammatical subtleties in the same language. By inferring those patterns, the compressor can then create a dictionary and emit just a reference to the pattern in the dictionary, rather than the pattern itself.

Cilibrasi and Vitanyi, in their paper, conduct a number of studies on an innovative application of compression to finding similarities and, conversely, differences. To repeat their experiment, I used a standard data compressor called bzip2 and grabbed two cantos from Dante’s Divine Comedy at random, and then two chapters from Thackeray’s The Book of Snobs. I then followed their procedure and compressed each individually using bzip2, as well as concatenating each together (pairwise) and compressing the combined document. The idea is that when you concatenate them together, the similarities present between the two documents should manifest as improved compression (smaller sized files) because the pattern dictionary will be more broadly applicable. The length of the files needs to be normalized a bit, however, because the files themselves vary in length, so following Cilibrasi and Vitanyi’s procedure, I subtracted the minimum of the compressed sizes of the two independent files and divided by the maximum of the same.

The results were perfect:

X1 X1 Size X2 X2 Size X1X2 Size NCD
d1 2828 d2 3030 5408 4933.748232
d1 2828 t1 3284 5834 6201.203678
d1 2828 t2 2969 5529 5280.703324
d2 3030 t1 3284 6001 5884.148845
d2 3030 t2 2969 5692 5000.694389
t1 3284 t2 2969 5861 4599.207369


Note that d1 has the lowest distance (NCD) to d2. t1 is also closest to t2. In this table, d1 and d2 are the Dante cantos. t1 and t2 are the Thackeray chapters. NCD is the Normalized Compression Distance. X1 Size is the compressed size of X1 in bytes. X2 Size is the compressed size for X2. Combined size is the compressed size of the combined original text (combined by concatenation in the order X1 followed by X2). NCD is calculated per the paper.

Fish eating fish eating fish

Decompressing in NorCal following a vibrant Hadoop World. More press mentions:

· Big Data, Big News: 10 Things To See At Hadoop World, CRN, October 23, 2012 – (Circulation 53,397)

· Quest Software Announces Hadoop-Centric Software Analytics, CloudNewsDaily, October 23, 2012-coverage of Hadoop product announcements.

· Quest Launches New Analytics Software for Hadoop, SiliconANGLE, October 23, 2012- coverage of Hadoop Product.

· Continuing its M&A software push, Dell moves into ‘big data’ analytics via Kitenga buy, 451 Research

· Cisco Updates Schedule to Automate Hadoop Big Data Analysis Systems, Eweek, October 24, 2012- mention of Kitenga product announcement at Hadoop. (Circulation 196,157)

· Quest Launches New Analytics Software for Hadoop, DABBC, October 24, 2012

And what about fish? Dell == Big Fish, Quest == Medium Fish, Kitenga == Happy Minnow.

Dell Acquires Kitenga

Dell Inc. : Quest Software Expands Its Big Data Solution with New Hadoop-Centric Software Capabilities for Business Analytics

10/23/2012| 08:05am US/Eastern

  • Complete solution includes application development, data replication, and data analysis

Hadoop World 2012-Quest Software, Inc. (now part of Dell) announced three significant product releases today aimed at helping customers more quickly adopt Hadoop and exploit their Big Data. When used together, the three products offer a complete solution that addresses the greatest challenge with Hadoop: the shortage of technical and analytical skills needed to gain meaningful business insight from massive volumes of captured data. Quest builds on its long history in data and database management to open the world of Big Data to more than just the data scientist.

News Facts:

  • Kitenga Analytics: Based on the recent acquisition of Kitenga, Quest Software now enables customers to analyze structured, semi-structured and unstructured data stored in Hadoop. Available immediately, Kitenga Analytics delivers sophisticated capabilities, including text search, machine learning, and advanced visualizations, all from an easy-to-use interface that does not require understanding of complex programming or the Hadoop stack itself. With Kitenga Analytics and the Quest Toad®Business Intelligence Suite, an organization has a complete self-service analysis environment that empowers business and systems analysts across a variety of backgrounds and job roles.

An Exit to a New Beginning

I am thrilled to note that my business partner and I sold our Big Data analytics startup to a large corporation yesterday. I am currently unemployed but start anew doing the same work on Monday.

Thrilled is almost too tame a word. Ecstatic does better describing the mood around here and the excitement we have over having triumphed in Sili Valley. There are many war stories that we’ve been swapping over the last 24 hours, including how we nearly shut down/rebooted at the start of 2012. But now it is over and we have just a bit of cleanup work left to dissolve the existing business structures and a short vacation to attend to.

Semantic Zooming

I’ve been pushing hard for a demo at Hadoop Summit this week, waking unexpectedly at 5 AM this morning with spherical trigonometry percolating through my head. The topic is “semantic zooming” and it is not a complicated concept to understand because we have a common example that many of us use daily: Google Maps. All the modern, online mapping systems do semantic zooming to a degree when they change the types of information that are displayed on the map depending on the zoom level. Thus, the “semantics” or “meaning” of the displayed information changes with zooming, revealing states, then rivers, then major roads, then minor roads, and then all the way down to local businesses. The goal of semantic zooming is to manage information overload by managing semantics.

In my case, I’m using a semantic zooming interface to apply different types of information visualizations to data resources in a distributed file system (a file system that spans many disk drives in many computers) related to the “big data” technology, Hadoop. A distributed file system can have many data types (numerical data, text, PDFs, log files from web servers, scientific data) and the only way to interact with the data is through a command-line or through fairly simple web-based user interfaces that act like crude file system browsers. Making use of the data in the system, analyzing it, requires running analysis processes on it, then pulling the data out and importing it into other technologies like Excel or business intelligence systems to bind charting and visualization tools to it. With semantic zooming operating directly on the data, however, the structure of the data can be probed directly and the required background processes launch automatically to create new aggregate views of the data. The end result is an entirely new way of looking at complex data resources.

How does it work? The system uses a spatial metaphor to represent the tree-like structure of the distributed file system. It runs in a browser using the new WebGL standard for high-performance 3D graphics that supports combining the native capabilities of graphics cards with web page rendering. WebGL isn’t pervasive yet, but works well in Google Chrome and acceptably in Firefox. I even tried a hack for making it work on my iPad 3, recently, with mixed success. More work is needed before most WebGL systems will run on iPads. To enable changing semantics, the browser is constantly requesting information from a proxy web service about the distributed file system. As users apply different data lenses (pie chart, bar chart, social network graph, etc.) to the data elements, the distributed processing engine generates new aggregate views of the underlying data.