Category: Big Data

Sparse Grokking

Jeff Hawkins of Palm fame shows up in the New York Times hawking his Grok for Big Data predictions. Interestingly, if one drills down into the details of Grok, we see once again that randomized sparse representations are the core of the system. That is, if we assign symbols random representational vectors that are sparse, we can sum the vectors for co-occurring symbols and, following J.R. Firth’s pithy “words shall be known by the company that they keep” start to develop a theory of meaning that would not offend Wittgenstein.

Is there anything new to Hawkins’ effort? For certain types of time-series prediction, the approach parallels artificial neural network designs, replacing the complexity of shifting, multi-epoch training regimens that, in effect, build the high-dimensional distances between co-occurring events by gradually moving time-correlated data together and uncorrelated data apart with an end-run around all the computational complexity. But then there is Random Indexing, which I’ve previously discussed, here. If one restricts Random Indexing to operating on temporal patterns, or on spatial patterns, then the results start to look like Numenta’s offering.

While there is a bit of opportunism in Hawkins latching onto Big Data to promote an application of methods he has been working on for years, there are very real opportunities for trying to mine leading indicators to help with everything from ecommerce to research and development. Many flowers will bloom, grok, die, and be reborn.

Pressing Snobs into Hell

Paul Vitanyi has been a deep advocate for Kolmogorov complexity for many years. His book with Ming Li, An Introduction to Kolmogorov Complexity and Its Applications, remains on my book shelf (and was a bit of an investment in grad school).

I came across a rather interesting paper by Vitanyi with Rudi Cilibrasi called “Clustering by Compression” that illustrates perhaps more easily and clearly than almost any other recent work the tight connections between meaning, repetition, and informational structure. Rather than describing the paper, however, I wanted to conduct an experiment that demonstrates their results. To do this, I asked the question: are the writings of Dante more similar to other writings of Dante than to Thackeray? And is the same true of Thackeray relative to Dante?

Now, we could pursue these questions at many different levels. We might ask scholars, well-versed in the works of each, to compare and contrast the two authors. They might invoke cultural factors, the memes of their respective eras, and their writing styles. Ultimately, though, the scholars would have to get down to some textual analysis, looking at the words on the page. And in so doing, they would draw distinctions by lifting features of the text, comparing and contrasting grammatical choices, word choices, and other basic elements of the prose and poetry on the page. We might very well be able to take parts of the knowledge of those experts and distill it into some kind of a logical procedure or algorithm that would parse the texts and draw distinctions based on the distributions of words and other structural cues. If asked, we might say that a similar method might work for the so-called language of life, DNA, but that it would require a different kind of background knowledge to build the analysis, much less create an algorithm to perform the same task. And perhaps a similar procedure would apply to music, finding both simple similarities between features like tempos, as well as complex stylistic and content-based alignments.

Yet, what if we could invoke some kind of universal approach to finding exactly what features of each domain are important for comparisons? The universal approach would need to infer the appropriate symbols, grammars, relationships, and even meaning in order to do the task. And this is where compression comes in. In order to efficiently compress a data stream, an algorithm must infer repetitive patterns that are both close together like the repetition of t-h-e in English, as well as further apart, like verb agreement and other grammatical subtleties in the same language. By inferring those patterns, the compressor can then create a dictionary and emit just a reference to the pattern in the dictionary, rather than the pattern itself.

Cilibrasi and Vitanyi, in their paper, conduct a number of studies on an innovative application of compression to finding similarities and, conversely, differences. To repeat their experiment, I used a standard data compressor called bzip2 and grabbed two cantos from Dante’s Divine Comedy at random, and then two chapters from Thackeray’s The Book of Snobs. I then followed their procedure and compressed each individually using bzip2, as well as concatenating each together (pairwise) and compressing the combined document. The idea is that when you concatenate them together, the similarities present between the two documents should manifest as improved compression (smaller sized files) because the pattern dictionary will be more broadly applicable. The length of the files needs to be normalized a bit, however, because the files themselves vary in length, so following Cilibrasi and Vitanyi’s procedure, I subtracted the minimum of the compressed sizes of the two independent files and divided by the maximum of the same.

The results were perfect:

X1 X1 Size X2 X2 Size X1X2 Size NCD
d1 2828 d2 3030 5408 4933.748232
d1 2828 t1 3284 5834 6201.203678
d1 2828 t2 2969 5529 5280.703324
d2 3030 t1 3284 6001 5884.148845
d2 3030 t2 2969 5692 5000.694389
t1 3284 t2 2969 5861 4599.207369


Note that d1 has the lowest distance (NCD) to d2. t1 is also closest to t2. In this table, d1 and d2 are the Dante cantos. t1 and t2 are the Thackeray chapters. NCD is the Normalized Compression Distance. X1 Size is the compressed size of X1 in bytes. X2 Size is the compressed size for X2. Combined size is the compressed size of the combined original text (combined by concatenation in the order X1 followed by X2). NCD is calculated per the paper.

Fish eating fish eating fish

Decompressing in NorCal following a vibrant Hadoop World. More press mentions:

· Big Data, Big News: 10 Things To See At Hadoop World, CRN, October 23, 2012 – (Circulation 53,397)

· Quest Software Announces Hadoop-Centric Software Analytics, CloudNewsDaily, October 23, 2012-coverage of Hadoop product announcements.

· Quest Launches New Analytics Software for Hadoop, SiliconANGLE, October 23, 2012- coverage of Hadoop Product.

· Continuing its M&A software push, Dell moves into ‘big data’ analytics via Kitenga buy, 451 Research

· Cisco Updates Schedule to Automate Hadoop Big Data Analysis Systems, Eweek, October 24, 2012- mention of Kitenga product announcement at Hadoop. (Circulation 196,157)

· Quest Launches New Analytics Software for Hadoop, DABBC, October 24, 2012

And what about fish? Dell == Big Fish, Quest == Medium Fish, Kitenga == Happy Minnow.

Dell Acquires Kitenga

Dell Inc. : Quest Software Expands Its Big Data Solution with New Hadoop-Centric Software Capabilities for Business Analytics

10/23/2012| 08:05am US/Eastern

  • Complete solution includes application development, data replication, and data analysis

Hadoop World 2012-Quest Software, Inc. (now part of Dell) announced three significant product releases today aimed at helping customers more quickly adopt Hadoop and exploit their Big Data. When used together, the three products offer a complete solution that addresses the greatest challenge with Hadoop: the shortage of technical and analytical skills needed to gain meaningful business insight from massive volumes of captured data. Quest builds on its long history in data and database management to open the world of Big Data to more than just the data scientist.

News Facts:

  • Kitenga Analytics: Based on the recent acquisition of Kitenga, Quest Software now enables customers to analyze structured, semi-structured and unstructured data stored in Hadoop. Available immediately, Kitenga Analytics delivers sophisticated capabilities, including text search, machine learning, and advanced visualizations, all from an easy-to-use interface that does not require understanding of complex programming or the Hadoop stack itself. With Kitenga Analytics and the Quest Toad®Business Intelligence Suite, an organization has a complete self-service analysis environment that empowers business and systems analysts across a variety of backgrounds and job roles.

An Exit to a New Beginning

I am thrilled to note that my business partner and I sold our Big Data analytics startup to a large corporation yesterday. I am currently unemployed but start anew doing the same work on Monday.

Thrilled is almost too tame a word. Ecstatic does better describing the mood around here and the excitement we have over having triumphed in Sili Valley. There are many war stories that we’ve been swapping over the last 24 hours, including how we nearly shut down/rebooted at the start of 2012. But now it is over and we have just a bit of cleanup work left to dissolve the existing business structures and a short vacation to attend to.

Semantic Zooming

I’ve been pushing hard for a demo at Hadoop Summit this week, waking unexpectedly at 5 AM this morning with spherical trigonometry percolating through my head. The topic is “semantic zooming” and it is not a complicated concept to understand because we have a common example that many of us use daily: Google Maps. All the modern, online mapping systems do semantic zooming to a degree when they change the types of information that are displayed on the map depending on the zoom level. Thus, the “semantics” or “meaning” of the displayed information changes with zooming, revealing states, then rivers, then major roads, then minor roads, and then all the way down to local businesses. The goal of semantic zooming is to manage information overload by managing semantics.

In my case, I’m using a semantic zooming interface to apply different types of information visualizations to data resources in a distributed file system (a file system that spans many disk drives in many computers) related to the “big data” technology, Hadoop. A distributed file system can have many data types (numerical data, text, PDFs, log files from web servers, scientific data) and the only way to interact with the data is through a command-line or through fairly simple web-based user interfaces that act like crude file system browsers. Making use of the data in the system, analyzing it, requires running analysis processes on it, then pulling the data out and importing it into other technologies like Excel or business intelligence systems to bind charting and visualization tools to it. With semantic zooming operating directly on the data, however, the structure of the data can be probed directly and the required background processes launch automatically to create new aggregate views of the data. The end result is an entirely new way of looking at complex data resources.

How does it work? The system uses a spatial metaphor to represent the tree-like structure of the distributed file system. It runs in a browser using the new WebGL standard for high-performance 3D graphics that supports combining the native capabilities of graphics cards with web page rendering. WebGL isn’t pervasive yet, but works well in Google Chrome and acceptably in Firefox. I even tried a hack for making it work on my iPad 3, recently, with mixed success. More work is needed before most WebGL systems will run on iPads. To enable changing semantics, the browser is constantly requesting information from a proxy web service about the distributed file system. As users apply different data lenses (pie chart, bar chart, social network graph, etc.) to the data elements, the distributed processing engine generates new aggregate views of the underlying data.

Randomness and Meaning

The impossibility of the Chinese Room has implications across the board for understanding what meaning means. Mark Walker’s paper “On the Intertranslatability of all Natural Languages” describes how the translation of words and phrases may be achieved:

  1. Through a simple correspondence scheme (word for word)
  2. Through “syntactic” expansion of the languages to accommodate concepts that have no obvious equivalence (“optometrist” => “doctor for eye problems”, etc.)
  3. Through incorporation of foreign words and phrases as “loan words”
  4. Through “semantic” expansion where the foreign word is defined through its coherence within a larger knowledge network.

An example for (4) is the word “lepton” where many languages do not have a corresponding concept and, in fact, the concept is dependent on a bulwark of advanced concepts from particle physics. There may be no way to create a superposition of the meanings of other words using (2) to adequately handle “lepton.”

These problems present again for trying to understand how children acquire meaning in learning a language. As Walker points out, language learning for a second language must involve the same kinds of steps as learning translations, so any simple correspondence theory has to be supplemented.

So how do we make adequate judgments about meanings and so rapidly learn words, often initially with a course granularity but later with increasingly sharp levels of focus? What procedure is required for expanding correspondence theories to operate in larger networks? Methods like Latent Semantic Analysis and Random Indexing show how this can be achieved in ways that are illuminating about human cognition. In each case, the methods provide insights into how relatively simple transformations of terms and their occurrence contexts can be viewed as providing a form of “triangulation” about the meaning of words. And, importantly, this level of triangulation is sufficient for these methods to do very human-like things. Both methods can pass the TOEFL exam, for instance, and Latent Semantic Analysis is at the heart of automatic essay grading approaches that have sufficiently high success rates that they are widely used by standardized test makers.

How do they work? I’ll just briefly describe Random Indexing, since I recently presented the concept at the Big Data Science meetup at SGI in Fremont, California. In Random Indexing, we simply create a randomized sparse vector for each word we encounter in a large collection of texts. The vector can be binary as a first approximation, so something like:

The: 0000000000000100000010000000000000000001000000000000000…

quick: 000100000000000010000000000001000000000110000000000000…

fox: 0000000000000000000000100000000000000000000000000100100…

Now, as I encountered a given word in the text, I just add up the random vectors for the words around it to create a new “context” vector that is still sparse, but less so than the component parts. What is interesting about this approach is that if you consider the vectors as representing points in a hyperspace with the same dimensionality as the vectors are long, then words that have similar meanings tend to cluster in that space. Latent Semantic Analysis achieves a similar clustering using some rather complex linear algebra. A simple approximation of the LSA approach is also at the heart of Google’s PageRank algorithm, though operating on link structure rather than word co-occurrences.

So how do we solve the TOEFL test using an approach like Random Indexing? A large collection of texts are analyzed to create a Random Index, then for a sample question like:

In line 5, the word “pronounced” most closely means

  1. evident
  2. spoken
  3. described
  4. unfortunate

The question and the question text are converted into a context vector using the same random vectors for the index and then the answers vectors are compared to see which is closest in the index space. This is remarkably inexpensive to compute, requiring just an inner product between the context vectors for question and answer.

A method for compact coding using Algorithmic Information Theory can also be used to achieve similar results, demonstrating the wide applicability of context-based analysis to helping understand how intertranslateability and language learning are dependent on the rich contexts of word usage.