# The Goldilocks Complexity Zone

Since my time in the early 90s at Santa Fe Institute, I’ve been fascinated by the informational physics of complex systems. What are the requirements of an abstract system that is capable of complex behavior? How do our intuitions about complex behavior or form match up with mathematical approaches to describing complexity? For instance, we might consider a snowflake complex, but it is also regular in it’s structure, driven by an interaction between crystal growth and the surrounding air. The classic examples of coastlines and fractal self-symmetry also seem complex but are not capable of complex behavior.

So what is a good way of thinking about complexity? There is actually a good range of ideas about how to characterize complexity. Seth Lloyd rounds up many of them, here. The intuition that drives many of them is that complexity seems to be associated with distributions of relationships and objects that are somehow juxtapositioned between a single state and a uniformly random set of states. Complex things, be they living organisms or computers running algorithms, should exist in a Goldilocks zone when each part is examined and those parts are somehow summed up to a single measure.

We can easily construct a complexity measure that captures some of these intuitions. Let’s look at three strings of characters:

x = aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa

y = menlqphsfyjubaoitwzrvcgxdkbwohqyxplerz

z = the fox met the hare and the fox saw the hare

Now we would likely all agree that y and z are more complex than x, and I suspect most would agree that y looks like gibberish compared with z. Of course, y could be a sequence of weirdly coded measurements or something, or encrypted such that the message appears random. Let’s ignore those possibilities for our initial attempt at defining a complexity measure. We can see right away that an approach using basic information theory doesn’t help much. Algorithmic informational complexity will be highest for y, as will entropy:

$H(s)=-\sum_{i=0}^{i=|s|}P(s_i)log_2P(s_i)$

for each sequence composed out of an alphabet with counts, s. So we get: H(x) = 0, H(y) = 3.199809, and H(z) = 2.3281. Here’s some sample R code using the “entropy” package if you want to calculate yourself:

> z = "the fox met the hare and the fox saw the hare"
> zt = table(strsplit(z, '')[[1]])
> entropy(zt, method="ML")



Note that the alphabet of each string is slightly different, but the missing characters between them don’t matter since their probabilities are 0.

We can just arbitrarily scale entropy by the maximum entropy possible for the same length string like this:

$H_m(s)=-\frac{-\sum_{i=0}^{i=|s|}P(s_i)logP(s_i)}{-\sum_{i=0}^{i=|s|}\frac{1}{|s|}log(\frac{1}{|s|})}$

$H_m(s)=\frac{\sum_{i=0}^{i=|s|}P(s_i)logP(s_i)}{log(\frac{1}{|s|})}$

This is somewhat like channel efficiency in communications theory, I think. And then just turn this into a parabolically-scaled measure that centers at 0.5:

$C(s)=\frac{1}{(1/2-H_m(s))^2+\epsilon}$

where $\epsilon$ is an arbitrary non-zero scaling parameter.

But this calculation is only considering the individual character frequencies, not the composition of the characters into groupings. So we can consider pairs of characters in this same calculation, or triples, etc. And also, just looking at these n-gram sequences doesn’t capture potentially longer range repetitious structures. So we can gradually ladle on grammars as the counting mechanism. Now, if our measure of complexity is really going to capture what we intuitively consider to be complex, all of these different levels of connections within the string or other organized piece of information must be present.

This general program is present in every one of Seth Lloyd’s complexity metrics in various ways and even comes into play in discussions of consciousness, though many use mutual information rather than entropy per se. Here’s Max Tegmark using a variation on Giulio Tinoni’s Phi concept from Integrated Information Theory to demonstrate that integration is a key component of consciousness and how that might be calculated for general physical systems.

# Entanglement and Information

Research can flow into interesting little eddies that cohere into larger circulations that become transformative phase shifts. That happened to me this morning between a morning drive in the Northern California hills and departing for lunch at one of our favorite restaurants in Danville.

The topic I’ve been working on since my retirement is whether there are preferential representations for optimal automated inference methods. We have this grab-bag of machine learning techniques that use differing data structures but that all implement some variation on fitting functions to data exemplars; at the most general they all look like some kind of gradient descent on an error surface. Getting the right mix of parameters, nodes, etc. falls to some kind of statistical regularization or bottlenecking for the algorithms. Or maybe you perform a grid search in the hyperparameter space, narrowing down the right mix. Or you can throw up your hands and try to evolve your way to a solution, suspecting that there may be local optima that are distracting the algorithms from global success.

Yet, algorithmic information theory (AIT) gives us, via Solomonoff, a framework for balancing parameterization of an inference algorithm against the error rate on the training set. But, first, it’s all uncomputable and, second, the AIT framework just uses strings of binary as the coded Turing machines, so I would have to flip 2^N bits and test each representation to get anywhere with the theory. Yet, I and many others have had incremental success at using variations on this framework, whether via Minimum Description Length (MDL) principles, it’s first cousin Minimum Message Length (MML), and other statistical regularization approaches that are somewhat proxies for these techniques. But we almost always choose a model (ANNs, compression lexicons, etc.) and then optimize the parameters around that framework. Can we do better? Is there a preferential model for time series versus static data? How about for discrete versus continuous?

So while researching model selection in this framework, I come upon a mention of Shannon’s information theory and its application to quantum decoherence. Of course I had to investigate. And here is the most interesting thing I’ve seen in months from the always interesting Max Tegmark at MIT:

Particles entangle and then quantum decoherence causes them to shed entropy into one another during interaction. But, most interesting, is the quantum Bayes’ theory section around 00:35:00 where Shannon entropy as a classical measure of improbability gets applied to the quantum indeterminacy through this decoherence process.

I’m pretty sure it sheds no particular light on the problem of model selection but when cosmology and machine learning issues converge it gives me mild shivers of joy.

# Machine Learning and the Coming Robot Apocalypse

Slides from a talk I gave today on current advances in machine learning are available in PDF, below. The agenda is pretty straightforward: starting with some theory about overfitting based on algorithmic information theory, we proceed on through a taxonomy of ML types (not exhaustive), then dip into ensemble learning and deep learning approaches. An analysis of the difficulty and types of performance we get from various algorithms and problems is presented. We end with a discussion of whether we should be frightened about the progress we see around us.

Note: click on the gray square if you don’t see the embedded PDF…browsers vary.

# Learning around the Non Sequiturs

If Solomonoff Induction and its conceptual neighbors have not yet found application in enhancing human reasoning, there are definitely areas where they have potential value.  Automatic, unsupervised learning of sequential patterns is an intriguing area of application. It also fits closely with the sequence inferencing problem that is at the heart of algorithmic information theory.

Pragmatically, the problem of how children learn the interchangeability of words that is the basic operation of grammaticality is one area where this kind of system might be useful. Given a sequence of words or symbols, what sort of information is available for figuring out the grammatical groupings? Not much beyond memories of repetitions, often inferred implicitly.

Could we apply some variant of Solomonoff Induction at this point? Recall that we want to find the most compact explanation for the observed symbol stream. Recall also that the form of the explanation is a computer program of some sort that consists of logical functions. It turns out that creating a program that, for every possible sequence, finds the absolutely most compact program is uncomputable. The notion of what is “uncomputable” (or incomputable) is a mathematical result that has to do with how many different potential programs must be investigated to try to find the shortest one. If that number grows faster than the length of a program, it becomes uncomputable. Being uncomputable is not a death sentence, however. We can come up with approximate methods that try to follow the same procedure because any method that incrementally compresses the explanatory program will get closer to the hypothetical best program.

Sequitur by Nevill-Manning and Witten is an example of a procedure that approximates Algorithmic Information Theory optimization for string sequences. In the Sequitur algorithm, when repetitions occur in a sequence, a new rule is created that encompasses the repeated grouping. New rules can be abandoned as well when they no longer cover more than one occurrence due to some exemplars being absorbed by other examples. A slightly more sophisticated version of Sequitur involves creating new rules only when the number of bits required to represent both the rules and the encoded string is lower than without the rule. This approach comes even closer to Solomonoff Induction but requires more calculation of probabilities and the information content of the rules and sequences.

I’ve applied this latter approach in various contexts, including using the method to try to explain human guessing performance on predicting symbol sequences. Neural correlates of this capability likely exist since there are abstractly related capabilities that are used for edge detection in visual processing (an edge is a compact “guess” drawn from a noisy region of a visual field). Finding a grammar inferencing brain region (likely in Wernicke’s area) but that is used for a range of sequencing problems would support the argument that Solomonoff-like capabilities have a biological reality.