# The Goldilocks Complexity Zone

Since my time in the early 90s at Santa Fe Institute, I’ve been fascinated by the informational physics of complex systems. What are the requirements of an abstract system that is capable of complex behavior? How do our intuitions about complex behavior or form match up with mathematical approaches to describing complexity? For instance, we might consider a snowflake complex, but it is also regular in it’s structure, driven by an interaction between crystal growth and the surrounding air. The classic examples of coastlines and fractal self-symmetry also seem complex but are not capable of complex behavior.

So what is a good way of thinking about complexity? There is actually a good range of ideas about how to characterize complexity. Seth Lloyd rounds up many of them, here. The intuition that drives many of them is that complexity seems to be associated with distributions of relationships and objects that are somehow juxtapositioned between a single state and a uniformly random set of states. Complex things, be they living organisms or computers running algorithms, should exist in a Goldilocks zone when each part is examined and those parts are somehow summed up to a single measure.

We can easily construct a complexity measure that captures some of these intuitions. Let’s look at three strings of characters:

x = aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa

y = menlqphsfyjubaoitwzrvcgxdkbwohqyxplerz

z = the fox met the hare and the fox saw the hare

Now we would likely all agree that y and z are more complex than x, and I suspect most would agree that y looks like gibberish compared with z. Of course, y could be a sequence of weirdly coded measurements or something, or encrypted such that the message appears random. Let’s ignore those possibilities for our initial attempt at defining a complexity measure. We can see right away that an approach using basic information theory doesn’t help much. Algorithmic informational complexity will be highest for y, as will entropy:

$H(s)=-\sum_{i=0}^{i=|s|}P(s_i)log_2P(s_i)$

for each sequence composed out of an alphabet with counts, s. So we get: H(x) = 0, H(y) = 3.199809, and H(z) = 2.3281. Here’s some sample R code using the “entropy” package if you want to calculate yourself:

> z = "the fox met the hare and the fox saw the hare"
> zt = table(strsplit(z, '')[[1]])
> entropy(zt, method="ML")



Note that the alphabet of each string is slightly different, but the missing characters between them don’t matter since their probabilities are 0.

We can just arbitrarily scale entropy by the maximum entropy possible for the same length string like this:

$H_m(s)=-\frac{-\sum_{i=0}^{i=|s|}P(s_i)logP(s_i)}{-\sum_{i=0}^{i=|s|}\frac{1}{|s|}log(\frac{1}{|s|})}$

$H_m(s)=\frac{\sum_{i=0}^{i=|s|}P(s_i)logP(s_i)}{log(\frac{1}{|s|})}$

This is somewhat like channel efficiency in communications theory, I think. And then just turn this into a parabolically-scaled measure that centers at 0.5:

$C(s)=\frac{1}{(1/2-H_m(s))^2+\epsilon}$

where $\epsilon$ is an arbitrary non-zero scaling parameter.

But this calculation is only considering the individual character frequencies, not the composition of the characters into groupings. So we can consider pairs of characters in this same calculation, or triples, etc. And also, just looking at these n-gram sequences doesn’t capture potentially longer range repetitious structures. So we can gradually ladle on grammars as the counting mechanism. Now, if our measure of complexity is really going to capture what we intuitively consider to be complex, all of these different levels of connections within the string or other organized piece of information must be present.

This general program is present in every one of Seth Lloyd’s complexity metrics in various ways and even comes into play in discussions of consciousness, though many use mutual information rather than entropy per se. Here’s Max Tegmark using a variation on Giulio Tinoni’s Phi concept from Integrated Information Theory to demonstrate that integration is a key component of consciousness and how that might be calculated for general physical systems.

# Entanglement and Information

Research can flow into interesting little eddies that cohere into larger circulations that become transformative phase shifts. That happened to me this morning between a morning drive in the Northern California hills and departing for lunch at one of our favorite restaurants in Danville.

The topic I’ve been working on since my retirement is whether there are preferential representations for optimal automated inference methods. We have this grab-bag of machine learning techniques that use differing data structures but that all implement some variation on fitting functions to data exemplars; at the most general they all look like some kind of gradient descent on an error surface. Getting the right mix of parameters, nodes, etc. falls to some kind of statistical regularization or bottlenecking for the algorithms. Or maybe you perform a grid search in the hyperparameter space, narrowing down the right mix. Or you can throw up your hands and try to evolve your way to a solution, suspecting that there may be local optima that are distracting the algorithms from global success.

Yet, algorithmic information theory (AIT) gives us, via Solomonoff, a framework for balancing parameterization of an inference algorithm against the error rate on the training set. But, first, it’s all uncomputable and, second, the AIT framework just uses strings of binary as the coded Turing machines, so I would have to flip 2^N bits and test each representation to get anywhere with the theory. Yet, I and many others have had incremental success at using variations on this framework, whether via Minimum Description Length (MDL) principles, it’s first cousin Minimum Message Length (MML), and other statistical regularization approaches that are somewhat proxies for these techniques. But we almost always choose a model (ANNs, compression lexicons, etc.) and then optimize the parameters around that framework. Can we do better? Is there a preferential model for time series versus static data? How about for discrete versus continuous?

So while researching model selection in this framework, I come upon a mention of Shannon’s information theory and its application to quantum decoherence. Of course I had to investigate. And here is the most interesting thing I’ve seen in months from the always interesting Max Tegmark at MIT:

Particles entangle and then quantum decoherence causes them to shed entropy into one another during interaction. But, most interesting, is the quantum Bayes’ theory section around 00:35:00 where Shannon entropy as a classical measure of improbability gets applied to the quantum indeterminacy through this decoherence process.

I’m pretty sure it sheds no particular light on the problem of model selection but when cosmology and machine learning issues converge it gives me mild shivers of joy.

# Predicting Black Swans

Nasim Taleb’s 2nd Edition of The Black Swan argues—not unpersuasively—that rare, cataclysmic events dominate ordinary statistics. Indeed, he notes that almost all wealth accumulation is based on long-tail distributions where a small number of individuals reap unexpected rewards. The downsides are also equally challenging, where he notes that casinos lose money not in gambling where the statistics are governed by Gaussians (the house always wins), but instead when tigers attack, when workers sue, and when other external factors intervene.

Black Swan Theory adds an interesting challenge to modern inference theories like Algorithmic Information Theory (AIT) that anticipate predictability to the universe. Even variant coding approaches like Minimum Description Length theory modify the anticipatory model based on relatively smooth error functions rather than high “kurtosis” distributions of variable change. And for the most part, for the regular events of life and our sensoriums, that is adequate. It is only where we start to look at rare existential threats that we begin to worry about Black Swans and inference.

How might we modify the typical formulations of AIT and the trade-offs between model complexity and data to accommodate the exceedingly rare? Several approaches are possible. First, if we are combining a predictive model with a resource accumulation criteria, we can simply pad out the model memory by reducing kurtosis risk through additional resource accumulation; any downside is mitigated by the storing of nuts for a rainy day. Good strategy for moderately rare events like weather change, droughts and whatnot. But what about even rarer events like little ice ages and dinosaur extinction-level meteorite hits? An alternative strategy is to maintain sufficient diversity in the face of radical unknowns that coping becomes a species-level achievement.

Any underlying objective function for these radical events has to sacrifice fidelity to temporally local conditions in order to cope with the outliers. Even a simple model based on populations of inductive machines with variant parameterizations wins when the population converges on the best outcomes. There is only medium term losing to focusing exclusively on the rare downside. Yet Taleb claims the rare dominates the smoothly predictable. Using the alternative strategy, that means that there might be an addition to AIT-based decision theory that adds a longer-term, multi-horizon valuation that promotes diversity against the short term gains.

# Methodical Play

My fourteen-year-old interviewed a physicist yesterday. I had the privilege of being home over the weekend and listened in; my travel schedule has lately been brutal, with the only saving grace being moments like right now en route to Chicago when I can collapse into reading and writing for a few whitenoise-washed moments. And the physicist who was once his grandfather said some remarkable things:

• Physics consists of empirical layers of untruth
• The scientific method is never used as formulated
• Schools, while valuable, won’t teach how to be a scientist
• The institutions of physics don’t support the creativity required to be a scientist

Yet there was no sense of anger or disillusionment in these statements, just a framing of the distinctions between the modern social model surrounding what scientists do and the complex reality of how they really do their work.

The positives were that play is both the essential ingredient and the missing determinant of the real “scientific method.” Mess around, try to explain, mess around some more. And what is all that play getting this remarkable octogenarian? Possible insights into the unification of electromagnetism and the strong nuclear force. The interview journey passed from alignment of quarks to the beams of neutron stars, igniting the imaginations of all the minds on the call.

But if there is no real large-scale method to this madness, what might we conclude about the rationality of the process of science? I would advocate that the algorithmic model of inference is perhaps the best (and least biased) way of approaching the issue of scientific method. By constantly reshuffling the available parameters and testing the compressibility of models, play is indistinguishable from science when the play pivots on best explanation. An hypothesis is a short range consequence of play, not a prerequisite.

So play and play some more, and enlighten the world. That’s the lesson of an 81-year-old for a a young, inquisitive mind.

# Curiouser and Curiouser

Jürgen Schmidhuber’s work on algorithmic information theory and curiosity is worth a few takes, if not more, for the researcher has done something that is both flawed and rather brilliant at the same time. The flaws emerge when we start to look deeply into the motivations for ideas like beauty (is symmetry and noncomplex encoding enough to explain sexual attraction? Well-understood evolutionary psychology is probably a better bet), but the core of his argument is worth considering.

If induction is an essential component of learning (and we might suppose it is for argument’s sake), then why continue to examine different parameterizations of possible models for induction? Why be creative about how to explain things, like we expect and even idolize of scientists?

So let us assume that induction is explained by the compression of patterns into better and better models using an information theoretic-style approach. Given this, Schmidhuber makes the startling leap that better compression and better models are best achieved by information harvesting behavior that involves finding novelty in the environment. Thus curiosity. Thus the implementation of action in support of ideas.

I proposed a similar model to explain aesthetic preferences for mid-ordered complex systems of notes, brush-strokes, etc. around 1994, but Schmidhuber’s approach has the benefit of not just characterizing the limitations and properties of aesthetic systems, but also justifying them. We find interest because we are programmed to find novelty, and we are programmed to find novelty because we want to optimize our predictive apparatus. The best optimization is actively seeking along the contours of the perceivable (and quantifiable) universe, and isolating the unknown patterns to improve our current model.

# Multitudes and the Mathematics of the Individual

The notion that there is a path from reciprocal altruism to big brains and advanced cognitive capabilities leads us to ask whether we can create “effective” procedures that shed additional light on the suppositions that are involved, and their consequences. Any skepticism about some virulent kind of scientism then gets whisked away by the imposition of a procedure combined with an earnest interest in careful evaluation of the outcomes. That may not be enough, but it is at least a start.

I turn back to Marcus Hutter, Solomonoff, and Chaitin-Kolmogorov at this point.  I’ll be primarily referencing Hutter’s Universal Algorithmic Intelligence (A Top-Down Approach) in what follows. And what follows is an attempt to break down how three separate factors related to intelligence can be explained through mathematical modeling. The first and the second are covered in Hutter’s paper, but the third may represent a new contribution, though perhaps an obvious one without the detail work that is needed to provide good support.

First, then, we start with a core requirement of any goal-seeking mechanism: the ability to predict patterns in the environment external to the mechanism. This is well-covered since Solomonoff in the 60s who formalized the implicit arguments in Kolmogorov algorithmic information theory (AIT), and that were subsequently expanded on by Greg Chaitin. In essence, given a range of possible models represented by bit sequences of computational states, the shortest sequence that predicts the observed data is also the optimal predictor for any future data also produced by the underlying generator function. The shortest sequence is not computable, but we can keep searching for shorter programs and come up with unique optimizations for specific data landscapes. And that should sound familiar because it recapitulates Occam’s Razor and, in a subset of cases, Epicurus’ Principle of Multiple Explanations. This represents the floor-plan of inductive inference, but it is only the first leg of the stool.

We should expect that evolutionary optimization might work according to this abstract formulation, but reality always intrudes. Instead, evolution is saddled by historical contingency that channels the movements through the search space. Biology ain’t computer science, in short, if for no other reason than it is tied to the physics and chemistry of the underlying operating system. Still the desire is there to identify such provable optimality in living systems because evolution is an optimizing engine, if not exactly an optimal one.

So we come to the second factor: optimality is not induction alone. Optimality is the interaction between the predictive mechanism and the environment. The “mechanism” might very well provide optimal or near optimal predictions of the past through a Solomonoff-style model, but applying those predictions introduces perturbations to the environment itself. Hutter elegantly simplifies this added complexity by abstracting the environment as a computing machine (a logical device; we assume here that the universe behaves deterministically even where it may have chaotic aspects) and running the model program at a slightly slower rate than the environmental program (it lags). Optimality is then a utility measure that combines prediction with resource allocation according to some objective function.

But what about the third factor that I promised and is missing? We get back to Fukuyama and the sociobiologists with this one: social interaction is the third factor. The exchange of information and the manipulation of the environment by groups of agents diffuses decision theory over inductive models of environments into a group of “mechanisms” that can, for example, transmit the location of optimal resource availability among the clan as a whole, increasing the utility of the individual agents with little cost to others. It seems appealing to expand Hutter’s model to include a social model, an agent model, and an environment within the purview of the mathematics. We might also get to the level where the social model overrides the agent model for a greater average utility, or where non-environmental signals from the social model interrupt function of the agent model, representing an irrational abstraction with group-selective payoff.