Tagged: kolmogorov

The IQ of Machines

standard-dudePerhaps idiosyncratic to some is my focus in the previous post on the theoretical background to machine learning that derives predominantly from algorithmic information theory and, in particular, Solomonoff’s theory of induction. I do note that there are other theories that can be brought to bear, including Vapnik’s Structural Risk Minimization and Valiant’s PAC-learning theory. Moreover, perceptrons and vector quantization methods and so forth derive from completely separate principals that can then be cast into more fundamental problems in informational geometry and physics.

Artificial General Intelligence (AGI) is then perhaps the hard problem on the horizon that I disclaim as having had significant progress in the past twenty years of so. That is not to say that I am not an enthusiastic student of the topic and field, just that I don’t see risk levels from intelligent AIs rising to what we should consider a real threat. This topic of how to grade threats deserves deeper treatment, of course, and is at the heart of everything from so-called “nanny state” interventions in food and product safety to how to construct policy around global warming. Luckily–and unlike both those topics–killer AIs don’t threaten us at all quite yet.

But what about simply characterizing what AGIs might look like and how we can even tell when they arise? Mildly interesting is Simon Legg and Joel Veness’ idea of an Artificial Intelligence Quotient or AIQ that they expand on in An Approximation of the Universal Intelligence Measure. This measure is derived from, voilà, exactly the kind of algorithmic information theory (AIT) and compression arguments that I lead with in the slide deck. Is this the only theory around for AGI? Pretty much, but different perspectives tend to lead to slightly different focuses. For instance, there is little need to discuss AIT when dealing with Deep Learning Neural Networks. We just instead discuss statistical regularization and bottlenecking, which can be thought of as proxies for model compression.

So how can intelligent machines be characterized by something like AIQ? Well, the conclusion won’t be surprising. Intelligent machines are those machines that function well in terms of achieving goals over a highly varied collection of environments. This allows for tractable mathematical treatments insofar as the complexity of the landscapes can be characterized, but doesn’t really give us a good handle on what the particular machines might look like. They can still be neural networks or support vector machines, or maybe even something simpler, and through some selection and optimization process have the best performance over a complex topology of reward-driven goal states.

So still no reason to panic, but some interesting ideas that shed greater light on the still mysterious idea of intelligence and the human condition.

Pressing Snobs into Hell

Paul Vitanyi has been a deep advocate for Kolmogorov complexity for many years. His book with Ming Li, An Introduction to Kolmogorov Complexity and Its Applications, remains on my book shelf (and was a bit of an investment in grad school).

I came across a rather interesting paper by Vitanyi with Rudi Cilibrasi called “Clustering by Compression” that illustrates perhaps more easily and clearly than almost any other recent work the tight connections between meaning, repetition, and informational structure. Rather than describing the paper, however, I wanted to conduct an experiment that demonstrates their results. To do this, I asked the question: are the writings of Dante more similar to other writings of Dante than to Thackeray? And is the same true of Thackeray relative to Dante?

Now, we could pursue these questions at many different levels. We might ask scholars, well-versed in the works of each, to compare and contrast the two authors. They might invoke cultural factors, the memes of their respective eras, and their writing styles. Ultimately, though, the scholars would have to get down to some textual analysis, looking at the words on the page. And in so doing, they would draw distinctions by lifting features of the text, comparing and contrasting grammatical choices, word choices, and other basic elements of the prose and poetry on the page. We might very well be able to take parts of the knowledge of those experts and distill it into some kind of a logical procedure or algorithm that would parse the texts and draw distinctions based on the distributions of words and other structural cues. If asked, we might say that a similar method might work for the so-called language of life, DNA, but that it would require a different kind of background knowledge to build the analysis, much less create an algorithm to perform the same task. And perhaps a similar procedure would apply to music, finding both simple similarities between features like tempos, as well as complex stylistic and content-based alignments.

Yet, what if we could invoke some kind of universal approach to finding exactly what features of each domain are important for comparisons? The universal approach would need to infer the appropriate symbols, grammars, relationships, and even meaning in order to do the task. And this is where compression comes in. In order to efficiently compress a data stream, an algorithm must infer repetitive patterns that are both close together like the repetition of t-h-e in English, as well as further apart, like verb agreement and other grammatical subtleties in the same language. By inferring those patterns, the compressor can then create a dictionary and emit just a reference to the pattern in the dictionary, rather than the pattern itself.

Cilibrasi and Vitanyi, in their paper, conduct a number of studies on an innovative application of compression to finding similarities and, conversely, differences. To repeat their experiment, I used a standard data compressor called bzip2 and grabbed two cantos from Dante’s Divine Comedy at random, and then two chapters from Thackeray’s The Book of Snobs. I then followed their procedure and compressed each individually using bzip2, as well as concatenating each together (pairwise) and compressing the combined document. The idea is that when you concatenate them together, the similarities present between the two documents should manifest as improved compression (smaller sized files) because the pattern dictionary will be more broadly applicable. The length of the files needs to be normalized a bit, however, because the files themselves vary in length, so following Cilibrasi and Vitanyi’s procedure, I subtracted the minimum of the compressed sizes of the two independent files and divided by the maximum of the same.

The results were perfect:

X1 X1 Size X2 X2 Size X1X2 Size NCD
d1 2828 d2 3030 5408 4933.748232
d1 2828 t1 3284 5834 6201.203678
d1 2828 t2 2969 5529 5280.703324
d2 3030 t1 3284 6001 5884.148845
d2 3030 t2 2969 5692 5000.694389
t1 3284 t2 2969 5861 4599.207369

 

Note that d1 has the lowest distance (NCD) to d2. t1 is also closest to t2. In this table, d1 and d2 are the Dante cantos. t1 and t2 are the Thackeray chapters. NCD is the Normalized Compression Distance. X1 Size is the compressed size of X1 in bytes. X2 Size is the compressed size for X2. Combined size is the compressed size of the combined original text (combined by concatenation in the order X1 followed by X2). NCD is calculated per the paper.

Multitudes and the Mathematics of the Individual

The notion that there is a path from reciprocal altruism to big brains and advanced cognitive capabilities leads us to ask whether we can create “effective” procedures that shed additional light on the suppositions that are involved, and their consequences. Any skepticism about some virulent kind of scientism then gets whisked away by the imposition of a procedure combined with an earnest interest in careful evaluation of the outcomes. That may not be enough, but it is at least a start.

I turn back to Marcus Hutter, Solomonoff, and Chaitin-Kolmogorov at this point.  I’ll be primarily referencing Hutter’s Universal Algorithmic Intelligence (A Top-Down Approach) in what follows. And what follows is an attempt to break down how three separate factors related to intelligence can be explained through mathematical modeling. The first and the second are covered in Hutter’s paper, but the third may represent a new contribution, though perhaps an obvious one without the detail work that is needed to provide good support.

First, then, we start with a core requirement of any goal-seeking mechanism: the ability to predict patterns in the environment external to the mechanism. This is well-covered since Solomonoff in the 60s who formalized the implicit arguments in Kolmogorov algorithmic information theory (AIT), and that were subsequently expanded on by Greg Chaitin. In essence, given a range of possible models represented by bit sequences of computational states, the shortest sequence that predicts the observed data is also the optimal predictor for any future data also produced by the underlying generator function. The shortest sequence is not computable, but we can keep searching for shorter programs and come up with unique optimizations for specific data landscapes. And that should sound familiar because it recapitulates Occam’s Razor and, in a subset of cases, Epicurus’ Principle of Multiple Explanations. This represents the floor-plan of inductive inference, but it is only the first leg of the stool.

We should expect that evolutionary optimization might work according to this abstract formulation, but reality always intrudes. Instead, evolution is saddled by historical contingency that channels the movements through the search space. Biology ain’t computer science, in short, if for no other reason than it is tied to the physics and chemistry of the underlying operating system. Still the desire is there to identify such provable optimality in living systems because evolution is an optimizing engine, if not exactly an optimal one.

So we come to the second factor: optimality is not induction alone. Optimality is the interaction between the predictive mechanism and the environment. The “mechanism” might very well provide optimal or near optimal predictions of the past through a Solomonoff-style model, but applying those predictions introduces perturbations to the environment itself. Hutter elegantly simplifies this added complexity by abstracting the environment as a computing machine (a logical device; we assume here that the universe behaves deterministically even where it may have chaotic aspects) and running the model program at a slightly slower rate than the environmental program (it lags). Optimality is then a utility measure that combines prediction with resource allocation according to some objective function.

But what about the third factor that I promised and is missing? We get back to Fukuyama and the sociobiologists with this one: social interaction is the third factor. The exchange of information and the manipulation of the environment by groups of agents diffuses decision theory over inductive models of environments into a group of “mechanisms” that can, for example, transmit the location of optimal resource availability among the clan as a whole, increasing the utility of the individual agents with little cost to others. It seems appealing to expand Hutter’s model to include a social model, an agent model, and an environment within the purview of the mathematics. We might also get to the level where the social model overrides the agent model for a greater average utility, or where non-environmental signals from the social model interrupt function of the agent model, representing an irrational abstraction with group-selective payoff.

Solomonoff Induction, Truth, and Theism

LukeProg of CommonSenseAtheism fame created a bit of a row when he declared that Solomonoff Induction largely rules out theism, continuing on to expand on the theme:

If I want to pull somebody away from magical thinking, I don’t need to mention atheism. Instead, I teach them Kolmogorov complexity and Bayesian updating. I show them the many ways our minds trick us. I show them the detailed neuroscience of human decision-making. I show them that we can see (in the brain) a behavior being selected up to 10 seconds before a person is consciously aware of ‘making’ that decision. I explain timelessness.

There were several reasons for the CSA community to get riled up about these statements and they took on several different forms:

  • The focus on Solomonoff Induction/Kolmogorov Complexity is obscurantist in using radical technical terminology.
  • The author is ignoring deductive arguments that support theist claims.
  • The author has joined a cult.
  • Inductive claims based on Solomonoff/Kolmogorov are no different from Reasoning to the Best Explanation.

I think all of these critiques are partially valid, though I don’t think there are any good reasons for thinking theism is true, but the fourth one (which I contributed) was a personal realization for me. Though I have been fascinated with the topics related to Kolmogorov since the early 90s, I don’t think they are directly applicable to the topic of theism/atheism.  Whether we are discussing the historical validity of Biblical claims or the logical consistency of extensions to notions of omnipotence or omniscience, I can’t think of a way that these highly mathematical concepts have direct application.

But what are we talking about? Solomonoff Induction, Kolmogorov Complexity, Minimum Description Length, Algorithmic Information Theory, and related ideas are formalizations of the idea of William of Occam (variously Ockham) known as Occam’s Razor that given multiple explanations of a given phenomena, one should prefer the simpler explanation. This notion that the most parsimonious explanation was preferable to other explanations existed as a heuristic until the 20th Century, when statistics began to be merged with computational theory through information theory. I’m not aware of any scientist describing facing a trade-off between contending theories that was resolved by an appeal to Occam’s Razor. Yet the intuition that the principle was important remained until being formalized by people like Kolmogorov as part of the mathematical and scientific zeitgeist.

The concepts are admittedly deep in their mathematical formulations, but at heart is the notion that all logical procedures can be reduced to a computational model. And a computational model running on a standardized computer called a Turing Machine can be expressed as a string of numbers that can be reduced to binary numbers. One can imagine that there are many programs that can produce the same output given the same input. In fact, we can just add an infinite number of random no-ops (no operations) to the bit stream and still get the same output from the computer.  Moreover, we can guess that the structure of the program for a given string is essentially a model of the output string that compresses the underlying data into the form of the program. So, among all of the programs for a string, the shortest program is, like Occam’s Razor predicts, the most parsimonious way of generating the string.

What comes next is rather impressive: the shortest program among all of the possible programs is also the most likely to continue to produce the “right” output if the string is continued.  In other words, as a friend coined (and as appears in my book Teleology), “compression is truth” in that the most compressed and compact program is also the best predictor of the future based on the existing evidence. The formalization of these concepts across statistics, computational theory, and recently into philosophy, represents a crowning achievement of information theory in the 20th Century.

I use these ideas regularly in machine learning, and related ideas inform concepts like Support Vector Machines, yet I don’t see a direct connection to human argumentation about complex ideas. Moreover, and I am hesitant to admit this, I am not convinced that human neural anatomy implements much more than vague approximations of these notions (and primarily in relatively low-level perceptual processes).

So does Solomonoff Induction rule out theism? Only indirectly in that it may help us feel confident about a solid theoretical basis for other conceptual processes that more directly interact with the evidence for and against.

I plan on elaborating on algorithmic information theory and its implications in future posts.