Tagged: solomonoff

A Paradigm of Guessing

boxesThe most interesting thing I’ve read this week comes from Jurgen Schmidhuber’s paper, Algorithmic Theories of Everything, which should be provocative enough to pique the most jaded of interests. And the quote is from way into the paper:

The first number is 2, the second is 4, the third is 6, the fourth is 8. What is the fifth? The correct answer is “250,” because the nth number is n 5 −5n^4 −15n^3 + 125n^2 −224n+ 120. In certain IQ tests, however, the answer “250” will not yield maximal score, because it does not seem to be the “simplest” answer consistent with the data (compare [73]). And physicists and others favor “simple” explanations of observations.

And this is the beginning and the end of logical positivism. How can we assign truth to inductive judgments without crossing from fact to value, and what should that value system be?

Universal Artificial Social Intelligence

Continuing to develop the idea that social reasoning adds to Hutter’s Universal Artificial Intelligence model, below is his basic layout for agents and environments:

A few definitions: The Agent (p) is a Turing machine that consists of a working tape and an algorithm that can move the tape left or right, read a symbol from the tape, write a symbol to the tape, and transition through a finite number of internal states as held in a table. That is all that is needed to be a Turing machine and Turing machines can compute like our every day notion of a computer. Formally, there are bounds to what they can compute (for instance, whether any given program consisting of the symbols on the tape will stop at some point or will run forever without stopping (this is the so-called “halting problem“). But it suffices to think of the Turing machine as a general-purpose logical machine in that all of its outputs are determined by a sequence of state changes that follow from the sequence of inputs and transformations expressed in the state table. There is no magic here.

Hutter then couples the agent to a representation of the environment, also expressed by a Turing machine (after all, the environment is likely deterministic), and has the output symbols of the agent consumed by the environment (y) which, in turn, outputs the results of the agent’s interaction with it as a series of rewards (r) and environment signals (x), that are consumed by agent once again.

Where this gets interesting is that the agent is trying to maximize the reward signal which implies that the combined predictive model must convert all the history accumulated at one point in time into an optimal predictor. This is accomplished by minimizing the behavioral error and behavioral error is best minimized by choosing the shortest program that also predicts the history. By doing so, you simultaneously reduce the brittleness of the model to future changes in the environment.

So far, so good. But this is just one agent coupled to the environment. If we have two agents competing against one another, we can treat each as the environment for the other and the mathematics is largely unchanged (see Hutter, pp. 36-37 for the treatment of strategic games via Nash equilibria and minimax). However, for non-competitive multi-agent simulations operating against the same environment there is a unique opportunity if the agents are sampling different parts of the environmental signal. So, let’s change the model to look as follows:

Now, each agent is sampling different parts of the output symbols generated by the environment (as well as the utility score, r). We assume that there is a rate difference between the agents input symbols and the environmental output symbols, but this is not particularly hard to model: as part of the input process, the agents’ state table just passes over N symbols, where N is the number of the agent, for instance. The resulting agents will still be Hutter-optimal with regard to the symbol sequences that they do process, and will generate outputs over time that maximize the additive utility of the reward signal, but they are no longer each maximizing the complete signal. Indeed, the relative quality of the individual agents is directly proportional to the quantity of the input symbol stream that they can consume.

Overcoming this limitation has an obvious fix: share the working tape between the individual agents:

Then, each agent can record not only their states on the tape, but can also consume the states of the other agents. The tape becomes an extended memory or a shared theory about the environment. Formally, I don’t believe there is any difference between this and the single agent model because, like multihead Turing machines, the sequence of moves and actions can be collapsed to a single table and a single tape insofar as the entire environmental signal is available to all of the agents (or their concatenated form). Instead the value lies in consideration of what a multi-agent system implies concerning shared meaning and the value of coordination: for any real environment, perfect coupling between a single agent and that environment is an unrealistic simplification. And shared communication and shared modeling translates into an expansion of the individual agent’s model of the universe into greater individual reward and, as a side-effect, group reward as well.

Multitudes and the Mathematics of the Individual

The notion that there is a path from reciprocal altruism to big brains and advanced cognitive capabilities leads us to ask whether we can create “effective” procedures that shed additional light on the suppositions that are involved, and their consequences. Any skepticism about some virulent kind of scientism then gets whisked away by the imposition of a procedure combined with an earnest interest in careful evaluation of the outcomes. That may not be enough, but it is at least a start.

I turn back to Marcus Hutter, Solomonoff, and Chaitin-Kolmogorov at this point.  I’ll be primarily referencing Hutter’s Universal Algorithmic Intelligence (A Top-Down Approach) in what follows. And what follows is an attempt to break down how three separate factors related to intelligence can be explained through mathematical modeling. The first and the second are covered in Hutter’s paper, but the third may represent a new contribution, though perhaps an obvious one without the detail work that is needed to provide good support.

First, then, we start with a core requirement of any goal-seeking mechanism: the ability to predict patterns in the environment external to the mechanism. This is well-covered since Solomonoff in the 60s who formalized the implicit arguments in Kolmogorov algorithmic information theory (AIT), and that were subsequently expanded on by Greg Chaitin. In essence, given a range of possible models represented by bit sequences of computational states, the shortest sequence that predicts the observed data is also the optimal predictor for any future data also produced by the underlying generator function. The shortest sequence is not computable, but we can keep searching for shorter programs and come up with unique optimizations for specific data landscapes. And that should sound familiar because it recapitulates Occam’s Razor and, in a subset of cases, Epicurus’ Principle of Multiple Explanations. This represents the floor-plan of inductive inference, but it is only the first leg of the stool.

We should expect that evolutionary optimization might work according to this abstract formulation, but reality always intrudes. Instead, evolution is saddled by historical contingency that channels the movements through the search space. Biology ain’t computer science, in short, if for no other reason than it is tied to the physics and chemistry of the underlying operating system. Still the desire is there to identify such provable optimality in living systems because evolution is an optimizing engine, if not exactly an optimal one.

So we come to the second factor: optimality is not induction alone. Optimality is the interaction between the predictive mechanism and the environment. The “mechanism” might very well provide optimal or near optimal predictions of the past through a Solomonoff-style model, but applying those predictions introduces perturbations to the environment itself. Hutter elegantly simplifies this added complexity by abstracting the environment as a computing machine (a logical device; we assume here that the universe behaves deterministically even where it may have chaotic aspects) and running the model program at a slightly slower rate than the environmental program (it lags). Optimality is then a utility measure that combines prediction with resource allocation according to some objective function.

But what about the third factor that I promised and is missing? We get back to Fukuyama and the sociobiologists with this one: social interaction is the third factor. The exchange of information and the manipulation of the environment by groups of agents diffuses decision theory over inductive models of environments into a group of “mechanisms” that can, for example, transmit the location of optimal resource availability among the clan as a whole, increasing the utility of the individual agents with little cost to others. It seems appealing to expand Hutter’s model to include a social model, an agent model, and an environment within the purview of the mathematics. We might also get to the level where the social model overrides the agent model for a greater average utility, or where non-environmental signals from the social model interrupt function of the agent model, representing an irrational abstraction with group-selective payoff.

Solomonoff Induction, Truth, and Theism

LukeProg of CommonSenseAtheism fame created a bit of a row when he declared that Solomonoff Induction largely rules out theism, continuing on to expand on the theme:

If I want to pull somebody away from magical thinking, I don’t need to mention atheism. Instead, I teach them Kolmogorov complexity and Bayesian updating. I show them the many ways our minds trick us. I show them the detailed neuroscience of human decision-making. I show them that we can see (in the brain) a behavior being selected up to 10 seconds before a person is consciously aware of ‘making’ that decision. I explain timelessness.

There were several reasons for the CSA community to get riled up about these statements and they took on several different forms:

  • The focus on Solomonoff Induction/Kolmogorov Complexity is obscurantist in using radical technical terminology.
  • The author is ignoring deductive arguments that support theist claims.
  • The author has joined a cult.
  • Inductive claims based on Solomonoff/Kolmogorov are no different from Reasoning to the Best Explanation.

I think all of these critiques are partially valid, though I don’t think there are any good reasons for thinking theism is true, but the fourth one (which I contributed) was a personal realization for me. Though I have been fascinated with the topics related to Kolmogorov since the early 90s, I don’t think they are directly applicable to the topic of theism/atheism.  Whether we are discussing the historical validity of Biblical claims or the logical consistency of extensions to notions of omnipotence or omniscience, I can’t think of a way that these highly mathematical concepts have direct application.

But what are we talking about? Solomonoff Induction, Kolmogorov Complexity, Minimum Description Length, Algorithmic Information Theory, and related ideas are formalizations of the idea of William of Occam (variously Ockham) known as Occam’s Razor that given multiple explanations of a given phenomena, one should prefer the simpler explanation. This notion that the most parsimonious explanation was preferable to other explanations existed as a heuristic until the 20th Century, when statistics began to be merged with computational theory through information theory. I’m not aware of any scientist describing facing a trade-off between contending theories that was resolved by an appeal to Occam’s Razor. Yet the intuition that the principle was important remained until being formalized by people like Kolmogorov as part of the mathematical and scientific zeitgeist.

The concepts are admittedly deep in their mathematical formulations, but at heart is the notion that all logical procedures can be reduced to a computational model. And a computational model running on a standardized computer called a Turing Machine can be expressed as a string of numbers that can be reduced to binary numbers. One can imagine that there are many programs that can produce the same output given the same input. In fact, we can just add an infinite number of random no-ops (no operations) to the bit stream and still get the same output from the computer.  Moreover, we can guess that the structure of the program for a given string is essentially a model of the output string that compresses the underlying data into the form of the program. So, among all of the programs for a string, the shortest program is, like Occam’s Razor predicts, the most parsimonious way of generating the string.

What comes next is rather impressive: the shortest program among all of the possible programs is also the most likely to continue to produce the “right” output if the string is continued.  In other words, as a friend coined (and as appears in my book Teleology), “compression is truth” in that the most compressed and compact program is also the best predictor of the future based on the existing evidence. The formalization of these concepts across statistics, computational theory, and recently into philosophy, represents a crowning achievement of information theory in the 20th Century.

I use these ideas regularly in machine learning, and related ideas inform concepts like Support Vector Machines, yet I don’t see a direct connection to human argumentation about complex ideas. Moreover, and I am hesitant to admit this, I am not convinced that human neural anatomy implements much more than vague approximations of these notions (and primarily in relatively low-level perceptual processes).

So does Solomonoff Induction rule out theism? Only indirectly in that it may help us feel confident about a solid theoretical basis for other conceptual processes that more directly interact with the evidence for and against.

I plan on elaborating on algorithmic information theory and its implications in future posts.