Category: Cognitive Science

Singularity and its Discontents

Kimmel botIf a machine-based process can outperform a human being is it significant? That weighty question hung in the background as I reviewed Jürgen Schmidhuber’s work on traffic sign classification. Similar results have emerged from IBM’s Watson competition and even on the TOEFL test. In each case, machines beat people.

But is that fact significant? There are a couple of ways we can look at these kinds of comparisons. First, we can draw analogies to other capabilities that were not accessible by mechanical aid and show that the fact that they outperformed humans was not overly profound. The wheel quickly outperformed human legs for moving heavy objects. The cup outperformed the hands for drinking water. This then invites the realization that the extension of these physical comparisons leads to extraordinary juxtapositions: the airline really outperformed human legs for transport, etc. And this, in turn, justifies the claim that since we are now just outperforming human mental processes, we can only expect exponential improvements moving forward.

But this may be a category mistake in more than the obvious differentiator of the mental and the physical. Instead, the category mismatch is between levels of complexity. The number of parts in a Boeing 747 is 6 million versus one moving human as the baseline (we could enumerate the cells and organelles, etc., but then we would need to enumerate the crystal lattices of the aircraft steel, so that level of granularity is a wash). The number of memory addresses in a big server computer is 64 x 10^9 or higher, with disk storage in the TBs (10^12). Meanwhile, the human brain has 100 x 10^9 neurons and 10^14 connections. So, with just 2 orders of magnitude between computers and brains versus 6 between humans and planes, we find ourselves approaching Kurzweil’s argument that we have to wait until 2040. I’m more pessimistic and figure 2080, but then no one expects the Inquisition, either, to quote the esteemed philosophers, Monty Python.

We might move that back even further, though, because we still lack a theory of the large scale control of the collected software modules needed to operate on that massive neural simulation. At least Schmidhuber’s work used an artifical neural network. The others were looser about any affiliation to actual human information processing, though the LSI work is mathematically similar to some kinds of ANNs in terms of outcomes.

So if analogies only serve to support a mild kind of techno-optimism, we still can think about the problem in other ways by inverting the comparisons or emphasizing the risk of superintelligent machines. Thus is born the existential risk school of technological singularities. But such concerns and planning doesn’t really address the question of whether superintelligent machines are actually possible, or whether current achievements are significant.

And that brings us to the third perspective: the focus on competitive outcomes in AI research leads to only mild advances in the state-of-the-art, but does lead to important social outcomes. These are Apollo moon shots, in other words. Regardless of significant scientific advances, they stir the mind and the soul. It may transform the mild techno-optimism into moderate techo-optimism. And that’s OK, because the alternative is stationary fear.

Curiouser and Curiouser

georgeJürgen Schmidhuber’s work on algorithmic information theory and curiosity is worth a few takes, if not more, for the researcher has done something that is both flawed and rather brilliant at the same time. The flaws emerge when we start to look deeply into the motivations for ideas like beauty (is symmetry and noncomplex encoding enough to explain sexual attraction? Well-understood evolutionary psychology is probably a better bet), but the core of his argument is worth considering.

If induction is an essential component of learning (and we might suppose it is for argument’s sake), then why continue to examine different parameterizations of possible models for induction? Why be creative about how to explain things, like we expect and even idolize of scientists?

So let us assume that induction is explained by the compression of patterns into better and better models using an information theoretic-style approach. Given this, Schmidhuber makes the startling leap that better compression and better models are best achieved by information harvesting behavior that involves finding novelty in the environment. Thus curiosity. Thus the implementation of action in support of ideas.

I proposed a similar model to explain aesthetic preferences for mid-ordered complex systems of notes, brush-strokes, etc. around 1994, but Schmidhuber’s approach has the benefit of not just characterizing the limitations and properties of aesthetic systems, but also justifying them. We find interest because we are programmed to find novelty, and we are programmed to find novelty because we want to optimize our predictive apparatus. The best optimization is actively seeking along the contours of the perceivable (and quantifiable) universe, and isolating the unknown patterns to improve our current model.

Industrial Revolution #4

Paul Krugman at New York Times consumes Robert Gordon’s analysis of economic growth and the role of technology and comes up more hopeful than Gordon. The kernel in Krugman’s hope is that Big Data analytics can provide a shortcut to intelligent machines by bypassing the requirement for specification and programming that was once assumed to be a requirement for artificial intelligence. Instead, we don’t specify but use “data-intensive ways” to achieve a better result. And we might get to IR#4, following Gordon’s taxonomy where IR stands for “industrial revolution.” IR#1 was steam and locomotives  IR#2 was everything up to computers. IR#3 is computers and cell phones and whatnot.

Krugman implies that IR#4 might spur the typical economic consequences of grand technological change, including the massive displacement of workers, but like in previous revolutions it is also assumed that economic growth built from new industries will ultimately eclipse the negatives. This is not new, of course. Robert Anton Wilson argued decades ago for the R.I.C.H. economy (Rising Income through Cybernetic Homeostasis). Wilson may have been on acid, but Krugman wasn’t yet tuned in, man. (A brief aside: the Krugman/Wilson notions probably break down over extraction and agribusiness/land rights issues. If labor is completely replaced by intelligent machines, the land and the ingredients it contains nevertheless remain a bottleneck for economic growth. Look at the global demand for copper and rare earth materials, for instance.)

But why the particular focus on Big Data technologies? Krugman’s hope teeters on the assumption that data-intensive algorithms possess a fundamentally different scale and capacity than human-engineered approaches. Having risen through the computational linguistics and AI community working on data-driven methods for approaching intelligence, I can certainly sympathize with the motivation, but there are really only modest results to report at this time. For instance, statistical machine translation is still pretty poor quality, and is arguably not of better quality than the rules-based methods from the 70s and 80s in anything other than scale and diversity of the languages that are being used. Recent achievements like the DARPA grand challenge for self-driving vehicles were not achieved through data-intensive methods but through careful examination of the limits of the baseline system. In that case, baseline meant a system that used a scanning laser rangefinder to avoid obstacles while following a map and an improvement meant marginally outrunning the distance limitations of the rangefinder by using optical image recognition to support a modest speedup. Speech recognition is better due to accumulating many examples of labeled, categorized text, true. And we can certainly guess that the relevance of advertising placed on a web page is better than it once was, if only because it is an easy problem to attack without the necessity of deep considerations of human understanding–unless you take our buying behavior to be a deep indicator of our beings. We can also see some glimmers of data-intensive methods in the IBM Watson system, though the Watson team will be the first to tell you that they dealt with only medium-scale data (wikipedia) in the design of their system.

Still, there is a clear economic-growth argument for the upshot of replacing workers in manual drudgery straight through to fairly intelligent drudgery, which gives an economist like Krugman reason for hope. Now, if the limitations of energy and resource requirements can just be replaced, we can all retire to RICH, creative lives.

Sparse Grokking

Jeff Hawkins of Palm fame shows up in the New York Times hawking his Grok for Big Data predictions. Interestingly, if one drills down into the details of Grok, we see once again that randomized sparse representations are the core of the system. That is, if we assign symbols random representational vectors that are sparse, we can sum the vectors for co-occurring symbols and, following J.R. Firth’s pithy “words shall be known by the company that they keep” start to develop a theory of meaning that would not offend Wittgenstein.

Is there anything new to Hawkins’ effort? For certain types of time-series prediction, the approach parallels artificial neural network designs, replacing the complexity of shifting, multi-epoch training regimens that, in effect, build the high-dimensional distances between co-occurring events by gradually moving time-correlated data together and uncorrelated data apart with an end-run around all the computational complexity. But then there is Random Indexing, which I’ve previously discussed, here. If one restricts Random Indexing to operating on temporal patterns, or on spatial patterns, then the results start to look like Numenta’s offering.

While there is a bit of opportunism in Hawkins latching onto Big Data to promote an application of methods he has been working on for years, there are very real opportunities for trying to mine leading indicators to help with everything from ecommerce to research and development. Many flowers will bloom, grok, die, and be reborn.

Bats and Belfries

Thomas Nagel proposes a radical form of skepticism in his new book, Minds and Cosmos, continuing his trajectory through subjective experience and moral realism first began with bats zigging and zagging among the homunculi of dualism reimagined in the form of qualia. The skepticism involves disputing materialistic explanations and proposing, instead, that teleological ones of an unspecified form will likely apply, for how else could his subtitle that paints the “Neo-Darwinian Concept of Nature” as likely false hold true?

Nagel is searching for a non-religious explanation, of course, because just enervating nature through fiat is hardly an explanation at all; any sort of powerful, non-human entelechy could be gaming us and the universe in a non-coherent fashion. But what parameters might support his argument? Since he apparently requires a “significant likelihood” argument to hold sway in support of the origins of life, for instance, we might imagine what kind of thinking could result in highly likely outcomes that begin with inanimate matter and lead to goal-directed behavior while supporting a significant likelihood of that outcome. The parameters might involve the conscious coordination of the events leading towards the emergence of goal-directed life, thus presupposing a consciousness that is not our own. We are back then to our non-human entelechy looming like an alien or like a strange creator deity (which is not desirable to Nagel). We might also consider the possibility that there are properties to the universe itself that result in self-organization and that either we don’t yet know or that we are only beginning to understand. Elliot Sober’s critique suggests that the 2nd Law of Thermodynamics results in what I might call “patterned” behavior while not becoming “goal-directed” per se. Yet it is precisely the capacity for self-organization that begins at the borderline of energy harvesting mediated by the 2nd Law that results in some of the clearest examples of physical structures emerging from simpler forms, and in an inevitable way. That is, with a “significant likelihood” of occurrence. Can we, in fact, say that there is a meaningful distinction that can be drawn between an inevitable self-organizing crystal process and an evolutionary one that involves large populations of entities interacting together? It seems to me that if we can conceive of the first, we can attribute an equal or better weight of probability to the second.

Are there other options? Could the form of this “new teleology” that is non-materialistic in nature achieve other insights that are significantly likely? One possibility would be if a physical property or process showed a unique affinity for goal-directed behavior that could not be explained by bridging rules that straddled known neuropsychological and evolutionary models. Such a phenomena would be recognized for its resilience to explanation, I think: there are no effective explanations for creativity; it is a uniquely human quality. Yet we just don’t have any compelling examples. Creativity does not appear to be completely resilient to explanation, nor does any human mental process.

Our bats, our belfries, remain uniquely our own.

Universal Artificial Social Intelligence

Continuing to develop the idea that social reasoning adds to Hutter’s Universal Artificial Intelligence model, below is his basic layout for agents and environments:

A few definitions: The Agent (p) is a Turing machine that consists of a working tape and an algorithm that can move the tape left or right, read a symbol from the tape, write a symbol to the tape, and transition through a finite number of internal states as held in a table. That is all that is needed to be a Turing machine and Turing machines can compute like our every day notion of a computer. Formally, there are bounds to what they can compute (for instance, whether any given program consisting of the symbols on the tape will stop at some point or will run forever without stopping (this is the so-called “halting problem“). But it suffices to think of the Turing machine as a general-purpose logical machine in that all of its outputs are determined by a sequence of state changes that follow from the sequence of inputs and transformations expressed in the state table. There is no magic here.

Hutter then couples the agent to a representation of the environment, also expressed by a Turing machine (after all, the environment is likely deterministic), and has the output symbols of the agent consumed by the environment (y) which, in turn, outputs the results of the agent’s interaction with it as a series of rewards (r) and environment signals (x), that are consumed by agent once again.

Where this gets interesting is that the agent is trying to maximize the reward signal which implies that the combined predictive model must convert all the history accumulated at one point in time into an optimal predictor. This is accomplished by minimizing the behavioral error and behavioral error is best minimized by choosing the shortest program that also predicts the history. By doing so, you simultaneously reduce the brittleness of the model to future changes in the environment.

So far, so good. But this is just one agent coupled to the environment. If we have two agents competing against one another, we can treat each as the environment for the other and the mathematics is largely unchanged (see Hutter, pp. 36-37 for the treatment of strategic games via Nash equilibria and minimax). However, for non-competitive multi-agent simulations operating against the same environment there is a unique opportunity if the agents are sampling different parts of the environmental signal. So, let’s change the model to look as follows:

Now, each agent is sampling different parts of the output symbols generated by the environment (as well as the utility score, r). We assume that there is a rate difference between the agents input symbols and the environmental output symbols, but this is not particularly hard to model: as part of the input process, the agents’ state table just passes over N symbols, where N is the number of the agent, for instance. The resulting agents will still be Hutter-optimal with regard to the symbol sequences that they do process, and will generate outputs over time that maximize the additive utility of the reward signal, but they are no longer each maximizing the complete signal. Indeed, the relative quality of the individual agents is directly proportional to the quantity of the input symbol stream that they can consume.

Overcoming this limitation has an obvious fix: share the working tape between the individual agents:

Then, each agent can record not only their states on the tape, but can also consume the states of the other agents. The tape becomes an extended memory or a shared theory about the environment. Formally, I don’t believe there is any difference between this and the single agent model because, like multihead Turing machines, the sequence of moves and actions can be collapsed to a single table and a single tape insofar as the entire environmental signal is available to all of the agents (or their concatenated form). Instead the value lies in consideration of what a multi-agent system implies concerning shared meaning and the value of coordination: for any real environment, perfect coupling between a single agent and that environment is an unrealistic simplification. And shared communication and shared modeling translates into an expansion of the individual agent’s model of the universe into greater individual reward and, as a side-effect, group reward as well.

Multitudes and the Mathematics of the Individual

The notion that there is a path from reciprocal altruism to big brains and advanced cognitive capabilities leads us to ask whether we can create “effective” procedures that shed additional light on the suppositions that are involved, and their consequences. Any skepticism about some virulent kind of scientism then gets whisked away by the imposition of a procedure combined with an earnest interest in careful evaluation of the outcomes. That may not be enough, but it is at least a start.

I turn back to Marcus Hutter, Solomonoff, and Chaitin-Kolmogorov at this point.  I’ll be primarily referencing Hutter’s Universal Algorithmic Intelligence (A Top-Down Approach) in what follows. And what follows is an attempt to break down how three separate factors related to intelligence can be explained through mathematical modeling. The first and the second are covered in Hutter’s paper, but the third may represent a new contribution, though perhaps an obvious one without the detail work that is needed to provide good support.

First, then, we start with a core requirement of any goal-seeking mechanism: the ability to predict patterns in the environment external to the mechanism. This is well-covered since Solomonoff in the 60s who formalized the implicit arguments in Kolmogorov algorithmic information theory (AIT), and that were subsequently expanded on by Greg Chaitin. In essence, given a range of possible models represented by bit sequences of computational states, the shortest sequence that predicts the observed data is also the optimal predictor for any future data also produced by the underlying generator function. The shortest sequence is not computable, but we can keep searching for shorter programs and come up with unique optimizations for specific data landscapes. And that should sound familiar because it recapitulates Occam’s Razor and, in a subset of cases, Epicurus’ Principle of Multiple Explanations. This represents the floor-plan of inductive inference, but it is only the first leg of the stool.

We should expect that evolutionary optimization might work according to this abstract formulation, but reality always intrudes. Instead, evolution is saddled by historical contingency that channels the movements through the search space. Biology ain’t computer science, in short, if for no other reason than it is tied to the physics and chemistry of the underlying operating system. Still the desire is there to identify such provable optimality in living systems because evolution is an optimizing engine, if not exactly an optimal one.

So we come to the second factor: optimality is not induction alone. Optimality is the interaction between the predictive mechanism and the environment. The “mechanism” might very well provide optimal or near optimal predictions of the past through a Solomonoff-style model, but applying those predictions introduces perturbations to the environment itself. Hutter elegantly simplifies this added complexity by abstracting the environment as a computing machine (a logical device; we assume here that the universe behaves deterministically even where it may have chaotic aspects) and running the model program at a slightly slower rate than the environmental program (it lags). Optimality is then a utility measure that combines prediction with resource allocation according to some objective function.

But what about the third factor that I promised and is missing? We get back to Fukuyama and the sociobiologists with this one: social interaction is the third factor. The exchange of information and the manipulation of the environment by groups of agents diffuses decision theory over inductive models of environments into a group of “mechanisms” that can, for example, transmit the location of optimal resource availability among the clan as a whole, increasing the utility of the individual agents with little cost to others. It seems appealing to expand Hutter’s model to include a social model, an agent model, and an environment within the purview of the mathematics. We might also get to the level where the social model overrides the agent model for a greater average utility, or where non-environmental signals from the social model interrupt function of the agent model, representing an irrational abstraction with group-selective payoff.

Reciprocity and Abstraction

Fukuyama’s suggestion is intriguing but needs further development and empirical support before it can be considered more than a hypothesis. To be mildly repetitive, ideology derived from scientific theories should be subject to even more scrutiny than religious-political ideologies if for no other reason than it can be. But in order to drill down into the questions surrounding how reciprocal altruism might enable the evolution of linguistic and mental abstractions, we need to simplify the problems down to basics, then work outward.

So let’s start with reciprocal altruism as a mere mathematical game. The iterated prisoner’s dilemma is a case study: you and a compatriot are accused of a heinous crime and put in separate rooms. If you deny involvement and so does your friend you will each get 3 years prison. If you admit to the crime and so does your friend you will both get 1 year (cooperation behavior). But if you or your co-conspirator deny involvement while fingering the other, one gets to walk free while the other gets 6 years (defection strategy). Joint fingering is equivalent to two denials at 3 years since the evidence is equivocal. What does one do as a “rational actor” in order to minimize penalization? The only solution is to betray your friend while denying involvement (deny, deny, deny): you get either 3 years (assuming he also denies involvement), or you walk (he denies), or he fingers you also which is the same as dual denials at 3 years each. The average years served are 1/3*3 + 1/3*0 + 1/3*3 = 3 years versus 1/2*1 + 1/2*6 = 3.5 years for admitting to the crime.

In other words it doesn’t pay to cooperate.

But that isn’t the “iterated” version of the game. In the iterated prisoner’s dilemma the game is played over and over again. What strategy is best then? An initial empirical result showed that “tit for tat” worked impressively well between two actors. In tit-for-tat you don’t need much memory about your co-conspirator’s past behavior. It suffices for you to simply do in the current round what they just did in the last round. If they defected, you defect to punish them. If they cooperated, you cooperate.

But this is just two actors and robust payoff matrixes. What if we expand the game to include hundreds of interacting agents who are all competing for mating privileges and access to resources? Fukuyama’s claim is being applied to human prehistory, after all. How does a more complex competitive-cooperative landscape change these simple games and lead to an upward trajectory of abstraction, induction, abduction, or other mechanisms that feed into cognitive processes and then into linguistic ones? We can bound the problem in the following way: the actors need at least as many bits as there are interacting actors to be able to track their defection rates to the last interaction. And, since there are observable limitations to identifying defection (cheating) with regard to mating opportunities or other complex human behaviors, we can expand the bits requirement to floating point representations that cast past behavior in terms of an estimate of their likelihood of future defections. Next, you have to maintain individual statistical models of each participant to better estimate their likelihood of defection versus cooperation (hundreds of estimates and variables). You also need a vast array of predictive neural structures that are tuned to various social cues (Did he just flirt with my girlfriend? Did he just suck-up to the head man?)

We do seem to end up with big brains, just like Vonnegut predicted and lamented in Galapagos, though contra-Vonnegut whether those big brains translate into species-wide destruction is less about prediction and more about policy choices. Still, Fukuyama is better than most historians in that he neither succumbs to atheoretical reporting (ODTAA: history is just “one damn thing after another”) nor to fixating on the support of a central theory that forces the interpretation of the historical record (OMEX: “one more example of X”).

Science, Pre-science, and Religion

Francis Fukuyama in The Origins of Political Order: From Prehuman Times to the French Revolution draws a bright line from reciprocal altruism to abstract reasoning, and then through to religious belief:

Game theory…suggests that individuals who interact with one another repeatedly tend to gravitate toward cooperation with those who have shown themselves to be honest and reliable, and shun those who have behaved opportunistically. But to do this effectively, they have to be able to remember each other’s past behavior and to anticipate likely future behavior based on an interpretation of other people’s motives.

Then, language allows transmission of historical patterns (largely gossip in tight-knit social groups) and abstractions about ethical behaviors until, ultimately:

The ability to create mental models and to attribute causality to invisible abstractions is in turn the basis for the emergence of religion.

But this can’t be the end of the line. Insofar as abstract beliefs can attribute repetitive weather patterns to Olympian gods, or consolidate moral reasoning to a monotheistic being, the same mechanisms of abstraction must be the basis for scientific reasoning as well. Either that or the cognitive capacities for linguistic abstraction and game theory are not cross-applicable to scientific thinking, which seems unlikely.

So the irony of assertions that science is just another religion is that they certainly share a similar initial cognitive evolution, while nevertheless diverging in their dependence on faith and supernatural expectations, on the one hand, and channeling the predictive models along empirical contours on the other.

On the Structure of Brian Eno

I recently came across an ancient document, older than my son, dating to 1994 when I had a brief FAX-based exchange of communiques with Brian Eno, the English eclectic electronic musician and producer of everything from Bowie’s Low through to U2’s Joshua Tree and Jane Siberry. Eno had been pointed at one of my colleague’s efforts (Eric in the FAXes, below) at using models of RNA replication to create music by the editor of Whole Earth Catalog who saw Eric present at an Artificial Life conference. I was doing other, somewhat related work, and Eric allowed me to correspond with Mr. Eno. I did, resulting in a brief round of FAXes (email was fairly new to the non-specialist in 1994).

I later dropped off a copy of a research paper I had written at his London office and he was summoned down from an office/loft and shook his head in the negative about me. I was shown the door by the receptionist.

Below is my last part of the FAX interchange. Due to copyright and privacy concerns, I’ll only show my part of the exchange (and, yes, I misspelled “Britain”). Notably, Brian still talks about the structure of music and art in recent interviews.