Category: AI

Pressing Snobs into Hell

Paul Vitanyi has been a deep advocate for Kolmogorov complexity for many years. His book with Ming Li, An Introduction to Kolmogorov Complexity and Its Applications, remains on my book shelf (and was a bit of an investment in grad school).

I came across a rather interesting paper by Vitanyi with Rudi Cilibrasi called “Clustering by Compression” that illustrates perhaps more easily and clearly than almost any other recent work the tight connections between meaning, repetition, and informational structure. Rather than describing the paper, however, I wanted to conduct an experiment that demonstrates their results. To do this, I asked the question: are the writings of Dante more similar to other writings of Dante than to Thackeray? And is the same true of Thackeray relative to Dante?

Now, we could pursue these questions at many different levels. We might ask scholars, well-versed in the works of each, to compare and contrast the two authors. They might invoke cultural factors, the memes of their respective eras, and their writing styles. Ultimately, though, the scholars would have to get down to some textual analysis, looking at the words on the page. And in so doing, they would draw distinctions by lifting features of the text, comparing and contrasting grammatical choices, word choices, and other basic elements of the prose and poetry on the page. We might very well be able to take parts of the knowledge of those experts and distill it into some kind of a logical procedure or algorithm that would parse the texts and draw distinctions based on the distributions of words and other structural cues. If asked, we might say that a similar method might work for the so-called language of life, DNA, but that it would require a different kind of background knowledge to build the analysis, much less create an algorithm to perform the same task. And perhaps a similar procedure would apply to music, finding both simple similarities between features like tempos, as well as complex stylistic and content-based alignments.

Yet, what if we could invoke some kind of universal approach to finding exactly what features of each domain are important for comparisons? The universal approach would need to infer the appropriate symbols, grammars, relationships, and even meaning in order to do the task. And this is where compression comes in. In order to efficiently compress a data stream, an algorithm must infer repetitive patterns that are both close together like the repetition of t-h-e in English, as well as further apart, like verb agreement and other grammatical subtleties in the same language. By inferring those patterns, the compressor can then create a dictionary and emit just a reference to the pattern in the dictionary, rather than the pattern itself.

Cilibrasi and Vitanyi, in their paper, conduct a number of studies on an innovative application of compression to finding similarities and, conversely, differences. To repeat their experiment, I used a standard data compressor called bzip2 and grabbed two cantos from Dante’s Divine Comedy at random, and then two chapters from Thackeray’s The Book of Snobs. I then followed their procedure and compressed each individually using bzip2, as well as concatenating each together (pairwise) and compressing the combined document. The idea is that when you concatenate them together, the similarities present between the two documents should manifest as improved compression (smaller sized files) because the pattern dictionary will be more broadly applicable. The length of the files needs to be normalized a bit, however, because the files themselves vary in length, so following Cilibrasi and Vitanyi’s procedure, I subtracted the minimum of the compressed sizes of the two independent files and divided by the maximum of the same.

The results were perfect:

X1 X1 Size X2 X2 Size X1X2 Size NCD
d1 2828 d2 3030 5408 4933.748232
d1 2828 t1 3284 5834 6201.203678
d1 2828 t2 2969 5529 5280.703324
d2 3030 t1 3284 6001 5884.148845
d2 3030 t2 2969 5692 5000.694389
t1 3284 t2 2969 5861 4599.207369


Note that d1 has the lowest distance (NCD) to d2. t1 is also closest to t2. In this table, d1 and d2 are the Dante cantos. t1 and t2 are the Thackeray chapters. NCD is the Normalized Compression Distance. X1 Size is the compressed size of X1 in bytes. X2 Size is the compressed size for X2. Combined size is the compressed size of the combined original text (combined by concatenation in the order X1 followed by X2). NCD is calculated per the paper.

Intelligence versus Motivation

Nick Bostrom adds to the dialog on desire, intelligence, and intentionality with his recent paper, The Superintelligent Will: Motivation and Instrumental Rationality in Advanced Artificial Agents. The argument is largely a deconstruction of the general assumption that there is somehow an inexorable linkage between intelligence and moral goodness. Indeed, he even proposes that intelligence and motivation are essentially orthogonal (“The Orthogonality Thesis”) but that there may be a particular subset of possible trajectories towards any goal that are common (self-preservation, etc.) The latter is scoped by his “instrumental convergence thesis” where there might be convergences towards central tenets that look an awful lot like the vagaries of human moral sentiments. But they remain vagaries and should not be taken to mean that advanced artificial agents will act in a predictable manner.

Universal Artificial Social Intelligence

Continuing to develop the idea that social reasoning adds to Hutter’s Universal Artificial Intelligence model, below is his basic layout for agents and environments:

A few definitions: The Agent (p) is a Turing machine that consists of a working tape and an algorithm that can move the tape left or right, read a symbol from the tape, write a symbol to the tape, and transition through a finite number of internal states as held in a table. That is all that is needed to be a Turing machine and Turing machines can compute like our every day notion of a computer. Formally, there are bounds to what they can compute (for instance, whether any given program consisting of the symbols on the tape will stop at some point or will run forever without stopping (this is the so-called “halting problem“). But it suffices to think of the Turing machine as a general-purpose logical machine in that all of its outputs are determined by a sequence of state changes that follow from the sequence of inputs and transformations expressed in the state table. There is no magic here.

Hutter then couples the agent to a representation of the environment, also expressed by a Turing machine (after all, the environment is likely deterministic), and has the output symbols of the agent consumed by the environment (y) which, in turn, outputs the results of the agent’s interaction with it as a series of rewards (r) and environment signals (x), that are consumed by agent once again.

Where this gets interesting is that the agent is trying to maximize the reward signal which implies that the combined predictive model must convert all the history accumulated at one point in time into an optimal predictor. This is accomplished by minimizing the behavioral error and behavioral error is best minimized by choosing the shortest program that also predicts the history. By doing so, you simultaneously reduce the brittleness of the model to future changes in the environment.

So far, so good. But this is just one agent coupled to the environment. If we have two agents competing against one another, we can treat each as the environment for the other and the mathematics is largely unchanged (see Hutter, pp. 36-37 for the treatment of strategic games via Nash equilibria and minimax). However, for non-competitive multi-agent simulations operating against the same environment there is a unique opportunity if the agents are sampling different parts of the environmental signal. So, let’s change the model to look as follows:

Now, each agent is sampling different parts of the output symbols generated by the environment (as well as the utility score, r). We assume that there is a rate difference between the agents input symbols and the environmental output symbols, but this is not particularly hard to model: as part of the input process, the agents’ state table just passes over N symbols, where N is the number of the agent, for instance. The resulting agents will still be Hutter-optimal with regard to the symbol sequences that they do process, and will generate outputs over time that maximize the additive utility of the reward signal, but they are no longer each maximizing the complete signal. Indeed, the relative quality of the individual agents is directly proportional to the quantity of the input symbol stream that they can consume.

Overcoming this limitation has an obvious fix: share the working tape between the individual agents:

Then, each agent can record not only their states on the tape, but can also consume the states of the other agents. The tape becomes an extended memory or a shared theory about the environment. Formally, I don’t believe there is any difference between this and the single agent model because, like multihead Turing machines, the sequence of moves and actions can be collapsed to a single table and a single tape insofar as the entire environmental signal is available to all of the agents (or their concatenated form). Instead the value lies in consideration of what a multi-agent system implies concerning shared meaning and the value of coordination: for any real environment, perfect coupling between a single agent and that environment is an unrealistic simplification. And shared communication and shared modeling translates into an expansion of the individual agent’s model of the universe into greater individual reward and, as a side-effect, group reward as well.

Multitudes and the Mathematics of the Individual

The notion that there is a path from reciprocal altruism to big brains and advanced cognitive capabilities leads us to ask whether we can create “effective” procedures that shed additional light on the suppositions that are involved, and their consequences. Any skepticism about some virulent kind of scientism then gets whisked away by the imposition of a procedure combined with an earnest interest in careful evaluation of the outcomes. That may not be enough, but it is at least a start.

I turn back to Marcus Hutter, Solomonoff, and Chaitin-Kolmogorov at this point.  I’ll be primarily referencing Hutter’s Universal Algorithmic Intelligence (A Top-Down Approach) in what follows. And what follows is an attempt to break down how three separate factors related to intelligence can be explained through mathematical modeling. The first and the second are covered in Hutter’s paper, but the third may represent a new contribution, though perhaps an obvious one without the detail work that is needed to provide good support.

First, then, we start with a core requirement of any goal-seeking mechanism: the ability to predict patterns in the environment external to the mechanism. This is well-covered since Solomonoff in the 60s who formalized the implicit arguments in Kolmogorov algorithmic information theory (AIT), and that were subsequently expanded on by Greg Chaitin. In essence, given a range of possible models represented by bit sequences of computational states, the shortest sequence that predicts the observed data is also the optimal predictor for any future data also produced by the underlying generator function. The shortest sequence is not computable, but we can keep searching for shorter programs and come up with unique optimizations for specific data landscapes. And that should sound familiar because it recapitulates Occam’s Razor and, in a subset of cases, Epicurus’ Principle of Multiple Explanations. This represents the floor-plan of inductive inference, but it is only the first leg of the stool.

We should expect that evolutionary optimization might work according to this abstract formulation, but reality always intrudes. Instead, evolution is saddled by historical contingency that channels the movements through the search space. Biology ain’t computer science, in short, if for no other reason than it is tied to the physics and chemistry of the underlying operating system. Still the desire is there to identify such provable optimality in living systems because evolution is an optimizing engine, if not exactly an optimal one.

So we come to the second factor: optimality is not induction alone. Optimality is the interaction between the predictive mechanism and the environment. The “mechanism” might very well provide optimal or near optimal predictions of the past through a Solomonoff-style model, but applying those predictions introduces perturbations to the environment itself. Hutter elegantly simplifies this added complexity by abstracting the environment as a computing machine (a logical device; we assume here that the universe behaves deterministically even where it may have chaotic aspects) and running the model program at a slightly slower rate than the environmental program (it lags). Optimality is then a utility measure that combines prediction with resource allocation according to some objective function.

But what about the third factor that I promised and is missing? We get back to Fukuyama and the sociobiologists with this one: social interaction is the third factor. The exchange of information and the manipulation of the environment by groups of agents diffuses decision theory over inductive models of environments into a group of “mechanisms” that can, for example, transmit the location of optimal resource availability among the clan as a whole, increasing the utility of the individual agents with little cost to others. It seems appealing to expand Hutter’s model to include a social model, an agent model, and an environment within the purview of the mathematics. We might also get to the level where the social model overrides the agent model for a greater average utility, or where non-environmental signals from the social model interrupt function of the agent model, representing an irrational abstraction with group-selective payoff.

Bostrom on the Hardness of Evolving Intelligence

At 38,000 feet somewhere above Missouri, returning from a one day trip to Washington D.C., it is easy to take Nick Bostrom’s point that bird flight is not the end-all of what is possible for airborne objects and mechanical contrivances like airplanes in his paper, How Hard is Artificial Intelligence? Evolutionary Arguments and Selection Effects. His efforts to try to bound and distinguish the evolution of intelligence as either Hard or Not-Hard runs up against significant barriers, however. As a practitioner of the art, finding similarities between a purely physical phenomena like flying and something as complex as human intelligence falls flat for me.

But Bostrom is not taking flying as more than a starting point for arguing that there is an engineer-able possibility for intelligence. And that possibility might be bounded by a number of current and foreseeable limitations, not least of which is that computer simulations of evolution require a certain amount of computing power and representational detail in order to be a sufficient simulation. His conclusion is that we may need as much as another 100 years of improvements in computing technology just to get to a point where we might succeed at a massive-scale evolutionary simulation (I’ll leave to the reader to investigate his additional arguments concerning convergent evolution and observer selection effects).

Bostrom dismisses as pessimistic the assumption that a sufficient simulation would, in fact, require a highly detailed emulation of some significant portion of the real environment and the history of organism-environment interactions:

A skeptic might insist that an abstract environment would be inadequate for the evolution of general intelligence, believing instead that the virtual environment would need to closely resemble the actual biological environment in which our ancestors evolved … However, such extreme pessimism seems unlikely to be well founded; it seems unlikely that the best environment for evolving intelligence is one that mimics nature as closely as possible. It is, on the contrary, plausible that it would be more efficient to use an artificial selection environment, one quite unlike that of our ancestors, an environment specifically designed to promote adaptations that increase the type of intelligence we are seeking to evolve.

Unfortunately, I don’t see any easy way to bound the combined complexity of the needed substrate for evolutionary action (be it artificial organisms or just artificial neuronal networks) and the complexity of defining the necessary artificial environment for achieving the requested goal. It just makes it at least as hard and perhaps harder in that we can define a physical system much more easily than an abstract adaptive landscape designed to “promote…abstract reasoning and general problem-solving skills.”

Randomness and Meaning

The impossibility of the Chinese Room has implications across the board for understanding what meaning means. Mark Walker’s paper “On the Intertranslatability of all Natural Languages” describes how the translation of words and phrases may be achieved:

  1. Through a simple correspondence scheme (word for word)
  2. Through “syntactic” expansion of the languages to accommodate concepts that have no obvious equivalence (“optometrist” => “doctor for eye problems”, etc.)
  3. Through incorporation of foreign words and phrases as “loan words”
  4. Through “semantic” expansion where the foreign word is defined through its coherence within a larger knowledge network.

An example for (4) is the word “lepton” where many languages do not have a corresponding concept and, in fact, the concept is dependent on a bulwark of advanced concepts from particle physics. There may be no way to create a superposition of the meanings of other words using (2) to adequately handle “lepton.”

These problems present again for trying to understand how children acquire meaning in learning a language. As Walker points out, language learning for a second language must involve the same kinds of steps as learning translations, so any simple correspondence theory has to be supplemented.

So how do we make adequate judgments about meanings and so rapidly learn words, often initially with a course granularity but later with increasingly sharp levels of focus? What procedure is required for expanding correspondence theories to operate in larger networks? Methods like Latent Semantic Analysis and Random Indexing show how this can be achieved in ways that are illuminating about human cognition. In each case, the methods provide insights into how relatively simple transformations of terms and their occurrence contexts can be viewed as providing a form of “triangulation” about the meaning of words. And, importantly, this level of triangulation is sufficient for these methods to do very human-like things. Both methods can pass the TOEFL exam, for instance, and Latent Semantic Analysis is at the heart of automatic essay grading approaches that have sufficiently high success rates that they are widely used by standardized test makers.

How do they work? I’ll just briefly describe Random Indexing, since I recently presented the concept at the Big Data Science meetup at SGI in Fremont, California. In Random Indexing, we simply create a randomized sparse vector for each word we encounter in a large collection of texts. The vector can be binary as a first approximation, so something like:

The: 0000000000000100000010000000000000000001000000000000000…

quick: 000100000000000010000000000001000000000110000000000000…

fox: 0000000000000000000000100000000000000000000000000100100…

Now, as I encountered a given word in the text, I just add up the random vectors for the words around it to create a new “context” vector that is still sparse, but less so than the component parts. What is interesting about this approach is that if you consider the vectors as representing points in a hyperspace with the same dimensionality as the vectors are long, then words that have similar meanings tend to cluster in that space. Latent Semantic Analysis achieves a similar clustering using some rather complex linear algebra. A simple approximation of the LSA approach is also at the heart of Google’s PageRank algorithm, though operating on link structure rather than word co-occurrences.

So how do we solve the TOEFL test using an approach like Random Indexing? A large collection of texts are analyzed to create a Random Index, then for a sample question like:

In line 5, the word “pronounced” most closely means

  1. evident
  2. spoken
  3. described
  4. unfortunate

The question and the question text are converted into a context vector using the same random vectors for the index and then the answers vectors are compared to see which is closest in the index space. This is remarkably inexpensive to compute, requiring just an inner product between the context vectors for question and answer.

A method for compact coding using Algorithmic Information Theory can also be used to achieve similar results, demonstrating the wide applicability of context-based analysis to helping understand how intertranslateability and language learning are dependent on the rich contexts of word usage.

On the Soul-Eyes of Polar Bears

I sometimes reference a computational linguistics factoid that appears to be now lost in the mists of early DoD Tipster program research: Chinese linguists only agree on the segmentation of texts into words about 80% of the time. We can find some qualitative agreement on the problematic nature of the task, but the 80% is widely smeared out among the references that I can now find. It should be no real surprise, though, because even English with white-space tokenization resists easy characterization of words versus phrases: “New York” and “New York City” are almost words in themselves, though just given white-space tokenization are also phrases. Phrases lift out with common and distinct usage, however, and become more than the sum of their parts; it would be ridiculously noisy to match a search for “York” against “New York” because no one in the modern world attaches semantic significance to the “York” part of the phrase. It exists as a whole and the nature of the parts has dissolved against this wholism.

John Searle’s Chinese Room argument came up again today. My son was waxing, as he does, in a discussion about mathematics and order, and suggested a poverty of our considerations of the world as being purely and completely natural. He meant in the sense of “materialism” and “naturalism” meaning that there are no mystical or magical elements to the world in a metaphysical sense. I argued that there may nonetheless be something that is different and indescribable by simple naturalistic calculi: there may be qualia. It led, in turn, to a qualification of what is unique about the human experience and hence on to Searle’s Chinese Room.

And what happens in the Chinese Room? Well, without knowledge of Chinese, you are trapped in a room with a large collection of rules for converting Chinese questions into Chinese answers. As slips of Chinese questions arrive, you consult the rule book and spit out responses. Searle’s point was that it is silly to argue that the algorithm embodied by the room really understands Chinese and that the notion of “Strong AI” (artificial intelligence is equivalent to human intelligence insofar as there is behaviorally equivalence between the two) falls short of the meaning of “strong.” This is a correlate to the Turing Test in a way, which also posits a thought experiment with computer and human interlocutors who are remotely located.

The arguments against the Chinese Room range from complaints that there is no other way to establish intelligence to the claim that given sensory-motor relationships with the objects the symbols represent, the room could be considered intentional. I don’t dispute any of these arguments, however. Instead, I would point out that the initial specification of the gedankenexperiment fails in the assumption that the Chinese Room is actually able to produce adequate outputs for the range of possible inputs. In fact, while even the linguists disagree about the nature of Chinese words, every language can be used to produce utterances that have never been uttered before. Chomsky’s famous “colorless green ideas sleep furiously” shows the problem with clarity. It is the infinitude of language and its inherent ambiguity that makes the Chinese Room an inexact metaphor. A Chinese questioner could ask how do the “soul-eyes of polar bears beam into the hearts of coal miners?” and the system would fail like enjambing precision German machinery fed tainted oil. Yeah, German machinery enjambs just like polar bears beam.

So the argument stands in its opposition to Strong AI given its initial assumptions, but fails given real qualifications of those assumptions.

NOTE: There is possibly a formal argument embedded in here in that a Chomsky grammar that is recursively enumerable has infinite possible productions but that an algorithm can be devised to accommodate those productions given Turing completeness. Such an algorithm is in principle only, however, and does require a finite symbol alphabet. While the Chinese characters may be finite, the semantic and pragmatic metadata are not clearly so.

Teleology, Chapter 5

Harry spent most of that summer involved in the Santa Fe Sangre de Cristo Church, first with the church summer camp, then with the youth group. He seemed happy and spent the evenings text messaging with his new friends. I was jealous in a way, but refused to let it show too much. Thursdays he was picked up by the church van and went to watch movies in a recreation center somewhere. I looked out one afternoon as the van arrived and could see Sarah’s bright hair shining through the high back window of the van.

Mom explained that they seemed to be evangelical, meaning that they liked to bring as many new worshippers into the religion as possible through outreach and activities. Harry didn’t talk much about his experiences. He was too much in the thick of things to be concerned with my opinions, I think, and snide comments were brushed aside with a beaming smile and a wave. “You just don’t understand,” Harry would dismissively tell me.

I was reading so much that Mom would often demand that I get out of the house on weekend evenings after she had encountered me splayed on the couch straight through lunch and into the shifting evening sunlight passing through the high windows of our thick-walled adobe. I would walk then, often for hours, snaking up the arroyos towards the mountains, then wend my way back down, traipsing through the thick sand until it was past dinner time.

It was during this time period that I read cyberpunk authors and became intrigued with the idea that someday, one day, perhaps computing machines would “wake up” and start to think on their own. I knew enough about computers that I could not even conceive of how that could possibly come about. My father had once described for me a simple guessing game that learned. If the system couldn’t guess your choice of animal, it would concede and use the correct answer to expand its repertoire. I had called it “learning by asking” at the time but only saw it as a simple game and never connected it to the problem of human learning.

Yet now the concept made some sense as an example of how an intelligent machine could escape from the confines of just producing the outputs that it was programmed to produce. Yet there were still confines; the system could never just reconfigure the rules system or decide to randomly guess when it got bored (or even get bored). There was something profound missing from our understanding of human intelligence.

Purposefulness seemed to be the missing attribute that we had and that machines did not. We were capable of making choices by a mechanism of purposefulness that transcended simple programmable rules systems, I hypothesized, and also traced that purpose back to more elementary programming that was part of our instinctive, animal core. There was a philosophical problem with this scheme, though, that I recognized early on; if our daily systems of learning and thought were just elaborations of logical games like that animal learning game, and the purpose was embedded more deeply, what natural rules governed that deeper thing, and how could it be fundamentally different than the higher-order rules?

I wanted to call this core “instinct” and even hypothesized that if it could be codified it would bridge the gap between truly thinking and merely programmed machines. But the alternative to instinct being a logical system seemed to be assigning it supernatural status and that wasn’t right for several reasons.

First, the commonsense notion of instinct associated with doing primitive things like eating, mating and surviving seemed far removed from the effervescent and transcendent ideas about souls that were preached by religions. I wanted to understand the animating principle behind simple ideas like wanting to eat and strategizing about how to do it—hardly the core that ascends to heaven in Christianity and other religions I was familiar with. It was also shared across all animals and even down to the level of truly freaky things like viruses and prions.

The other problem was that any answer of supernaturalism struck me as leading smack into an intellectual brick wall because we could explain and explain until we get to the core of our beings and then just find this billiard ball of God-light. Somehow, though, that billiard ball had to emanate energy or little logical arms to affect the rules systems by which we made decisions; after all, purposefulness can’t just be captive in the billiard ball but has to influence the real world, and at that point we must be able to characterize those interactions and guess a bit at the structure of the billiard ball.

So the simplest explanation seemed to be that the core, instinct, was a logically describable system shaped by natural processes and equipped with rules that governed how to proceed. Those rules didn’t need to be simple or even easily explainable, but they needed to be capable of explanation. Any other scheme I could imagine involved a problem of recursion, with little homunculi trapped inside other homunculi and ultimately powered by a billiard ball of cosmic energy.

I tried to imagine what the religious thought about this scheme of explanation but found what I had heard from Harry to be largely incompatible with any sort of explanation. Instead, the discussion was devoid of any sort of detailed analysis or arguments concerning human intelligence. There was a passion play between good and evil forces, the notion of betraying or denying the creator god, and an unexplained transmigration of souls, being something like our personalities or identities. If we wanted to ask a question about why someone had, say, committed a crime, it was due to supernatural influences that acted through their personalities. More fundamental questions like how somehow learned to speak a language, which I thought was pretty amazing, were not apparently subject to the same supernatural processes, but might be explained with a simple recognition of the eminence of God’s creation. So moral decisions were subject to evil while basic cognition was just an example of supernatural good in this scheme of things, with the latter perhaps subject to the arbitrary motivations of the creator being.

Supernaturalism was an appeal to non-explanation driven by a conscious desire to not look for answers. “God’s Will” was the refrain for this sort of reasoning and it was counterproductive to understanding how intelligence worked or had come about.

God was the end of all thought. The terminus of bland emptiness. A void.

But if natural processes were responsible, then the source of instinct was evolutionary in character. Evolution led to purpose, but in a kind of secondhand way. The desire to reproduce did not directly result in complex brains or those elaborate plumes on birds that showed up in biology textbooks. It was a distal effect built on a basic platform of sensing and reacting and controlling the environment. That seemed obvious enough but was just the beginning of the puzzle for me. It also left the possibility of machines “waking up” far too distant a possibility since evolution worked very slowly in the biological world.

I suddenly envisioned computer programs competing with each other to solve specific problems in massive phalanxes. Each program varied slightly from the others in minute details. One could print “X” while another could print “Y”. The programs that did better would then be replicated into the next generation. Gradually the programs would converge on solving a problem using a simple evolutionary scheme. There was an initial sense of elegant simplicity, though the computing system to carry the process out seemed at least as large as the internet itself. There was a problem, however. The system required a central governor to carry out the replication of the programs, their mutation and to measure the success of the programs. It would also have to kill off, to reap, the losers. That governor struck me as remarkably god-like in its powers, sitting above the population of actors and defining the world in which they acted. It was also inevitable that the solutions at which programs would arrive would be completely shaped by the adaptive landscape that they were presented with; though they were competing against one another, their behavior was mediated through an outside power. It was like a game show in a way and didn’t have the kind of direct competition that real evolutionary processes inherently have.

A solution required that the governor process go away, that the individual programs replicate themselves and that even that replication process be subject to variation and selection. Moreover, the selection process had to be very broadly defined based on harvesting resources in order to replicate, not based on an externally defined objective function. Under those circumstances, the range of replicating machines—automata—could be as vast as the types of flora and fauna on Earth itself.

As I trudged up the arroyo, I tried to imagine the number of insects, bacteria, spores, plants and vines in even this relatively sparse desert. A cricket began singing in a nearby mesquite bush, joining the chorus of other crickets in the late summer evening. The light of the moon was beginning to glow behind a mountain ridge. Darkness was coming fast and I could hear coyotes start calling further up the wash towards St. John’s College.

As I returned home, I felt as though I was the only person walking in the desert that night, isolated in the dark spaces that separated the haphazard Santa Fe roads, yet I also was warmed with the idea that there was a solution to the problem of purpose embedded deeply in our biology and that could be recreated in a laboratory of sorts, given a vastly complex computing system larger than the internet itself. That connection to a deep truth seemed satisfying in a way that the weird language of religion had never felt. We could know and understand our own nature through reason, through experiments and through simulation, and even perhaps create a completely new form of intelligence that had its own kind of soul derived from surviving generations upon generations of replications.

But did we, like gods, have the capacity to apprehend this? I recalled my Hamlet: The paragon of animals, indeed. A broad interpretation of the Biblical Fall as a desire to be like God lent a metaphorical flavor to this nascent notion. Were we reaching out to try to become like a creator god of sorts through the development of intelligent technologies and biological manipulation? If we did create a self-aware machine that seemed fully human-like, it would certainly support the idea that we were creators of new souls.

I was excited about this line of thinking as I slipped into the living room where Mom and Harry were watching a crime drama on TV. Harry would not understand this, I realized, and would lash out at me for being terrifically weird if I tried to discuss it with him. The distance between us had widened to the point that I would avoid talking directly to him. It felt a bit like the sense of loss after Dad died, though without the sense of finality that death brought with it. Harry and I could recover, I thought, reconnecting later on in life and reconciling our divergent views.

A commercial came and I stared at the back of his head like I had done so often, trying to burrow into his skull with my mind. “Harry, Harry!” I called in my thoughts. He suddenly turned around with his eyes bulging and a crooked smile erupting across his face.

“What?” he asked.

It still worked.

On the Non-Simulation of Human Intelligence

There is a curious dilemma that pervades much machine learning research. The solutions that we are trying to devise are supposed to minimize behavioral error by formulating the best possible model (or collection of competing models). This is also the assumption of evolutionary optimization, whether natural or artificial: optimality is the key to efficiently outcompeting alternative structures, alternative alleles, and alternative conceptual models. The dilemma is whether such optimality is applicable to the notoriously error prone, conceptual flexible, and inefficient reasoning of people. In other words, is machine learning at all like human learning?

I came across a paper called “Multi-Armed Bandit Bayesian Decision Making” while trying to understand what Ted Dunning is planning to talk about at the Big Data Science Meetup at SGI in Fremont, CA a week from Saturday (I’ll be talking, as well) that has a remarkable admission concerning this point:

Human behaviour is after all heavily influenced by emotions, values, culture and genetics; as agents operating in a decentralised system humans are notoriously bad at coordination. It is this fact that motivates us to develop systems that do coordinate well and that operate outside the realms of emotional biasing. We use Bayesian Probability Theory to build these systems specifically because we regard it as common sense expressed mathematically, or rather `the right thing to do’.

The authors continue on to suggest that therefore such systems should instead be seen as corrective assistants for the limitations of human cognitive processes! Machines can put the rational back into reasoned decision-making. But that is really not what machine learning is used for today. Instead, machine learning is used where human decision-making processes are unavailable due to the physical limitations of including humans “in the loop,” or the scale of the data involved, or the tediousness of the tasks at hand.

For example, automatic parts-of-speech tagging could be done by row after row of professional linguists who mark-up the text with the correct parts-of-speech. Where occasionally great ambiguity arises, they would have meetings to reach agreement on the correct assignment of the part. This kind of thing is still done. I worked with a company that creates conceptual models of the biological results expressed in research papers. The models are created by PhD biologists who are trained in the conceptual ontology they have developed over the years through a process of arguing and consensus development. Yahoo! originally used teams of ontologists to classify web pages. Automatic machine translation is still unacceptable for most professional translation tasks, though it can be useful for gisting.

So the argument that the goal of these systems is to overcome the cognitive limitations of people is mostly incorrect, I think. Instead, the real reason why we explore topics like Bayesian probability theory for machine learning is that the mathematics gives us traction against the problems. For instance, we could try to study the way experts make decisions about parts-of-speech and create a rules system that contained every little rule. This would be an “expert system,” but even the creation of such a system requires careful assessment of massive amounts of detail. That scalability barrier rises again and emotional biases are not much at play except where they result in boredom and ennui due to sheer tedium.

Eusociality, Errors, and Behavioral Plasticity

I encountered an error in E.O. Wilson’s The Social Conquest of Earth where the authors intended to assert an alternative to “kin selection” but instead repeated “multilevel selection,” which is precisely what the authors wanted to draw a distinction with. I am sympathetic, however, if for no other reason than I keep finding errors and issues with my own books and papers.

The critical technical discussion from Nature concerning the topic is available here. As technical discussion, the issues debated are fraught with details like how halictid bees appear to live socially, but are in fact solitary animals that co-exist in tunnel arrangements.

Despite the focus on “spring-loaded traits” as determiners for haplodiploid animals like bees and wasps, the problem of big-brained behavioral plasticity keeps coming up in Wilson’s book. Humanity is a pinnacle because of taming fire, because of the relative levels of energy available in animal flesh versus plant matter, and because of our ability to outrun prey over long distances (yes, our identity emerges from marathon running). But these are solutions that correlate with the rapid growth of our craniums.

So if behavioral plasticity is so very central to who we are, we are faced with an awfully complex problem in trying to simulate that behavior. We can expect that there must be phalanxes of genes involved in setting our developmental path (our nature and the substrate for our nurture). We should, indeed, expect that almost no cognitive capacity is governed by a small set of genes, and that all the relevant genes work in networks through polygeny, epistasis, and related effects (pleiotropy). And we can expect no easy answers as a result, except to assert that AI is exactly as hard as we should have expected, and progress will be inevitably slow in understanding the mind, the brain, and the way we interact.