Category: Cognitive

Bayesianism and Properly Basic Belief

Kircher-Diagram_of_the_names_of_GodXu and Tenebaum, in Word Learning as Bayesian Inference (Psychological Review, 2007), develop a very simple Bayesian model of how children (and even adults) build semantic associations based on accumulated evidence. In short, they find contrastive elimination approaches as well as connectionist methods unable to explain the patterns that are observed. Specifically, the most salient problem with these other methods is that they lack the rapid transition that is seen when three exemplars are presented for a class of objects associated with a word versus one exemplar. Adults and kids (the former even more so) just get word meanings faster than those other models can easily show. Moreover, a space of contending hypotheses that are weighted according to their Bayesian statistics, provides an escape from the all-or-nothing of hypothesis elimination and some of the “soft” commitment properties that connectionist models provide.

The mathematical trick for the rapid transition is rather interesting. They formulate a “size principle” that weights the likelihood of a given hypothesis (this object is most similar to a “feb,” for instance, rather than the many other object sets that are available) according to a scaling that is exponential in the number of exposures. Hence the rapid transition:

Hypotheses with smaller extensions assign greater probability than do larger hypotheses to the same data, and they assign exponentially greater probability as the number of consistent examples increases.

It should be noted that they don’t claim that the psychological or brain machinery implements exactly this algorithm. As is usual in these matters, it is instead likely that whatever machinery is involved, it simply has at least these properties. It may very well be that connectionist architectures can do the same but that existing approaches to connectionism simply don’t do it quite the right way. So other methods may need to be tweaked to get closer to the observed learning of people in these word tasks.

So what can this tell us about epistemology and belief? Classical foundationalism might be formulated as something is a “basic” or “justified” belief if it is self-evident or evident to our senses. Other beliefs may therefore be grounded by those basic beliefs. And a more modern reformulation might substitute “incorrigible” for “justified” with the layered meaning of incorrigibility built on the necessity that given the proposition it is in fact true.

Here’s Alvin Plantinga laying out a case for why justified and incorrigibility have a range of problems, problems serious enough for Plantinga that he suspects that god belief could just as easily be a basic belief, allowing for the kinds of presuppositional Natural Theology (think: I look around me and the hand of God is obvious) that is at the heart of some of the loftier claims concerning the viability or non-irrationality of god belief. It even provides a kind of coherent interpretative framework for historical interpretation.

Plantinga positions the problem of properly basic belief then as an inductive problem:

And hence the proper way to arrive at such a criterion is, broadly speaking, inductive. We must assemble examples of beliefs and conditions such that the former are obviously properly basic in the latter, and examples of beliefs and conditions such that the former are obviously not properly basic in the latter. We must then frame hypotheses as to the necessary and sufficient conditions of proper basicality and test these hypothesis by reference to those examples. Under the right conditions, for example, it is clearly rational to believe that you see a human person before you: a being who has thoughts and feelings, who knows and believes things, who makes decisions and acts. It is clear, furthermore, that you are under no obligation to reason to this belief from others you hold; under those conditions that belief is properly basic for you.

He goes on to conclude that this opens up the god hypothesis as providing this kind of coherence mechanism:

By way of conclusion then: being self-evident, or incorrigible, or evident to the senses is not a necessary condition of proper basicality. Furthermore, one who holds that belief in God is properly basic is not thereby committed to the idea that belief in God is groundless or gratuitous or without justifying circumstances. And even if he lacks a general criterion of proper basicality, he is not obliged to suppose that just any or nearly any belief—belief in the Great Pumpkin, for example—is properly basic. Like everyone should, he begins with examples; and he may take belief in the Great Pumpkin as a paradigm of irrational basic belief.

So let’s assume that the word learning mechanism based on this Bayesian scaling is representative of our human inductive capacities. Now this may or may not be broadly true. It is possible that it is true of words but not other domains of perceptual phenomena. Nevertheless, given this scaling property, the relative inductive truth of a given proposition (a meaning hypothesis) is strictly Bayesian. Moreover, this doesn’t succumb to problems of verificationalism because it only claims relative truth. Properly basic or basic is then the scaled contending explanatory hypotheses and the god hypothesis has to compete with other explanations like evolutionary theory (for human origins), empirical evidence of materialism (for explanations contra supernatural ones), perceptual mistakes (ditto), myth scholarship, textual analysis, influence of parental belief exposure, the psychology of wish fulfillment, the pragmatic triumph of science, etc. etc.

And so we can stick to a relative scaling of hypotheses as to what constitutes basicality or justified true belief. That’s fine. We can continue to argue the previous points as to whether they support or override one hypothesis or another. But the question Plantinga raises as to what ethics to apply in making those decisions is important. He distinguishes different reasons why one might want to believe more true things than others (broadly) or maybe some things as properly basic rather than others, or, more correctly, why philosophers feel the need to pin god-belief as irrational. But we succumb to a kind of unsatisfying relativism insofar as the space of these hypotheses is not, in fact, weighted in a manner that most reflects the known facts. The relativism gets deeper when the weighting is washed out by wish fulfillment, pragmatism, aspirations, and personal insights that lack falsifiability. That is at least distasteful, maybe aretetically so (in Plantinga’s framework) but probably more teleologically so in that it influences other decision-making and the conflicts and real harms societies may cause.

A Soliloquy for Volcanoes and Nearest Neighbors

Tongariro National Park: Emerald Lake
Tongariro National Park: Emerald Lake

A German kid caught me talking to myself yesterday. It was my fault, really. I was trying to break a hypnotic trance-like repetition of exactly what I was going to say to the tramper’s hut warden about two hours away. OK, more specifically, I had left the Waihohonu camp site in Tongariro National Park at 7:30AM and was planning to walk out that day. To put this into perspective, it’s 28.8 km (17.9 miles) with elevation changes of around 900m, including a ridiculous final assault above red crater at something like 60 degrees along a stinking volcanic ridge line. And, to make things extra lovely, there was hail, then snow, then torrential downpours punctuated by hail again—a lovely tramp in the New Zealand summer—all in a full pack.

But anyway, enough bragging about my questionable judgement. I was driven by thoughts of a hot shower and the duck l’orange at Chateau Tongariro while my hands numbed to unfeeling arresting myself with trekking poles down through muddy canyons. I was talking to myself. I was trying to stop repeating to myself why I didn’t want my campsite for the night that I had reserved. This is the opposite of glorious runner’s high. This is when all the extra blood from one’s brain is obsessed with either making leg muscles go or watching how the feet will fall. I also had the hood of my rain fly up over my little Marmot ball cap. I was in full regalia, too, with the shifting rub of my Gortex rain pants a constant presence throughout the day.  I didn’t notice him easing up on me as I carried on about one-shot learning as some kind of trance-breaking ritual.

We exchanged pleasantries and he meandered on. With his tiny little day pack it was clear he had just come up from the car park at Mangatepopo for a little jaunt. Eurowimp. I caught up with him later slathering some kind of meat product on white bread trailside and pushed by, waiting on my own lunch of jerky, chili-tuna, crackers, and glorious spring water, gulp after gulp, an hour onward. He didn’t bring up the glossolalic soliloquy incident.

My mantra was simple: artificial neural networks, including deep learning approaches, require massive learning cycles and huge numbers of exemplars to learn. In a classic test, scores of handwritten digit images (0 to 9) are categorized as to which number they are. Deep learning systems have gotten to 99% accuracy on that problem, actually besting average human performance. Yet they require a huge training corpus to pull this off, combined with many CPU hours to optimize the models on that corpus. We humans can do much better than that with our neural systems.

So we get this recently lauded effort, One-Shot Learning of Visual Concepts, that uses an extremely complicated Bayesian mixture modeling approach that combines stroke exemplars together for trying to classify foreign and never-before-seen characters (like Bengali or Ethiopic) after only one exposure to the stimulus. In other words, if I show you some weird character with some curves and arcs and a vertical bar in it, you can find similar ones in a test set quite handily, but machines really can’t. A deep learning model could be trained on every possible example known in a long, laborious process, but when exposed to a new script like Amharic or a Cherokee syllabary, the generalizations break down. A simple comparison approach is to use a nearest neighbor match or vote. That is, simply create vectors of the image pixels starting at the top left and compare the distance between the new image vector and the example using an inner vector product. Similar things look the same and have similar pixel patterns, right? Well, except they are rotated. They are shifted. They are enlarged and shrunken.

And then it hit me that the crazy-complex stroke model could be simplified quite radically by simply building a similar collection of stroke primitives as splines and then looking at the K nearest neighbors in the stroke space. So a T is two strokes drawn from the primitives collection with a central junction and the horizontal laying atop the vertical. This builds on the stroke-based intuition of the paper’s authors (basically, all written scripts have strokes as a central feature and we as writers and readers understand the line-ness of them from experience with our own script).

I may have to try this out. I should note, also in critique of this antithesis of runner’s high (tramping doldrums?), that I was also deeply concerned that there were so many damn contending voices and thoughts racing around my head in the face of such incredible scenery. Why did I feel the need to distract my mind from it’s obsessions over something so humanly trivial? At least, I suppose, the distraction was interesting enough that it was worth the effort.

Lucifer on the Beach

glowwormsI picked up a whitebait pizza while stopped along the West Coast of New Zealand tonight. Whitebait are tiny little swarming immature fish that can be scooped out of estuarial river flows using big-mouthed nets. They run, they dart, and it is illegal to change river exit points to try to channel them for capture. Hence, whitebait is semi-precious, commanding NZD70-130/kg, which explains why there was a size limit on my pizza: only the small one was available.

By the time I was finished the sky had aged from cinereal to iron in a satire of the vivid, watch-me colors of CNN International flashing Donald Trump’s linguistic indirection across the television. I crept out, setting my headlamp to red LEDs designed to minimally interfere with night vision. Just up away from the coast, hidden in the impossible tangle of cold rainforest, there was a glow worm dell. A few tourists conjured with flashlights facing the ground to avoid upsetting the tiny arachnocampa luminosa that clung to the walls inside the dark garden. They were like faint stars composed into irrelevant constellations, with only the human mind to blame for any observed patterns.

And the light, what light, like white-light LEDs recently invented, but a light that doesn’t flicker or change, and is steady under the calmest observation. Driven by luciferin and luciferase, these tiny creatures lure a few scant light-seeking creatures to their doom and as food for absorption until they emerge to mate, briefly, lay eggs, and then die.

Lucifer again, named properly from the Latin as the light bringer, the chemical basis for bioluminescence was largely isolated in the middle of the 20th Century. Yet there is this biblical stigma hanging over the term—one that really makes no sense at all. The translation of morning star or some other such nonsense into Latin got corrupted into a proper name by a process of word conversion (this isn’t metonymy or something like that; I’m not sure there is a word for it other than “mistake”). So much for some kind of divine literalism tracking mechanism that preserves perfection. Even Jesus got rendered as lucifer in some passages.

But nothing new, here. Demon comes from the Greek daemon and Christianity tried to, well, demonize all the ancient spirits during the monolatry to monotheism transition. The spirits of the air that were in a constant flux for the Hellenists, then the Romans, needed to be suppressed and given an oppositional position to the Christian soteriology. Even “Satan” may have been borrowed from Persian court drama as a kind of spy or informant after the exile.

Oddly, we are left with a kind of naming magic for the truly devout who might look at those indifferent little glow worms with some kind of castigating eye, corrupted by a semantic chain that is as kinked as the popular culture epithets of Lucifer himself.

The IQ of Machines

standard-dudePerhaps idiosyncratic to some is my focus in the previous post on the theoretical background to machine learning that derives predominantly from algorithmic information theory and, in particular, Solomonoff’s theory of induction. I do note that there are other theories that can be brought to bear, including Vapnik’s Structural Risk Minimization and Valiant’s PAC-learning theory. Moreover, perceptrons and vector quantization methods and so forth derive from completely separate principals that can then be cast into more fundamental problems in informational geometry and physics.

Artificial General Intelligence (AGI) is then perhaps the hard problem on the horizon that I disclaim as having had significant progress in the past twenty years of so. That is not to say that I am not an enthusiastic student of the topic and field, just that I don’t see risk levels from intelligent AIs rising to what we should consider a real threat. This topic of how to grade threats deserves deeper treatment, of course, and is at the heart of everything from so-called “nanny state” interventions in food and product safety to how to construct policy around global warming. Luckily–and unlike both those topics–killer AIs don’t threaten us at all quite yet.

But what about simply characterizing what AGIs might look like and how we can even tell when they arise? Mildly interesting is Simon Legg and Joel Veness’ idea of an Artificial Intelligence Quotient or AIQ that they expand on in An Approximation of the Universal Intelligence Measure. This measure is derived from, voilà, exactly the kind of algorithmic information theory (AIT) and compression arguments that I lead with in the slide deck. Is this the only theory around for AGI? Pretty much, but different perspectives tend to lead to slightly different focuses. For instance, there is little need to discuss AIT when dealing with Deep Learning Neural Networks. We just instead discuss statistical regularization and bottlenecking, which can be thought of as proxies for model compression.

So how can intelligent machines be characterized by something like AIQ? Well, the conclusion won’t be surprising. Intelligent machines are those machines that function well in terms of achieving goals over a highly varied collection of environments. This allows for tractable mathematical treatments insofar as the complexity of the landscapes can be characterized, but doesn’t really give us a good handle on what the particular machines might look like. They can still be neural networks or support vector machines, or maybe even something simpler, and through some selection and optimization process have the best performance over a complex topology of reward-driven goal states.

So still no reason to panic, but some interesting ideas that shed greater light on the still mysterious idea of intelligence and the human condition.

Trees of Lives

Tree of LifeWith a brief respite between vacationing in the canyons of Colorado and leaving tomorrow for Australia, I’ve open-sourced an eight-year-old computer program for converting one’s DNA sequences into an artistic rendering. The input to the program are the allelic patterns from standard DNA analysis services that use the Short Tandem Repeat Polymorphisms from forensic analysis, as well as poetry reflecting one’s ethnic heritage. The output is generative art: a tree that overlays the sequences with the poetry and a background rendered from the sequences.

Generative art is perhaps one of the greatest aesthetic achievements of the late 20th Century. Generative art is, fundamentally, a recognition that the core of our humanity can be understood and converted into meaningful aesthetic products–it is the parallel of effective procedures in cognitive science, and developed in lock-step with the constructive efforts to reproduce and simulate human cognition.

To use Tree of Lives, install Java 1.8, unzip the package, and edit the supplied markconfig.txt to enter in your STRs and the allele variant numbers in sequence per line 15 of the configuration file. Lines 16+ are for lines of poetry that will be rendered on the limbs of the tree. Other configuration parameters can be discerned by examining, and involve colors, paths, etc. Execute the program with:

java -cp treeoflives.jar:iText-4.2.0-com.itextpdf.jar com.treeoflives.CAlleleRenderer markconfig.txt

Parsimonious Portmanteaus

portmanteauMeaning is a problem. We think we might know what something means but we keep being surprised by the facts, research, and logical difficulties that surround the notion of meaning. Putnam’s Representation and Reality runs through a few different ways of thinking about meaning, though without reaching any definitive conclusions beyond what meaning can’t be.

Children are a useful touchstone concerning meaning because we know that they acquire linguistic skills and consequently at least an operational understanding of meaning. And how they do so is rather interesting: first, presume that whole objects are the first topics for naming; next, assume that syntactic differences lead to semantic differences (“the dog” refers to the class of dogs while “Fido” refers to the instance); finally, prefer that linguistic differences point to semantic differences. Paul Bloom slices and dices the research in his Précis of How Children Learn the Meanings of Words, calling into question many core assumptions about the learning of words and meaning.

These preferences become useful if we want to try to formulate an algorithm that assigns meaning to objects or groups of objects. Probabilistic Latent Semantic Analysis, for example, assumes that words are signals from underlying probabilistic topic models and then derives those models by estimating all of the probabilities from the available signals. The outcome lacks labels, however: the “meaning” is expressed purely in terms of co-occurrences of terms. Reconciling an approach like PLSA with the observations about children’s meaning acquisition presents some difficulties. The process seems too slow, for example, which was always a complaint about connectionist architectures of artificial neural networks as well. As Bloom points out, kids don’t make many errors concerning meaning and when they do, they rapidly compensate.

I’ve previously proposed a model for lexical acquisition that uses a coding hierarchy based on co-occurrence or other features. As new terms are observed, the hierarchy builds, in an unsupervised manner, by making local swaps and consolidations based on minimum description length principles. Thus, it bears a close relationship to Nevill-Manning’s SEQUITUR approach to sequence learning. There is a limitation to the approach in that in a tree-like grammar the complexity of examining all possible re-arrangements of the grammar when new symbols arrive seems to put a massive burden on any cognitive correlates that we might claim exist. Thus the system just uses local swaps and consolidations.

It’s worth considering how such an approach might solve the cluster labeling problem. If we cluster things together based on the parsimonious coding approach, the objects and their grammatical coordinations move higher up the tree. What is missing is a preference for adding new, distinctive terms that differentiate one grouping from another. For instance, in the toy sample given in my paper, “Financial Institution” or “Retail Bank” are not applied to the appropriate bank cluster, nor is “River Bank” applied to the other bank cluster. Instead we are just left with the shared context terms. I think this might be correctable in a larger grouping, however, by allowing for a distinguishing series of portmanteaus to be constructed by composition from nearby (in the semantic region) concepts. So, as the co-occurrences of bank and teller and ATM and loan pile up and get coded into groupings, the nearby finance, bank, retail bank, investment bank grouping is used to create a common portmanteau out of the most distinctive terms out of the set, and such that they most distinguish from the river semantic set.

In Like Flynn

The exceptionally interesting James Flynn explains the cognitive history of the past century and what it means in terms of human intelligence in this TED talk:

What does the future hold? While we might decry the “twitch” generation and their inundation by social media, gaming stimulation, and instant interpersonal engagement, the slowing observed in the Flynn Effect might be getting ready for another ramp-up over the next 100 years.

Perhaps most intriguing is the discussion of the ability to think in terms of hypotheticals as a a core component of ethical reasoning. Ethics is about gaming outcomes and also about empathizing with others. The influence of media as a delivery mechanism for narratives about others emerged just as those changes in cognitive capabilities were beginning to mature in the 20th Century. Widespread media had a compounding effect on the core abstract thinking capacity, and with the expansion of smartphones and informational flow, we may only have a few generations to go before the necessary ingredients for good ethical reasoning are widespread even in hard-to-reach areas of the world.

Contingency and Irreducibility

JaredTarbell2Thomas Nagel returns to defend his doubt concerning the completeness—if not the efficacy—of materialism in the explanation of mental phenomena in the New York Times. He quickly lays out the possibilities:

  1. Consciousness is an easy product of neurophysiological processes
  2. Consciousness is an illusion
  3. Consciousness is a fluke side-effect of other processes
  4. Consciousness is a divine property supervened on the physical world

Nagel arrives at a conclusion that all four are incorrect and that a naturalistic explanation is possible that isn’t “merely” (1), but that is at least (1), yet something more. I previously commented on the argument, here, but the refinement of the specifications requires a more targeted response.

Let’s call Nagel’s new perspective Theory 1+ for simplicity. What form might 1+ take on? For Nagel, the notion seems to be a combination of Chalmers-style qualia combined with a deep appreciation for the contingencies that factor into the personal evolution of individual consciousness. The latter is certainly redundant in that individuality must be absolutely tied to personal experiences and narratives.

We might be able to get some traction on this concept by looking to biological evolution, though “ontogeny recapitulates phylogeny” is about as close as we can get to the topic because any kind of evolutionary psychology must be looking for patterns that reinforce the interpretation of basic aspects of cognitive evolution (sex, reproduction, etc.) rather than explore the more numinous aspects of conscious development. So we might instead look for parallel theories that focus on the uniqueness of outcomes, that reify the temporal evolution without reference to controlling biology, and we get to ideas like uncomputability as a backstop. More specifically, we can explore ideas like computational irreducibility to support the development of Nagel’s new theory; insofar as the environment lapses towards weak predictability, a consciousness that self-observes, regulates, and builds many complex models and metamodels is superior to those that do not.

I think we already knew that, though. Perhaps Nagel has been too much a philosopher and too little involved in the sciences that surround and enervate modern theories of learning and adaption to see the movement towards the exits?



Singularity and its Discontents

Kimmel botIf a machine-based process can outperform a human being is it significant? That weighty question hung in the background as I reviewed Jürgen Schmidhuber’s work on traffic sign classification. Similar results have emerged from IBM’s Watson competition and even on the TOEFL test. In each case, machines beat people.

But is that fact significant? There are a couple of ways we can look at these kinds of comparisons. First, we can draw analogies to other capabilities that were not accessible by mechanical aid and show that the fact that they outperformed humans was not overly profound. The wheel quickly outperformed human legs for moving heavy objects. The cup outperformed the hands for drinking water. This then invites the realization that the extension of these physical comparisons leads to extraordinary juxtapositions: the airline really outperformed human legs for transport, etc. And this, in turn, justifies the claim that since we are now just outperforming human mental processes, we can only expect exponential improvements moving forward.

But this may be a category mistake in more than the obvious differentiator of the mental and the physical. Instead, the category mismatch is between levels of complexity. The number of parts in a Boeing 747 is 6 million versus one moving human as the baseline (we could enumerate the cells and organelles, etc., but then we would need to enumerate the crystal lattices of the aircraft steel, so that level of granularity is a wash). The number of memory addresses in a big server computer is 64 x 10^9 or higher, with disk storage in the TBs (10^12). Meanwhile, the human brain has 100 x 10^9 neurons and 10^14 connections. So, with just 2 orders of magnitude between computers and brains versus 6 between humans and planes, we find ourselves approaching Kurzweil’s argument that we have to wait until 2040. I’m more pessimistic and figure 2080, but then no one expects the Inquisition, either, to quote the esteemed philosophers, Monty Python.

We might move that back even further, though, because we still lack a theory of the large scale control of the collected software modules needed to operate on that massive neural simulation. At least Schmidhuber’s work used an artifical neural network. The others were looser about any affiliation to actual human information processing, though the LSI work is mathematically similar to some kinds of ANNs in terms of outcomes.

So if analogies only serve to support a mild kind of techno-optimism, we still can think about the problem in other ways by inverting the comparisons or emphasizing the risk of superintelligent machines. Thus is born the existential risk school of technological singularities. But such concerns and planning doesn’t really address the question of whether superintelligent machines are actually possible, or whether current achievements are significant.

And that brings us to the third perspective: the focus on competitive outcomes in AI research leads to only mild advances in the state-of-the-art, but does lead to important social outcomes. These are Apollo moon shots, in other words. Regardless of significant scientific advances, they stir the mind and the soul. It may transform the mild techno-optimism into moderate techo-optimism. And that’s OK, because the alternative is stationary fear.

Universal Artificial Social Intelligence

Continuing to develop the idea that social reasoning adds to Hutter’s Universal Artificial Intelligence model, below is his basic layout for agents and environments:

A few definitions: The Agent (p) is a Turing machine that consists of a working tape and an algorithm that can move the tape left or right, read a symbol from the tape, write a symbol to the tape, and transition through a finite number of internal states as held in a table. That is all that is needed to be a Turing machine and Turing machines can compute like our every day notion of a computer. Formally, there are bounds to what they can compute (for instance, whether any given program consisting of the symbols on the tape will stop at some point or will run forever without stopping (this is the so-called “halting problem“). But it suffices to think of the Turing machine as a general-purpose logical machine in that all of its outputs are determined by a sequence of state changes that follow from the sequence of inputs and transformations expressed in the state table. There is no magic here.

Hutter then couples the agent to a representation of the environment, also expressed by a Turing machine (after all, the environment is likely deterministic), and has the output symbols of the agent consumed by the environment (y) which, in turn, outputs the results of the agent’s interaction with it as a series of rewards (r) and environment signals (x), that are consumed by agent once again.

Where this gets interesting is that the agent is trying to maximize the reward signal which implies that the combined predictive model must convert all the history accumulated at one point in time into an optimal predictor. This is accomplished by minimizing the behavioral error and behavioral error is best minimized by choosing the shortest program that also predicts the history. By doing so, you simultaneously reduce the brittleness of the model to future changes in the environment.

So far, so good. But this is just one agent coupled to the environment. If we have two agents competing against one another, we can treat each as the environment for the other and the mathematics is largely unchanged (see Hutter, pp. 36-37 for the treatment of strategic games via Nash equilibria and minimax). However, for non-competitive multi-agent simulations operating against the same environment there is a unique opportunity if the agents are sampling different parts of the environmental signal. So, let’s change the model to look as follows:

Now, each agent is sampling different parts of the output symbols generated by the environment (as well as the utility score, r). We assume that there is a rate difference between the agents input symbols and the environmental output symbols, but this is not particularly hard to model: as part of the input process, the agents’ state table just passes over N symbols, where N is the number of the agent, for instance. The resulting agents will still be Hutter-optimal with regard to the symbol sequences that they do process, and will generate outputs over time that maximize the additive utility of the reward signal, but they are no longer each maximizing the complete signal. Indeed, the relative quality of the individual agents is directly proportional to the quantity of the input symbol stream that they can consume.

Overcoming this limitation has an obvious fix: share the working tape between the individual agents:

Then, each agent can record not only their states on the tape, but can also consume the states of the other agents. The tape becomes an extended memory or a shared theory about the environment. Formally, I don’t believe there is any difference between this and the single agent model because, like multihead Turing machines, the sequence of moves and actions can be collapsed to a single table and a single tape insofar as the entire environmental signal is available to all of the agents (or their concatenated form). Instead the value lies in consideration of what a multi-agent system implies concerning shared meaning and the value of coordination: for any real environment, perfect coupling between a single agent and that environment is an unrealistic simplification. And shared communication and shared modeling translates into an expansion of the individual agent’s model of the universe into greater individual reward and, as a side-effect, group reward as well.