Category: Cognitive Science

Black and Gray Boxes with Autonomous Meta-Cognition

Vijay Pande of VC Andreessen Horowitz (who passed on my startups twice but, hey, it’s just business!) has a relevant article in New York Times concerning fears of the “black box” of deep learning and related methods: is the lack of explainability and limited capacity for interrogation of the underlying decision making a deal-breaker for applications to critical areas like medical diagnosis or parole decisions? His point is simple, and related to the previous post’s suggestion of the potential limitations of our capacity to truly understand many aspects of human cognition. Even the doctor may only be able to point to a nebulous collection of clinical experiences when it comes to certain observational aspects of their jobs, like in reading images for indicators of cancer. At least the algorithm has been trained on a significantly larger collection of data than the doctor could ever encounter in a professional lifetime.

So the human is almost as much a black box (maybe a gray box?) as the algorithm. One difference that needs to be considered, however, is that the deep learning algorithm might make unexpected errors when confronted with unexpected inputs. The classic example from the early history of artificial neural networks involved a DARPA test of detecting military tanks in photographs. The apocryphal to legendary formulation of the story is that there was a difference in the cloud cover between the tank images and the non-tank images. The end result was that the system performed spectacularly on the training and test data sets but then failed miserably on new data that lacked the cloud cover factor. I recalled this slightly differently recently and substituted film grain for the cloudiness. In any case, it became a discussion point about the limits of data-driven learning that showed how radically incorrect solutions could be created without careful understanding of how the systems work.

How can the fears of radical failure be reduced? In medicine we expect automated decision making to be backed up by a doctor who serves as a kind of meta-supervisor. When the diagnosis or prognosis looks dubious, the doctor will order more tests or countermand the machine. When the parole board sees the parole recommendation, they always have the option of ignoring it based on special circumstances. In each case, it is the presence of some anomaly in the recommendation or the input data that would lead to reconsideration. Similarly, it is certainly possible to automate that scrutiny at a meta-level. In machine learning, statistical regularization is used to reduce or eliminate outliers in data sets in an effort to prevent overtraining or overfitting on noisy data elements. In much the same way, the regularization process can provide warnings and clues about data viability. And, in turn, unusual outputs that are statistically unlikely given the history of the machine’s decisions can trigger warnings about anomalous results.

Deep Simulation in the Southern Hemisphere

I’m unusually behind in my postings due to travel. I’ve been prepping for and now deep inside a fresh pass through New Zealand after two years away. The complexity of the place seems to have a certain draw for me that has lured me back, yet again, to backcountry tramping amongst the volcanoes and glaciers, and to leasurely beachfront restaurants painted with eruptions of summer flowers fueled by the regular rains.

I recently wrote a technical proposal that rounded up a number of the most recent advances in deep learning neural networks. In each case, like with Google’s transformer architecture, there is a modest enhancement that is based on a realization of a deficit in the performance of one of two broad types of networks, recurrent and convolutional.

An old question is whether we learn anything about human cognition if we just simulate it using some kind of automatically learning mechanism. That is, if we use a model acquired through some kind of supervised or unsupervised learning, can we say we know anything about the original mind and its processes?

We can at least say that the learning methodology appears to be capable of achieving the technical result we were looking for. But it also might mean something a bit different: that there is not much more interesting going on in the original mind. In this radical corner sits the idea that cognitive processes in people are tactical responses left over from early human evolution. All you can learn from them is that they may be biased and tilted towards that early human condition, but beyond that things just are the way they turned out.

If we take this position, then, we might have to discard certain aspects of the social sciences. We can no longer expect to discover some hauntingly elegant deep structure to our minds, language, or personalities. Instead, we are just about as deep as we are shallow, nothing more.

I’m alright with this, but think it is premature to draw the more radical conclusion. In the same vein that there are tactical advances—small steps—that improve simple systems in solving more complex problems, the harder questions of self-reflectivity and “general AI” still have had limited traction. It’s perfectly reasonable that upon further development, the right approaches and architectures that do reveal insights into deep cognitive “simulation” will, in turn, uncover some of that deep structure.

Ambiguously Slobbering Dogs

I was initially dismissive of this note from Google Research on improving machine translation via Deep Learning Networks by adding in a sentence-level network. My goodness, they’ve rediscovered anaphora and co-reference resolution! Next thing they will try is some kind of network-based slot-filler ontology to carry gender metadata. But their goal was to add a framework to their existing recurrent neural network architecture that would support a weak, sentence-level resolution of translational ambiguities while still allowing the TPU/GPU accelerators they have created to function efficiently. It’s a hack, but one that potentially solves yet another corner of the translation problem and might result in a few percent further improvements in the quality of the translation.

But consider the following sentences:

The dog had the ball. It was covered with slobber.

The dog had the ball. It was thinking about lunch while it played.

In these cases, the anaphora gets resolved by semantics and the resolution seems largely an automatic and subconscious process to us as native speakers. If we had to translate these into a second language, however, we would be able to articulate that there are specific reasons for correctly assigning the “It” to the ball in the first two sentences. Well, it might be possible for the dog to be covered with slobber, but we would guess the sentence writer would intentionally avoid that ambiguity. The second set of sentences could conceivably be ambiguous if, in the broader context, the ball was some intelligent entity controlling the dog. Still, when our guesses are limited to the sentence pairs in isolation we would assign the obvious interpretations. Moreover, we can resolve giant, honking passage-level ambiguities with ease, where the author is showing off in not resolving the co-referents until obscenely late in the text.

In combination, we can see the obvious problem with sentence-level “attention” calculations. The context has to be moving and fairly long.

Brain Gibberish with a Convincing Heart

Elon Musk believes that direct brain interfaces will help people better transmit ideas to one another in addition to just allowing thought-to-text generation. But there is a fundamental problem with this idea. Let’s take Hubert Dreyfus’ conception of the way meaning works as being tied to a more holistic view of our social interactions with others. Hilary Putnam would probably agree with this perspective, though now I am speaking for two dead philosphers of mind. We can certainly conclude that my mental states when thinking about the statement “snow is white” are, borrowing from Putnam who borrows from Quine, different from a German person thinking “Schnee ist weiß.” The orthography, grammar, and pronunciation are different to begin with. Then there is what seems to transpire when I think about that statement: mild visualizations of white snow-laden rocks above a small stream for instance, or, just now, Joni Mitchell’s “As snow gathers like bolts of lace/Waltzing on a ballroom girl.” The centrality or some kind of logical ground that merely asserts that such a statement is a propositional truth that is shared in some kind of mind interlingua doesn’t bear much fruit to the complexities of what such a statement entails.

Religious and political terminology is notoriously elastic. Indeed, for the former, it hardly even seems coherent to talk about the concept of supernatural things or events. If they are detectable by any other sense than some kind of unverifiable gnosis, then they are at least natural in that they are manifesting in the observable world. So supernatural imposes a barrier that seems to preclude any kind of discussion using ordinary language. The only thing left is a collection of metaphysical assumptions that, in lacking any sort of reference, must merely conform to the patterns of synonymy, metonymy, and other language games that we ordinarily reserve for discernible events and things. And, of course, where unverifiable gnosis holds sway, it is not public knowledge and therefore seems to mainly serve as a social mechanism for attracting attention to oneself.

Politics takes on a similar quality, with it often said to be a virtue if a leader can translate complex policies into simple sound bites. But, as we see in modern American politics, what instead happens is that abstract fear signaling is the primary currency to try to motivate (and manipulate) the voter. The elasticity of a concept like “freedom” is used to polarize the sides of political negotiation that almost always involves the management of winners and losers and the dividing line between them. Fear mixes with complex nostalgia about times that never were, or were more nuanced than most recall, and jeremiads serve to poison the well of discourse.

So, if I were to have a brain interface, it might be trainable to write words for me by listening to the regular neural firing patterns that accompany my typing or speaking, but I doubt it would provide some kind of direct transmission or telepathy between people that would have any more content than those written or spoken forms. Instead, the inscrutable and non-referential abstractions about complex ideas would be tied together and be in contrast with the existing holistic meaning network. And that would just be gibberish to any other mind. Worst still, such a system might also be able to convey raw emotion from person to person, thus just amplifying the fear or joy component of the idea without being able to transmit the specifics of the thoughts. And that would be worse than mere gibberish, it would be gibberish with a convincing heart.

The Obsessive Dreyfus-Hawking Conundrum

I’ve been obsessed lately. I was up at 5 A.M. yesterday and drove to Ruidoso to do some hiking (trails T93 to T92, if interested). The San Augustin Pass was desolate as the sun began breaking over, so I inched up into triple digit speeds in the M6. Because that is what the machine is made for. Booming across White Sands Missile Range, I recalled watching base police work with National Park Rangers to chase oryx down the highway while early F117s practiced touch-and-gos at Holloman in the background, and then driving my carpool truck out to the high energy laser site or desert ship to deliver documents.

I settled into Starbucks an hour and a half later and started writing on ¡Reconquista!, cranking out thousands of words before trying to track down the trailhead and starting on my hike. (I would have run the thing but wanted to go to lunch later and didn’t have access to a shower. Neither restaurant nor diners deserve an après-run moi.) And then I was on the trail and I kept stopping and taking plot and dialogue notes, revisiting little vignettes and annotating enhancements that I would later salt in to the main text over lunch. And I kept rummaging through the development of characters, refining and sifting the facts of their lives through different sets of sieves until they took on both a greater valence within the story arc and, often, more comedic value.

I was obsessed and remain so. It is a joyous thing to be in this state, comparable only to working on large-scale software systems when the hours melt away and meals slip as one cranks through problem after problem, building and modulating the subsystems until the units begin to sing together like a chorus. In English, the syntax and semantics are less constrained and the pragmatics more pronounced, but the emotional high is much the same.

With the recent death of Hubert Dreyfus at Berkeley it seems an opportune time to consider the uniquely human capabilities that are involved in each of these creative ventures. Uniquely, I suggest, because we can’t yet imagine what it would be like for a machine to do the same kinds of intelligent tasks. Yet, from Stephen Hawking through to Elon Musk, influential minds are worried about what might happen if we develop machines that rise to the level of human consciousness. This might be considered a science fiction-like speculation since we have little basis for conjecture beyond the works of pure imagination. We know that mechanization displaces workers, for instance, and think it will continue, but what about conscious machines?

For Dreyfus, the human mind is too embodied and situational to be considered an encodable thing representable by rules and algorithms. Much like the trajectory of a species through an evolutionary landscape, the mind is, in some sense, an encoded reflection of the world in which it lives. Taken further, the evolutionary parallel becomes even more relevant in that it is embodied in a sensory and physical identity, a product of a social universe, and an outgrowth of some evolutionary ping pong through contingencies that led to greater intelligence and self-awareness.

Obsession with whatever cultivars, whatever traits and tendencies, lead to this riot of wordplay and software refinement is a fine example of how this moves away from the fears of Hawking and towards the impossibilities of Dreyfus. We might imagine that we can simulate our way to the kernel of instinct and emotion that makes such things possible. We might also claim that we can disconnect the product of the effort from these internal states and the qualia that defy easy description. The books and the new technologies have only desultory correspondence to the process by which they are created. But I doubt it. It’s more likely that getting from great automatic speech recognition or image classification to the general AI that makes us fearful is a longer hike than we currently imagine.

Tweak, Memory

Artificial Neural Networks (ANNs) were, from early on in their formulation as Threshold Logic Units (TLUs) or Perceptrons, mostly focused on non-sequential decision-making tasks. With the invention of back-propagation training methods, the application to static presentations of data became somewhat fixed as a methodology. During the 90s Support Vector Machines became the rage and then Random Forests and other ensemble approaches held significant mindshare. ANNs receded into the distance as a quaint, historical approach that was fairly computationally expensive and opaque when compared to the other methods.

But Deep Learning has brought the ANN back through a combination of improvements, both minor and major. The most important enhancements include pre-training of the networks as auto-encoders prior to pursuing error-based training using back-propagation or  Contrastive Divergence with Gibbs Sampling. The critical other enhancement derives from Schmidhuber and others work in the 90s on managing temporal presentations to ANNs so the can effectively process sequences of signals. This latter development is critical for processing speech, written language, grammar, changes in video state, etc. Back-propagation without some form of recurrent network structure or memory management washes out the error signal that is needed for adjusting the weights of the networks. And it should be noted that increased compute fire-power using GPUs and custom chips has accelerated training performance enough that experimental cycles are within the range of doable.

Note that these are what might be called “computer science” issues rather than “brain science” issues. Researchers are drawing rough analogies between some observed properties of real neuronal systems (neurons fire and connect together) but then are pursuing a more abstract question as to how a very simple computational model of such neural networks can learn. And there are further analogies that start building up: learning is due to changes in the strength of neural connections, for instance, and neurons fire after suitable activation. Then there are cognitive properties of human minds that might be modeled, as well, which leads us to a consideration of working memory in building these models.

It is this latter consideration of working memory that is critical to holding stimuli presentations long enough that neural connections can process them and learn from them. Schmidhuber et. al.’s methodology (LSTM) is as ad hoc as most CS approaches in that it observes a limitation with a computational architecture and the algorithms that operate within that architecture and then tries to remedy the limitation by architectural variations. There tends to be a tinkering and tweaking that goes on in the gradual evolution of these kinds of systems until something starts working. Theory walks hand-in-hand with practice in applied science.

Given that, however, it should be noted that there are researchers who are attempting to create a more biologically-plausible architecture that solves some of the issues with working memory and training neural networks. For instance, Frank, Loughry, and O’Reilly at University of Colorado have been developing a computational model that emulates the circuits that connect the frontal cortex and the basal ganglia. The model uses an elaborate series of activating and inhibiting connections to provide maintenance of perceptual stimuli in working memory. The model shows excellent performance on specific temporal presentation tasks. In its attempt to preserve a degree of fidelity to known brain science, it does lose some of the simplicity that purely CS-driven architectures provide, but I think it has a better chance of helping overcome another vexing problem for ANNs. Specifically, the slow learning properties of ANNs have only scant resemblance to much human learning. We don’t require many, many presentations of a given stimulus in order to learn it; often, one presentation is sufficient. Reconciling the slow tuning of ANN models, even recurrent ones, with this property of human-like intelligence remains an open issue, and more biology may be the key.

Apprendre à traduire

Google’s translate has always been a useful tool for awkward gists of short texts. The method used was based on building a phrase-based statistical translation model. To do this, you gather up “parallel” texts that are existing, human, translations. You then “align” them by trying to find the most likely corresponding phrases in each sentence or sets of sentences. Often, between languages, fewer or more sentences will be used to express the same ideas. Once you have that collection of phrasal translation candidates, you can guess the most likely translation of a new sentence by looking up the sequence of likely phrase groups that correspond to that sentence. IBM was the progenitor of this approach in the late 1980’s.

It’s simple and elegant, but it always was criticized for telling us very little about language. Other methods that use techniques like interlingual transfer and parsers showed a more linguist-friendly face. In these methods, the source language is parsed into a parse tree and then that parse tree is converted into a generic representation of the meaning of the sentence. Next a generator uses that representation to create a surface form rendering in the target language. The interlingua must be like the deep meaning of linguistic theories, though the computer science versions of it tended to look a lot like ontological representations with fixed meanings. Flexibility was never the strong suit of these approaches, but their flaws were much deeper than just that.

For one, nobody was able to build a robust parser for any particular language. Next, the ontology was never vast enough to accommodate the rich productivity of real human language. Generators, being the inverse of the parser, remained only toy projects in the computational linguistic community. And, at the end of the day, no functional systems were built.

Instead, the statistical methods plodded along but had their own limitations. For instance, the translation of a never-before-seen sentence consisting of never-before-seen phrases, is the null set. Rare and strange words in the data have problems too, because they have very low probabilities and are swamped by well-represented candidates that lack the nuances of the rarer form. The model doesn’t care, of course; the probabilities rule everything. So you need more and more data. But then you get noisy data mixed in with the good data that distorts the probabilities. And you have to handle completely new words and groupings like proper nouns and numbers that are due to the unique productivity of these classes of forms.

So, where to go from here? For Google and its recent commitment to Deep Learning, the answer was to apply Deep Learning Neural Network approaches. The approach threw every little advance of recent history at the problem to pretty good effect. For instance, to cope with novel and rare words, they broke the input text up into sub-word letter groupings. The segmentation of the groupings was based, itself, on a learned model of the most common break-ups of terms, though they didn’t necessarily correspond to syllables or other common linguistic expectations. Sometimes they also used character-level models. The models were then combined into an ensemble, which is a common way of overcoming brittleness and overtraining on subsets of the data set. They used GPUs in some cases as well as reduced-precision arithmetic to speed-up the training of the models. They also used an attention-based intermediary between the encoder layers and the decoder layers to limit the influence of the broader context within a sentence.

The results improved translation quality by as much as 60% over the baseline phrase-based approach and, interestingly, showed a close approach to the average human translator’s performance. Is this enough? Not at all. You are not going to translate poetry this way any time soon. The productiveness of human language and the open classes of named entities remain a barrier. The subtleties of pragmatics might still vex any data driven approach—at least until there are a few examples in the corpora. And there might need to be a multi-sensory model somehow merged with the purely linguistic one to help manage some translation candidates. For instance, knowing the way in which objects fall could help move a translation from “plummeted” to “settled” to the ground.

Still, data-driven methods continue to reshape the intelligent machines of the future.

Boredom and Being a Decider

tds_decider2_v6Seth Lloyd and I have rarely converged (read: absolutely never) on a realization, but his remarkable 2013 paper on free will and halting problems does, in fact, converge on a paper I wrote around 1986 for an undergraduate Philosophy of Language course. I was, at the time, very taken by Gödel, Escher, Bach: An Eternal Golden Braid, Douglas Hofstadter’s poetic excursion around the topic of recursion, vertical structure in ricercars, and various other topics that stormed about in his book. For me, when combined with other musings on halting problems, it led to a conclusion that the halting problem could be probabilistically solved by an observer who decides when the recursion is too repetitive or too deep. Thus, it prescribes an overlay algorithm that guesses about the odds of another algorithm when subjected to a time or resource constraint. Thus we have a boredom algorithm.

I thought this was rather brilliant at the time and I ended up having a one-on-one with my prof who scoffed at GEB as a “serious” philosophical work. I had thought it was all psychedelically transcendent and had no deep understanding of more serious philosophical work beyond the papers by Kripke, Quine, and Davidson that we had been tasked to read. So I plead undergraduateness. Nevertheless, he had invited me to a one-on-one and we clashed over the concept of teleology and directedness in evolutionary theory. How we got to that from the original decision trees of halting or non-halting algorithms I don’t recall.

But now we have an argument that essentially recapitulates that original form, though with the help of the Hartmanis-Stearns theorem to support it. Whatever the algorithm that runs in our heads, it needs to simulate possible outcomes and try to determine what the best course of action might be (or the worst course, or just some preference). That algorithm is in wetware and is therefore perfectly deterministic. And, importantly, quantum indeterminacy doesn’t rescue us from the free-will implications of that determinism at all; randomness is just random, not decision-making. Instead, the impossibility of assessing the possible outcomes comes from one algorithm monitoring another. In a few narrow cases, it may be possible to enumerate all the stopping results of the enclosed algorithm, but in general, all you can do is greedily terminate branches in the production tree based on some kind of temporal or resource-based criteria,

Free will is neither random nor classically deterministic, but is an algorithmic constraint on the processing power to simulate reality in a conscious, but likely deterministic, head.

Desire and Other Matters

From the frothy mind of Jeff Koons
From the frothy mind of Jeff Koons

“What matters?” is a surprisingly interesting question. I think about it constantly since it weighs-in whenever plotting future choices, though often I seem to be more autopilot than consequentialist in these conceptions. It is an essential first consideration when trying to value one option versus another. I can narrow the question a bit to “what ideas matter?” This immediately externalizes the broad reality of actions that meaningfully improve lives, like helping others, but still leaves a solid core of concepts that are valued more abstractly. Does the traditional Western liberal tradition really matter? Do social theories? Are less intellectually-embellished virtues like consistency and trust more relevant and applicable than notions like, well, consequentialism?

Maybe it amounts to how to value certain intellectual systems against others?

Some are obviously more true than others. So “dowsing belief systems” are less effective in a certain sense than “planetary science belief systems.” Yet there are a broader range of issues at work.

But there are some areas of the liberal arts that have a vexing relationship with the modern mind. Take linguistics. The field ranges from catalogers of disappearing languages to theorists concerned with how to structure syntactic trees. Among the latter are the linguists who have followed Noam Chomsky’s paradigm that explains language using a hierarchy of formal syntactic systems, all of which feature recursion as a central feature. What is interesting is that there have been very few impacts of this theory. It is very simple at its surface: languages are all alike and involve phrasal groups that embed in deep hierarchies. The specific ways in which the phrases and their relative embeddings take place may differ among languages, but they are alike in this abstract way.

And likewise we have to ask what the impact is of scholarship like René Girard’s theory of mimesis. The theory has a Victorian feel about it: a Freudian/Jungian essential psychological tendency girds all that we know, experience, and see. Violence is the triangulation of wanton desire as we try to mimic one another. That triangulation was suppressed—sublimated, if you will—by sacrifice that refocused the urge to violence on the sacrificial object. It would be unusual for such a theory to rise above the speculative scholarship that only queasily embraces empiricism without some prodding.

But maybe it is enough that ideas are influential at some level. So we have Ayn Rand, liberally called-out by American economic conservatives, at least until they are reminded of Rand’s staunch atheism. And we have Peter Thiel, from PayPal mafia to recent Gawker lawsuits, justifying his Facebook angel round based on Girard’s theory of mimesis. So we are all slaves of our desires to like, indirectly, a bunch of crap on the internet. But at least it is theoretically sound.

Motivation, Boredom, and Problem Solving

shatteredIn the New York Times Stone column, James Blachowicz of Loyola challenges the assumption that the scientific method is uniquely distinguishable from other ways of thinking and problem solving we regularly employ. In his example, he lays out how writing poetry involves some kind of alignment of words that conform to the requirements of the poem. Whether actively aware of the process or not, the poet is solving constraint satisfaction problems concerning formal requirements like meter and structure, linguistic problems like parts-of-speech and grammar, semantic problems concerning meaning, and pragmatic problems like referential extension and symbolism. Scientists do the same kinds of things in fitting a theory to data. And, in Blachowicz’s analysis, there is no special distinction between scientific method and other creative methods like the composition of poetry.

We can easily see how this extends to ideas like musical composition and, indeed, extends with even more constraints that range from formal through to possibly the neuropsychology of sound. I say “possibly” because there remains uncertainty on how much nurture versus nature is involved in the brain’s reaction to sounds and music.

In terms of a computational model of this creative process, if we presume that there is an objective function that governs possible fits to the given problem constraints, then we can clearly optimize towards a maximum fit. For many of the constraints there are, however, discrete parameterizations (which part of speech? which word?) that are not like curve fitting to scientific data. In fairness, discrete parameters occur there, too, especially in meta-analyses of broad theoretical possibilities (Quantum loop gravity vs. string theory? What will we tell the children?) The discrete parameterizations blow up the search space with their combinatorics, demonstrating on the one hand why we are so damned amazing, and on the other hand why a controlled randomization method like evolutionary epistemology’s blind search and selective retention gives us potential traction in the face of this curse of dimensionality. The blind search is likely weakened for active human engagement, though. Certainly the poet or the scientist would agree; they are using learned skills, maybe some intellectual talent of unknown origin, and experience on how to traverse the wells of improbability in finding the best fit for the problem. This certainly resembles pre-training in deep learning, though on a much more pervasive scale, including feedback from categorical model optimization into the generative basis model.

But does this extend outwards to other ways in which we form ideas? We certainly know that motivated reasoning is involved in key aspects of our belief formation, which plays strongly into how we solve these constraint problems. We tend to actively look for confirmations and avoid disconfirmations of fit. We positively bias recency of information, or repeated exposures, and tend to only reconsider in much slower cycles.

Also, as the constraints of certain problem domains become, in turn, extensions that can result in change—where there is a dynamic interplay between belief and success—the fixity of the search space itself is no longer guaranteed. Broad human goals like the search for meaning are an example of that. In come complex human factors, like how boredom correlates with motivation and ideological extremism (overview, here, journal article, here).

This latter data point concerning boredom crosses from mere bias that might preclude certain parts of a search space into motivation that focuses it, and that optimizes for novelty seeking and other behaviors.