One Shot, Few Shot, Radical Shot

Exunoplura is back up after a sad excursion through the challenges of hosting providers. To be blunt, they mostly suck. Between systems that just don’t work right (SSL certificate provisioning in this case) and bad to counterproductive support experiences, it’s enough to make one want to host it oneself. But hosting is mostly, as they say of war, long boring periods punctuated by moments of terror as things go frustratingly sideways. But we are back up again after two hosting provider side-trips!

Honestly, I’d like to see an AI agent effectively navigate through these technological challenges. Where even human performance is fleeting and imperfect, the notion that an AI could learn how to deal with the uncertain corners of the process strikes me as currently unthinkable. But there are some interesting recent developments worth noting and discussing in the journey towards what is named “general AI” or a framework that is as flexible as people can be, rather than narrowly tied to a specific task like visually inspecting welds or answering a few questions about weather, music, and so forth.

First, there is the work by the OpenAI folks on massive language models being tested against one-shot or few-shot learning problems. In each of these learning problems, the number of presentations of the training data cases is limited, rather than presenting huge numbers of exemplars and “fine tuning” the response of the model. What is a language model? Well, it varies across different approaches, but typically is a weighted context of words of varying length, with the weights reflecting the probabilities of those words in those contexts over a massive collection of text corpora. For the OpenAI model, GPT-3, the total number of parameters (words/contexts and their counts) is an astonishing 175 billion using 45 Tb of text to train the model.… Read the rest

Ensembles Against Abominables

It seems obvious to me that when we face existential threats we should make the best possible decisions. I do this with respect to investment decisions, as well. I don’t rely on “guts” or feelings or luck or hope or faith or hunches or trends. All of those ideas are proxies for some sense of incompleteness in our understanding of probabilities and future outcomes.

So how can we cope with those kinds of uncertainties given existential threats? The core methodology is based on ensembles of predictions. We don’t actually want to trust an expert per se, but want instead to trust a basket of expert opinions—an ensemble of predictions. Ideally, those experts who have been more effective in the past should be given greater weight than those who have made poorer predictions. We most certainly should not rely on gut calls by abominable narcissists in what Chauncey Devega at Salon disturbingly characterizes as a “pathological kakistocracy.”

Investment decision-making takes exactly this form, when carried out rationally. Index funds adjust their security holdings in relationship to an index like the S&P 500. Since stock markets have risen since their inceptions with, of course, set backs along the way, an index is a reliable ensemble approach to growth. Ensembles smooth predictions and smooth out brittleness.

Ensemble methods are also core to predictive improvements in machine learning. While a single decision tree trained on data may overweight portions of the data set, an ensemble of trees (which we call a forest, of course) smoothes the decision making by having each tree become only a part of the final vote for a prediction. The training of the individual trees is based on a randomized subset of the data, allowing for specialization of stands of trees, but preserving overall effectiveness of the system.… Read the rest

A Personal Computing Revolution

I’m writing this on a 2018 iPad Pro (11” with 512GB storage and LTE). I’m also using an external Apple Magic Keyboard 2 and Magic Trackpad 2. The iPad is plugged into an LG USB-C monitor at my sit-stand desk overlooking a forested canyon in Sedona. And it is, well, almost perfect. Almost, but there are remaining limitations (I’ll get to them), though they are well-balanced by the capabilities and I suspect will be remedied soon.

Overall, though, it feels like a compute revolution where a small, extremely light (1 pound or so) device is all I need to occupy much of my day. I’ll point out that I am not by nature an Apple fanboi. I have an HP laptop that dual boots Ubuntu Linux and Windows in addition to a Macbook Pro with Parallels hosting two Linux distributions for testing and continuing education purposes. I know I can live my online life in Chrome on Linux well enough, using Microsoft Office 365, Google Mail, 1Password, Qobuz, Netflix, etc. while still being able to build enterprise and startup software ecosystems via the Eclipse IDE, Java J2EE, Python, MySQL, AWS, Azure, etc. Did I forget anyone in there? Oh, of course there are Bitbucket, git, maven, Confluence, and all those helpers. All are just perfect on Linux once you fight your way through the package managers and occasional consults of Stack Overflow. I think I first installed Linux on a laptop in 1993, and it remains not for the weak of geek, but is constantly improving.

But what are the positives of the iPad Pro? First is the lightness and more-than-sufficient power. Photo editing via Affinity Photo is actually faster than on my Macbook Pro (2016) and video editing works well though without quite the professional completeness of a Final Cut.… Read the rest

Forever Uncanny

Quanta has a fair round up of recent advances in deep learning. Most interesting is the recent performance on natural language understanding tests that are close to or exceed mean human performance. Inevitably, John Searle’s Chinese Room argument is brought up, though the author of the Quanta article suggests that inferring the Chinese translational rule book from the data itself is slightly different from the original thought experiment. In the Chinese Room there is a person who knows no Chinese but has a collection of translational reference books. She receives texts through a slot and dutifully looks up the translation of the text and passes out the result. “Is this intelligence?” is the question and it serves as a challenge to the Strong AI hypothesis. With statistical machine translation methods (and their alternative mechanistic implementation, deep learning), the rule books have been inferred by looking at translated texts (“parallel” texts as we say in the field). By looking at a large enough corpus of parallel texts, greater coverage of translated variants is achieved as well as some inference of pragmatic issues in translation and corner cases.

As a practical matter, it should be noted that modern, professional translators often use translation memory systems that contain idiomatic—or just challenging—phrases that they can reference when translating new texts. The understanding resides in the original translator’s head, we suppose, and in the correct application of the rule to the new text by checking for applicability according to, well, some other criteria that the translator brings to bear on the task.

In the General Language Understand Evaluation (GLUE) tests described in the Quanta article, the systems are inferring how to answer Wh-style queries (who, what, where, when, and how) as well as identify similar texts.… Read the rest

Deep Learning with Quantum Decoherence

Getting back to metaphors in science, Wojciech Zurek’s so-called Quantum Darwinism is in the news due to a series of experimental tests. In Quantum Darwinism (QD), the collapse of the wave function (more properly the “extinction” of states) is a result of decoherence from environmental entanglement. There is a kind of replication in QD, where pointer states are multiplied, and then a kind of environmental selection as well. There is no variation per se, however, though some might argue that the pointer states imprinted by the environment are variants of the originals. Still, it makes the metaphor a bit thin at the edges, but it is close enough for the core idea to fit most of the floor-plan of Darwinism. Indeed, some champion it as part of a more general model for everything. Even selection among viable multiverse bubbles has a similar feel to it: some survive while others perish.

I’ve been simultaneously studying quantum computing and complexity theories that are getting impressively well developed. Richard Cleve’s An Introduction to Quantum Complexity Theory and John Watrous’s Quantum Computational Complexity are notable in their bridging from traditional computational complexity to this newer world of quantum computing using qubits, wave functions, and even decoherence gates.

Decoherence sucks for quantum computing in general, but there may be a way to make use of it. For instance, an artificial neural network (ANN) also has some interesting Darwinian-like properties to it. The initial weight distribution in an ANN is typically a random real value. This is designed to simulate the relative strength of neural connections. Real neural connections are much more complex than this, doing interesting cyclic behavior, saturating and suppressing based on neurotransmitter availability, and so forth, but assuming just a straightforward pattern of connectivity has allowed for significant progress.… Read the rest

Metaphors as Bridges to the Future

David Lewis’s (I’m coming to accept this new convention with s-ending possessives!) solution to Putnam’s semantic indeterminacy is that we have a network of concepts that interrelate in a manner that is consistent under probing. As we read, we know from cognitive psychology, texts that bridge unfamiliar concepts from paragraph to paragraph help us to settle those ideas into the network, sometimes tentatively, and sometimes needing some kind of theoretical reorganization as we learn more. Then there are some concepts that have special referential magnetism and are piers for the bridges.

You can see these same kinds of bridging semantics being applied in the quest to solve some our most difficult and unresolved scientific conundrums. Quantum physics has presented strangeness from its very beginning and the various interpretations of that strangeness and efforts to reconcile the strange with our everyday logic remains incomplete. So it is not surprising that efforts to unravel the strange in quantum physics often appeal to Einstein’s descriptive approach to deciphering the strange problems of electromagnetic wave propagation that ultimately led to Special and then General Relativity.

Two recent approaches that borrow from the Einstein model are Carlo Rovelli’s Relational Quantum Mechanics and David Albert’s How to Teach Quantum Mechanics. Both are quite explicit in drawing comparisons to the relativity approach; Einstein, in merging space and time, and in realizing inertial and gravitational frames of reference were indistinguishable, introduced an explanation that defied our expectations of ordinary, Newtonian physical interactions. Time was no longer a fixed universal but became locked to observers and their relative motion, and to space itself.

Yet the two quantum approaches are decidedly different, as well. For Rovelli, there is no observer-independent state to quantum affairs.… Read the rest

Theoretical Reorganization

Sean Carroll of Caltech takes on the philosophy of science in his paper, Beyond Falsifiability: Normal Science in a Multiverse, as part of a larger conversation on modern theoretical physics and experimental methods. Carroll breaks down the problems of Popper’s falsification criterion and arrives at a more pedestrian Bayesian formulation for how to view science. Theories arise, theories get their priors amplified or deflated, that prior support changes due to—often for Carroll—coherence reasons with other theories and considerations and, in the best case, the posterior support improves with better experimental data.

Continuing with the previous posts’ work on expanding Bayes via AIT considerations, the non-continuous changes to a group of scientific theories that arrive with new theories or data require some better model than just adjusting priors. How exactly does coherence play a part in theory formation? If we treat each theory as a binary string that encodes a Turing machine, then the best theory, inductively, is the shortest machine that accepts the data. But we know that there is no machine that can compute that shortest machine, so there needs to be an algorithm that searches through the state space to try to locate the minimal machine. Meanwhile, the data may be varying and the machine may need to incorporate other machines that help improve the coverage of the original machine or are driven by other factors, as Carroll points out:

We use our taste, lessons from experience, and what we know about the rest of physics to help guide us in hopefully productive directions.

The search algorithm is clearly not just brute force in examining every micro variation in the consequences of changing bits in the machine. Instead, large reusable blocks of subroutines get reparameterized or reused with variation.… Read the rest

Free Will and Algorithmic Information Theory (Part II)

Bad monkey

So we get some mild form of source determinism out of Algorithmic Information Complexity (AIC), but we haven’t addressed the form of free will that deals with moral culpability at all. That free will requires that we, as moral agents, are capable of making choices that have moral consequences. Another way of saying it is that given the same circumstances we could have done otherwise. After all, all we have is a series of if/then statements that must be implemented in wetware and they still respond to known stimuli in deterministic ways. Just responding in model-predictable ways to new stimuli doesn’t amount directly to making choices.

Let’s expand the problem a bit, however. Instead of a lock-and-key recognition of integer “foodstuffs” we have uncertain patterns of foodstuffs and fallible recognition systems. Suddenly we have a probability problem with P(food|n) [or even P(food|q(n)) where q is some perception function] governed by Bayesian statistics. Clearly we expect evolution to optimize towards better models, though we know that all kinds of historical and physical contingencies may derail perfect optimization. Still, if we did have perfect optimization, we know what that would look like for certain types of statistical patterns.

What is an optimal induction machine? AIC and variants have been used to define that machine. First, we have Solomonoff induction from around 1960. But we also have Jorma Rissanen’s Minimum Description Length (MDL) theory from 1978 that casts the problem more in terms of continuous distributions. Variants are available, too, from Minimum Message Length, to Akaike’s Information Criterion (AIC, confusingly again), Bayesian Information Criterion (BIC), and on to Structural Risk Minimization via Vapnik-Chervonenkis learning theory.

All of these theories involve some kind of trade-off between model parameters, the relative complexity of model parameters, and the success of the model on the trained exemplars.… Read the rest

Free Will and Algorithmic Information Theory

I was recently looking for examples of applications of algorithmic information theory, also commonly called algorithmic information complexity (AIC). After all, for a theory to be sound is one thing, but when it is sound and valuable it moves to another level. So, first, let’s review the broad outline of AIC. AIC begins with the problem of randomness, specifically random strings of 0s and 1s. We can readily see that given any sort of encoding in any base, strings of characters can be reduced to a binary sequence. Likewise integers.

Now, AIC states that there are often many Turing machines that could generate a given string and, since we can represent those machines also as a bit sequence, there is at least one machine that has the shortest bit sequence while still producing the target string. In fact, if the shortest machine is as long or a bit longer (given some machine encoding requirements), then the string is said to be AIC random. In other words, no compression of the string is possible.

Moreover, we can generalize this generator machine idea to claim that given some set of strings that represent the data of a given phenomena (let’s say natural occurrences), the smallest generator machine that covers all the data is a “theoretical model” of the data and the underlying phenomena. An interesting outcome of this theory is that it can be shown that there is, in fact, no algorithm (or meta-machine) that can find the smallest generator for any given sequence. This is related to Turing Incompleteness.

In terms of applications, Gregory Chaitin, who is one of the originators of the core ideas of AIC, has proposed that the theory sheds light on questions of meta-mathematics and specifically that it demonstrates that mathematics is a quasi-empirical pursuit capable of producing new methods rather than being idealistically derived from analytic first-principles.… Read the rest

The Elusive in Art and Artificial Intelligence

Per caption.
Deep Dream (deepdreamgenerator.com) of my elusive inner Van Gogh.

How exactly deep learning models do what they do is at least elusive. Take image recognition as a task. We know that there are decision-making criteria inferred by the hidden layers of the networks. In Convolutional Neural Networks (CNNs), we have further knowledge that locally-receptive fields (or their simulated equivalent) provide a collection of filters that emphasize image features in different ways, from edge detection to rotation-invariant reductions prior to being subjected to a learned categorizer. Yet, the dividing lines between a chair and a small loveseat, or between two faces, is hidden within some non-linear equation composed of these field representations with weights tuned by exemplar presentation.

This elusiveness was at least part of the reason that neural networks and, generally, machine learning-based approaches have had a complicated position in AI research; if you can’t explain how they work, or even fairly characterize their failure modes, maybe we should work harder to understand the support for those decision criteria rather than just build black boxes to execute them?

So when groups use deep learning to produce visual artworks like the recently auctioned work sold by Christie’s for USD 432K, we can be reassured that the murky issue of aesthetics in art appreciation is at least paired with elusiveness in the production machine.

Or is it?

Let’s take Wittgenstein’s ideas about aesthetics as a perhaps slightly murky point of comparison. In Wittgenstein, we are almost always looking at what are effectively games played between and among people. In language, the rules are shared in a culture, a community, and even between individuals. These are semantic limits, dialogue considerations, standardized usages, linguistic pragmatics, expectations, allusions, and much more.… Read the rest