Apprendre à traduire

Google’s translate has always been a useful tool for awkward gists of short texts. The method used was based on building a phrase-based statistical translation model. To do this, you gather up “parallel” texts that are existing, human, translations. You then “align” them by trying to find the most likely corresponding phrases in each sentence or sets of sentences. Often, between languages, fewer or more sentences will be used to express the same ideas. Once you have that collection of phrasal translation candidates, you can guess the most likely translation of a new sentence by looking up the sequence of likely phrase groups that correspond to that sentence. IBM was the progenitor of this approach in the late 1980’s.

It’s simple and elegant, but it always was criticized for telling us very little about language. Other methods that use techniques like interlingual transfer and parsers showed a more linguist-friendly face. In these methods, the source language is parsed into a parse tree and then that parse tree is converted into a generic representation of the meaning of the sentence. Next a generator uses that representation to create a surface form rendering in the target language. The interlingua must be like the deep meaning of linguistic theories, though the computer science versions of it tended to look a lot like ontological representations with fixed meanings. Flexibility was never the strong suit of these approaches, but their flaws were much deeper than just that.

For one, nobody was able to build a robust parser for any particular language. Next, the ontology was never vast enough to accommodate the rich productivity of real human language. Generators, being the inverse of the parser, remained only toy projects in the computational linguistic community.… Read the rest

Boredom and Being a Decider

tds_decider2_v6Seth Lloyd and I have rarely converged (read: absolutely never) on a realization, but his remarkable 2013 paper on free will and halting problems does, in fact, converge on a paper I wrote around 1986 for an undergraduate Philosophy of Language course. I was, at the time, very taken by Gödel, Escher, Bach: An Eternal Golden Braid, Douglas Hofstadter’s poetic excursion around the topic of recursion, vertical structure in ricercars, and various other topics that stormed about in his book. For me, when combined with other musings on halting problems, it led to a conclusion that the halting problem could be probabilistically solved by an observer who decides when the recursion is too repetitive or too deep. Thus, it prescribes an overlay algorithm that guesses about the odds of another algorithm when subjected to a time or resource constraint. Thus we have a boredom algorithm.

I thought this was rather brilliant at the time and I ended up having a one-on-one with my prof who scoffed at GEB as a “serious” philosophical work. I had thought it was all psychedelically transcendent and had no deep understanding of more serious philosophical work beyond the papers by Kripke, Quine, and Davidson that we had been tasked to read. So I plead undergraduateness. Nevertheless, he had invited me to a one-on-one and we clashed over the concept of teleology and directedness in evolutionary theory. How we got to that from the original decision trees of halting or non-halting algorithms I don’t recall.

But now we have an argument that essentially recapitulates that original form, though with the help of the Hartmanis-Stearns theorem to support it. Whatever the algorithm that runs in our heads, it needs to simulate possible outcomes and try to determine what the best course of action might be (or the worst course, or just some preference).… Read the rest

Startup Next

I’m thrilled to announce my new startup, Like Human. The company is focused on making significant new advances to the state of the art in cognitive computing and artificial intelligence. We will remain a bit stealthy for another six months or so and then will open up shop for early adopters.

I’m also pleased to share with you Like Human’s logo that goes by the name Logo McLogoface, or LM for short. LM combines imagery from nuclear warning signs, Robby the Robot from Forbidden Planet, and Leonardo da Vinci’s Vitruvian Man. I think you will agree about Mr. McLogoface’s agreeability:

logo-b

You can follow developments at @likehumancom on Twitter, and I will make a few announcements here as well.… Read the rest

Motivation, Boredom, and Problem Solving

shatteredIn the New York Times Stone column, James Blachowicz of Loyola challenges the assumption that the scientific method is uniquely distinguishable from other ways of thinking and problem solving we regularly employ. In his example, he lays out how writing poetry involves some kind of alignment of words that conform to the requirements of the poem. Whether actively aware of the process or not, the poet is solving constraint satisfaction problems concerning formal requirements like meter and structure, linguistic problems like parts-of-speech and grammar, semantic problems concerning meaning, and pragmatic problems like referential extension and symbolism. Scientists do the same kinds of things in fitting a theory to data. And, in Blachowicz’s analysis, there is no special distinction between scientific method and other creative methods like the composition of poetry.

We can easily see how this extends to ideas like musical composition and, indeed, extends with even more constraints that range from formal through to possibly the neuropsychology of sound. I say “possibly” because there remains uncertainty on how much nurture versus nature is involved in the brain’s reaction to sounds and music.

In terms of a computational model of this creative process, if we presume that there is an objective function that governs possible fits to the given problem constraints, then we can clearly optimize towards a maximum fit. For many of the constraints there are, however, discrete parameterizations (which part of speech? which word?) that are not like curve fitting to scientific data. In fairness, discrete parameters occur there, too, especially in meta-analyses of broad theoretical possibilities (Quantum loop gravity vs. string theory? What will we tell the children?) The discrete parameterizations blow up the search space with their combinatorics, demonstrating on the one hand why we are so damned amazing, and on the other hand why a controlled randomization method like evolutionary epistemology’s blind search and selective retention gives us potential traction in the face of this curse of dimensionality.… Read the rest

Local Minima and Coatimundi

CoatimundiEven given the basic conundrum of how deep learning neural networks might cope with temporal presentations or linear sequences, there is another oddity to deep learning that only seems obvious in hindsight. One of the main enhancements to traditional artificial neural networks is a phase of supervised pre-training that forces each layer to try to create a generative model of the input pattern. The deep learning networks then learn a discriminant model after the initial pre-training is done, focusing on the error relative to classification versus simply recognizing the phrase or image per se.

Why this makes a difference has been the subject of some investigation. In general, there is an interplay between the smoothness of the error function and the ability of the optimization algorithms to cope with local minima. Visualize it this way: for any machine learning problem that needs to be solved, there are answers and better answers. Take visual classification. If the system (or you) gets shown an image of a coatimundi and a label that says coatimundi (heh, I’m running in New Mexico right now…), learning that image-label association involves adjusting weights assigned to different pixels in the presentation image down through multiple layers of the network that provide increasing abstractions about the features that define a coatimundi. And, importantly, that define a coatimundi versus all the other animals and non-animals.,

These weight choices define an error function that is the optimization target for the network as a whole, and this error function can have many local minima. That is, by enhancing the weights supporting a coati versus a dog or a raccoon, the algorithm inadvertently leans towards a non-optimal assignment for all of them by focusing instead on a balance between them that is predestined by the previous dog and raccoon classifications (or, in general, the order of presentation).… Read the rest

New Behaviorism and New Cognitivism

lstm_memorycellDeep Learning now dominates discussions of intelligent systems in Silicon Valley. Jeff Dean’s discussion of its role in the Alphabet product lines and initiatives shows the dominance of the methodology. Pushing the limits of what Artificial Neural Networks have been able to do has been driven by certain algorithmic enhancements and the ability to process weight training algorithms at much higher speeds and over much larger data sets. Google even developed specialized hardware to assist.

Broadly, though, we see mostly pattern recognition problems like image classification and automatic speech recognition being impacted by these advances. Natural language parsing has also recently had some improvements from Fernando Pereira’s team. The incremental improvements using these methods should not be minimized but, at the same time, the methods don’t emulate key aspects of what we observe in human cognition. For instance, the networks train incrementally and lack the kinds of rapid transitions that we observe in human learning and thinking.

In a strong sense, the models that Deep Learning uses can be considered Behaviorist in that they rely almost exclusively on feature presentation with a reward signal. The internal details of how modularity or specialization arise within the network layers are interesting but secondary to the broad use of back-propagation or Gibb’s sampling combined with autoencoding. This is a critique that goes back to the early days of connectionism, of course, and why it was somewhat sidelined after an initial heyday in the late eighties. Then came statistical NLP, then came hybrid methods, then a resurgence of corpus methods, all the while with image processing getting more and more into the hand-crafted modular space.

But we can see some interesting developments that start to stir more Cognitivism into this stew.… Read the rest

Evolving Visions of Chaotic Futures

FlutterbysMost artificial intelligence researchers think unlikely the notion that a robot apocalypse or some kind of technological singularity is coming anytime soon. I’ve said as much, too. Guessing about the likelihood of distant futures is fraught with uncertainty; current trends are almost impossible to extrapolate.

But if we must, what are the best ways for guessing about the future? In the late 1950s the Delphi method was developed. Get a group of experts on a given topic and have them answer questions anonymously. Then iteratively publish back the group results and ask for feedback and revisions. Similar methods have been developed for face-to-face group decision making, like Kevin O’Connor’s approach to generating ideas in The Map of Innovation: generate ideas and give participants votes equaling a third of the number of unique ideas. Keep iterating until there is a consensus. More broadly, such methods are called “nominal group techniques.”

Most recently, the notion of prediction markets has been applied to internal and external decision making. In prediction markets,  a similar voting strategy is used but based on either fake or real money, forcing participants towards a risk-averse allocation of assets.

Interestingly, we know that optimal inference based on past experience can be codified using algorithmic information theory, but the fundamental problem with any kind of probabilistic argument is that much change that we observe in society is non-linear with respect to its underlying drivers and that the signals needed are imperfect. As the mildly misanthropic Nassim Taleb pointed out in The Black Swan, the only place where prediction takes on smooth statistical regularity is in Las Vegas, which is why one shouldn’t bother to gamble.… Read the rest

The Retiring Mind, Part III: Autonomy

Retiring Mind IIIRobert Gordon’s book on the end of industrial revolutions recently came out. I’ve been arguing for a while that the coming robot apocalypse might be Industrial Revolution IV. But the Dismal Science continues to point out uncomfortable facts in opposition to my suggestion.

So I had to test the beginning of the end (or the beginning of the beginning?) when my Tesla P90D with autosteer, summon mode, automatic parking, and ludicrous mode arrived to take the place of my three-year-old P85:… Read the rest

The Goldilocks Complexity Zone

FractalSince my time in the early 90s at Santa Fe Institute, I’ve been fascinated by the informational physics of complex systems. What are the requirements of an abstract system that is capable of complex behavior? How do our intuitions about complex behavior or form match up with mathematical approaches to describing complexity? For instance, we might consider a snowflake complex, but it is also regular in it’s structure, driven by an interaction between crystal growth and the surrounding air. The classic examples of coastlines and fractal self-symmetry also seem complex but are not capable of complex behavior.

So what is a good way of thinking about complexity? There is actually a good range of ideas about how to characterize complexity. Seth Lloyd rounds up many of them, here. The intuition that drives many of them is that complexity seems to be associated with distributions of relationships and objects that are somehow juxtapositioned between a single state and a uniformly random set of states. Complex things, be they living organisms or computers running algorithms, should exist in a Goldilocks zone when each part is examined and those parts are somehow summed up to a single measure.

We can easily construct a complexity measure that captures some of these intuitions. Let’s look at three strings of characters:

x = aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa

y = menlqphsfyjubaoitwzrvcgxdkbwohqyxplerz

z = the fox met the hare and the fox saw the hare

Now we would likely all agree that y and z are more complex than x, and I suspect most would agree that y looks like gibberish compared with z. Of course, y could be a sequence of weirdly coded measurements or something, or encrypted such that the message appears random.… Read the rest

Entanglement and Information

shannons-formula-smallResearch can flow into interesting little eddies that cohere into larger circulations that become transformative phase shifts. That happened to me this morning between a morning drive in the Northern California hills and departing for lunch at one of our favorite restaurants in Danville.

The topic I’ve been working on since my retirement is whether there are preferential representations for optimal automated inference methods. We have this grab-bag of machine learning techniques that use differing data structures but that all implement some variation on fitting functions to data exemplars; at the most general they all look like some kind of gradient descent on an error surface. Getting the right mix of parameters, nodes, etc. falls to some kind of statistical regularization or bottlenecking for the algorithms. Or maybe you perform a grid search in the hyperparameter space, narrowing down the right mix. Or you can throw up your hands and try to evolve your way to a solution, suspecting that there may be local optima that are distracting the algorithms from global success.

Yet, algorithmic information theory (AIT) gives us, via Solomonoff, a framework for balancing parameterization of an inference algorithm against the error rate on the training set. But, first, it’s all uncomputable and, second, the AIT framework just uses strings of binary as the coded Turing machines, so I would have to flip 2^N bits and test each representation to get anywhere with the theory. Yet, I and many others have had incremental success at using variations on this framework, whether via Minimum Description Length (MDL) principles, it’s first cousin Minimum Message Length (MML), and other statistical regularization approaches that are somewhat proxies for these techniques.… Read the rest