Category: AI

Computing the Madness of People

Bubble playing cardThe best paper I’ve read so far this year has to be Pseudo-Mathematics and Financial Charlatanism: The Effects of Backtest Overfitting on Out-of-sample Performance by David Bailey, Jonathan Borwein, Marcos López de Prado, and Qiji Jim Zhu. The title should ring alarm bells with anyone who has ever puzzled over the disclaimers made by mutual funds or investment strategists that “past performance is not a guarantee of future performance.” No, but we have nothing but that past performance to judge the fund or firm on; we could just pick based on vague investment “philosophies” like the heroizing profiles in Kiplingers seem to promote or trust that all the arbitraging has squeezed the markets into perfect equilibria and therefore just use index funds.

The paper’s core tenets extend well beyond financial charlatanism, however. They point out that the same problem arises in drug discovery where main effects of novel compounds may be due to pure randomness in the sample population in a way that is masked by the sample selection procedure. The history of mental illness research has similar failures, with the head of NIMH remarking that clinical trials and the DSM for treating psychiatric symptoms is too often “shooting in the dark.”

The core suggestion of the paper is remarkably simple, however: use held-out data to validate models. Remarkably simple but apparently rarely done in quantitative financial analysis. The researchers show how simple random walks can look like a seasonal price pattern, and how by sending binary signals about market performance to clients (market will rise/market will fall) investment advisors can create a subpopulation that thinks they are geniuses as other clients walk away due to losses. These rise to the level of charlatanism but the problem of overfitting is just one of pseudo-mathematics where insufficient care is used in managing the data. And the fix is simple: do what we do in machine learning. Divide the training data into 10 buckets, train on 9 of them and verify on the last one. Then rotate or do another division/training cycle. Anomalies in the data start popping out very quickly and ensemble-based methods can help to cope with breakdowns of independence assumptions and stability.

Perhaps the best quote in the paper is from Sir Isaac Newton who lamented that he could not calculate  “the madness of people” after losing a minor fortune in the South Sea Bubble of 1720. If we might start to compute that madness it is important to do it right.

Saving Big Data from the Zeros

ZerosBecause of the hype cycle, Big Data inevitably attracts dissenters who want to deflate a bit the lofty expectations that are built around new technologies that appear mystifying to those on the outside of the Silicon Valley machine. The first response is generally “so what?” and that there is nothing new here, just rehashing efforts like grid computing and Beowulf and whatnot. This skepticism is generally a healthy inoculation against aggrandizement and any kind of hangover from unmet expectations. Hence, the NY Times op-ed from April 6th, Eight (No, Nine!) Problems with Big Data should be embraced for enumerating eight or nine different ways that Big Data technologies, algorithms and thinking might be stretching the balloon of hope towards a loud, but ineffectual, pop.

The eighth of the list bears some scrutiny, though. The authors, who I am not familiar with, focus on the overuse of trigrams in building statistical language models. And they note that language is very productive and that even a short sentence from Rob Lowe, “dumbed-down escapist fare,” doesn’t appear in the indexed corpus of Google. Shades of “colorless green ideas…” from Chomsky, but an important lesson in how to manage the composition of meaning. Dumbed-down escapist fare doesn’t translate well back-and-forth through German via the Google translate capability. For the authors, that shows the failure of the statistical translation methodology linked to Big Data, and ties in to their other concerns about predicting rare occurrences or even, in the case of Lowe’s quote, zero occurrences.

In reality, though, these methods of statistical translation through parallel text learning date to the late 1980s and reflect a distinct journey through ways of thinking about natural language and computing. Throughout the field’s history, phrasal bracketing and the alignment of those phrases to build a statistical concordance has been driven by more than trigrams. And where higher-order ngrams get sparse (or go to zero probability like Rob Lowe’s phrase), the better estimate is based on the composition of the probabilities of each sub phrase or words:


Google Search,hits

”dumbed-down”, 690\,000

escapist, 1\,860\,000

fare, 132\,000\,000

”dumbed-down escapist fare”, 110 (all about the NY Times article)



Indeed, reweighting language models to accommodate the complexities of unknowns has been around for a long time. The idea is called “back-off probability” and can be used for ngram text modeling or even for rather elegant multi-length compression models like Prediction by Partial Matching (PPM). In both cases, the composition of the substrings become important when the phrasal whole has no reference. And when an unknown word is present in a sequence of knowns, the combined semantics of the knowns with the estimate of the part-of-speech of the unknown based on syntax regularities provides clues; “dumbed-down engorssian fare” could be about the arts or food but likely not about space travel or delivery vans.

And so we rescue Big Data from the threat of zeros.


Parsimonious Portmanteaus

portmanteauMeaning is a problem. We think we might know what something means but we keep being surprised by the facts, research, and logical difficulties that surround the notion of meaning. Putnam’s Representation and Reality runs through a few different ways of thinking about meaning, though without reaching any definitive conclusions beyond what meaning can’t be.

Children are a useful touchstone concerning meaning because we know that they acquire linguistic skills and consequently at least an operational understanding of meaning. And how they do so is rather interesting: first, presume that whole objects are the first topics for naming; next, assume that syntactic differences lead to semantic differences (“the dog” refers to the class of dogs while “Fido” refers to the instance); finally, prefer that linguistic differences point to semantic differences. Paul Bloom slices and dices the research in his Précis of How Children Learn the Meanings of Words, calling into question many core assumptions about the learning of words and meaning.

These preferences become useful if we want to try to formulate an algorithm that assigns meaning to objects or groups of objects. Probabilistic Latent Semantic Analysis, for example, assumes that words are signals from underlying probabilistic topic models and then derives those models by estimating all of the probabilities from the available signals. The outcome lacks labels, however: the “meaning” is expressed purely in terms of co-occurrences of terms. Reconciling an approach like PLSA with the observations about children’s meaning acquisition presents some difficulties. The process seems too slow, for example, which was always a complaint about connectionist architectures of artificial neural networks as well. As Bloom points out, kids don’t make many errors concerning meaning and when they do, they rapidly compensate.

I’ve previously proposed a model for lexical acquisition that uses a coding hierarchy based on co-occurrence or other features. As new terms are observed, the hierarchy builds, in an unsupervised manner, by making local swaps and consolidations based on minimum description length principles. Thus, it bears a close relationship to Nevill-Manning’s SEQUITUR approach to sequence learning. There is a limitation to the approach in that in a tree-like grammar the complexity of examining all possible re-arrangements of the grammar when new symbols arrive seems to put a massive burden on any cognitive correlates that we might claim exist. Thus the system just uses local swaps and consolidations.

It’s worth considering how such an approach might solve the cluster labeling problem. If we cluster things together based on the parsimonious coding approach, the objects and their grammatical coordinations move higher up the tree. What is missing is a preference for adding new, distinctive terms that differentiate one grouping from another. For instance, in the toy sample given in my paper, “Financial Institution” or “Retail Bank” are not applied to the appropriate bank cluster, nor is “River Bank” applied to the other bank cluster. Instead we are just left with the shared context terms. I think this might be correctable in a larger grouping, however, by allowing for a distinguishing series of portmanteaus to be constructed by composition from nearby (in the semantic region) concepts. So, as the co-occurrences of bank and teller and ATM and loan pile up and get coded into groupings, the nearby finance, bank, retail bank, investment bank grouping is used to create a common portmanteau out of the most distinctive terms out of the set, and such that they most distinguish from the river semantic set.

Algorithmic Aesthetics

Tarbell art

Jared Tarbell’s work in algorithmic composition via continues to amaze me. See more, here. The relatively compact descriptions of complex landscapes lend themselves to treatment as aesthetic phenomena where the scale of the grammars versus the complexity of the results asks the question what is art and how does it relate to human neurosystems?



Novelty in the Age of Criticism

Lower Manhattan Panorama because I am in Jersey City, NJ tonight.
Lower Manhattan panorama because I am in Jersey City, NJ as I write this, with an awesomely aesthetic view.

Gary Cutting from Notre Dame and the New York Times knows how to incite an intellectual riot, as demonstrated by his most recent The Stone piece, Mozart vs. the Beatles. “High art” is superior to “low art” because of its “stunning intellectual and emotional complexity.” He sums up:

My argument is that this distinctively aesthetic value is of great importance in our lives and that works of high art achieve it much more fully than do works of popular art.

But what makes up these notions of complexity and distinctive aesthetic value? One might try to enumerate those values or create a list. Or, alternatively, one might instead claim that time serves as a sieve for the values that Cutting is claiming make one work of art superior to another, thus leaving open the possibility for the enumerated list approach to be incomplete but still a useful retrospective system of valuation.

I previously argued in a 1994 paper (published in 1997), Complexity Formalisms, Order and Disorder in the Structure of Art, that simplicity and random chaos exist in a careful balance in art that reflects our underlying grammatical systems that are used to predict the environment. And Jürgen Schmidhuber took the approach further by applying algorithmic information theory to novelty seeking behavior that leads, in turn, to aesthetically pleasing models. The reflection of this behavioral optimization in our sideline preoccupations emerges as art, with the ultimate causation machine of evolution driving the proximate consequences for men and women.

But let’s get back to the flaw I see in Cutting’s argument that, in turn, fits better with Schmidhuber’s approach: much of what is important in art is cultural novelty. Picasso is not aesthetically superior to the detailed hyper-reality of Dutch Masters, for instance, but is notable for his cultural deconstruction of the role of art as photography and reproduction took hold. And the simplicity and unstructured chaos of the Abstract Expressionists is culturally significant as well. Importantly, changes in technology are essential to changes in artistic outlook, from the aforementioned role of photography in diminishing the aesthetic value of hand renderings to the application of electronic instruments in Philip Glass symphonies. Is Mozart better than Glass or Stravinsky? Using this newer standard for aesthetics, no, because Mozart was working skillfully (and perhaps brilliantly) but within the harmonic model of Classical composition and Classical forms. He was one of many. But Wagner or Debussy changed the aural landscape, by comparison, and by the time of tone rows and aleatoric composition, conventional musical aesthetics were largely abandoned, if only fleetingly.

Modernism and postmodernism in prose and poetry follow similar trajectories, but I think there may have been a counter-opposing force to novelty seeking in much prose literature. That force is the requirement for narrative stories that are about human experiences, which is not a critical component of music or visual art. Human experience has a temporal flow and spatial unity. When novelists break these requirements in complex ways, writing becomes increasingly difficult to comprehend (perhaps a bit like aleatoric music?), so the efforts of novelists more often cling to convention while using other prose tools and stylistic fireworks to enhance the reader’s aesthetic valuations. Novelty hits less often, but often with greater challenges. Poetry has, by comparison, been more experimental in forms and concepts.

And architecture? Cutting’s Chartres versus Philip Johnson?

So, returning to Cutting, I have largely been arguing about the difficulty of calling one piece of what Cutting might declare high art as aesthetically superior to another piece of high art. But my goal is that if we use cultural novelty as the primary yardstick, then we need to reorder the valuations. Early rock and roll pioneers, early blues artists, early modern jazz impresarios—all the legends we can think of—get top billing alongside Debussy. Heavy metal, rap, and electronica inventors live proudly with the Baroque masters. They will likely survive that test-of-time criteria, too, because of the invention of recording technologies, which were not available to the Baroque composers.

Singularity and its Discontents

Kimmel botIf a machine-based process can outperform a human being is it significant? That weighty question hung in the background as I reviewed Jürgen Schmidhuber’s work on traffic sign classification. Similar results have emerged from IBM’s Watson competition and even on the TOEFL test. In each case, machines beat people.

But is that fact significant? There are a couple of ways we can look at these kinds of comparisons. First, we can draw analogies to other capabilities that were not accessible by mechanical aid and show that the fact that they outperformed humans was not overly profound. The wheel quickly outperformed human legs for moving heavy objects. The cup outperformed the hands for drinking water. This then invites the realization that the extension of these physical comparisons leads to extraordinary juxtapositions: the airline really outperformed human legs for transport, etc. And this, in turn, justifies the claim that since we are now just outperforming human mental processes, we can only expect exponential improvements moving forward.

But this may be a category mistake in more than the obvious differentiator of the mental and the physical. Instead, the category mismatch is between levels of complexity. The number of parts in a Boeing 747 is 6 million versus one moving human as the baseline (we could enumerate the cells and organelles, etc., but then we would need to enumerate the crystal lattices of the aircraft steel, so that level of granularity is a wash). The number of memory addresses in a big server computer is 64 x 10^9 or higher, with disk storage in the TBs (10^12). Meanwhile, the human brain has 100 x 10^9 neurons and 10^14 connections. So, with just 2 orders of magnitude between computers and brains versus 6 between humans and planes, we find ourselves approaching Kurzweil’s argument that we have to wait until 2040. I’m more pessimistic and figure 2080, but then no one expects the Inquisition, either, to quote the esteemed philosophers, Monty Python.

We might move that back even further, though, because we still lack a theory of the large scale control of the collected software modules needed to operate on that massive neural simulation. At least Schmidhuber’s work used an artifical neural network. The others were looser about any affiliation to actual human information processing, though the LSI work is mathematically similar to some kinds of ANNs in terms of outcomes.

So if analogies only serve to support a mild kind of techno-optimism, we still can think about the problem in other ways by inverting the comparisons or emphasizing the risk of superintelligent machines. Thus is born the existential risk school of technological singularities. But such concerns and planning doesn’t really address the question of whether superintelligent machines are actually possible, or whether current achievements are significant.

And that brings us to the third perspective: the focus on competitive outcomes in AI research leads to only mild advances in the state-of-the-art, but does lead to important social outcomes. These are Apollo moon shots, in other words. Regardless of significant scientific advances, they stir the mind and the soul. It may transform the mild techno-optimism into moderate techo-optimism. And that’s OK, because the alternative is stationary fear.

Methodical Play

imageMy fourteen-year-old interviewed a physicist yesterday. I had the privilege of being home over the weekend and listened in; my travel schedule has lately been brutal, with the only saving grace being moments like right now en route to Chicago when I can collapse into reading and writing for a few whitenoise-washed moments. And the physicist who was once his grandfather said some remarkable things:

  • Physics consists of empirical layers of untruth
  • The scientific method is never used as formulated
  • Schools, while valuable, won’t teach how to be a scientist
  • The institutions of physics don’t support the creativity required to be a scientist

Yet there was no sense of anger or disillusionment in these statements, just a framing of the distinctions between the modern social model surrounding what scientists do and the complex reality of how they really do their work.

The positives were that play is both the essential ingredient and the missing determinant of the real “scientific method.” Mess around, try to explain, mess around some more. And what is all that play getting this remarkable octogenarian? Possible insights into the unification of electromagnetism and the strong nuclear force. The interview journey passed from alignment of quarks to the beams of neutron stars, igniting the imaginations of all the minds on the call.

But if there is no real large-scale method to this madness, what might we conclude about the rationality of the process of science? I would advocate that the algorithmic model of inference is perhaps the best (and least biased) way of approaching the issue of scientific method. By constantly reshuffling the available parameters and testing the compressibility of models, play is indistinguishable from science when the play pivots on best explanation. An hypothesis is a short range consequence of play, not a prerequisite.

So play and play some more, and enlighten the world. That’s the lesson of an 81-year-old for a a young, inquisitive mind.

Curiouser and Curiouser

georgeJürgen Schmidhuber’s work on algorithmic information theory and curiosity is worth a few takes, if not more, for the researcher has done something that is both flawed and rather brilliant at the same time. The flaws emerge when we start to look deeply into the motivations for ideas like beauty (is symmetry and noncomplex encoding enough to explain sexual attraction? Well-understood evolutionary psychology is probably a better bet), but the core of his argument is worth considering.

If induction is an essential component of learning (and we might suppose it is for argument’s sake), then why continue to examine different parameterizations of possible models for induction? Why be creative about how to explain things, like we expect and even idolize of scientists?

So let us assume that induction is explained by the compression of patterns into better and better models using an information theoretic-style approach. Given this, Schmidhuber makes the startling leap that better compression and better models are best achieved by information harvesting behavior that involves finding novelty in the environment. Thus curiosity. Thus the implementation of action in support of ideas.

I proposed a similar model to explain aesthetic preferences for mid-ordered complex systems of notes, brush-strokes, etc. around 1994, but Schmidhuber’s approach has the benefit of not just characterizing the limitations and properties of aesthetic systems, but also justifying them. We find interest because we are programmed to find novelty, and we are programmed to find novelty because we want to optimize our predictive apparatus. The best optimization is actively seeking along the contours of the perceivable (and quantifiable) universe, and isolating the unknown patterns to improve our current model.

Industrial Revolution #4

Paul Krugman at New York Times consumes Robert Gordon’s analysis of economic growth and the role of technology and comes up more hopeful than Gordon. The kernel in Krugman’s hope is that Big Data analytics can provide a shortcut to intelligent machines by bypassing the requirement for specification and programming that was once assumed to be a requirement for artificial intelligence. Instead, we don’t specify but use “data-intensive ways” to achieve a better result. And we might get to IR#4, following Gordon’s taxonomy where IR stands for “industrial revolution.” IR#1 was steam and locomotives  IR#2 was everything up to computers. IR#3 is computers and cell phones and whatnot.

Krugman implies that IR#4 might spur the typical economic consequences of grand technological change, including the massive displacement of workers, but like in previous revolutions it is also assumed that economic growth built from new industries will ultimately eclipse the negatives. This is not new, of course. Robert Anton Wilson argued decades ago for the R.I.C.H. economy (Rising Income through Cybernetic Homeostasis). Wilson may have been on acid, but Krugman wasn’t yet tuned in, man. (A brief aside: the Krugman/Wilson notions probably break down over extraction and agribusiness/land rights issues. If labor is completely replaced by intelligent machines, the land and the ingredients it contains nevertheless remain a bottleneck for economic growth. Look at the global demand for copper and rare earth materials, for instance.)

But why the particular focus on Big Data technologies? Krugman’s hope teeters on the assumption that data-intensive algorithms possess a fundamentally different scale and capacity than human-engineered approaches. Having risen through the computational linguistics and AI community working on data-driven methods for approaching intelligence, I can certainly sympathize with the motivation, but there are really only modest results to report at this time. For instance, statistical machine translation is still pretty poor quality, and is arguably not of better quality than the rules-based methods from the 70s and 80s in anything other than scale and diversity of the languages that are being used. Recent achievements like the DARPA grand challenge for self-driving vehicles were not achieved through data-intensive methods but through careful examination of the limits of the baseline system. In that case, baseline meant a system that used a scanning laser rangefinder to avoid obstacles while following a map and an improvement meant marginally outrunning the distance limitations of the rangefinder by using optical image recognition to support a modest speedup. Speech recognition is better due to accumulating many examples of labeled, categorized text, true. And we can certainly guess that the relevance of advertising placed on a web page is better than it once was, if only because it is an easy problem to attack without the necessity of deep considerations of human understanding–unless you take our buying behavior to be a deep indicator of our beings. We can also see some glimmers of data-intensive methods in the IBM Watson system, though the Watson team will be the first to tell you that they dealt with only medium-scale data (wikipedia) in the design of their system.

Still, there is a clear economic-growth argument for the upshot of replacing workers in manual drudgery straight through to fairly intelligent drudgery, which gives an economist like Krugman reason for hope. Now, if the limitations of energy and resource requirements can just be replaced, we can all retire to RICH, creative lives.