# The Deep Computing Lessons of Apollo

With the arrival of the Apollo 11 mission’s 45th anniversary, and occasional planning and dreaming about a manned mission to Mars, the role of information technology comes again into focus. The next great mission will include a phalanx of computing resources, sensors, radars, hyper spectral cameras, laser rangefinders, and information fusion visualization and analysis tools to knit together everything needed for the astronauts to succeed. Some of these capabilities will be autonomous, predictive, and knowledgable.

But it all began with the Apollo Guidance Computer or AGC, the rather sophisticated for-its-time computer that ran the trigonometric and vector calculations for the original moonshot. The AGC was startlingly simple in many ways, made up exclusively of NOR gates to implement Arithmetic Logic Unit-like functionality, shifts, and register opcodes combined with core memory (tiny ferromagnetic loops) in both RAM and ROM forms (the latter hand-woven by graduate students).

Using NOR gates to create the entire logic of the central processing unit is guided by a few simple principles. A NOR gate combines both NOT and OR functionality together and has the following logical functionality:

[table id=1 /]

The NOT-OR logic can be read as “if INPUT1 or INPUT2 is set to 1, then the OUTPUT should be 1, but then take the logical inversion (NOT) of that”. And, amazingly, circuits built from NORs can create any Boolean logic. NOT A is just NOR(A,A), which you can see from the following table:

[table id=2 /]

AND and OR can similarly be constructed by layering NORs together. For Apollo, the use of just a single type of integrated circuit that packaged NORs into chips improved reliability.

This level of simplicity has another important theoretical result that bears on the transition from simple guidance systems to potentially intelligent technologies for future Mars missions: a single layer of Boolean functions can only compute simple things. And as you layer on the functions you get increased complexity but complexity that is bounded by the depth of the logical function network. In fact, it can be proved that there are functions that can be represented in a k-depth network that can only be represented in a k-1 depth network if that network has exponentially many hidden units relative to the input size.

This is a startling theoretical discovery and motivates much of the deep learning research: functions for classification of Martian hyper spectral imagery need deep networks precisely because the complexity of the classification task rules out the use of shallower ones. Now mostly we are using artificial neural node simplifications to do this rather than Boolean primitives, but the motivations are the same.

But back to the crawling that predates the running: besides basic logical operations, how can we do something more usefully complex using NORs? Here’s an example logic circuit from circuitstoday.com that shows an adding circuit for adding together bits:

where each of the little half-moons are NORs with their inputs on the left and their outputs to the right. A and B are the inputs while S is the output, and C is the “carry” to the next significant bit. By combining these together, they can add arbitrarily large binary representations of integers with a circuit depth of 7 per 2 bit adder.

# Trees of Lives

With a brief respite between vacationing in the canyons of Colorado and leaving tomorrow for Australia, I’ve open-sourced an eight-year-old computer program for converting one’s DNA sequences into an artistic rendering. The input to the program are the allelic patterns from standard DNA analysis services that use the Short Tandem Repeat Polymorphisms from forensic analysis, as well as poetry reflecting one’s ethnic heritage. The output is generative art: a tree that overlays the sequences with the poetry and a background rendered from the sequences.

Generative art is perhaps one of the greatest aesthetic achievements of the late 20th Century. Generative art is, fundamentally, a recognition that the core of our humanity can be understood and converted into meaningful aesthetic products–it is the parallel of effective procedures in cognitive science, and developed in lock-step with the constructive efforts to reproduce and simulate human cognition.

To use Tree of Lives, install Java 1.8, unzip the package, and edit the supplied markconfig.txt to enter in your STRs and the allele variant numbers in sequence per line 15 of the configuration file. Lines 16+ are for lines of poetry that will be rendered on the limbs of the tree. Other configuration parameters can be discerned by examining com.treeoflives.CTreeConfig.java, and involve colors, paths, etc. Execute the program with:

`java -cp treeoflives.jar:iText-4.2.0-com.itextpdf.jar com.treeoflives.CAlleleRenderer markconfig.txt`

# Inching Towards Shannon’s Oblivion

Following Bill Joy’s concerns over the future world of nanotechnology, biological engineering, and robotics in 2000’s Why the Future Doesn’t Need Us, it has become fashionable to worry over “existential threats” to humanity. Nuclear power and weapons used to be dreadful enough, and clearly remain in the top five, but these rapidly developing technologies, asteroids, and global climate change have joined Oppenheimer’s misquoted “destroyer of all things” in portending our doom. Here’s Max Tegmark, Stephen Hawking, and others in Huffington Post warning again about artificial intelligence:

One can imagine such technology outsmarting financial markets, out-inventing human researchers, out-manipulating human leaders, and developing weapons we cannot even understand. Whereas the short-term impact of AI depends on who controls it, the long-term impact depends on whether it can be controlled at all.

I almost always begin my public talks on Big Data and intelligent systems with a presentation on industrial revolutions that progresses through Robert Gordon’s phases and then highlights Paul Krugman’s argument that Big Data and the intelligent systems improvements we are seeing potentially represent a next industrial revolution. I am usually less enthusiastic about the timeline than nonspecialists, but after giving a talk at PASS Business Analytics Friday in San Jose, I stuck around to listen in on a highly technical talk concerning statistical regularization and deep learning and I found myself enthused about the topic once again. Deep learning is using artificial neural networks to classify information, but is distinct from traditional ANNs in that the systems are pre-trained using auto-encoders to have a general knowledge about the data domain. To be clear, though, most of the problems that have been tackled are “subsymbolic” for image recognition and speech problems. Still, the improvements have been fairly impressive based on some pretty simple ideas. First, the pre-training is accompanied by systematic bottlenecking of the number of nodes that can be used for learning. Second, the amount that each fires is kept low to avoid overfitting to nodes with dominating magnitudes. Together, the auto-encoders learn the patterns without training and can then be trained faster and easier to associate those patterns with classes.

I still have my doubts concerning the threat timeline, however. For one, these are mostly sub-symbolic systems that are not capable of the kinds of self-directed system modifications that many fear can lead to exponential self-improvement. Second, the tasks that are seeing improvements are not new but just relatively well-known classification problems. Finally, the improvements, while impressive, are incremental improvements. There is probably a meaningful threat profile that can convert into a decision tree for when action is needed. For global climate change there are consensus estimates about sea level changes for instance. For Evil AI I think we need to wait for a single act of machine intelligence out-of-control before spending excessively on containment, policy, or regulation. In the meantime, though, keep a close eye on your laptop.

And then there’s the mild misanthropy of Claude Shannon, possibly driven by living too long in New Jersey:

I visualize a time when we will be to robots what dogs are to humans, and I’m rooting for the machines.

# Parsimonious Portmanteaus

Meaning is a problem. We think we might know what something means but we keep being surprised by the facts, research, and logical difficulties that surround the notion of meaning. Putnam’s Representation and Reality runs through a few different ways of thinking about meaning, though without reaching any definitive conclusions beyond what meaning can’t be.

Children are a useful touchstone concerning meaning because we know that they acquire linguistic skills and consequently at least an operational understanding of meaning. And how they do so is rather interesting: first, presume that whole objects are the first topics for naming; next, assume that syntactic differences lead to semantic differences (“the dog” refers to the class of dogs while “Fido” refers to the instance); finally, prefer that linguistic differences point to semantic differences. Paul Bloom slices and dices the research in his Précis of How Children Learn the Meanings of Words, calling into question many core assumptions about the learning of words and meaning.

These preferences become useful if we want to try to formulate an algorithm that assigns meaning to objects or groups of objects. Probabilistic Latent Semantic Analysis, for example, assumes that words are signals from underlying probabilistic topic models and then derives those models by estimating all of the probabilities from the available signals. The outcome lacks labels, however: the “meaning” is expressed purely in terms of co-occurrences of terms. Reconciling an approach like PLSA with the observations about children’s meaning acquisition presents some difficulties. The process seems too slow, for example, which was always a complaint about connectionist architectures of artificial neural networks as well. As Bloom points out, kids don’t make many errors concerning meaning and when they do, they rapidly compensate.

I’ve previously proposed a model for lexical acquisition that uses a coding hierarchy based on co-occurrence or other features. As new terms are observed, the hierarchy builds, in an unsupervised manner, by making local swaps and consolidations based on minimum description length principles. Thus, it bears a close relationship to Nevill-Manning’s SEQUITUR approach to sequence learning. There is a limitation to the approach in that in a tree-like grammar the complexity of examining all possible re-arrangements of the grammar when new symbols arrive seems to put a massive burden on any cognitive correlates that we might claim exist. Thus the system just uses local swaps and consolidations.

It’s worth considering how such an approach might solve the cluster labeling problem. If we cluster things together based on the parsimonious coding approach, the objects and their grammatical coordinations move higher up the tree. What is missing is a preference for adding new, distinctive terms that differentiate one grouping from another. For instance, in the toy sample given in my paper, “Financial Institution” or “Retail Bank” are not applied to the appropriate bank cluster, nor is “River Bank” applied to the other bank cluster. Instead we are just left with the shared context terms. I think this might be correctable in a larger grouping, however, by allowing for a distinguishing series of portmanteaus to be constructed by composition from nearby (in the semantic region) concepts. So, as the co-occurrences of bank and teller and ATM and loan pile up and get coded into groupings, the nearby finance, bank, retail bank, investment bank grouping is used to create a common portmanteau out of the most distinctive terms out of the set, and such that they most distinguish from the river semantic set.

# In Like Flynn

The exceptionally interesting James Flynn explains the cognitive history of the past century and what it means in terms of human intelligence in this TED talk:

What does the future hold? While we might decry the “twitch” generation and their inundation by social media, gaming stimulation, and instant interpersonal engagement, the slowing observed in the Flynn Effect might be getting ready for another ramp-up over the next 100 years.

Perhaps most intriguing is the discussion of the ability to think in terms of hypotheticals as a a core component of ethical reasoning. Ethics is about gaming outcomes and also about empathizing with others. The influence of media as a delivery mechanism for narratives about others emerged just as those changes in cognitive capabilities were beginning to mature in the 20th Century. Widespread media had a compounding effect on the core abstract thinking capacity, and with the expansion of smartphones and informational flow, we may only have a few generations to go before the necessary ingredients for good ethical reasoning are widespread even in hard-to-reach areas of the world.

# Contingency and Irreducibility

Thomas Nagel returns to defend his doubt concerning the completeness—if not the efficacy—of materialism in the explanation of mental phenomena in the New York Times. He quickly lays out the possibilities:

1. Consciousness is an easy product of neurophysiological processes
2. Consciousness is an illusion
3. Consciousness is a fluke side-effect of other processes
4. Consciousness is a divine property supervened on the physical world

Nagel arrives at a conclusion that all four are incorrect and that a naturalistic explanation is possible that isn’t “merely” (1), but that is at least (1), yet something more. I previously commented on the argument, here, but the refinement of the specifications requires a more targeted response.

Let’s call Nagel’s new perspective Theory 1+ for simplicity. What form might 1+ take on? For Nagel, the notion seems to be a combination of Chalmers-style qualia combined with a deep appreciation for the contingencies that factor into the personal evolution of individual consciousness. The latter is certainly redundant in that individuality must be absolutely tied to personal experiences and narratives.

We might be able to get some traction on this concept by looking to biological evolution, though “ontogeny recapitulates phylogeny” is about as close as we can get to the topic because any kind of evolutionary psychology must be looking for patterns that reinforce the interpretation of basic aspects of cognitive evolution (sex, reproduction, etc.) rather than explore the more numinous aspects of conscious development. So we might instead look for parallel theories that focus on the uniqueness of outcomes, that reify the temporal evolution without reference to controlling biology, and we get to ideas like uncomputability as a backstop. More specifically, we can explore ideas like computational irreducibility to support the development of Nagel’s new theory; insofar as the environment lapses towards weak predictability, a consciousness that self-observes, regulates, and builds many complex models and metamodels is superior to those that do not.

I think we already knew that, though. Perhaps Nagel has been too much a philosopher and too little involved in the sciences that surround and enervate modern theories of learning and adaption to see the movement towards the exits?

# Singularity and its Discontents

If a machine-based process can outperform a human being is it significant? That weighty question hung in the background as I reviewed Jürgen Schmidhuber’s work on traffic sign classification. Similar results have emerged from IBM’s Watson competition and even on the TOEFL test. In each case, machines beat people.

But is that fact significant? There are a couple of ways we can look at these kinds of comparisons. First, we can draw analogies to other capabilities that were not accessible by mechanical aid and show that the fact that they outperformed humans was not overly profound. The wheel quickly outperformed human legs for moving heavy objects. The cup outperformed the hands for drinking water. This then invites the realization that the extension of these physical comparisons leads to extraordinary juxtapositions: the airline really outperformed human legs for transport, etc. And this, in turn, justifies the claim that since we are now just outperforming human mental processes, we can only expect exponential improvements moving forward.

But this may be a category mistake in more than the obvious differentiator of the mental and the physical. Instead, the category mismatch is between levels of complexity. The number of parts in a Boeing 747 is 6 million versus one moving human as the baseline (we could enumerate the cells and organelles, etc., but then we would need to enumerate the crystal lattices of the aircraft steel, so that level of granularity is a wash). The number of memory addresses in a big server computer is 64 x 10^9 or higher, with disk storage in the TBs (10^12). Meanwhile, the human brain has 100 x 10^9 neurons and 10^14 connections. So, with just 2 orders of magnitude between computers and brains versus 6 between humans and planes, we find ourselves approaching Kurzweil’s argument that we have to wait until 2040. I’m more pessimistic and figure 2080, but then no one expects the Inquisition, either, to quote the esteemed philosophers, Monty Python.

We might move that back even further, though, because we still lack a theory of the large scale control of the collected software modules needed to operate on that massive neural simulation. At least Schmidhuber’s work used an artifical neural network. The others were looser about any affiliation to actual human information processing, though the LSI work is mathematically similar to some kinds of ANNs in terms of outcomes.

So if analogies only serve to support a mild kind of techno-optimism, we still can think about the problem in other ways by inverting the comparisons or emphasizing the risk of superintelligent machines. Thus is born the existential risk school of technological singularities. But such concerns and planning doesn’t really address the question of whether superintelligent machines are actually possible, or whether current achievements are significant.

And that brings us to the third perspective: the focus on competitive outcomes in AI research leads to only mild advances in the state-of-the-art, but does lead to important social outcomes. These are Apollo moon shots, in other words. Regardless of significant scientific advances, they stir the mind and the soul. It may transform the mild techno-optimism into moderate techo-optimism. And that’s OK, because the alternative is stationary fear.

# Curiouser and Curiouser

Jürgen Schmidhuber’s work on algorithmic information theory and curiosity is worth a few takes, if not more, for the researcher has done something that is both flawed and rather brilliant at the same time. The flaws emerge when we start to look deeply into the motivations for ideas like beauty (is symmetry and noncomplex encoding enough to explain sexual attraction? Well-understood evolutionary psychology is probably a better bet), but the core of his argument is worth considering.

If induction is an essential component of learning (and we might suppose it is for argument’s sake), then why continue to examine different parameterizations of possible models for induction? Why be creative about how to explain things, like we expect and even idolize of scientists?

So let us assume that induction is explained by the compression of patterns into better and better models using an information theoretic-style approach. Given this, Schmidhuber makes the startling leap that better compression and better models are best achieved by information harvesting behavior that involves finding novelty in the environment. Thus curiosity. Thus the implementation of action in support of ideas.

I proposed a similar model to explain aesthetic preferences for mid-ordered complex systems of notes, brush-strokes, etc. around 1994, but Schmidhuber’s approach has the benefit of not just characterizing the limitations and properties of aesthetic systems, but also justifying them. We find interest because we are programmed to find novelty, and we are programmed to find novelty because we want to optimize our predictive apparatus. The best optimization is actively seeking along the contours of the perceivable (and quantifiable) universe, and isolating the unknown patterns to improve our current model.

# Industrial Revolution #4

Paul Krugman at New York Times consumes Robert Gordon’s analysis of economic growth and the role of technology and comes up more hopeful than Gordon. The kernel in Krugman’s hope is that Big Data analytics can provide a shortcut to intelligent machines by bypassing the requirement for specification and programming that was once assumed to be a requirement for artificial intelligence. Instead, we don’t specify but use “data-intensive ways” to achieve a better result. And we might get to IR#4, following Gordon’s taxonomy where IR stands for “industrial revolution.” IR#1 was steam and locomotives  IR#2 was everything up to computers. IR#3 is computers and cell phones and whatnot.

Krugman implies that IR#4 might spur the typical economic consequences of grand technological change, including the massive displacement of workers, but like in previous revolutions it is also assumed that economic growth built from new industries will ultimately eclipse the negatives. This is not new, of course. Robert Anton Wilson argued decades ago for the R.I.C.H. economy (Rising Income through Cybernetic Homeostasis). Wilson may have been on acid, but Krugman wasn’t yet tuned in, man. (A brief aside: the Krugman/Wilson notions probably break down over extraction and agribusiness/land rights issues. If labor is completely replaced by intelligent machines, the land and the ingredients it contains nevertheless remain a bottleneck for economic growth. Look at the global demand for copper and rare earth materials, for instance.)

But why the particular focus on Big Data technologies? Krugman’s hope teeters on the assumption that data-intensive algorithms possess a fundamentally different scale and capacity than human-engineered approaches. Having risen through the computational linguistics and AI community working on data-driven methods for approaching intelligence, I can certainly sympathize with the motivation, but there are really only modest results to report at this time. For instance, statistical machine translation is still pretty poor quality, and is arguably not of better quality than the rules-based methods from the 70s and 80s in anything other than scale and diversity of the languages that are being used. Recent achievements like the DARPA grand challenge for self-driving vehicles were not achieved through data-intensive methods but through careful examination of the limits of the baseline system. In that case, baseline meant a system that used a scanning laser rangefinder to avoid obstacles while following a map and an improvement meant marginally outrunning the distance limitations of the rangefinder by using optical image recognition to support a modest speedup. Speech recognition is better due to accumulating many examples of labeled, categorized text, true. And we can certainly guess that the relevance of advertising placed on a web page is better than it once was, if only because it is an easy problem to attack without the necessity of deep considerations of human understanding–unless you take our buying behavior to be a deep indicator of our beings. We can also see some glimmers of data-intensive methods in the IBM Watson system, though the Watson team will be the first to tell you that they dealt with only medium-scale data (wikipedia) in the design of their system.

Still, there is a clear economic-growth argument for the upshot of replacing workers in manual drudgery straight through to fairly intelligent drudgery, which gives an economist like Krugman reason for hope. Now, if the limitations of energy and resource requirements can just be replaced, we can all retire to RICH, creative lives.