On top of Chomsky’s Managua Lectures and the Rolling Stone Special Edition on Kurt Cobain, and always an antidote to other pressing matters:
Rest in peace, Hilary Putnam.
On top of Chomsky’s Managua Lectures and the Rolling Stone Special Edition on Kurt Cobain, and always an antidote to other pressing matters:
Rest in peace, Hilary Putnam.
Michael Shermer and Sam Harris got together with an audience at Caltech to beat up on Deepak Chopra and a “storyteller” named Jean Houston in The Future of God debate hosted by ABC News. And Deepak got uncharacteristically angry back behind his crystal-embellished eyewear, especially at Shermer’s assertion that Deepak is just talking “woo-woo.”
But is there any basis for the woo-woo that Deepak is weaving? As it turns out, he is building on some fairly impressive work by Stuart Hameroff, MD, of University of Arizona and Sir Roger Penrose of Oxford University. Under development for more than 25 years, this work has most recently been summed up in their 2014 paper, “Consciousness in the universe: A review of the ‘Orch OR’ theory” available for free (but not the commentaries, alas). Deepak was even invited to comment on the paper in Physics of Life Reviews, though the content of his commentary was challenged as being somewhat orthogonal or contradictory to the main argument.
To start somewhere near the beginning, Penrose became obsessed with the limits of computation in the late 80s. The Halting Problem sums up his concerns about the idea that human minds can possibly be isomorphic with computational devices. There seems to be something that allows for breaking free of the limits of “mere” Turing Complete computation to Penrose. Whatever that something is, it should be physical and reside within the structure of the brain itself. Hameroff and Penrose would also like that something to explain consciousness and all of its confusing manifestations, for surely consciousness is part of that brain operation.
Now, to get at some necessary and sufficient sorts of explanations for this new model requires looking at Hameroff’s medical speciality: anesthesiology. Anyone who has had surgery has had the experience of consciousness going away while body function continues on, still mediated by brain activities. So certain drugs like halothane erase consciousness through some very targeted action. Next, consider that certain prokaryotes have internally coordinated behaviors without the presence of a nervous system. Finally, consider that it looks like most neurons do not integrate and fire like the classic model (and the model that artificial neural networks emulate), but instead have some very strange and random activation behaviors in the presence of the same stimuli.
How do these relate? Hameroff has been very focused on one particular component to the internal architecture of neural cells: microtubules or MTs. These are very small (compared to cellular scale) and there are millions in neurons (10^9 or so). They are just cylindrical polymers with some specific chemical properties. They also are small enough (25nm in diameter) that it might be possible that quantum effects are present in their architecture. There is some very recent evidence to this effect based on strange reactions of MTs to tiny currents of varying frequencies used to probe them. Also, anesthetics appear to bind to MTs, Indeed, they could also provide a memory strata that is orders of magnitude greater than the traditional interneuron concept of how memories form.
But what does this have to do with consciousness beyond the idea that MTs get interfered with by anesthetics and therefore might be around or part of the machinery that we label conscious? They also appear to be related to Alzheimer’s disease, but this could be just related to the same machinery. Well, this is where we get woo-woo-ey. If consciousness is not just an epiphenomena arising from standard brain function as a molecular computer, and it is also not some kind of dualistic soul overlay, then maybe it is something that is there but is not a classical computer. Hence quantum effects.
So Sir Penrose has been promoting a rather wild conjecture called the Diósi-Penrose theory that puts an upper limit on the amount of time a quantum superposition can survive. It does this based on some arguments I don’t fully understand but that integrate gravity with quantum phenomena to suggest that the mass displaced by the superposed wave functions snaps the superposition into wave collapse. So Schrödinger’s cat dies or lives very quickly even without an observer because there are a lot of superposed quantum particles in a big old cat and therefore very rapid resolution of the wave function evolution (10^-24s). Single particles can live in superposition for much longer because the mass difference between their wave functions is very small.
Hence the OR in “Orch OR” stands for Objective Resolution: wave functions are subject to collapse by probing but they also collapse just because they are unstable in that state. The resolution is objective and not subjective. The “Orch” stands for “Orchestrated.” And there is the seat of consciousness in the Hameroff-Penrose theory. In MTs little wave function collapses are constantly occurring and the presence of superposition means quantum computing can occur. And the presence of quantum computing means that non-classical computation can take place and maybe even be more than Turing Complete.
Now the authors are careful to suggest that these are actually proto-conscious events and that only their large-scale orchestration leads to what we associate with consciousness per se. Otherwise they are just quantum superpositions that collapse, maybe with 1 qubit of resolution under the right circumstances.
At least we know the cat has a fate now. That fate is due to an objective event, too, and not some added woo-woo from the strange world of quantum phenomena. And the cat’s curiosity is part of the same conscious machinery.
Xu and Tenebaum, in Word Learning as Bayesian Inference (Psychological Review, 2007), develop a very simple Bayesian model of how children (and even adults) build semantic associations based on accumulated evidence. In short, they find contrastive elimination approaches as well as connectionist methods unable to explain the patterns that are observed. Specifically, the most salient problem with these other methods is that they lack the rapid transition that is seen when three exemplars are presented for a class of objects associated with a word versus one exemplar. Adults and kids (the former even more so) just get word meanings faster than those other models can easily show. Moreover, a space of contending hypotheses that are weighted according to their Bayesian statistics, provides an escape from the all-or-nothing of hypothesis elimination and some of the “soft” commitment properties that connectionist models provide.
The mathematical trick for the rapid transition is rather interesting. They formulate a “size principle” that weights the likelihood of a given hypothesis (this object is most similar to a “feb,” for instance, rather than the many other object sets that are available) according to a scaling that is exponential in the number of exposures. Hence the rapid transition:
Hypotheses with smaller extensions assign greater probability than do larger hypotheses to the same data, and they assign exponentially greater probability as the number of consistent examples increases.
It should be noted that they don’t claim that the psychological or brain machinery implements exactly this algorithm. As is usual in these matters, it is instead likely that whatever machinery is involved, it simply has at least these properties. It may very well be that connectionist architectures can do the same but that existing approaches to connectionism simply don’t do it quite the right way. So other methods may need to be tweaked to get closer to the observed learning of people in these word tasks.
So what can this tell us about epistemology and belief? Classical foundationalism might be formulated as something is a “basic” or “justified” belief if it is self-evident or evident to our senses. Other beliefs may therefore be grounded by those basic beliefs. And a more modern reformulation might substitute “incorrigible” for “justified” with the layered meaning of incorrigibility built on the necessity that given the proposition it is in fact true.
Here’s Alvin Plantinga laying out a case for why justified and incorrigibility have a range of problems, problems serious enough for Plantinga that he suspects that god belief could just as easily be a basic belief, allowing for the kinds of presuppositional Natural Theology (think: I look around me and the hand of God is obvious) that is at the heart of some of the loftier claims concerning the viability or non-irrationality of god belief. It even provides a kind of coherent interpretative framework for historical interpretation.
Plantinga positions the problem of properly basic belief then as an inductive problem:
And hence the proper way to arrive at such a criterion is, broadly speaking, inductive. We must assemble examples of beliefs and conditions such that the former are obviously properly basic in the latter, and examples of beliefs and conditions such that the former are obviously not properly basic in the latter. We must then frame hypotheses as to the necessary and sufficient conditions of proper basicality and test these hypothesis by reference to those examples. Under the right conditions, for example, it is clearly rational to believe that you see a human person before you: a being who has thoughts and feelings, who knows and believes things, who makes decisions and acts. It is clear, furthermore, that you are under no obligation to reason to this belief from others you hold; under those conditions that belief is properly basic for you.
He goes on to conclude that this opens up the god hypothesis as providing this kind of coherence mechanism:
By way of conclusion then: being self-evident, or incorrigible, or evident to the senses is not a necessary condition of proper basicality. Furthermore, one who holds that belief in God is properly basic is not thereby committed to the idea that belief in God is groundless or gratuitous or without justifying circumstances. And even if he lacks a general criterion of proper basicality, he is not obliged to suppose that just any or nearly any belief—belief in the Great Pumpkin, for example—is properly basic. Like everyone should, he begins with examples; and he may take belief in the Great Pumpkin as a paradigm of irrational basic belief.
So let’s assume that the word learning mechanism based on this Bayesian scaling is representative of our human inductive capacities. Now this may or may not be broadly true. It is possible that it is true of words but not other domains of perceptual phenomena. Nevertheless, given this scaling property, the relative inductive truth of a given proposition (a meaning hypothesis) is strictly Bayesian. Moreover, this doesn’t succumb to problems of verificationalism because it only claims relative truth. Properly basic or basic is then the scaled contending explanatory hypotheses and the god hypothesis has to compete with other explanations like evolutionary theory (for human origins), empirical evidence of materialism (for explanations contra supernatural ones), perceptual mistakes (ditto), myth scholarship, textual analysis, influence of parental belief exposure, the psychology of wish fulfillment, the pragmatic triumph of science, etc. etc.
And so we can stick to a relative scaling of hypotheses as to what constitutes basicality or justified true belief. That’s fine. We can continue to argue the previous points as to whether they support or override one hypothesis or another. But the question Plantinga raises as to what ethics to apply in making those decisions is important. He distinguishes different reasons why one might want to believe more true things than others (broadly) or maybe some things as properly basic rather than others, or, more correctly, why philosophers feel the need to pin god-belief as irrational. But we succumb to a kind of unsatisfying relativism insofar as the space of these hypotheses is not, in fact, weighted in a manner that most reflects the known facts. The relativism gets deeper when the weighting is washed out by wish fulfillment, pragmatism, aspirations, and personal insights that lack falsifiability. That is at least distasteful, maybe aretetically so (in Plantinga’s framework) but probably more teleologically so in that it influences other decision-making and the conflicts and real harms societies may cause.
Research can flow into interesting little eddies that cohere into larger circulations that become transformative phase shifts. That happened to me this morning between a morning drive in the Northern California hills and departing for lunch at one of our favorite restaurants in Danville.
The topic I’ve been working on since my retirement is whether there are preferential representations for optimal automated inference methods. We have this grab-bag of machine learning techniques that use differing data structures but that all implement some variation on fitting functions to data exemplars; at the most general they all look like some kind of gradient descent on an error surface. Getting the right mix of parameters, nodes, etc. falls to some kind of statistical regularization or bottlenecking for the algorithms. Or maybe you perform a grid search in the hyperparameter space, narrowing down the right mix. Or you can throw up your hands and try to evolve your way to a solution, suspecting that there may be local optima that are distracting the algorithms from global success.
Yet, algorithmic information theory (AIT) gives us, via Solomonoff, a framework for balancing parameterization of an inference algorithm against the error rate on the training set. But, first, it’s all uncomputable and, second, the AIT framework just uses strings of binary as the coded Turing machines, so I would have to flip 2^N bits and test each representation to get anywhere with the theory. Yet, I and many others have had incremental success at using variations on this framework, whether via Minimum Description Length (MDL) principles, it’s first cousin Minimum Message Length (MML), and other statistical regularization approaches that are somewhat proxies for these techniques. But we almost always choose a model (ANNs, compression lexicons, etc.) and then optimize the parameters around that framework. Can we do better? Is there a preferential model for time series versus static data? How about for discrete versus continuous?
So while researching model selection in this framework, I come upon a mention of Shannon’s information theory and its application to quantum decoherence. Of course I had to investigate. And here is the most interesting thing I’ve seen in months from the always interesting Max Tegmark at MIT:
Particles entangle and then quantum decoherence causes them to shed entropy into one another during interaction. But, most interesting, is the quantum Bayes’ theory section around 00:35:00 where Shannon entropy as a classical measure of improbability gets applied to the quantum indeterminacy through this decoherence process.
I’m pretty sure it sheds no particular light on the problem of model selection but when cosmology and machine learning issues converge it gives me mild shivers of joy.
A German kid caught me talking to myself yesterday. It was my fault, really. I was trying to break a hypnotic trance-like repetition of exactly what I was going to say to the tramper’s hut warden about two hours away. OK, more specifically, I had left the Waihohonu camp site in Tongariro National Park at 7:30AM and was planning to walk out that day. To put this into perspective, it’s 28.8 km (17.9 miles) with elevation changes of around 900m, including a ridiculous final assault above red crater at something like 60 degrees along a stinking volcanic ridge line. And, to make things extra lovely, there was hail, then snow, then torrential downpours punctuated by hail again—a lovely tramp in the New Zealand summer—all in a full pack.
But anyway, enough bragging about my questionable judgement. I was driven by thoughts of a hot shower and the duck l’orange at Chateau Tongariro while my hands numbed to unfeeling arresting myself with trekking poles down through muddy canyons. I was talking to myself. I was trying to stop repeating to myself why I didn’t want my campsite for the night that I had reserved. This is the opposite of glorious runner’s high. This is when all the extra blood from one’s brain is obsessed with either making leg muscles go or watching how the feet will fall. I also had the hood of my rain fly up over my little Marmot ball cap. I was in full regalia, too, with the shifting rub of my Gortex rain pants a constant presence throughout the day. I didn’t notice him easing up on me as I carried on about one-shot learning as some kind of trance-breaking ritual.
We exchanged pleasantries and he meandered on. With his tiny little day pack it was clear he had just come up from the car park at Mangatepopo for a little jaunt. Eurowimp. I caught up with him later slathering some kind of meat product on white bread trailside and pushed by, waiting on my own lunch of jerky, chili-tuna, crackers, and glorious spring water, gulp after gulp, an hour onward. He didn’t bring up the glossolalic soliloquy incident.
My mantra was simple: artificial neural networks, including deep learning approaches, require massive learning cycles and huge numbers of exemplars to learn. In a classic test, scores of handwritten digit images (0 to 9) are categorized as to which number they are. Deep learning systems have gotten to 99% accuracy on that problem, actually besting average human performance. Yet they require a huge training corpus to pull this off, combined with many CPU hours to optimize the models on that corpus. We humans can do much better than that with our neural systems.
So we get this recently lauded effort, One-Shot Learning of Visual Concepts, that uses an extremely complicated Bayesian mixture modeling approach that combines stroke exemplars together for trying to classify foreign and never-before-seen characters (like Bengali or Ethiopic) after only one exposure to the stimulus. In other words, if I show you some weird character with some curves and arcs and a vertical bar in it, you can find similar ones in a test set quite handily, but machines really can’t. A deep learning model could be trained on every possible example known in a long, laborious process, but when exposed to a new script like Amharic or a Cherokee syllabary, the generalizations break down. A simple comparison approach is to use a nearest neighbor match or vote. That is, simply create vectors of the image pixels starting at the top left and compare the distance between the new image vector and the example using an inner vector product. Similar things look the same and have similar pixel patterns, right? Well, except they are rotated. They are shifted. They are enlarged and shrunken.
And then it hit me that the crazy-complex stroke model could be simplified quite radically by simply building a similar collection of stroke primitives as splines and then looking at the K nearest neighbors in the stroke space. So a T is two strokes drawn from the primitives collection with a central junction and the horizontal laying atop the vertical. This builds on the stroke-based intuition of the paper’s authors (basically, all written scripts have strokes as a central feature and we as writers and readers understand the line-ness of them from experience with our own script).
I may have to try this out. I should note, also in critique of this antithesis of runner’s high (tramping doldrums?), that I was also deeply concerned that there were so many damn contending voices and thoughts racing around my head in the face of such incredible scenery. Why did I feel the need to distract my mind from it’s obsessions over something so humanly trivial? At least, I suppose, the distraction was interesting enough that it was worth the effort.
I picked up a whitebait pizza while stopped along the West Coast of New Zealand tonight. Whitebait are tiny little swarming immature fish that can be scooped out of estuarial river flows using big-mouthed nets. They run, they dart, and it is illegal to change river exit points to try to channel them for capture. Hence, whitebait is semi-precious, commanding NZD70-130/kg, which explains why there was a size limit on my pizza: only the small one was available.
By the time I was finished the sky had aged from cinereal to iron in a satire of the vivid, watch-me colors of CNN International flashing Donald Trump’s linguistic indirection across the television. I crept out, setting my headlamp to red LEDs designed to minimally interfere with night vision. Just up away from the coast, hidden in the impossible tangle of cold rainforest, there was a glow worm dell. A few tourists conjured with flashlights facing the ground to avoid upsetting the tiny arachnocampa luminosa that clung to the walls inside the dark garden. They were like faint stars composed into irrelevant constellations, with only the human mind to blame for any observed patterns.
And the light, what light, like white-light LEDs recently invented, but a light that doesn’t flicker or change, and is steady under the calmest observation. Driven by luciferin and luciferase, these tiny creatures lure a few scant light-seeking creatures to their doom and as food for absorption until they emerge to mate, briefly, lay eggs, and then die.
Lucifer again, named properly from the Latin as the light bringer, the chemical basis for bioluminescence was largely isolated in the middle of the 20th Century. Yet there is this biblical stigma hanging over the term—one that really makes no sense at all. The translation of morning star or some other such nonsense into Latin got corrupted into a proper name by a process of word conversion (this isn’t metonymy or something like that; I’m not sure there is a word for it other than “mistake”). So much for some kind of divine literalism tracking mechanism that preserves perfection. Even Jesus got rendered as lucifer in some passages.
But nothing new, here. Demon comes from the Greek daemon and Christianity tried to, well, demonize all the ancient spirits during the monolatry to monotheism transition. The spirits of the air that were in a constant flux for the Hellenists, then the Romans, needed to be suppressed and given an oppositional position to the Christian soteriology. Even “Satan” may have been borrowed from Persian court drama as a kind of spy or informant after the exile.
Oddly, we are left with a kind of naming magic for the truly devout who might look at those indifferent little glow worms with some kind of castigating eye, corrupted by a semantic chain that is as kinked as the popular culture epithets of Lucifer himself.
Slides from a talk I gave today on current advances in machine learning are available in PDF, below. The agenda is pretty straightforward: starting with some theory about overfitting based on algorithmic information theory, we proceed on through a taxonomy of ML types (not exhaustive), then dip into ensemble learning and deep learning approaches. An analysis of the difficulty and types of performance we get from various algorithms and problems is presented. We end with a discussion of whether we should be frightened about the progress we see around us.
Note: click on the gray square if you don’t see the embedded PDF…browsers vary.
Sitting at JFK reading Scholarpedia’s comprehensive entry on Recurrent Neural Networks, waiting for my flight to Iceland with my kid while the sun drops just to the north of Freedom Tower…priceless.
Carl Schulman and Nick Bostrom argue about anthropic principles in “How Hard is Artificial Intelligence? Evolutionary Arguments and Selection Effects” (Journal of Consciousness Studies, 2012, 19:7-8), focusing on specific models for how the assumption of human-level intelligence should be easy to automate are built upon a foundation of assumptions of what easy means because of observational bias (we assume we are intelligent, so the observation of intelligence seems likely).
Yet the analysis of this presumption is blocked by a prior consideration: given that we are intelligent, we should be able to achieve artificial, simulated intelligence. If this is not, in fact, true, then the utility of determining whether the assumption of our own intelligence being highly probable is warranted becomes irrelevant because we may not be able to demonstrate that artificial intelligence is achievable anyway. About this, the authors are dismissive concerning any requirement for simulating the environment that is a prerequisite for organismal and species optimization against that environment:
In the limiting case, if complete microphysical accuracy were insisted upon, the computational requirements would balloon to utterly infeasible proportions. However, such extreme pessimism seems unlikely to be well founded; it seems unlikely that the best environment for evolving intelligence is one that mimics nature as closely as possible. It is, on the contrary, plausible that it would be more efficient to use an artificial selection environment, one quite unlike that of our ancestors, an environment specifically designed to promote adaptations that increase the type of intelligence we are seeking to evolve (say, abstract reasoning and general problem-solving skills as opposed to maximally fast instinctual reactions or a highly optimized visual system).
Why is this “unlikely”? The argument is that there are classes of mental function that can be compartmentalized away from the broader, known evolutionary provocateurs. For instance, the Red Queen argument concerning sexual optimization in the face of significant parasitism is dismissed as merely a distraction to real intelligence:
And as mentioned above, evolution scatters much of its selection power on traits that are unrelated to intelligence, such as Red Queen’s races of co-evolution between immune systems and parasites. Evolution will continue to waste resources producing mutations that have been reliably lethal, and will fail to make use of statistical similarities in the effects of different mutations. All these represent inefficiencies in natural selection (when viewed as a means of evolving intelligence) that it would be relatively easy for a human engineer to avoid while using evolutionary algorithms to develop intelligent software.
Inefficiencies? Really? We know that sexual dimorphism and competition are essential to the evolution of advanced species. Even the growth of brain size and creative capabilities are likely tied to sexual competition, so why should we think that they can be uncoupled? Instead, we are left with a blocker to the core argument that states instead that simulated evolution may, in fact, not be capable of producing sufficient complexity to produce intelligence as we know it without, in turn, a sufficiently complex simulated fitness function to evolve against. Observational effects, aside, if we don’t get this right, we need not worry about the problem of whether there are 10 or ten billion planets suitable for life out there.
Deep Learning methods that use auto-associative neural networks to pre-train (with bottlenecking methods to ensure generalization) have recently been shown to perform as well and even better than human beings at certain tasks like image categorization. But what is missing from the proposed methods? There seem to be a range of challenges that revolve around temporal novelty and sequential activation/classification problems like those that occur in natural language understanding. The most recent achievements are more oriented around relatively static data presentations.
Jürgen Schmidhuber revisits the history of connectionist research (dating to the 1800s!) in his October 2014 technical report, Deep Learning in Neural Networks: An Overview. This is one comprehensive effort at documenting the history of this reinvigorated area of AI research. What is old is new again, enhanced by achievements in computing that allow for larger and larger scale simulation.
The conclusions section has an interesting suggestion: what is missing so far is the sensorimotor activity loop that allows for the active interrogation of the data source. Human vision roams over images while DL systems ingest the entire scene. And the real neural systems have energy constraints that lead to suppression of neural function away from the active neural clusters.