Category: Computers

Apprendre à traduire

Google’s translate has always been a useful tool for awkward gists of short texts. The method used was based on building a phrase-based statistical translation model. To do this, you gather up “parallel” texts that are existing, human, translations. You then “align” them by trying to find the most likely corresponding phrases in each sentence or sets of sentences. Often, between languages, fewer or more sentences will be used to express the same ideas. Once you have that collection of phrasal translation candidates, you can guess the most likely translation of a new sentence by looking up the sequence of likely phrase groups that correspond to that sentence. IBM was the progenitor of this approach in the late 1980’s.

It’s simple and elegant, but it always was criticized for telling us very little about language. Other methods that use techniques like interlingual transfer and parsers showed a more linguist-friendly face. In these methods, the source language is parsed into a parse tree and then that parse tree is converted into a generic representation of the meaning of the sentence. Next a generator uses that representation to create a surface form rendering in the target language. The interlingua must be like the deep meaning of linguistic theories, though the computer science versions of it tended to look a lot like ontological representations with fixed meanings. Flexibility was never the strong suit of these approaches, but their flaws were much deeper than just that.

For one, nobody was able to build a robust parser for any particular language. Next, the ontology was never vast enough to accommodate the rich productivity of real human language. Generators, being the inverse of the parser, remained only toy projects in the computational linguistic community. And, at the end of the day, no functional systems were built.

Instead, the statistical methods plodded along but had their own limitations. For instance, the translation of a never-before-seen sentence consisting of never-before-seen phrases, is the null set. Rare and strange words in the data have problems too, because they have very low probabilities and are swamped by well-represented candidates that lack the nuances of the rarer form. The model doesn’t care, of course; the probabilities rule everything. So you need more and more data. But then you get noisy data mixed in with the good data that distorts the probabilities. And you have to handle completely new words and groupings like proper nouns and numbers that are due to the unique productivity of these classes of forms.

So, where to go from here? For Google and its recent commitment to Deep Learning, the answer was to apply Deep Learning Neural Network approaches. The approach threw every little advance of recent history at the problem to pretty good effect. For instance, to cope with novel and rare words, they broke the input text up into sub-word letter groupings. The segmentation of the groupings was based, itself, on a learned model of the most common break-ups of terms, though they didn’t necessarily correspond to syllables or other common linguistic expectations. Sometimes they also used character-level models. The models were then combined into an ensemble, which is a common way of overcoming brittleness and overtraining on subsets of the data set. They used GPUs in some cases as well as reduced-precision arithmetic to speed-up the training of the models. They also used an attention-based intermediary between the encoder layers and the decoder layers to limit the influence of the broader context within a sentence.

The results improved translation quality by as much as 60% over the baseline phrase-based approach and, interestingly, showed a close approach to the average human translator’s performance. Is this enough? Not at all. You are not going to translate poetry this way any time soon. The productiveness of human language and the open classes of named entities remain a barrier. The subtleties of pragmatics might still vex any data driven approach—at least until there are a few examples in the corpora. And there might need to be a multi-sensory model somehow merged with the purely linguistic one to help manage some translation candidates. For instance, knowing the way in which objects fall could help move a translation from “plummeted” to “settled” to the ground.

Still, data-driven methods continue to reshape the intelligent machines of the future.

Boredom and Being a Decider

tds_decider2_v6Seth Lloyd and I have rarely converged (read: absolutely never) on a realization, but his remarkable 2013 paper on free will and halting problems does, in fact, converge on a paper I wrote around 1986 for an undergraduate Philosophy of Language course. I was, at the time, very taken by Gödel, Escher, Bach: An Eternal Golden Braid, Douglas Hofstadter’s poetic excursion around the topic of recursion, vertical structure in ricercars, and various other topics that stormed about in his book. For me, when combined with other musings on halting problems, it led to a conclusion that the halting problem could be probabilistically solved by an observer who decides when the recursion is too repetitive or too deep. Thus, it prescribes an overlay algorithm that guesses about the odds of another algorithm when subjected to a time or resource constraint. Thus we have a boredom algorithm.

I thought this was rather brilliant at the time and I ended up having a one-on-one with my prof who scoffed at GEB as a “serious” philosophical work. I had thought it was all psychedelically transcendent and had no deep understanding of more serious philosophical work beyond the papers by Kripke, Quine, and Davidson that we had been tasked to read. So I plead undergraduateness. Nevertheless, he had invited me to a one-on-one and we clashed over the concept of teleology and directedness in evolutionary theory. How we got to that from the original decision trees of halting or non-halting algorithms I don’t recall.

But now we have an argument that essentially recapitulates that original form, though with the help of the Hartmanis-Stearns theorem to support it. Whatever the algorithm that runs in our heads, it needs to simulate possible outcomes and try to determine what the best course of action might be (or the worst course, or just some preference). That algorithm is in wetware and is therefore perfectly deterministic. And, importantly, quantum indeterminacy doesn’t rescue us from the free-will implications of that determinism at all; randomness is just random, not decision-making. Instead, the impossibility of assessing the possible outcomes comes from one algorithm monitoring another. In a few narrow cases, it may be possible to enumerate all the stopping results of the enclosed algorithm, but in general, all you can do is greedily terminate branches in the production tree based on some kind of temporal or resource-based criteria,

Free will is neither random nor classically deterministic, but is an algorithmic constraint on the processing power to simulate reality in a conscious, but likely deterministic, head.

A Big Data Jeremiad and the Moral Health of America

monopolydude2The average of polls were wrong. The past-performance-weighted, hyper-parameterized, stratified-sampled, Monte Carlo-ized collaborative predictions fell as critically short in the general election as they had in the Republican primary. There will be much soul searching to establish why that might have been; from ground game engagement to voter turnout, from pollster bias to sampling defects, the hit list will continue to grow.

Things were less predictable than it seemed. During the 2008 and 2012 elections, the losing party proxies held that the polls were inherently flawed, though they were ultimately predictive. Now, in 2016, they were inherently flawed and not at all predictive.

But what the polls showed was instructive even if their numbers were not quite right. Specifically, there was a remarkable turn-out for Trump among white, less-educated voters who long for radical change to their economic lives. The Democratic candidate was less clearly engaging.

Another difference emerged, however. Despite efforts to paint Hillary Clinton as corrupt or a liar, objective fact checkers concluded that she was, in fact, one of the most honest candidates in recent history, and that Donald Trump was one of the worst, only approximated by Michelle Bachman in utter mendacity. We can couple that with his race-bating, misogyny, hostility, divorces, anti-immigrant scapegoating, and other childish antics. Yet these moral failures did not prevent his supporters from voting for him in numbers.

But his moral failures may be precisely why his supporters found him appealing. Evangelicals decided for him because Clinton was a threat to overturning Roe v. Wade, while he was an unknown who said a few contradictory things in opposition. His other moral issues were less important—even forgivable. In reality, though, this particular divide is an exemplar for a broader division in the moral fabric of America. The white working class has been struggling in post-industrial America for decades. Coal mining gives way to fracked, super-abundant natural gas. A freer labor market moves assembly overseas. The continuous rise in productivity shifts value away from labor in the service of innovation to disintermediated innovation itself.

The economic results are largely a consequence of freedom, a value that becomes suffused in the polarized economy where factories close on egghead economic restructuring. Other values come into question as well. Charles Murray’s Coming Apart: The State of White America, 1960-2010, brought a controversial conservative lens to the loss of traditional values for working class America. In this world, marriage, church, and hard work have dissolved due to the influence of the 60s pernicious counter-cultural deconstruction that was revolutionary for the college-educated elite but destructive to the working class. What is left is a vacuum of virtues where the downtrodden lash out at the eggheads from the coasts. The moral failings of a scion of wealth itself are recognizable and forgivable because at least there is a sense of change and some simple diagnostics about what is wrong with our precious state.

So we are left with pussy grabbing, with the Chinese hoax of climate change, with impossible border walls, with a fornicator-in-chief misogynist, with a gloomy Jeremiad of divided America being exploited into oblivion. Even the statisticians were eggheaded speculators who were manipulating the world with their crazy polls. But at least it wasn’t her.

Startup Next

I’m thrilled to announce my new startup, Like Human. The company is focused on making significant new advances to the state of the art in cognitive computing and artificial intelligence. We will remain a bit stealthy for another six months or so and then will open up shop for early adopters.

I’m also pleased to share with you Like Human’s logo that goes by the name Logo McLogoface, or LM for short. LM combines imagery from nuclear warning signs, Robby the Robot from Forbidden Planet, and Leonardo da Vinci’s Vitruvian Man. I think you will agree about Mr. McLogoface’s agreeability:


You can follow developments at @likehumancom on Twitter, and I will make a few announcements here as well.

Motivation, Boredom, and Problem Solving

shatteredIn the New York Times Stone column, James Blachowicz of Loyola challenges the assumption that the scientific method is uniquely distinguishable from other ways of thinking and problem solving we regularly employ. In his example, he lays out how writing poetry involves some kind of alignment of words that conform to the requirements of the poem. Whether actively aware of the process or not, the poet is solving constraint satisfaction problems concerning formal requirements like meter and structure, linguistic problems like parts-of-speech and grammar, semantic problems concerning meaning, and pragmatic problems like referential extension and symbolism. Scientists do the same kinds of things in fitting a theory to data. And, in Blachowicz’s analysis, there is no special distinction between scientific method and other creative methods like the composition of poetry.

We can easily see how this extends to ideas like musical composition and, indeed, extends with even more constraints that range from formal through to possibly the neuropsychology of sound. I say “possibly” because there remains uncertainty on how much nurture versus nature is involved in the brain’s reaction to sounds and music.

In terms of a computational model of this creative process, if we presume that there is an objective function that governs possible fits to the given problem constraints, then we can clearly optimize towards a maximum fit. For many of the constraints there are, however, discrete parameterizations (which part of speech? which word?) that are not like curve fitting to scientific data. In fairness, discrete parameters occur there, too, especially in meta-analyses of broad theoretical possibilities (Quantum loop gravity vs. string theory? What will we tell the children?) The discrete parameterizations blow up the search space with their combinatorics, demonstrating on the one hand why we are so damned amazing, and on the other hand why a controlled randomization method like evolutionary epistemology’s blind search and selective retention gives us potential traction in the face of this curse of dimensionality. The blind search is likely weakened for active human engagement, though. Certainly the poet or the scientist would agree; they are using learned skills, maybe some intellectual talent of unknown origin, and experience on how to traverse the wells of improbability in finding the best fit for the problem. This certainly resembles pre-training in deep learning, though on a much more pervasive scale, including feedback from categorical model optimization into the generative basis model.

But does this extend outwards to other ways in which we form ideas? We certainly know that motivated reasoning is involved in key aspects of our belief formation, which plays strongly into how we solve these constraint problems. We tend to actively look for confirmations and avoid disconfirmations of fit. We positively bias recency of information, or repeated exposures, and tend to only reconsider in much slower cycles.

Also, as the constraints of certain problem domains become, in turn, extensions that can result in change—where there is a dynamic interplay between belief and success—the fixity of the search space itself is no longer guaranteed. Broad human goals like the search for meaning are an example of that. In come complex human factors, like how boredom correlates with motivation and ideological extremism (overview, here, journal article, here).

This latter data point concerning boredom crosses from mere bias that might preclude certain parts of a search space into motivation that focuses it, and that optimizes for novelty seeking and other behaviors.

Local Minima and Coatimundi

CoatimundiEven given the basic conundrum of how deep learning neural networks might cope with temporal presentations or linear sequences, there is another oddity to deep learning that only seems obvious in hindsight. One of the main enhancements to traditional artificial neural networks is a phase of supervised pre-training that forces each layer to try to create a generative model of the input pattern. The deep learning networks then learn a discriminant model after the initial pre-training is done, focusing on the error relative to classification versus simply recognizing the phrase or image per se.

Why this makes a difference has been the subject of some investigation. In general, there is an interplay between the smoothness of the error function and the ability of the optimization algorithms to cope with local minima. Visualize it this way: for any machine learning problem that needs to be solved, there are answers and better answers. Take visual classification. If the system (or you) gets shown an image of a coatimundi and a label that says coatimundi (heh, I’m running in New Mexico right now…), learning that image-label association involves adjusting weights assigned to different pixels in the presentation image down through multiple layers of the network that provide increasing abstractions about the features that define a coatimundi. And, importantly, that define a coatimundi versus all the other animals and non-animals.,

These weight choices define an error function that is the optimization target for the network as a whole, and this error function can have many local minima. That is, by enhancing the weights supporting a coati versus a dog or a raccoon, the algorithm inadvertently leans towards a non-optimal assignment for all of them by focusing instead on a balance between them that is predestined by the previous dog and raccoon classifications (or, in general, the order of presentation).

Improvements require “escaping” these local optima in favor of a global solution that accords the best overall outcome to all the animals and a minimization of the global error. And pre-training seems to do that. It likely moves each discriminative category closer to the global possibilities because those global possibilities are initially encoded by the pre-training phase.

This has the added benefit of regularizing or smoothing out the noise that is inherent in any real data set. Indeed, the two approaches appear to be closely allied in their impact on the overall machine learning process.

New Behaviorism and New Cognitivism

lstm_memorycellDeep Learning now dominates discussions of intelligent systems in Silicon Valley. Jeff Dean’s discussion of its role in the Alphabet product lines and initiatives shows the dominance of the methodology. Pushing the limits of what Artificial Neural Networks have been able to do has been driven by certain algorithmic enhancements and the ability to process weight training algorithms at much higher speeds and over much larger data sets. Google even developed specialized hardware to assist.

Broadly, though, we see mostly pattern recognition problems like image classification and automatic speech recognition being impacted by these advances. Natural language parsing has also recently had some improvements from Fernando Pereira’s team. The incremental improvements using these methods should not be minimized but, at the same time, the methods don’t emulate key aspects of what we observe in human cognition. For instance, the networks train incrementally and lack the kinds of rapid transitions that we observe in human learning and thinking.

In a strong sense, the models that Deep Learning uses can be considered Behaviorist in that they rely almost exclusively on feature presentation with a reward signal. The internal details of how modularity or specialization arise within the network layers are interesting but secondary to the broad use of back-propagation or Gibb’s sampling combined with autoencoding. This is a critique that goes back to the early days of connectionism, of course, and why it was somewhat sidelined after an initial heyday in the late eighties. Then came statistical NLP, then came hybrid methods, then a resurgence of corpus methods, all the while with image processing getting more and more into the hand-crafted modular space.

But we can see some interesting developments that start to stir more Cognitivism into this stew. Recurrent Neural Networks provided interesting temporal behavior that might be lacking in some feedforward NNs, and Long-Short-Term Memory (LSTM) NNs help to overcome some specific limitations of  recurrent NNs like the disconnection between temporally-distant signals and the reward patterns.

Still, the modularity and rapid learning transitions elude us. While these methods are enhancing the ability to learn the contexts around specific events (and even the unique variability of contexts), that learning still requires many exposures to get right. We might consider our language or vision modules to be learned over evolutionary history and so not expect learning within a lifetime from scratch to result in similarly structured modules, but the differences remain not merely quantitative but significantly qualitative. A New Cognitivism requires more work to rise from this New Behaviorism.

Evolving Visions of Chaotic Futures

FlutterbysMost artificial intelligence researchers think unlikely the notion that a robot apocalypse or some kind of technological singularity is coming anytime soon. I’ve said as much, too. Guessing about the likelihood of distant futures is fraught with uncertainty; current trends are almost impossible to extrapolate.

But if we must, what are the best ways for guessing about the future? In the late 1950s the Delphi method was developed. Get a group of experts on a given topic and have them answer questions anonymously. Then iteratively publish back the group results and ask for feedback and revisions. Similar methods have been developed for face-to-face group decision making, like Kevin O’Connor’s approach to generating ideas in The Map of Innovation: generate ideas and give participants votes equaling a third of the number of unique ideas. Keep iterating until there is a consensus. More broadly, such methods are called “nominal group techniques.”

Most recently, the notion of prediction markets has been applied to internal and external decision making. In prediction markets,  a similar voting strategy is used but based on either fake or real money, forcing participants towards a risk-averse allocation of assets.

Interestingly, we know that optimal inference based on past experience can be codified using algorithmic information theory, but the fundamental problem with any kind of probabilistic argument is that much change that we observe in society is non-linear with respect to its underlying drivers and that the signals needed are imperfect. As the mildly misanthropic Nassim Taleb pointed out in The Black Swan, the only place where prediction takes on smooth statistical regularity is in Las Vegas, which is why one shouldn’t bother to gamble. Taleb’s approach is to look instead at minimizing the impact of shocks (or hedging them in financial markets).

But maybe we can learn something from philosophical circles. For instance, Evolutionary Epistemology (EE), as formulated by Donald Campbell, Sir Karl Popper, and others, posits that central to knowledge formation is blind variation and selective retention. Combined with optimal induction, this leads to random processes being injected into any kind of predictive optimization. We do this in evolutionary algorithms like Genetic Algorithms, Evolutionary Programming, Genetic Programming, and Evolutionary Strategies, as well as in related approaches like Simulated Annealing. But EE also suggests that there are several levels of learning by variation/retention, from the phylogenetic learning of species through to the mental processes of higher organisms. We speculate and trial-and-error continuously, repeating loops of what-ifs in our minds in an effort to optimize our responses in the future. It’s confounding as hell but we do remarkable things that machines can’t yet do like folding towels or learning to bake bread.

This noosgeny-recapitulates-ontogeny-recapitulates-phylogeny (just made that up) can be exploited in a variety of ways for abductive inference about the future. We can, for instance, use evolutionary optimization with a penalty for complexity that simulates the informational trade-off of AIT-style inductive optimality. Further, the noosgeny component (by which I mean the internalized mental trial-and-error) can reduce phylogenetic waste in simulations by providing speculative modeling that retains the “parental” position on the fitness landscape before committing to a next generation of potential solutions, allowing for further probing of complex adaptive landscapes.

The Linguistics of Hate

keep-calm-and-hate-corpus-linguisticsRight-wing authoritarianism (RWA) and Social dominance orientation (SDO) are measures of personality traits and tendencies. To measure them, you ask people to rate statements like:

Superior groups should dominate inferior groups

The withdrawal from tradition will turn out to be a fatal fault one day

People rate their opinions on these questions using a 1 to 5 scale from Definitely Disagree to Strongly Agree. These scales have their detractors but they also demonstrate some useful and stable reliability across cultures.

Note that while both of these measures tend to be higher in American self-described “conservatives,” they also can be higher for leftist authoritarians and they may even pop up for subsets of attitudes among Western social liberals about certain topics like religion. Haters abound.

I used the R packages twitterR, textminer, wordcloud, SnowballC, and a few others and grabbed a few thousand tweets that contained the #DonaldJTrump hashtag. A quick scan of them showed the standard properties of tweets like repetition through retweeting, heavy use of hashtags, and, of course, the use of the #DonaldJTrump as part of anti-Trump sentiments (something about a cocaine-use video). But, filtering them down, there were definite standouts that seemed to support a RWA/SDO orientation. Here are some examples:

The last great leader of the White Race was #trump #trump2016 #donaldjtrump #DonaldTrump2016 #donaldtrump”

Just a wuss who cant handle the defeat so he cries to GOP for brokered Convention. # Trump #DonaldJTrump

I am a PROUD Supporter of #DonaldJTrump for the Highest Office in the land. If you don’t like it, LEAVE!

#trump army it’s time, we stand up for family, they threaten trumps family they threaten us, lock and load, push the vote…

Not surprising, but the density of them shows a real aggressiveness that somewhat shocked me. So let’s assume that Republicans make up around 29% of the US population, and that Trump is getting around 40% of their votes in the primary season, then we have an angry RWA/SDO-focused subpopulation of around 12% of the US population.

That seems to fit with results from an online survey of RWA, reported here. An interesting open question is whether there is a spectrum of personality types that is genetically predisposed, or whether childhood exposures to ideas and modes of childrearing are more likely the cause of these patterns (and their cross-cultural applicability).

Here are some interesting additional resources:

Bilewicz, Michal, et al. “When Authoritarians Confront Prejudice. Differential Effects of SDO and RWA on Support for Hate‐Speech Prohibition.” Political Psychology (2015).

Sylwester K, Purver M (2015) Twitter Language Use Reflects Psychological Differences between Democrats and Republicans. PLoS ONE 10(9): e0137422. doi:10.1371/journal.pone.0137422

The latter has a particularly good overview of RWA/SDO, other measures like openness, etc., and Twitter as an analytics tool.

Finally, below is some R code for Twitter analytics that I am developing. It is derivative of sample code like here and here, but reorients the function structure and adds deletion of Twitter hashtags to focus on the supporting language. There are some other enhancements like codeset normalization. All uses and reuses are welcome. I am starting to play with building classifiers and using Singular Value Decomposition to pull apart various dominating factors and relationships in the term structure. Ultimately, however, human intervention is needed to identify pro vs. anti tweets, as well as phrasal patterns that are more indicative of RWA/SDO than bags-of-words can indicate.

Also, here are wordclouds generated for #hillaryclinton and #DonaldJTrump, respectively. The Trump wordcloud was distorted by some kind of repetitive robotweeting that dominated the tweets.





 #Grab the tweets
 djtTweets <- searchTwitter(searchTerm, num)

 #Use a handy helper function to put the tweets into a dataframe 

 RemoveDots <- function(tweet) {
 gsub("[\\.\\,\\;]+", " ", tweet)

 RemoveLinks <- function(tweet) {
 gsub("http:[^ $]+", "", tweet)
 gsub("https:[^ $]+", "", tweet)

 RemoveAtPeople <- function(tweet) {
 gsub("@\\w+", "", tweet) 

 RemoveHashtags <- function(tweet) {
 gsub("#\\w+", "", tweet) 

 FixCharacters <- function(tweet){

 CleanTweets <- function(tweet){
 s1 <- RemoveLinks(tweet)
 s2 <- RemoveAtPeople(s1)
 s3 <- RemoveDots(s2) 
 s4 <- RemoveHashtags(s3)
 s5 <- FixCharacters(s4)

 tweets <- as.vector(sapply(tw.df$text, CleanTweets))
 if (verbose) print(tweets)

 generateCorpus= function(df,pstopwords){
 tw.corpus= Corpus(VectorSource(df))
 tw.corpus = tm_map(tw.corpus, content_transformer(removePunctuation))
 tw.corpus = tm_map(tw.corpus, content_transformer(tolower))
 tw.corpus = tm_map(tw.corpus, removeWords, stopwords('english'))
 tw.corpus = tm_map(tw.corpus, removeWords, pstopwords)

 corpus = generateCorpus(tweets)

 doc.m = TermDocumentMatrix(corpus, control = list(minWordLength = 1))
 dm = as.matrix(doc.m)
 # calculate the frequency of words
 v = sort(rowSums(dm), decreasing=TRUE)

 d = data.frame(word=names(v), freq=v)
 #Generate the wordcloud
 wc=wordcloud(d$word, d$freq, scale=c(4,0.3), min.freq=min.freq, colors = brewer.pal(8, "Paired"))

djttweets = tweets.grabber("#DonaldJTrump", 2000, verbose=TRUE)
djtcorpus = corpus.stats(djttweets)
wordcloud.generate(djtcorpus, 3)

The Retiring Mind, Part III: Autonomy

Retiring Mind IIIRobert Gordon’s book on the end of industrial revolutions recently came out. I’ve been arguing for a while that the coming robot apocalypse might be Industrial Revolution IV. But the Dismal Science continues to point out uncomfortable facts in opposition to my suggestion.

So I had to test the beginning of the end (or the beginning of the beginning?) when my Tesla P90D with autosteer, summon mode, automatic parking, and ludicrous mode arrived to take the place of my three-year-old P85: