Category: Big Data

Apprendre à traduire

Google’s translate has always been a useful tool for awkward gists of short texts. The method used was based on building a phrase-based statistical translation model. To do this, you gather up “parallel” texts that are existing, human, translations. You then “align” them by trying to find the most likely corresponding phrases in each sentence or sets of sentences. Often, between languages, fewer or more sentences will be used to express the same ideas. Once you have that collection of phrasal translation candidates, you can guess the most likely translation of a new sentence by looking up the sequence of likely phrase groups that correspond to that sentence. IBM was the progenitor of this approach in the late 1980’s.

It’s simple and elegant, but it always was criticized for telling us very little about language. Other methods that use techniques like interlingual transfer and parsers showed a more linguist-friendly face. In these methods, the source language is parsed into a parse tree and then that parse tree is converted into a generic representation of the meaning of the sentence. Next a generator uses that representation to create a surface form rendering in the target language. The interlingua must be like the deep meaning of linguistic theories, though the computer science versions of it tended to look a lot like ontological representations with fixed meanings. Flexibility was never the strong suit of these approaches, but their flaws were much deeper than just that.

For one, nobody was able to build a robust parser for any particular language. Next, the ontology was never vast enough to accommodate the rich productivity of real human language. Generators, being the inverse of the parser, remained only toy projects in the computational linguistic community. And, at the end of the day, no functional systems were built.

Instead, the statistical methods plodded along but had their own limitations. For instance, the translation of a never-before-seen sentence consisting of never-before-seen phrases, is the null set. Rare and strange words in the data have problems too, because they have very low probabilities and are swamped by well-represented candidates that lack the nuances of the rarer form. The model doesn’t care, of course; the probabilities rule everything. So you need more and more data. But then you get noisy data mixed in with the good data that distorts the probabilities. And you have to handle completely new words and groupings like proper nouns and numbers that are due to the unique productivity of these classes of forms.

So, where to go from here? For Google and its recent commitment to Deep Learning, the answer was to apply Deep Learning Neural Network approaches. The approach threw every little advance of recent history at the problem to pretty good effect. For instance, to cope with novel and rare words, they broke the input text up into sub-word letter groupings. The segmentation of the groupings was based, itself, on a learned model of the most common break-ups of terms, though they didn’t necessarily correspond to syllables or other common linguistic expectations. Sometimes they also used character-level models. The models were then combined into an ensemble, which is a common way of overcoming brittleness and overtraining on subsets of the data set. They used GPUs in some cases as well as reduced-precision arithmetic to speed-up the training of the models. They also used an attention-based intermediary between the encoder layers and the decoder layers to limit the influence of the broader context within a sentence.

The results improved translation quality by as much as 60% over the baseline phrase-based approach and, interestingly, showed a close approach to the average human translator’s performance. Is this enough? Not at all. You are not going to translate poetry this way any time soon. The productiveness of human language and the open classes of named entities remain a barrier. The subtleties of pragmatics might still vex any data driven approach—at least until there are a few examples in the corpora. And there might need to be a multi-sensory model somehow merged with the purely linguistic one to help manage some translation candidates. For instance, knowing the way in which objects fall could help move a translation from “plummeted” to “settled” to the ground.

Still, data-driven methods continue to reshape the intelligent machines of the future.

A Big Data Jeremiad and the Moral Health of America

monopolydude2The average of polls were wrong. The past-performance-weighted, hyper-parameterized, stratified-sampled, Monte Carlo-ized collaborative predictions fell as critically short in the general election as they had in the Republican primary. There will be much soul searching to establish why that might have been; from ground game engagement to voter turnout, from pollster bias to sampling defects, the hit list will continue to grow.

Things were less predictable than it seemed. During the 2008 and 2012 elections, the losing party proxies held that the polls were inherently flawed, though they were ultimately predictive. Now, in 2016, they were inherently flawed and not at all predictive.

But what the polls showed was instructive even if their numbers were not quite right. Specifically, there was a remarkable turn-out for Trump among white, less-educated voters who long for radical change to their economic lives. The Democratic candidate was less clearly engaging.

Another difference emerged, however. Despite efforts to paint Hillary Clinton as corrupt or a liar, objective fact checkers concluded that she was, in fact, one of the most honest candidates in recent history, and that Donald Trump was one of the worst, only approximated by Michelle Bachman in utter mendacity. We can couple that with his race-bating, misogyny, hostility, divorces, anti-immigrant scapegoating, and other childish antics. Yet these moral failures did not prevent his supporters from voting for him in numbers.

But his moral failures may be precisely why his supporters found him appealing. Evangelicals decided for him because Clinton was a threat to overturning Roe v. Wade, while he was an unknown who said a few contradictory things in opposition. His other moral issues were less important—even forgivable. In reality, though, this particular divide is an exemplar for a broader division in the moral fabric of America. The white working class has been struggling in post-industrial America for decades. Coal mining gives way to fracked, super-abundant natural gas. A freer labor market moves assembly overseas. The continuous rise in productivity shifts value away from labor in the service of innovation to disintermediated innovation itself.

The economic results are largely a consequence of freedom, a value that becomes suffused in the polarized economy where factories close on egghead economic restructuring. Other values come into question as well. Charles Murray’s Coming Apart: The State of White America, 1960-2010, brought a controversial conservative lens to the loss of traditional values for working class America. In this world, marriage, church, and hard work have dissolved due to the influence of the 60s pernicious counter-cultural deconstruction that was revolutionary for the college-educated elite but destructive to the working class. What is left is a vacuum of virtues where the downtrodden lash out at the eggheads from the coasts. The moral failings of a scion of wealth itself are recognizable and forgivable because at least there is a sense of change and some simple diagnostics about what is wrong with our precious state.

So we are left with pussy grabbing, with the Chinese hoax of climate change, with impossible border walls, with a fornicator-in-chief misogynist, with a gloomy Jeremiad of divided America being exploited into oblivion. Even the statisticians were eggheaded speculators who were manipulating the world with their crazy polls. But at least it wasn’t her.

Startup Next

I’m thrilled to announce my new startup, Like Human. The company is focused on making significant new advances to the state of the art in cognitive computing and artificial intelligence. We will remain a bit stealthy for another six months or so and then will open up shop for early adopters.

I’m also pleased to share with you Like Human’s logo that goes by the name Logo McLogoface, or LM for short. LM combines imagery from nuclear warning signs, Robby the Robot from Forbidden Planet, and Leonardo da Vinci’s Vitruvian Man. I think you will agree about Mr. McLogoface’s agreeability:


You can follow developments at @likehumancom on Twitter, and I will make a few announcements here as well.

Local Minima and Coatimundi

CoatimundiEven given the basic conundrum of how deep learning neural networks might cope with temporal presentations or linear sequences, there is another oddity to deep learning that only seems obvious in hindsight. One of the main enhancements to traditional artificial neural networks is a phase of supervised pre-training that forces each layer to try to create a generative model of the input pattern. The deep learning networks then learn a discriminant model after the initial pre-training is done, focusing on the error relative to classification versus simply recognizing the phrase or image per se.

Why this makes a difference has been the subject of some investigation. In general, there is an interplay between the smoothness of the error function and the ability of the optimization algorithms to cope with local minima. Visualize it this way: for any machine learning problem that needs to be solved, there are answers and better answers. Take visual classification. If the system (or you) gets shown an image of a coatimundi and a label that says coatimundi (heh, I’m running in New Mexico right now…), learning that image-label association involves adjusting weights assigned to different pixels in the presentation image down through multiple layers of the network that provide increasing abstractions about the features that define a coatimundi. And, importantly, that define a coatimundi versus all the other animals and non-animals.,

These weight choices define an error function that is the optimization target for the network as a whole, and this error function can have many local minima. That is, by enhancing the weights supporting a coati versus a dog or a raccoon, the algorithm inadvertently leans towards a non-optimal assignment for all of them by focusing instead on a balance between them that is predestined by the previous dog and raccoon classifications (or, in general, the order of presentation).

Improvements require “escaping” these local optima in favor of a global solution that accords the best overall outcome to all the animals and a minimization of the global error. And pre-training seems to do that. It likely moves each discriminative category closer to the global possibilities because those global possibilities are initially encoded by the pre-training phase.

This has the added benefit of regularizing or smoothing out the noise that is inherent in any real data set. Indeed, the two approaches appear to be closely allied in their impact on the overall machine learning process.

The Linguistics of Hate

keep-calm-and-hate-corpus-linguisticsRight-wing authoritarianism (RWA) and Social dominance orientation (SDO) are measures of personality traits and tendencies. To measure them, you ask people to rate statements like:

Superior groups should dominate inferior groups

The withdrawal from tradition will turn out to be a fatal fault one day

People rate their opinions on these questions using a 1 to 5 scale from Definitely Disagree to Strongly Agree. These scales have their detractors but they also demonstrate some useful and stable reliability across cultures.

Note that while both of these measures tend to be higher in American self-described “conservatives,” they also can be higher for leftist authoritarians and they may even pop up for subsets of attitudes among Western social liberals about certain topics like religion. Haters abound.

I used the R packages twitterR, textminer, wordcloud, SnowballC, and a few others and grabbed a few thousand tweets that contained the #DonaldJTrump hashtag. A quick scan of them showed the standard properties of tweets like repetition through retweeting, heavy use of hashtags, and, of course, the use of the #DonaldJTrump as part of anti-Trump sentiments (something about a cocaine-use video). But, filtering them down, there were definite standouts that seemed to support a RWA/SDO orientation. Here are some examples:

The last great leader of the White Race was #trump #trump2016 #donaldjtrump #DonaldTrump2016 #donaldtrump”

Just a wuss who cant handle the defeat so he cries to GOP for brokered Convention. # Trump #DonaldJTrump

I am a PROUD Supporter of #DonaldJTrump for the Highest Office in the land. If you don’t like it, LEAVE!

#trump army it’s time, we stand up for family, they threaten trumps family they threaten us, lock and load, push the vote…

Not surprising, but the density of them shows a real aggressiveness that somewhat shocked me. So let’s assume that Republicans make up around 29% of the US population, and that Trump is getting around 40% of their votes in the primary season, then we have an angry RWA/SDO-focused subpopulation of around 12% of the US population.

That seems to fit with results from an online survey of RWA, reported here. An interesting open question is whether there is a spectrum of personality types that is genetically predisposed, or whether childhood exposures to ideas and modes of childrearing are more likely the cause of these patterns (and their cross-cultural applicability).

Here are some interesting additional resources:

Bilewicz, Michal, et al. “When Authoritarians Confront Prejudice. Differential Effects of SDO and RWA on Support for Hate‐Speech Prohibition.” Political Psychology (2015).

Sylwester K, Purver M (2015) Twitter Language Use Reflects Psychological Differences between Democrats and Republicans. PLoS ONE 10(9): e0137422. doi:10.1371/journal.pone.0137422

The latter has a particularly good overview of RWA/SDO, other measures like openness, etc., and Twitter as an analytics tool.

Finally, below is some R code for Twitter analytics that I am developing. It is derivative of sample code like here and here, but reorients the function structure and adds deletion of Twitter hashtags to focus on the supporting language. There are some other enhancements like codeset normalization. All uses and reuses are welcome. I am starting to play with building classifiers and using Singular Value Decomposition to pull apart various dominating factors and relationships in the term structure. Ultimately, however, human intervention is needed to identify pro vs. anti tweets, as well as phrasal patterns that are more indicative of RWA/SDO than bags-of-words can indicate.

Also, here are wordclouds generated for #hillaryclinton and #DonaldJTrump, respectively. The Trump wordcloud was distorted by some kind of repetitive robotweeting that dominated the tweets.





 #Grab the tweets
 djtTweets <- searchTwitter(searchTerm, num)

 #Use a handy helper function to put the tweets into a dataframe 

 RemoveDots <- function(tweet) {
 gsub("[\\.\\,\\;]+", " ", tweet)

 RemoveLinks <- function(tweet) {
 gsub("http:[^ $]+", "", tweet)
 gsub("https:[^ $]+", "", tweet)

 RemoveAtPeople <- function(tweet) {
 gsub("@\\w+", "", tweet) 

 RemoveHashtags <- function(tweet) {
 gsub("#\\w+", "", tweet) 

 FixCharacters <- function(tweet){

 CleanTweets <- function(tweet){
 s1 <- RemoveLinks(tweet)
 s2 <- RemoveAtPeople(s1)
 s3 <- RemoveDots(s2) 
 s4 <- RemoveHashtags(s3)
 s5 <- FixCharacters(s4)

 tweets <- as.vector(sapply(tw.df$text, CleanTweets))
 if (verbose) print(tweets)

 generateCorpus= function(df,pstopwords){
 tw.corpus= Corpus(VectorSource(df))
 tw.corpus = tm_map(tw.corpus, content_transformer(removePunctuation))
 tw.corpus = tm_map(tw.corpus, content_transformer(tolower))
 tw.corpus = tm_map(tw.corpus, removeWords, stopwords('english'))
 tw.corpus = tm_map(tw.corpus, removeWords, pstopwords)

 corpus = generateCorpus(tweets)

 doc.m = TermDocumentMatrix(corpus, control = list(minWordLength = 1))
 dm = as.matrix(doc.m)
 # calculate the frequency of words
 v = sort(rowSums(dm), decreasing=TRUE)

 d = data.frame(word=names(v), freq=v)
 #Generate the wordcloud
 wc=wordcloud(d$word, d$freq, scale=c(4,0.3), min.freq=min.freq, colors = brewer.pal(8, "Paired"))

djttweets = tweets.grabber("#DonaldJTrump", 2000, verbose=TRUE)
djtcorpus = corpus.stats(djttweets)
wordcloud.generate(djtcorpus, 3)

The IQ of Machines

standard-dudePerhaps idiosyncratic to some is my focus in the previous post on the theoretical background to machine learning that derives predominantly from algorithmic information theory and, in particular, Solomonoff’s theory of induction. I do note that there are other theories that can be brought to bear, including Vapnik’s Structural Risk Minimization and Valiant’s PAC-learning theory. Moreover, perceptrons and vector quantization methods and so forth derive from completely separate principals that can then be cast into more fundamental problems in informational geometry and physics.

Artificial General Intelligence (AGI) is then perhaps the hard problem on the horizon that I disclaim as having had significant progress in the past twenty years of so. That is not to say that I am not an enthusiastic student of the topic and field, just that I don’t see risk levels from intelligent AIs rising to what we should consider a real threat. This topic of how to grade threats deserves deeper treatment, of course, and is at the heart of everything from so-called “nanny state” interventions in food and product safety to how to construct policy around global warming. Luckily–and unlike both those topics–killer AIs don’t threaten us at all quite yet.

But what about simply characterizing what AGIs might look like and how we can even tell when they arise? Mildly interesting is Simon Legg and Joel Veness’ idea of an Artificial Intelligence Quotient or AIQ that they expand on in An Approximation of the Universal Intelligence Measure. This measure is derived from, voilà, exactly the kind of algorithmic information theory (AIT) and compression arguments that I lead with in the slide deck. Is this the only theory around for AGI? Pretty much, but different perspectives tend to lead to slightly different focuses. For instance, there is little need to discuss AIT when dealing with Deep Learning Neural Networks. We just instead discuss statistical regularization and bottlenecking, which can be thought of as proxies for model compression.

So how can intelligent machines be characterized by something like AIQ? Well, the conclusion won’t be surprising. Intelligent machines are those machines that function well in terms of achieving goals over a highly varied collection of environments. This allows for tractable mathematical treatments insofar as the complexity of the landscapes can be characterized, but doesn’t really give us a good handle on what the particular machines might look like. They can still be neural networks or support vector machines, or maybe even something simpler, and through some selection and optimization process have the best performance over a complex topology of reward-driven goal states.

So still no reason to panic, but some interesting ideas that shed greater light on the still mysterious idea of intelligence and the human condition.

Machine Learning and the Coming Robot Apocalypse

Daliesque creepy dogsSlides from a talk I gave today on current advances in machine learning are available in PDF, below. The agenda is pretty straightforward: starting with some theory about overfitting based on algorithmic information theory, we proceed on through a taxonomy of ML types (not exhaustive), then dip into ensemble learning and deep learning approaches. An analysis of the difficulty and types of performance we get from various algorithms and problems is presented. We end with a discussion of whether we should be frightened about the progress we see around us.

Note: click on the gray square if you don’t see the embedded PDF…browsers vary.

Download the PDF file .

Intelligence Augmentation and a Frictionless Economy

Speed SkatingThe ever-present Tom Davenport weighs in in the Harvard Business Review on the topic of artificial intelligence (AI) and its impact on knowledge workers of the future. The theme is intelligence augmentation (IA) where knowledge workers improve their productivity and create new business opportunities using technology. And those new opportunities don’t displace others, per se, but introduce new efficiencies. This was also captured in the New York Times in a round-up of the role of talent and service marketplaces that reduce the costs of acquiring skills and services, creating more efficient and disintermediating sources of friction in economic interactions.

I’ve noticed the proliferation of services for connecting home improvement contractors to customers lately, and have benefited from them in several renovation/construction projects I have ongoing. Meanwhile, Amazon Prime has absorbed an increasingly large portion of our shopping, even cutting out Whole Foods runs, with often next day deliveries. Between pricing transparency and removing barriers (delivery costs, long delays, searching for reliable contractors), the economic impacts might be large enough to be considered a revolution, though perhaps a consumer revolution rather than a worker productivity one.

Here’s the concluding paragraph from an IEEE article I just wrote that will appear in the San Francisco Chronicle in the near future:

One of the most interesting risks also carries with it the potential for enhanced reward. Don’t they always? That is, some economists see economic productivity largely stabilizing if not stagnating.  Industrial revolutions driven by steam engines, electrification, telephony, and even connected computing led to radical reshaping our economy in the past and leaps in the productivity of workers, but there is no clear candidate for those kinds of changes in the near future. Big data feeding into more intelligent systems may be the driver for the next economic wave, though revolutions are always messier than anyone expected.

But maybe it will be simpler and less messy than I imagine, just intelligence augmentation helping with our daily engagement with a frictionless economy.

Inequality and Big Data Revolutions

industrial-revolutionsI had some interesting new talking points in my Rock Stars of Big Data talk this week. On the same day, MIT Technology Review published Technology and Inequality by David Rotman that surveys the link between a growing wealth divide and technological change. Part of my motivating argument for Big Data is that intelligent systems are likely the next industrial revolution via Paul Krugman of Nobel Prize and New York Times fame. Krugman builds on Robert Gordon’s analysis of past industrial revolutions that reached some dire conclusions about slowing economic growth in America. The consequences of intelligent systems on everyday life will have enormous impact and will disrupt everything from low-wage workers through to knowledge workers. And how does Big Data lead to that disruption?

Krugman’s optimism was built on the presumption that the brittleness of intelligent systems so far can be overcome by more and more data. There are some examples where we are seeing incremental improvements due to data volumes. For instance, having larger sample corpora to use for modeling spoken language enhances automatic speech recognition. Google Translate builds on work that I had the privilege to be involved with in the 1990s that used “parallel texts” (essentially line-by-line translations) to build automatic translation systems based on phrasal lookup. The more examples of how things are translated, the better the system gets. But what else improves with Big Data? Maybe instrumenting many cars and crowdsourcing driving behaviors through city streets would provide the best data-driven approach to self-driving cars. Maybe instrumenting individuals will help us overcome some of things we do effortlessly that are strangely difficult to automate like folding towels and understanding complex visual scenes.

But regardless of the methods, the consequences need to be considered. Our current fascination with Big Data may not lead to Industrial Revolution 4 in five years or twenty, but unless there is some magical barrier that we are not aware of, IR4 seems to be inevitable. And the impacts will perhaps be more profound than the past revolutions because, unlike those transitions, the direct displacement of workers is a key component of the IR4 plan. In Rotman’s article, Thomas Piketty’s r > g is invoked to explain the excess return on capital (r) versus economic growth rate (g) and how that leads to a concentration of wealth among the richest members of our society, creating a barbell distribution of economic opportunities where the middle class has been dismantled due to (per Gordon) the equalization of labor costs through outsourcing to low-cost nations. But at least there remains a left bell to that barbell in that it is largely impossible to eliminate the services jobs that are critical to retail, restaurant, logistics, health care, and a raft of other economic sectors.

All that changes in IR4 and the barbell turns into the hammer from the Olympic hammer throw as the owners of the capital take over the entire cost structure for a huge range of economic activities. The middle may not initially be gone, however, as maintenance of the machinery will require a skilled workforce. Even this will be a point of Big Data optimization, however, as predictive maintenance and self-healing systems optimize against their failure modes over usage cycles.

So let’s go back to Gordon’s pessimism (economics is, after all, the “dismal science”). What headwinds and tailwinds are left in IR4? Perhaps the most cogent is the recommended use of redistributive methods for accelerating educational opportunities while reducing the debt load of American students. The other areas that are discussed include unlimited immigration to try to offset hours per capita declines due to retirement and demographic effects, but Gordon’s application of this is not necessarily valid in IR4 where low-skilled immigration would cease because of a lack of economic opportunities and even higher-skilled workers might find themselves displaced.

One lesson learned from past industrial revolutions is that they created more opportunities than worker displacements. Steam power displaced animal labor and the workers needed to shoe and train and feed those animals. Diesel trains displaced steam engine builders and mechanics. Cars and aircraft displaced trains. But in each case there were new jobs that accompanied the shift. We might be equally optimistic about IR4, speculating about robot trainers and knowledge engineers, massive extraction industries and materials production, or enhanced creative and entertainment systems like Michael Crichton’s dystopian Westworld of the early 70s. Is this enough to buffer against the headwind of the loss of the service sector? Perhaps, but it will not come without enormous global disruption.