If you were lisitening to NPR’s “All Things Considered” broadcast on January 18, you might have heard a brief report on research that reveals regional differences (“dialects”) in word usage, spellings, slang and abbreviations in Twitter postings.  For example, Northern and Southern California use spelling variants koo and coo to mean “cool.”

Finding regional differences in these written expressions is interesting in its own right, but I’ve just finished reading the paper describing this research and there’s a lot more going on here than simply counting and comparing expressions across different geographic regions.  The paper is an excellent example of what market researchers might do to analyze social media.

The study authors–Jacob Eisenstein, Brendan O’Connor, Noah A. Smith, and Eric P. Xing–are affiliated with the School of Computer Science at Carnegie Mellon University (Eisenstein, who was interviewed for the ATC broadcast, is a postdoctoral fellow).  They set out to develop a latent variable model to predict an author’s geographic location from the characteristics of text messages.  As they point out, there work is unique in that they use raw text data (although “tokenized”) as input to the modeling.  They develop and compare a few different models, including a “geographic topic model” that incorporates the interaction between base topics (such as sports) and an author’s geographic location as well as additional latent variable models:  a “mixture of unigrams” (model assumes a single topic) and a “supervised linear Dirichlet allocation.”    If you have not yet figured it out, the models, as described, use statistical machine learning methods.  That means that some of the terminology may be unfamiliar to market researchers, but the description of the algorithm for the geographic topic model resembles the hierarchical Bayesian methods using the Gibb’s sampler that have come into fairly wide use in market research (especially for choice-based conjoint analysis).

This research is important for market research because it demonstrates a method for estimating characteristics of individual authors from the characteristics of their social media postings.  While we have not exhausted the potential of simpler methods (frequency and sentiment analyses, for example), this looks like the future of social media analysis for marketing.

Copyright 2011 by David G. Bakken.  All rights reserved.

The Preditioneer’s Game:  Using the Logic of Brazen Self-Interest to See and Shape the Future by Bruce Bueno de Mesquita makes a pretty strong case for using models to make critical decisions, whether in business or international policy.  To anyone involved in prediction science, Bueno de Mesquita’s claim of 90% accuracy (“According to a declassified CIA assessment…”) might seem an exaggeration.  But the author has two things in his favor.  He limits his efforts at prediction to a specific type of problem, and he’s predicting outcomes for which there is usually a limited set of possibilities (for example, whether or not a bank will issue a fraudulent financial report in a given year).  (more…)

Nassim Nicholas Taleb introduced a new term into the lexicon of business forecasting, the “black swan event.”  The metaphor comes from the apparent fact that, for some reason, black swans should not exist, but they sometimes do.  In THE BLACK SWAN:  The Impact of the Highly Improbably, Taleb expounds for  366 pages on what is, for the most part, a single idea:  the normal (bell-shaped) distribution is pretty much worthless for predicting the likelihood of any random occurrence.  Taleb augments this idea in various, occasionally entertaining ways, acquaints the reader with power law and “fat tail” distributions, and takes excursions through fractal geometry and chaos theory.

Taleb tells us he aspires to erudition, and he introduces the reader to plenty of “great thinkers” that history has failed to credit.  You can come away from this book feeling that it is mostly about showing us how erudite Taleb is.  For me, one of the key shortcomings is Taleb’s tendency, via style, to claim that we should accept his arguments on faith.  There are plenty of concepts, especially involving numbers, that would benefit from concrete examples.  There’s just a little too much “Take my word for it” in his writing.  Still, if you’ve got time to kill, this is not an unrewarding read.

David Orrell tackles the very same subject–our inability to predict the future–in The Future of Everything:  The Science of Prediction (which has a sub-sub title: “From Wealth and Weather to Chaos and Complexity”).  For a mathematician, Orrel has an entertaining style and writes with clarity.  This book is far more focused than THE BLACK SWAN, which is sort of meandering.  The book is divided into three main parts: past, present and future.  The past provides a history of forecasting, beginning with the Greeks and the Oracle at Delphi.  The present considers the challenges of prediction in three key areas: weather, health (via genetics), and finance.  Orrel did his dissertation research on weather forecasting, and after reading this book, I think you’ll agree that it’s a great case study for revealing everything we think we know about the “science of prediction.”

Orrel’s main point is that a key problem in prediction is model error (the basis of his dissertation), which far outweighs the influence of chaos and other random disturbances.  In a nutshell, the complexity of these systems exceeds our ability to specify and parameterize models (models are subject to specification error, parameter error, and stochastic error).  Weather is a great example.  While there are only a few components to the system (temperature, humidity, air pressure, and such), the interactions between these components are almost impossible to predict.  Another problem is the resolution of the model; conditions are extremely local, but it it very difficult to develop a model that resolves to a volume small enough to predict local conditions.

Orrell educates.  The reader comes away with an understanding of the logic and mechanics of forecasting, as well as the seemingly intractable challenges.  Orrell provides clear explanations of many important forecasting concepts and does a good job of making the math accessible to a general reader.  There are a couple of shortcomings.  Orrell gives only passing notice to agent-based simulation and similar computational approaches to complexity.  And, in the third part of the book (the “future”), after spending the preceding two parts on the near futility of prediction (but for different reasons than Taleb), Orrell offers his “best guesses” for the future in areas such as climate change.

While I embrace the basic premises of these books, some new developments are cause for optimism.  Economists using an agent-based model of credit markets were able to simulate the fall off the cliff that we’ve experienced in the real world, as just one example.  While not truly “predictive,” these models can help us understand the conditions that are likely to produce extreme outcomes.

THE BLACK SWAN has its rewards, but The Future of Everything has far more value for the forecasting professional.  As a chaser, you might try Why Most Things Fail:  Evolution, Extinction and Economics by Paul Ormerod.

Copyright 2010 by David G. Bakken.  All rights reservcd.

Steve Lohr reported in The New York Times on July 28 that two teams appear to have tied for the $1 million prize offered by Netflix to anyone who could improve its movie recommendation system (target: a 10% reduction in a measure of prediction error).  This is certainly a triumph for the field of predictive modeling, and, perhaps, for “crowdsourcing” (at least when accompanied by a big monetary carrot) as an effective method for finding innovative solutions to difficult problems.

Predictive modeling has been used to target customers and to determine their credit worthiness for at least a couple of decades, but it’s been receiving a lot more attention lately, in part thanks to books like Supercrunchers (by Ian Ayres, Bantam, 2007) and Competing on Analytics (by Thomas H. Davenport and Jeanne G. Harris, Harvard Business School Press, 2007). The basic idea behind predictive modeling, as most of you will know, is that variation in some as yet unobserved outcome variable (such as whether a consumer will respond to a direct mail offer, spend a certain amount on a purchase, or give a movie a rating of four out of five stars) can be predicted based on knowledge of the relationship between one or more variables that we can observe in advance and the outcome of interest.  And we learn about such relationships by looking at cases where we can observe both the outcome and the “predictors.”  The workhorse method for uncovering such relationships is regression analysis.

In many respects, the Netflix Prize is a textbook example of the development of a predictive model for business applications.  In the first place, prediction accuracy is important for Netflix, which operates a long tail business, making a lot of money from its “backlist” of movie titles.  Recommendation engines like Cinematch and those used by Amazon and other online retailers make the long tail possible to the extent that they bring backlist titles to the attention of buyers who otherwise would not discover them. Second, Netflix has a lot of data consisting of ratings of movies by its many customers that can be used as fodder in developing the model.  All entrants had access to a dataset consisting of more than 100 million ratings from over 480,000 randomly chosen Netflix customers (that’s roughly 200 ratings per customer).  In all these customer rated about 18,000 different titles (for about 5,500 ratings per title).  That is a lot of data for developing a predictive model by almost any standard.  And, following the textbook approach, Netflix provided a second dataset to be used for testing the model, because the goal of the modeling is to predict cases not yet encountered, and the judging was based on how accurately a model predicted the ratings in this dataset (and those ratings were not provided to the contestants).

There were a couple of unusual challenges in this competition.  First, despite the sheer quantity of data, it is potentially “sparse” in terms of the number of individuals who rated exactly the same sets of movies.  A second challenge came in the form of what Clive Thompson, in an article in the Sunday Times Magazine (“If You Liked This, You’re Sure to Love That,” November 23, 2008), called the “Napoleon Dynamite” problem.  In a nutshell, it’s really hard to predict how much someone will like “Napoleon Dynamite” based on how much they like other films.  Other problem films identified by Len Bertoni, one of the contestents Thompson interviewed for the article, include “Lost in Translation” (which I liked) and “The Life Aquatic with Steve Zissou” (which I hated, even though both films star Bill Murray).

I’m eager to see the full solutions that the winning teams employed.  After reading about the “Napoleon Dynamite” problem, I began to think that a hierarchical Bayesian solution might work by capturing some of the unique variability in these problem films but there are likely other machine learning approaches that would work.

It’s possible that the achievements of these two teams will translate to real advances for predictive modeling based on the kinds of behavioral and attitudinal data that companies can gather from or about their customers. If that’s the case, then we’ll probably see companies turning to ever more sophisticated predictive models.  But better predictive models do not necessarily improve our understanding of the drivers of customer behavior.  What’s missing in many data-driven predictive modeling systems like Cinematch is a theory of movie preferences.  This is one reason why the algorithms came up short in predicting the ratings for films like “Napoleon Dynamite”–the data do not contain all the information needed to explain or understand movie preferences.  If you looked across my ratings for a set of films similar to “The Life Aquatic” in important respects (cast, director, quirkiness factor) you would predict that I’d give this movie a four or a five.  Same thing for the “The Duchess,”  which I sent back to Netflix without even watching the entire movie.

These minor inaccuracies may not matter much to Netflix which should seek to optimize across as many customers and titles as possible.  Still, if I follow the recommendations of Cinematch and I’m disappointed too often, I may just discontinue Netflix altogether. (NOTE:  Netflix incorporates some additional information into their Cinematch algorithm, but for purposes of the contest, they restricted the data available to the contestants).

In my view, predictive models can be powerful business tools, but they have the potential to lead us into a false belief that because we can predict something on the basis of mathematical relationships, we understand what we’re predicting.  We might also lapse into an expectation that “prediction” based on past behavior is in fact destiny.  We need to remind our selves that correlation or association is a necessary but not a sufficient condition to show a causal relationship.

Copyright 2009 by David G. Bakken.  All rights reserved.