In November of last year David Leonhardt, an economics writer for The New York Times, created an “interactive puzzle” that enabled readers to create a solution for reducing the federal deficit by $1.3 trillion (or therebouts) in 2030.  A variety of options involving either spending cuts or tax increases that reflected the recommendations of the deficit reduction commission were offered, along with the size of the reduction associated with each option.  Visitors to the puzzle simplyselected various options until they achieved the targeted reduction.

The options represented trade-offs, the simplest being that between cutting programs or raising revenues.  Someone has to suffer, and suffering was not evenly distributed across the options.  Nearly seven thousand Twitter users completed the puzzle, and Leonhardt has summarized the choices.  You might still be able to access the puzzle online at

Leonhardt was able to group the solutions according to whether they seemed to consist mostly of spending cuts or a mix of spending cuts and tax increases.  He admits that the “sample” is not scientific and, given that it’s comprised of Twitter users, may skew young.  Unfortunately, no personal data was collected from those who completed the puzzle, so we’re left to speculate about the patterns of choices.  Perhaps a little data mining would shed some additional light on the clustering of responses. 

Even though this is not survey resarch in the way that we know it, there may be much value in using this type of puzzle to measure public opinion about the tough choices that the U.S. is facing.  The typical opinion survey might ask respondents whether they “favor” one course of action or another (“Do you favor spending cuts or tax increases for reducing the deficit?”).  The options presented in Leonhardt’s puzzle represent real policy choices, and the differences between them force you to consider the trade-offs you are willing to make.  While the choices were comprehensive, they were not contrived in the way that conjoint analysis structures choices; that might present a problem if we are trying to develop a model to predict or explain preferences.

There’s no reason this technique cannot be used with the same kinds of samples that we obtain for much online survey research.  Add a few demographic and political orientation questions and you have what I think could be a powerful way to capture the trade-offs that the public is willing to make.

Copyright 2011 by David G. Bakken.  All rights reserved.


The New York Times is one of the more interesting innovators when it comes to using data visualization to tell a story or make a point.  In particular, the Business section employs a variety of chart forms to reveal what is happening in financial markets.  The Weather Report uses “small multiples” to show 10-day temperature trend for major U.S. Cities.

Even more interesting are the occasional illustrations that appear under the heading of “Op-Chart.”  For a few years now the Times periodically presents on the Op-Ed page a comparative table that tracks “progress” in Iraq on a number of measures such as electric power generation.

Another impressive chart appeared in “Sunday Opinion” on January 10, 2010.  Titled “A Year in Iraq and Afghanistan,” this full page illustration provides a detailed look at the 489 American and allied deaths that occurred in Afghanistan and the 141 deaths in Iraq.  At first glance, the chart resembles the Periodic Table of Elements.  Deaths in Iraq take up the top one-fourth or so of the chart (along with the legend); deaths in Afghanistan occupy the bulk of the illustration.

Each death is represented by a figure, and each figure appears in a box representing the date which the death occurred. One figure shape represents American forces, and a slightly different shape signifies a member of the coalition forces.  For coalition forces, the color of the figure indicates nationality.  A small symbol indicates the cause of each death (homemade bomb, mortar, hostile fire, bomb, suicide bomb, or non-combat related).  Multiple deaths from the same event or cause on a date occupy the same box.

Most dates have only a single death, but a few days standout as particularly tragic:  seven U.S. troops dying due to a non-combat related cause in Afghanistan on October 26; eight killed by hostile fire on October 3rd; seven killed by a homemade bomb on October 27; six Italians killed by a homemade bomb on September 17; five Americans killed by a suicide bomber in Mosul, Iraq, on April 10.

The deaths are linked to specific locations on maps of Iraq and Afghanistan.  Helmand Province was the deadliest place, with 79 of the 489 deaths in Afghanistan.  In Iraq, Baghdad was the most dangerous place, accounting for 42 of the 141 deaths in that country.  While Americans are the largest number, 112 of the dead in Afghanistan were British troops.

There is a wealth of information in this chart with four pieces of information on every death, but in some ways there is too much detail.  To get at the numbers I provided above, I had to manually count the pictures.  There are no summary statistics.  The picture grabs our attention, and immediately conveys the magnitude of the price the U.S. and our allies are paying in Afghanistan.   But if we want to act on data, we need a little more than just a very clever visual display.  Summaries of the numbers would help, here.  It’s useful to know, for example, that 65 of the 141 deaths in Iraq (46%) were due to non-combat related causes, compared to 48 (10%) of the deaths in Afghanistan.  Eighty percent of the fatalities in deadly Helmand province were due to hostile fire; 57% in other parts of Afghanistan were caused by homemade bombs (in Iraq there were 19 deaths, or 13% of the total, from homemade bombs).

Two of the creators of this chart, Adriana Lins de Albuquerque (a doctoral student in political science at Columbia) and Alicia Cheng of, produced a slightly different version of this chart summarizing the death toll in Iraq for 2007 (click here).  That earlier version did not have as much detail about each individual death (location information is not included, for example) but includes some additional causes, like torture and beheading that, thankfully, appear to have disappeared.

The advantage to displaying data in this fashion lies in the ability of our brains to form patterns quickly.  The use of color to designate coalition members makes the contributions of our allies apparent in a way that a simple tally might not.  Even without a year-to-year comparison, we can see that Iraq has become, at least for US troops and our allies, a much safer place than Afghanistan.  Additionally, this one chart presents data that, in other forms, might require several PowerPoint slides to communicate: deaths by date, deaths by city or province, deaths by nationality, causes of death, number killed per incident, and cause of death.

Any complex visual display of data requires making trade-offs.  In this case, for example, the creators arranged the deaths chronologically (oldest first) within each geographic block.  That means that patterns in other variables, such as cause of death or nationality of troops, may be harder to detect on first glance.  The chronological ordering has layout implications, since on some dates there were multiple casualties.

All in all, it’s a great piece of data visualization that to my mind would be even better with the addition of a few summary statistics.

A disclaimer–I counted twice to get each of the numbers I provide above, but I offer no guarantee that I am not off by one or two deaths in any of those numbers.

Copyright 2010 by David G. Bakken.  All rights reserved.

In my last post on predictive modeling (4 August 2009) I used the recent announcement that the Netflix Prize appears to have been won to make two points.  First, predictive modeling based on huge amounts of consumer/customer data is becoming more important and more prevalent throughout business (and other aspects of life as well).  Second, the power of predictive modeling to deliver improved results may seduce us into believing that just because we can predict something, we understand it.

Perhaps because it fuses popular culture with predictive modeling, Cinematch (Netflix’ recommendation engine) seemed like a good example to use in making these points.  For one thing, if predicting movie viewers’ preferences were easy, the motion picture industry would probably have figured out how to do it at some early stage in the production process–not that they haven’t tried.  A recent approach uses neural network modeling to predict box office success from the characteristics of the screenplay (you can read Malcom Gladwell’s article in The New Yorker titled “The Formula” for a narrative account of this effort).  The market is segmented by product differentiation (e.g., genres) as well as preferences.  At the same time, moviegoers’ preferences are somewhat fluid, and there is a lot of “cross-over” with fans of foreign and independent films also flocking to the most Hollywood of blockbuster films.

This brings to mind a paradox of predictive modeling (PM).  PM can work pretty well in the aggregate (and perhaps allowing Netflix to do a good job of estimating demand for different titles in the backlist) but not so well when it comes to predicting a given individual’s preferences.  I tend to be reminded of this every time I look at the list of movies that Cinematch predicts I’ll love.  For each recommended film, there’s a list of one or more other films that form the basis for the recommendation.  I’m struck by the often wide disparities between the recommended film and the films that led to the recommendation.  One example:  Cinematch recommended “Little Miss Sunshine” (my predicted rating is 4.9, compared to an average of 3.9) because I also liked “There Will Be Blood,” “Fargo,” and “Syriana.”  It would be hard to find three films more different from “Little Miss Sunshine.”  “Mostly Martha” is another example.  This is a German film in the “foreign romance” genre that was remade as “No Reservations” in the U.S. with Catherine Zeta-Jones.  Cinematch based its recommendation on the fact that I liked “The Station Agent.”  These two films have almost no objective elements in common.  They are in different languages, set in different countries, with very different story lines, cast and so forth.  But they share many subjective elements (great acting, characters you care about, and humor, among others) and it’s easy to imagine that someone who likes one of these will enjoy the other.  On the other hand, Cinematch made a lot of strange recommendations (such as “Amelie,” a French romantic comedy) based on the fact that I enjoyed “Gandhi,” the Oscar-winning 1982 biopic that starred Ben Kingsley. (more…)

Steve Lohr reported in The New York Times on July 28 that two teams appear to have tied for the $1 million prize offered by Netflix to anyone who could improve its movie recommendation system (target: a 10% reduction in a measure of prediction error).  This is certainly a triumph for the field of predictive modeling, and, perhaps, for “crowdsourcing” (at least when accompanied by a big monetary carrot) as an effective method for finding innovative solutions to difficult problems.

Predictive modeling has been used to target customers and to determine their credit worthiness for at least a couple of decades, but it’s been receiving a lot more attention lately, in part thanks to books like Supercrunchers (by Ian Ayres, Bantam, 2007) and Competing on Analytics (by Thomas H. Davenport and Jeanne G. Harris, Harvard Business School Press, 2007). The basic idea behind predictive modeling, as most of you will know, is that variation in some as yet unobserved outcome variable (such as whether a consumer will respond to a direct mail offer, spend a certain amount on a purchase, or give a movie a rating of four out of five stars) can be predicted based on knowledge of the relationship between one or more variables that we can observe in advance and the outcome of interest.  And we learn about such relationships by looking at cases where we can observe both the outcome and the “predictors.”  The workhorse method for uncovering such relationships is regression analysis.

In many respects, the Netflix Prize is a textbook example of the development of a predictive model for business applications.  In the first place, prediction accuracy is important for Netflix, which operates a long tail business, making a lot of money from its “backlist” of movie titles.  Recommendation engines like Cinematch and those used by Amazon and other online retailers make the long tail possible to the extent that they bring backlist titles to the attention of buyers who otherwise would not discover them. Second, Netflix has a lot of data consisting of ratings of movies by its many customers that can be used as fodder in developing the model.  All entrants had access to a dataset consisting of more than 100 million ratings from over 480,000 randomly chosen Netflix customers (that’s roughly 200 ratings per customer).  In all these customer rated about 18,000 different titles (for about 5,500 ratings per title).  That is a lot of data for developing a predictive model by almost any standard.  And, following the textbook approach, Netflix provided a second dataset to be used for testing the model, because the goal of the modeling is to predict cases not yet encountered, and the judging was based on how accurately a model predicted the ratings in this dataset (and those ratings were not provided to the contestants).

There were a couple of unusual challenges in this competition.  First, despite the sheer quantity of data, it is potentially “sparse” in terms of the number of individuals who rated exactly the same sets of movies.  A second challenge came in the form of what Clive Thompson, in an article in the Sunday Times Magazine (“If You Liked This, You’re Sure to Love That,” November 23, 2008), called the “Napoleon Dynamite” problem.  In a nutshell, it’s really hard to predict how much someone will like “Napoleon Dynamite” based on how much they like other films.  Other problem films identified by Len Bertoni, one of the contestents Thompson interviewed for the article, include “Lost in Translation” (which I liked) and “The Life Aquatic with Steve Zissou” (which I hated, even though both films star Bill Murray).

I’m eager to see the full solutions that the winning teams employed.  After reading about the “Napoleon Dynamite” problem, I began to think that a hierarchical Bayesian solution might work by capturing some of the unique variability in these problem films but there are likely other machine learning approaches that would work.

It’s possible that the achievements of these two teams will translate to real advances for predictive modeling based on the kinds of behavioral and attitudinal data that companies can gather from or about their customers. If that’s the case, then we’ll probably see companies turning to ever more sophisticated predictive models.  But better predictive models do not necessarily improve our understanding of the drivers of customer behavior.  What’s missing in many data-driven predictive modeling systems like Cinematch is a theory of movie preferences.  This is one reason why the algorithms came up short in predicting the ratings for films like “Napoleon Dynamite”–the data do not contain all the information needed to explain or understand movie preferences.  If you looked across my ratings for a set of films similar to “The Life Aquatic” in important respects (cast, director, quirkiness factor) you would predict that I’d give this movie a four or a five.  Same thing for the “The Duchess,”  which I sent back to Netflix without even watching the entire movie.

These minor inaccuracies may not matter much to Netflix which should seek to optimize across as many customers and titles as possible.  Still, if I follow the recommendations of Cinematch and I’m disappointed too often, I may just discontinue Netflix altogether. (NOTE:  Netflix incorporates some additional information into their Cinematch algorithm, but for purposes of the contest, they restricted the data available to the contestants).

In my view, predictive models can be powerful business tools, but they have the potential to lead us into a false belief that because we can predict something on the basis of mathematical relationships, we understand what we’re predicting.  We might also lapse into an expectation that “prediction” based on past behavior is in fact destiny.  We need to remind our selves that correlation or association is a necessary but not a sufficient condition to show a causal relationship.

Copyright 2009 by David G. Bakken.  All rights reserved.

Keith Devlin, NPR’s “math guy” and Gary Lorden, the math consultant on the hit CBS television series “NUMB3RS(tm)” have written an enjoyable and informative book that explains the math behind many of the more sophisticated analytic methods finding their way into the customer knowledge business these days.  If you’ve been perched at the edge of the “supercrunching” pool, wanting to dip your toes but afraid you’ll end up over your head, this book might be just what you need.  Because the context is crime fighting (at least as depicted on network TV), the applications are fun as well as informative.  Topics include geographic profiling, data mining, changepoint detection, Bayesian inference, the math of networks, and a bit of game theory (Chapter 11:  The Prisoner’s Dilemma, Risk Analysis, and Counter-terrorism).

The Numbers Behind NUMB3RS:  Solving Crime With Mathematics.  PLUME (Penguin) 2007.  ISBN 978-0-452-28857-7.