If you were lisitening to NPR’s “All Things Considered” broadcast on January 18, you might have heard a brief report on research that reveals regional differences (“dialects”) in word usage, spellings, slang and abbreviations in Twitter postings.  For example, Northern and Southern California use spelling variants koo and coo to mean “cool.”

Finding regional differences in these written expressions is interesting in its own right, but I’ve just finished reading the paper describing this research and there’s a lot more going on here than simply counting and comparing expressions across different geographic regions.  The paper is an excellent example of what market researchers might do to analyze social media.

The study authors–Jacob Eisenstein, Brendan O’Connor, Noah A. Smith, and Eric P. Xing–are affiliated with the School of Computer Science at Carnegie Mellon University (Eisenstein, who was interviewed for the ATC broadcast, is a postdoctoral fellow).  They set out to develop a latent variable model to predict an author’s geographic location from the characteristics of text messages.  As they point out, there work is unique in that they use raw text data (although “tokenized”) as input to the modeling.  They develop and compare a few different models, including a “geographic topic model” that incorporates the interaction between base topics (such as sports) and an author’s geographic location as well as additional latent variable models:  a “mixture of unigrams” (model assumes a single topic) and a “supervised linear Dirichlet allocation.”    If you have not yet figured it out, the models, as described, use statistical machine learning methods.  That means that some of the terminology may be unfamiliar to market researchers, but the description of the algorithm for the geographic topic model resembles the hierarchical Bayesian methods using the Gibb’s sampler that have come into fairly wide use in market research (especially for choice-based conjoint analysis).

This research is important for market research because it demonstrates a method for estimating characteristics of individual authors from the characteristics of their social media postings.  While we have not exhausted the potential of simpler methods (frequency and sentiment analyses, for example), this looks like the future of social media analysis for marketing.

Copyright 2011 by David G. Bakken.  All rights reserved.

Looking back over the last year in market research offers an opportunity to consider just which transformations, new ideas, industry trends, and emerging techniques might shape MR over the next few years.  Here’s a list of eight topics I’ve been following, with thoughts on the potential impact each might have on MR over the next two or three years. (more…)

In my last post on predictive modeling (4 August 2009) I used the recent announcement that the Netflix Prize appears to have been won to make two points.  First, predictive modeling based on huge amounts of consumer/customer data is becoming more important and more prevalent throughout business (and other aspects of life as well).  Second, the power of predictive modeling to deliver improved results may seduce us into believing that just because we can predict something, we understand it.

Perhaps because it fuses popular culture with predictive modeling, Cinematch (Netflix’ recommendation engine) seemed like a good example to use in making these points.  For one thing, if predicting movie viewers’ preferences were easy, the motion picture industry would probably have figured out how to do it at some early stage in the production process–not that they haven’t tried.  A recent approach uses neural network modeling to predict box office success from the characteristics of the screenplay (you can read Malcom Gladwell’s article in The New Yorker titled “The Formula” for a narrative account of this effort).  The market is segmented by product differentiation (e.g., genres) as well as preferences.  At the same time, moviegoers’ preferences are somewhat fluid, and there is a lot of “cross-over” with fans of foreign and independent films also flocking to the most Hollywood of blockbuster films.

This brings to mind a paradox of predictive modeling (PM).  PM can work pretty well in the aggregate (and perhaps allowing Netflix to do a good job of estimating demand for different titles in the backlist) but not so well when it comes to predicting a given individual’s preferences.  I tend to be reminded of this every time I look at the list of movies that Cinematch predicts I’ll love.  For each recommended film, there’s a list of one or more other films that form the basis for the recommendation.  I’m struck by the often wide disparities between the recommended film and the films that led to the recommendation.  One example:  Cinematch recommended “Little Miss Sunshine” (my predicted rating is 4.9, compared to an average of 3.9) because I also liked “There Will Be Blood,” “Fargo,” and “Syriana.”  It would be hard to find three films more different from “Little Miss Sunshine.”  “Mostly Martha” is another example.  This is a German film in the “foreign romance” genre that was remade as “No Reservations” in the U.S. with Catherine Zeta-Jones.  Cinematch based its recommendation on the fact that I liked “The Station Agent.”  These two films have almost no objective elements in common.  They are in different languages, set in different countries, with very different story lines, cast and so forth.  But they share many subjective elements (great acting, characters you care about, and humor, among others) and it’s easy to imagine that someone who likes one of these will enjoy the other.  On the other hand, Cinematch made a lot of strange recommendations (such as “Amelie,” a French romantic comedy) based on the fact that I enjoyed “Gandhi,” the Oscar-winning 1982 biopic that starred Ben Kingsley. (more…)