Thinking Analytically

If you were lisitening to NPR’s “All Things Considered” broadcast on January 18, you might have heard a brief report on research that reveals regional differences (“dialects”) in word usage, spellings, slang and abbreviations in Twitter postings.  For example, Northern and Southern California use spelling variants koo and coo to mean “cool.”

Finding regional differences in these written expressions is interesting in its own right, but I’ve just finished reading the paper describing this research and there’s a lot more going on here than simply counting and comparing expressions across different geographic regions.  The paper is an excellent example of what market researchers might do to analyze social media.

The study authors–Jacob Eisenstein, Brendan O’Connor, Noah A. Smith, and Eric P. Xing–are affiliated with the School of Computer Science at Carnegie Mellon University (Eisenstein, who was interviewed for the ATC broadcast, is a postdoctoral fellow).  They set out to develop a latent variable model to predict an author’s geographic location from the characteristics of text messages.  As they point out, there work is unique in that they use raw text data (although “tokenized”) as input to the modeling.  They develop and compare a few different models, including a “geographic topic model” that incorporates the interaction between base topics (such as sports) and an author’s geographic location as well as additional latent variable models:  a “mixture of unigrams” (model assumes a single topic) and a “supervised linear Dirichlet allocation.”    If you have not yet figured it out, the models, as described, use statistical machine learning methods.  That means that some of the terminology may be unfamiliar to market researchers, but the description of the algorithm for the geographic topic model resembles the hierarchical Bayesian methods using the Gibb’s sampler that have come into fairly wide use in market research (especially for choice-based conjoint analysis).

This research is important for market research because it demonstrates a method for estimating characteristics of individual authors from the characteristics of their social media postings.  While we have not exhausted the potential of simpler methods (frequency and sentiment analyses, for example), this looks like the future of social media analysis for marketing.

Copyright 2011 by David G. Bakken.  All rights reserved.


There’s an interesting article by Jonah Lehrer in the Dec. 13 issue of The New Yorker, “The Truth Wears Off:  Is there something wrong with the scientific method?” Lehrer reports that a growing number of scientists are concerned about what psychologist Joseph Banks Rhine termed the “decline effect.”  In a nutshell, the “decline effect” is an observed tendency for the size of an observed effect to decline over the course of studies attempting to replicate that effect.  Lehrer cites examples from studies of the clinical outcomes for a class of once-promising antipsychotic drugs as well as from more theoretical research.  This is a scary situation given the inferential nature of most scientific research.  Each set of observations represents an opportunity to disconfirm a hypothesis.  As long as subsequent observations don’t lead to disconfirmation, our confidence in the hypothesis grows.  The decline effect suggests that replication is more likely, over time, to disconfirm a hypothesis than not.  Under those circumstances, it’s hard to develop sound theory.

Given that market researchers apply much of the same reasoning as scientists in deciding what’s an effect and what isn’t, the decline effect is a serious threat to creating customer knowledge and making evidence-based marketing decisions. (more…)

As I noted in my last post, the American Marketing Association’s Advanced Research Techniques Forum took place in San Francisco the second week in June (June 6-9).  The program is an intentional mix of presentations from academic researchers and market research practitioners.  While the practitioner presentations are often more interesting, at least from the standpoint of a fellow practitioner, this year the best and most useful presentations either came from the academic side or had significant contribution from one or more academic researchers.  In that last post I wrote about three papers that explored different aspects of social media.  Three more papers from this year’s ART make my list of the most worthwhile presentations. (more…)

The New York Times is one of the more interesting innovators when it comes to using data visualization to tell a story or make a point.  In particular, the Business section employs a variety of chart forms to reveal what is happening in financial markets.  The Weather Report uses “small multiples” to show 10-day temperature trend for major U.S. Cities.

Even more interesting are the occasional illustrations that appear under the heading of “Op-Chart.”  For a few years now the Times periodically presents on the Op-Ed page a comparative table that tracks “progress” in Iraq on a number of measures such as electric power generation.

Another impressive chart appeared in “Sunday Opinion” on January 10, 2010.  Titled “A Year in Iraq and Afghanistan,” this full page illustration provides a detailed look at the 489 American and allied deaths that occurred in Afghanistan and the 141 deaths in Iraq.  At first glance, the chart resembles the Periodic Table of Elements.  Deaths in Iraq take up the top one-fourth or so of the chart (along with the legend); deaths in Afghanistan occupy the bulk of the illustration.

Each death is represented by a figure, and each figure appears in a box representing the date which the death occurred. One figure shape represents American forces, and a slightly different shape signifies a member of the coalition forces.  For coalition forces, the color of the figure indicates nationality.  A small symbol indicates the cause of each death (homemade bomb, mortar, hostile fire, bomb, suicide bomb, or non-combat related).  Multiple deaths from the same event or cause on a date occupy the same box.

Most dates have only a single death, but a few days standout as particularly tragic:  seven U.S. troops dying due to a non-combat related cause in Afghanistan on October 26; eight killed by hostile fire on October 3rd; seven killed by a homemade bomb on October 27; six Italians killed by a homemade bomb on September 17; five Americans killed by a suicide bomber in Mosul, Iraq, on April 10.

The deaths are linked to specific locations on maps of Iraq and Afghanistan.  Helmand Province was the deadliest place, with 79 of the 489 deaths in Afghanistan.  In Iraq, Baghdad was the most dangerous place, accounting for 42 of the 141 deaths in that country.  While Americans are the largest number, 112 of the dead in Afghanistan were British troops.

There is a wealth of information in this chart with four pieces of information on every death, but in some ways there is too much detail.  To get at the numbers I provided above, I had to manually count the pictures.  There are no summary statistics.  The picture grabs our attention, and immediately conveys the magnitude of the price the U.S. and our allies are paying in Afghanistan.   But if we want to act on data, we need a little more than just a very clever visual display.  Summaries of the numbers would help, here.  It’s useful to know, for example, that 65 of the 141 deaths in Iraq (46%) were due to non-combat related causes, compared to 48 (10%) of the deaths in Afghanistan.  Eighty percent of the fatalities in deadly Helmand province were due to hostile fire; 57% in other parts of Afghanistan were caused by homemade bombs (in Iraq there were 19 deaths, or 13% of the total, from homemade bombs).

Two of the creators of this chart, Adriana Lins de Albuquerque (a doctoral student in political science at Columbia) and Alicia Cheng of, produced a slightly different version of this chart summarizing the death toll in Iraq for 2007 (click here).  That earlier version did not have as much detail about each individual death (location information is not included, for example) but includes some additional causes, like torture and beheading that, thankfully, appear to have disappeared.

The advantage to displaying data in this fashion lies in the ability of our brains to form patterns quickly.  The use of color to designate coalition members makes the contributions of our allies apparent in a way that a simple tally might not.  Even without a year-to-year comparison, we can see that Iraq has become, at least for US troops and our allies, a much safer place than Afghanistan.  Additionally, this one chart presents data that, in other forms, might require several PowerPoint slides to communicate: deaths by date, deaths by city or province, deaths by nationality, causes of death, number killed per incident, and cause of death.

Any complex visual display of data requires making trade-offs.  In this case, for example, the creators arranged the deaths chronologically (oldest first) within each geographic block.  That means that patterns in other variables, such as cause of death or nationality of troops, may be harder to detect on first glance.  The chronological ordering has layout implications, since on some dates there were multiple casualties.

All in all, it’s a great piece of data visualization that to my mind would be even better with the addition of a few summary statistics.

A disclaimer–I counted twice to get each of the numbers I provide above, but I offer no guarantee that I am not off by one or two deaths in any of those numbers.

Copyright 2010 by David G. Bakken.  All rights reserved.

Looking back over the last year in market research offers an opportunity to consider just which transformations, new ideas, industry trends, and emerging techniques might shape MR over the next few years.  Here’s a list of eight topics I’ve been following, with thoughts on the potential impact each might have on MR over the next two or three years. (more…)

The debate over the accuracy–and quality–of survey research conducted online is flaring at the moment, at least partly in response to a paper by Yeager, Krosnick, Chang, Javitz. Levendusky, Simpson and Wang: “Comparing the accuracy of RDD telephone surveys and Internet surveys conducted with probability and non-probability samples.”  Gary Langer, director of polling at ABC News, wrote about the paper in his blog “The Numbers” on September 1. In a nutshell, the paper compares survey results obtained via random-digit dialing (RDD) with those from an Internet panel where panelists were recruited originally by means of RDD and from a number of “opt-in” Internet panels where panelists were “sourced” in a variety of ways.   The results produced by the probability sampling methods are, according to the authors, more accurate than those obtained from the non-probability Internet samples.  You can find a response from Doug Rivers, CEO of YouGov/Polimetrix (and Professor of Political Science at Stanford) at “The Numbers,” as well as some other comments.

The analysis presented in the paper is based on surveys conducted in 2004/5.  In recent years the coverage of the RDD sampling frame has deteriorated as the number of cellphone-only users has increased (to 20% currently).  In response to concerns of several major advertisers about the quality of online panel data, the Advertising Research Foundation (ARF) established an Online Research Quality Council and just this past year conducted new research comparing online panels with RDD telephone samples.  Joel Rubinson, Chief Research Office of The ARF, has summarized some of the key findings in a blog post. According to Rubinson, this study reveals no clear pattern of greater accuracy for the RDD sample.  There are, of course, differences in the two studies, both in purpose and method, but it seems that we can no longer assume that RDD samples represent the best benchmark against which to compare all other samples. (more…)

Have you heard about the Facebook Gross National Happiness Index?  On Monday, October 12, the Times ran an article (by Noam Cohen) reporting some of the findings based on analysis of two years’ worth of Facebook status updates from 100 million users in the U.S.  The index was created by Adam D. I. Kramer, a doctoral candidate in social psychology at the University of Oregon, and is based on counts of positive and negative words in status updates.  According to the article, classification of words as positive or negative is based on the Linguistic Inquiry and Word Count dictionary.

Among the researchers’ conclusions:  we’re happier on Fridays than on Mondays; holidays also make Americans happy.  The premature death of a celebrity may make us sad.  According to a post by Mr. Kramer on the Facebook blog, the two “saddest” days–days with the highest numbers of negative words–were the days on which actor Heath Ledger and pop icon Michael Jackson died.  Mr. Kramer points out that, coincidentally, Mr. Ledger died on the day of the Asian stock market crash, which might have contributed to the degree of negativity.

We’re going to see a lot more of this kind of thing as researchers delve into the rich trove of information generated by users of search engines and web-enabled social networking.  The happiness index, based as it is on simple frequency analysis of words, is the tip of the iceberg.  At the moment, “social media”–I’m not exactly sure what that label means–is getting incredible attention in the marketing and marketing research community.  The question that has yet to be posed, let alone answered, is, “what exactly do we learn from all this information?”


Next Page »