Query Bias in Social Media Research
Posted by Vovici Blog on Thu, Jan 20, 2011
One of the difficulties in conducting social media market research is the sheer quantity of information available. The traditional way to weed through it is to use search queries. For instance, if you were interested in what people thought of the video game Sid Meier's Civilization V, just do a search on “Civilization V” right?
But what about all those people who typed “Civilization 5”? Or “Civ V” or “Civ 5”? Or used the hashtags “#CivV” or “#Civ5” on Twitter? (Hashtags are a convention for marking a tweet as being about a specific subject.) Heck, some people might be writing about “the new Civ” or “the new Civilization” or “the new Firaxis game” or “the new Sid Meier game”. Where do you draw the line? Does it really matter?
Sadly, it matters a tremendous amount.
To demonstrate this, I downloaded 100 comments from Twitter for six variations of “Civilization V” and did a manual sentiment analysis, scoring positive sentiment a +1, negative sentiment a -1 and mixed or neutral sentiment a 0. Sentiment varied widely by search query, with only “#CivV” having negative average sentiment. If someone used that as their only query, they’d be missing much of the picture.

Why the wide variance? Well, the types of tweets differ by language formality. The following chart shows the percent of tweets by query term that were “Newsy” (i.e., they contained a link) and the percent that were “Conversational” (i.e., were replies to another Twitter user).

The phrase “Civilization V” was newsy and unconversational, reflecting its level of formality; sadly, this is the term you first think to query on, since it is closest to the full product name. Such a search returns tweets that echo the headlines of game reviews and link to those reviews. The phrase “Civilization 5” was less newsy and less unconversational, since it’s a slight misrepresentation of the formal name. The “Civ” abbreviations were much more conversational and much less likely to contain links.
As Andrew Jeavons points out, there were 25 billion tweets on Twitter last year – there’s no way to read them all and see which apply to the subject you are researching. You have to use queries to narrow your search. Just be aware that your choice of query terms introduces its own bias.