Social Media Research is Not the Same as Verbatim Coding
Posted by Jeffrey Henning on Tue, Apr 27, 2010

At a few conferences recently, speakers have described social media research as basically the same as survey coding. I understand the analogy -- you have hundreds or thousands of comments scraped from Twitter, blogs and other social media sites that you are manually or automatically structuring and categorizing, just like verbatim responses to surveys. However, the techniques differ pretty dramatically.
Sampling
With a survey, you have set questions you are analyzing, though perhaps you are collecting new responses on a regular basis. Your set of responses is fixed at any point in time when you are doing your analysis. With social media research, you can redefine your search to grab new data, and you can supplement your search with other sources. For instance, for my research into iPad returns, I collected tweets that matched the search "iPad (return OR returned)". I could have gathered more comments using other search terms, such as "returning iPad" or "iPad eBay". In retrospect, it's probably better to cast a wider net to get around any bias introduced by narrow searches.
In a survey, when a particular verbatim response is confusing or ambiguous you can read the respondent's other responses to determine context. In social media research, you can read other tweets written by the same author and review their blog or social network profile if they've linked to it (a lot of Twitter users list their LinkedIn profile). On an ad hoc basis, you can search for additional information about an individual. For my iPad research, I used Twitter to read conversations they had with other Twitter users and searched for their username and "iPad" to see what else they had to say about the device. I reviewed their blogs and even watched several video posts. Data collection can easily be expanded during analysis.
Data Cleaning
A big problem with social media analysis is the need to clean the data. Take tweets, for instance. You need to filter out the links and news stories people are sharing that happen to use your keywords; you have to remove all the retweets, which add noise. For instance, one of the most retweeted items for iPad returns was about single tasking, yet no other user mentioned this as the reason for their return.
Filters will get you down to more basic conversation, which still needs to be filtered. Some comments might be spam messages or from shills (I encountered a few spam iPad messages pushing apps). Others might be false hits: quite a few people were spending their U.S. income-tax "return" (refund) on an iPad; others were discussing how Guy Kawasaki's iPad was returned to him after he left in on a plane.
Aggregating by User
Net all of this out, and you have your corpus that you can analyze - almost. For Twitter, you will find that you might have many tweets from a few users. You will want to sort the tweets by user, and--when you categorize answers--categorize them by user name so that you are not double- or triple-counting occurrences of themes; to go back to the survey coding analogy, it's as if your respondents took your survey multiple times. If you are using blogs as your primary source material, use the blog name. In both cases, individuals may have more than one blog or Twitter account.
Interestingly, I learnt that a private user of Twitter - someone with "protected tweets" - was thinking of returning their iPad, based upon reading replies from a user whose tweets were public (it was a bit like hearing only one side of a telephone conversation).
Categorizing/Coding
The actual categorizing and tallying of the common themes is then similar to verbatim coding, but it took a lot of work to get here and there's more work to be done. When coding verbatim responses, you know the question that they were asked, and this narrows the context. With social media anaylsis, people are talking about whatever they want to talk about it, to an audience that certainly wasn't conceived of with you in mind. As a result, the responses can be almost meaningless. For the iPad returns, it became clear that some cases were about returning a "borrowed" iPad that people were temporarily playing with. Others were ambiguous - "dang it! Have to return the iPad. Rich wants outfits too lol" - this sounded too upbeat to be an actual product return, but perhaps the iPad refund money would be used to buy clothing.
At the end of the day, coding verbatim responses is just a small subset of the effort involved in analyzing social-media comments.