Diving into the Deep Web for Social Media Research
Posted by Jeffrey Henning on Thu, Oct 21, 2010
When Ray Poynter and I each individually started blogging about social media research ethics, we didn’t get much response. And, when I presented to the CASRO Tech Conference results of a study I had done on consumer attitudes towards social media research, quite a few people took me aside afterwards to say it was too early to be worrying about such issues. Not CASRO leaders themselves, who invited me to participate in the CASRO Social Media Task Force.
So the timing couldn’t have been better for me to kick off last week’s Social Media Town Hall at the CASRO Annual Conference. The day before the town hall, the Wall Street Journal reported that BuzzMetrics had scraped conversations from a private healthcare community, PatientsLikeMe.com, and integrated member posts into their research database. To do this, BuzzMetrics staff had created member accounts and then had their web crawler log into those accounts to scrape forum posts. Among other things, this violated the PatientsLikeMe.com terms of service.
The PatientsLikeMe forum pages are part of the Deep Web, content that is not visible to standard search engines. I just checked both Google and Bing, neither of which index PatientsLikeMe forum pages (though they index plenty of other pages from the site). Wikipedia lists seven types of content that make up the Deep Web:
- Dynamic content: dynamic pages which are returned in response to a submitted query or accessed only through a form, especially if open-domain input elements (such as text fields) are used; such fields are hard to navigate without domain knowledge
- Unlinked content: pages which are not linked to by other pages…
- Private Web: sites that require registration and login (password-protected resources)
- Contextual Web: pages with content varying for different access contexts (e.g., ranges of client IP addresses or previous navigation sequence)
- Limited access content: sites that limit access to their pages in a technical way (e.g., using the Robots Exclusion Standard or CAPTCHAs…
- Scripted content: pages that are only accessible through links produced by JavaScript … or Flash…
- Non-HTML/text content: textual content encoded in multimedia (image or video) files or specific file formats not handled by search engines
During the CASRO town hall, one participant said that his company has also mined private forums for clients, in cases where the terms of service permit such mining. So to my collection of questions that social media researchers disagree on, we can now add:
What types of Deep Web content, if any, is it acceptable to mine for social media research?
Perhaps you think that there’s enough information in the “Shallow Web” that you don’t have to go diving for pearls of wisdom in the Deep Web? Well, you simply haven’t yet had to research issues that in fact are not being actively talked about.
Try educator’s insurance, for instance; I bet if you needed to research patients diagnosed with Progressive Supranuclear Palsy, you’d be tempted to log in to PatientsLikeMe as well.
And if you have firm opinions about what Deep Web resources are acceptable, you should make sure that the web crawling service you are using, or that your vendor is using, matches your guidelines. For instance, good luck trying to be compliant with each site’s Terms of Service (TOS). Arguably every crawling service ignores the terms of many sites, because there is no electronic standard TOS intended for parsing by web crawlers; humans would have to verify compliance, and no firm staffs to do that for any but the largest web sites that they scrape.
One electronic standard that does exist, robots.txt, the Robots Exclusion Standard, is typically used to restrict indexing of pages that are otherwise public. Perhaps community sites should add their forum pages to their robots.txt file as well, to make it clear that such pages are not to be indexed. PatientsLikeMe only restricted access to two pages, neither of which were the forums.
Happily, social media research ethics is no longer something that just a few of us are interested in. The CASRO conference was abuzz with discussion. For more reaction to “Scrapegate”, as Andrew Jerina termed it, check out:
As a member of the CASRO Social Media Task Force, I am eager to discuss your concerns about how we best shape the future of social media research. Feel free to contact me privately (email jhenning, at vovici.com) or comment anonymously below.