Survey Software, Web Survey, Online Surveys, and Enterprise Feedback Management solutions from Vovici

Your email:
   

Welcome to the Listening Post!

Your single source for everything Voice of the Customer (VoC) and Customer Experience (CxP). And, don’t forget you can follow us on twitter @vovici, or come check us out on Facebook and join the Vovici Network on LinkedIn.

 

Current Articles | RSS Feed RSS Feed

Data Cleaning: Outliers & Erroneous Inliers

 

Last summer I wrote a Data Cleaning blog post summarizing the paper "Data Cleaning: Detecting, Diagnosing, and Editing Data Abnormalities", published in PLoS Medicine in 2005. For the article, Giovvani Maki created a superb illustration:

outliers & erroneous inliers

This graphic nicely displays the types of data quality issues to be on the lookout for:

  • Impossible values - Outliers such as the household that claims to have 50 children or the respondent born in the eighteenth century! You can usually anticipate impossible values and set up validation checks on web surveys so that the respondent can fix their own mistakes.
  • Suspect values - Outliers such as the household with 19 children or the centenarian taking your survey. They're out there, but are they really answering your survey? Best to review such respondents' other answers to see if they are consistent with the suspect values.
  • Erroneous inliers - Values given by mistake but within the accepted range: in web surveys, typos; in telephone surveys, transcription errors; in any type of survey, mental mistakes. "My favorite color?" "Blue-no, yellow! Auuuuuuuugh." As with suspect values, you will need to review the respondent's other answers to see if they are consistent. Make sure to set up contextual validation, based on earlier answers; for instance, preventing the respondent from entering more children living in the household than the total number of people in the household. It is difficult to correct more than a few erroneous inliers.
  • Missing values - Answers omitted by the respondent because they didn't feel like answering or because skip patterns routed them past the question based on how they answered earlier questions. In the latter case, the missing answer is the correct result; in the former case, answers could be imputed, but you will want to retain the services of a skilled statistician to perform the imputation. Use required-answers validation sparingly to reduce missing values; overuse will lead to a higher abandonment rate.
Because survey software instantly charts survey data in real time, you can easily skip the data cleaning phase and begin presenting the results. That would be a mistake. Before you begin charting the results, make sure to spend the time on data cleaning: screening, diagnosis and treatment.

Comments

Jeffrey, you are a master of restraint and professionalism.  
 
 
 
My visceral reaction would have been to title this post as "In-Liars and Outright Liars" 
 
 
 
(;>)
Posted @ Monday, January 18, 2010 10:25 AM by CT
Post Comment
Name
 *
Email
 *
Website (optional)
Comment
 *

Allowed tags: <a> link, <b> bold, <i> italics

Latest Posts

Loading
What's New
Don't Be in the 4%
VoC on Twitter
Verint Blog
Verint Blog: Read the Latest from the Verint Systems Blog