Data Cleaning: Outliers & Erroneous Inliers
Posted by Jeffrey Henning on Sun, Jan 17, 2010
Last summer I wrote a Data Cleaning blog post summarizing the paper "Data Cleaning: Detecting, Diagnosing, and Editing Data Abnormalities", published in PLoS Medicine in 2005. For the article, Giovvani Maki created a superb illustration:

This graphic nicely displays the types of data quality issues to be on the lookout for:
- Impossible values - Outliers such as the household that claims to have 50 children or the respondent born in the eighteenth century! You can usually anticipate impossible values and set up validation checks on web surveys so that the respondent can fix their own mistakes.
- Suspect values - Outliers such as the household with 19 children or the centenarian taking your survey. They're out there, but are they really answering your survey? Best to review such respondents' other answers to see if they are consistent with the suspect values.
- Erroneous inliers - Values given by mistake but within the accepted range: in web surveys, typos; in telephone surveys, transcription errors; in any type of survey, mental mistakes. "My favorite color?" "Blue-no, yellow! Auuuuuuuugh." As with suspect values, you will need to review the respondent's other answers to see if they are consistent. Make sure to set up contextual validation, based on earlier answers; for instance, preventing the respondent from entering more children living in the household than the total number of people in the household. It is difficult to correct more than a few erroneous inliers.
- Missing values - Answers omitted by the respondent because they didn't feel like answering or because skip patterns routed them past the question based on how they answered earlier questions. In the latter case, the missing answer is the correct result; in the former case, answers could be imputed, but you will want to retain the services of a skilled statistician to perform the imputation. Use required-answers validation sparingly to reduce missing values; overuse will lead to a higher abandonment rate.
Because survey software instantly charts survey data in real time, you can easily skip the data cleaning phase and begin presenting the results. That would be a mistake. Before you begin charting the results, make sure to spend the time on
data cleaning: screening, diagnosis and treatment.