Survey Software, Web Survey, Online Surveys, and Enterprise Feedback Management solutions from Vovici

Your email:
   

Welcome to the Listening Post!

Your single source for everything Voice of the Customer (VoC) and Customer Experience (CxP). And, don’t forget you can follow us on twitter @vovici, or come check us out on Facebook and join the Vovici Network on LinkedIn.

 

Current Articles | RSS Feed RSS Feed

Data Cleaning

 

Early in my career I learnt survey data cleaning firsthand from Jo Ann De Clercq, who also taught me how to code responses to open-ended questions. Back then, we had a body of practices that we used from study to study, but no formal documentation of those practices.

The paper "Data Cleaning: Detecting, Diagnosing, and Editing Data Abnormalities", published in PLoS Medicine in 2005, provided the first systematic review of data cleaning. The authors offer this bit of background:
 
Data cleaning is emblematic of the historical lower status of data quality issues and has long been viewed as a suspect activity, bordering on data manipulation. Armitage and Berry almost apologized for inserting a short chapter on data editing in their standard textbook on statistics in medical research. Nowadays, whenever discussing data cleaning, it is still felt to be appropriate to start by saying that data cleaning can never be a cure for poor study design or study conduct. Concerns about where to draw the line between data manipulation and responsible data editing are legitimate. Yet all studies, no matter how well designed and implemented, have to deal with errors from various sources and their effects on study results.

The authors outline a data-cleaning process with three steps: Screening Phase, systematically looking for problems with the data; Diagnostic Phase, identifying the condition of the suspect data; and Treatment Phase, deleting or editing the data or leaving it as is.

Data Cleaning Process 

Screening Phase

Examine data for five different kinds of possible errors:
  1. Lack of data – Do some questions have far fewer answers than surrounding questions?
  2. Excess of data – Are there duplicate responses?
  3. Outliers/inconsistencies – Are there values that are so far beyond the typical that they seem potentially erroneous?
  4. Strange patterns – Are there patterns that imply cheating rather than honest answers?  For instance, does a respondent alternate between ratings of 4 and 5 on every other topic in a matrix question?
  5. Suspect analysis results – Do the answers to some questions seem counterintuitive or extremely unlikely?

Diagnosis Phase

From the Screening Phase you have highlighted data that needs investigation. To clarify suspect data, you often must review all of a respondent’s answers to determine if the data makes sense taken in context. Sometimes you must review a cross-section of different respondents’ answers, to identify issues such as a skip pattern that was specified incorrectly. 

With this research complete, what is the true nature of the data that you’ve highlighted?  The five possible values the authors give:
  1. Missing data – Answers omitted by the respondent or questions skipped over 
  2. Errors – Typos or answers that indicate the question was misunderstood
  3. True extreme – An answer that seems high but can be justified by other answers (e.g., the respondent working 100 hours a week because they work a full-time job and two part-time jobs)
  4. True normal – A valid answer 
  5. No diagnosis, still suspect – The verdict is out on this “idiopathic” data. When it comes time for the Treatment Phase, you may need to make a judgment call on how to treat this data.

Treatment Phase

You’ve screened the data and tried to come to a verdict on whether suspect data is guilty or innocent. You have three choices for what to do with suspect data:
  1. Leave it unchanged – The most conservative course of action is to accept this data as a valid response and make no change to it. The larger your sample size, the less that one suspect response will affect the analysis; the smaller your sample size, the more difficult the decision.
  2. Correct the data – If the respondent’s original intent can be determined, then I am in favor of fixing their answer.  For instance, perhaps it is clear from the respondent’s explanation for their ratings that they reversed the scale in their minds; you can invert each of their answers to this question to correct the issue. Some statisticians will argue for imputation, replacing the answers with imputed values, such as the mean for that variable, but the techniques for imputation can become quite elaborate and are best left to professional statisticians.
  3. Delete the data – The data seems illogical and the value is so far from the norm that it will affect descriptive or inferential statistics. What to do? Delete just this response or delete the entire record? Whenever you begin to toss out data, it raises the possibility that you are “cherry picking” the data to get the answer you want. 
However you choose to treat the data, make sure to document in your survey report what steps you took, how many responses were affected and for which questions.

Conclusion

Data cleaning is time-consuming, troublesome and potentially contentious. Further, many issues can be avoided by setting up data validation during the survey design. For instance, recently for an age question, I actually looked up the age of the oldest person alive and used that as my boundary condition. I did double check the results to make sure I didn’t have a surfeit of centenarians answer the survey, but this validation kept someone from entering 1697 as their birth year (as a typo for 1967); had they done so, the survey would have immediately alerted them to the fact. Far better to let the respondent catch and fix their mistakes than have to do it for them. 

When I started my career, if an answer seemed wrong, we could pull a paper questionnaire to see if a data entry mistake had been made. On the other hand, if it was for a telephone survey, we often had to interpret the response, where the interviewer had clearly misheard the name of a vendor or the acronym used to describe a technology. Web surveys give you the opportunity to prevent many types of errors from being recorded—many, but not all. Cleaning data is never pretty, but it’s an important step that should be taken for any strategic survey.

Comments

Very well explained. 
 
Basic data are the most important point of data analysis but often neglected. 
 
Proper methods should be used to get clean, accurate data with minimum missing values. 
 
Anil Arekar
Posted @ Friday, October 01, 2010 4:26 AM by Anil Arekar
Data cleaning is the most important part of statistical analysis which is usually not addressed.Data cleaning methods should be included in the basic statistical books. 
 
Author has described all aspects about data screening before statistical analysis. 
 
Anil Arekar
Posted @ Wednesday, June 15, 2011 7:57 AM by Anil Arekar
I am a research scholar in the subject of statistics. I found this elaboration very much useful. Thanks a lot
Posted @ Friday, June 17, 2011 11:02 PM by Ahmad Ali
Post Comment
Name
 *
Email
 *
Website (optional)
Comment
 *

Allowed tags: <a> link, <b> bold, <i> italics

Latest Posts

Loading
What's New
Don't Be in the 4%
VoC on Twitter
Verint Blog
Verint Blog: Read the Latest from the Verint Systems Blog