By Ann Ray
On July 4, 2006
Updated December 16, 2013

Dealing with data cleaning

"What do you mean I have to clean the data? I used the Web so I wouldn't have to do any data entry."

While Web and scannable surveys will minimize data entry costs, some degree of data cleaning is required for all projects. Data cleaning has two elements: checking that the forms submitted are of good quality, and coding typed/written responses so they can be analyzed most effectively.

As with all aspects of a survey, you may need to make some trade-offs on your projects. I recently spoke to someone who was analyzing 20,000 responses per week on one survey, so optimizing all the comments she received wasn't feasible. However, if you have a few hundred responses to a one-time or quarterly study, it will probably make sense to invest a bit more on a per-form basis. Likewise with high stakes surveys, such as employee feedback, you'll want to make sure you get the most from the data.

Checking form quality

Partial forms
Do you want to include partially completed surveys in your analysis? In some cases the demographics (generally at the end of the survey) will be critical data so a partial form has minimal use. Or you may consider a partial form to be one discarded by the respondent, and therefore not one officially "submitted" for the survey. Or you may look at a partial form and, while incomplete, see valuable data in what was finished.

Garbage submissions
Sometimes people go through surveys leaving every question blank or marking straight down one column on a rating grid. These are easily spotted by a human scanning record-by-record through the database, but masked by any aggregate analysis.

Logical consistency of answers
This is the most expensive quality check, as it requires not a quick visual scan of a form but actually reading and processing their responses. For example, what if someone marks that they've never use a particular product, but later in the questionnaire they give it a poor rating? When dealing with Web or phone surveys, use skips to minimize these conflicts. Apart from reducing data cleaning costs, you want to prevent this type of problem because it's rarely possible to repair the inconsistency—instead you have to decide whether to keep it in the results, exclude the affected responses, or exclude the entire form.

Cleaning up individual questions

Dates, currency, phone numbers, e-mails, etc.
With keyed data entry, this type of clean-up is done automatically by the data entry clerk. For Web surveys, the ideal approach is to tell the respondent the required format and enforce it at the time of collection so they can fix any typos. Try to balance their ease with your preferred format. For example, if you're asking for their serial number, can you code the field to ignore hyphens, spaces and case when doing the validation check? Likewise, make sure that you're covering all possibilities, such as an extension on a phone number or really looong government e-mail address with lots of dots.

Short typed/written answers
An astonishing number of people mis-type their home city. Asking for a product name will generate even more flavors. For prevention, consider whether the question can use a fixed scale with Other blank instead of just a type-in blank. For cleanup, you'll want an application which will do sorts of answers both alphabetically and by frequency (so you can spot the problem cases), and then mass replaces to consolidate "Bellevue," "Belevue," "BELLEVUE," and "bellevue."

Comments
With comments (AKA verbatims), you really want to preserve the original respondent's phrasing because their tone is as much a part of the feedback as the situation they describe. However, 50 pages of comments are of limited use for decision-making. Imagine you have a team of eight people who are making the business decision based on the survey results. Even if all eight had the time to review 50 pages of comments, they would all come out with different interpretations. Humans tend to put excess emphasis on data we see first and last, and on information which is negative or reinforces our existing beliefs.

Therefore, the ideal approach with comments is to pair each one with a fixed scale or tabulated written response that encodes the comment. For example, a comment:

"The service rep Jane was great, but I wouldn't have to call so much if the product weren't garbage."

might be encoded with two values "Service compliment" and "Product quality complaint." Once you have the responses coded, you can use this to count up the number of quality complaints vs. quality compliments, which gives us something much less subjective than our individual interpretations of the comments. Then for the verbatim responses, you can use the coding to group answers by topics so someone can review the 10 pages of quality comments all together.

As you can guess, this postcoding (research jargon) is very time consuming. It requires either the survey analyst do the coding themselves, or that they create the code and carefully train database clerks on how to deal with square pegs. With high-volume surveys, human review of every answer may not be possible. In that case, regularly check the comments to see if there are recurring themes that can be captured in fixed scale questions instead, and consider keyword searches rather than manual coding.

Filed under:
Data Handling
Reporting

Tagged with:
Comments
Data cleaning
Data validation
Import & export
Open-ended questions
Postcoding
Verbatims

Need a Hand?

A little help can add a lot of polish—or just save hours and headaches:

(206) 399-2344
aray@querygroup.com

Ann has been a great resource for us. She responds timely, has good suggestions, and really knows her craft. Ann is our primary resource when it comes to major surveys and reports.

Barry L. Brown
President
Effective Resources

Dealing with data cleaning

Checking form quality

Cleaning up individual questions

Related Articles

Need a Hand?

Recent Articles