Dirty data (also known as rogue data) is inaccurate, incomplete or erroneous data (Chu, 2004). Dirty data impacts on the overall quality of the data set collected via a survey and can have serious consequences in relation to any decisions based on the data set. The quality of the data set can be improved by identifying and removing any dirty data through data cleaning.
Data cleaning (also known as data scrubbing) is the process of detecting and removing errors and inconsistencies from the data set to improve the quality of the data collected (Rahm & Do, 2000). This process involves detailed data analysis using data analysis tools, software programs and a manual inspection of the data to identify missing data, duplicates and other invalid data (Rahm & Do, 2000). So how does one identify dirty data and once identified what does one do with it?
There is a range of data cleaning programs available and many survey tools have an inbuilt data cleaning tool which use algorithms to identify speeders and quality of responses. Speeders are respondents who finish a survey in a fast time based on the survey design and the average response time of others and as such are outliers. Speeders are less likely to read the question properly and reflect on their answer which results in poor quality data. Algorithms are also used to identify potential answers which may be of poor quality looking for things like straight lining or patterns in radio buttons and checkbox questions or respondents who have selected all the checkboxes. In terms of open text responses data cleaning tools often search for one-word answers or gibberish and flag this data. It is up to the researcher to decide what to do with this flagged data and the two options are explored further down.
After running the data cleaning software or tool one must undertake a manual data analysis for dirty data. As well as looking for straightliners, pattern responses, gibberish, and other dirty data the researcher needs to keep an eye out for test responses, nonsense or missing open text data and contradictions.
During the design phase of a survey, the researcher will test the survey a number of times to ensure that all the required questions and responses are included, any logic applied to the survey functions and to test accessibility and survey fatigue. It is good practice for survey designers to get others to take the survey to provide a second pair of eyes and the survey will also be sent to the client who has commissioned the survey and often further shared through the project team. The sharing of the survey in the test phase is often done via a test link which allows testers to take the survey as it appears in the field and any responses are recorded by the survey software. It is important to identify and remove any data created during this test phase as part of the data cleaning process as the data is not genuine data and may influence the results of the survey.
Nonsense or gibberish data occurs within open text responses in the survey. Nonsense data is responses that don’t fit within the topic or context of the survey. A couple of examples of nonsense data is someone asking a question within the survey i.e. why don’t we have an Isite anymore? or making comments on an unrelated topic within a survey i.e. “why do we spend thousands on cycleways to please a few” in a survey on proposed changes to dog control bylaws. Gibberish data occurs when one just randomly hits the keys on the keyboard just to fill in the textbox or mashes the keyboard.
Contradictions are responses within a survey which don’t match up with an earlier response. Contradictions often occur when the respondent has been asked a radio button or checkbox question which is then followed up with an open text question, however, contradictions can occur within radio button and checkbox questions.
Once the dirty data has been identified the researcher can remove it from the data set which will improve the quality of the data set.
Partial or missing data occurs when a respondent starts the survey but does not complete it or misses questions within the survey. This leaves the researcher with a conundrum; do they remove the respondents’ answers from the data set or do they keep the partial data in the dataset and convert it to a complete response. If the researcher deletes the data from the survey they lose the data and information that was supplied. If they choose to covert the partial response to a completed response it may skew the results of the survey. There is no easy answer on whether or not to delete the partial data. The nature of the survey, the clients brief and needs as well as the researchers own experience will guide them in this decision.
Data cleaning can be an expensive process not only in terms of software but time and labour in undertaking the manual data analysis. Some forethought and sound survey design can reduce the incidences of dirty data, improve the quality of the data and allow you to get to the fun stuff, the analysis sooner.