Data cleaning; is it time to stop sweeping it under the carpet? An example from the Dogslife project.

Presenter Charlotte Woolley
Authors Charlotte S.C. Woolley, Ian G. Handel, B. Mark Bronsvoort, Jeffrey J. Schoenebeck, Dylan N. Clements.
Affiliations The Roslin Institute and the The Royal (Dick) School of Veterinary Studies, University of Edinburgh.
Presentation Type Poster

Abstract

Even with careful study design and extensive validation, large datasets are often heterogeneous and require cleaning prior to analysis to prevent losses in research validity, quality and statistical power. Many publications report that data was ‘cleaned’ but few studies document the process reproducibly and values identified as ‘outliers’ are commonly deleted without reporting the possible causes of error. Our aim was to develop a novel, automated data cleaning algorithm for growth (height and weight) that could be applied to large datasets.
Dogslife is an internet-based, longitudinal cohort study of Kennel Club registered Labrador Retrievers living in the UK, which was launched in 2010 and has over 7500 registered dogs to date. The main objective of Dogslife is to identify risk factors for canine health and disease by collecting information from owners via regular questionnaires. In addition to questionnaire data, the study has collected DNA and faecal samples from subsets of the cohort, which has produced genomic and microbiome data.
We developed our data cleaning pipeline in R software and used rule-based approaches, non-linear mixed-effects mathematical models and text analysis to identify common errors such as duplicate entries, typing, decimal point, unit, menu/option, intentional, website-generated and measurement errors. Individuals were permitted to differ from the population by making use of repeated measurements and alternative data sources. The method avoids the modification of unusual but biologically plausible values, prioritise data repair over removal and explicitly report the decision making process behind why a particular data entry is modified or deleted.
We validated our cleaning algorithm for growth variables (weight and height) on three other independent data sources from studies with fundamentally different designs; veterinary consultation Labrador Retriever weight records from the SAVSNET (Small Animal Veterinary Surveillance Network), clinical Labrador Retriever weight records from a veterinary hospital network and a publically available (via the UK Data Service) human weight and height data from CLOSER (Cohort & Longitudinal Studies Enhancement Resources) with varying proportions of artificially simulated errors. We found that our algorithm could be reproducibly applied as an effective data cleaning method on all of the validation datasets. We also compared our method to uncleaned data and six different cleaning methods and found that our algorithm out-performed these with greater accuracy and fewer unnecessary data deletions.
There is an increasing demand for data cleaning methodologies to be thoroughly reported so that they can be reproduced, tested and adapted by the wider research community. In the future, it is vital that data cleaning is considered an integral part of study design and should be considered as early as possible in order to ensure that the quality of the data is conserved. Our methods have broad applicability to longitudinal and cross-sectional growth data and we propose that they could be adapted for use in other breeds, species and fields.