Data Quality – a Multidimensional Approach
August 2020, Nigel Turner, Principal Consultant, EMEA
In my blogs and articles over the lockdown period I’ve avoided talking about the impact of the Covid 19 pandemic and the heavy reliance on good quality data to support the models needed to combat and mitigate its effects. I have decided to break my silence in this blog as a major data story recently hit the headlines in my part of the world, Wales in the United Kingdom. This story was literally so close to home that I felt impelled to highlight and comment on it, and use it to stress why the need for good data quality is more important than ever.
As the lockdown was imposed, the Welsh Government compiled a list of 75,000 people living in Wales who were classified as ‘vulnerable’. These were mainly older people or those with existing health conditions that would make them particularly at risk should they contract the virus. Letters were duly sent out advising them to stay at home for 12 weeks. It soon became evident, however, that not all was well. It was reported that 13,000 letters were sent to the wrong addresses (that’s over 17% of all letters sent) with the outcome that 13,000 vulnerable people were not advised to shield, and others not in the vulnerable category told to stay at home. I don’t need to spell out the implications of this; high risk people may well have become severely ill or worse as a result. It also damaged trust in the Government at a time when combating Covid relied heavily on the population complying with its instructions.
The Welsh Government blamed these problems on a ‘processing error’ (a standard non-explanation when these things happen). They duly re-sent 13,000 letters but even today there are vulnerable people still reporting they have not received any advice, and healthy people being told to shield when they don’t need to. The problems rumble on.
A more plausible explanation for the problem is rooted in underlying data quality issues. The Government assembled its list from several health and social services data sources across Wales and it emerged that this data had a plethora of problems. These included incomplete and missing data, duplicate data records, out of date addresses and contact numbers, and an inability to merge sources effectively into a definitive list because of a lack of data standards and resulting format and content inconsistencies.
This tale illustrates all too well why data quality matters. It matters because all other data disciplines rely so heavily on it. Predictive modelling, analytics, business intelligence et al can only produce reliable results and models if the underlying data that feeds it can itself be relied on. Clearly in this case, the data couldn’t be relied on.
So how can you ensure that data quality is fit for purpose, whether it’s combating Covid, estimating product sales or marketing to potential customers? Creating and maintaining good quality data depends on five basic activities:
- Understand what data is stored and processed and how it is used within an organisation
- Baseline the current quality of the data and assess how well it is meeting business needs and uses
- Where the data is not fit for purpose, set quality improvement targets to guide remedial activities
- Undertake improvement initiatives (encompassing people, process, technology and the data itself) and measure the impact against targets
- Regularly measure the data and continue to improve and maintain it so that it meets current and future business needs
In this process, a critical activity is to measure the data and set improvement targets. To do this it’s important to recognise that one data quality measure is never enough; data quality consists of several dimensions and so any data set needs to be evaluated against these. The bad news is that there is no standard industry agreement on what these dimensions are, and there are many variants. I’ve personally always favoured the following 7 dimensions, split into two categories: Content (focused on the data itself) and Context (focused on the use of the data). The five Content dimensions are:
- Completeness – is all the expected data present and in the correct fields? Are there unintentionally blank fields, e.g. no date of birth in a health record?
- Accuracy – does the data reflect the real world? Is the address recorded the correct current one for the individual concerned?
- Uniqueness – in a data source, is the entry unique or is there unintended and uncontrolled duplication of records?
- Validity – does the data conform to a specified or expected format or business rule, e.g. numeric, alphanumeric, a maximum of 14 characters etc.? Can customers have an age greater than 120 years old, or schoolchildren an age greater than 18?
- Consistency – where data is deliberately held more than once in different sources, are the sources consistent?
The Context dimensions are:
- Accessibility – do the users who need the data have access to it?
- Timeliness – is the data available to users when they need it and is it sufficiently timely to meet their needs?
Data can then be baselined against these dimensions. Targets can be set and reported on regularly in support of a programme of continuous data quality improvement. I end with a few final tips to make this approach work in practice:
- It is vital to combine any data quality improvement initiative with Data Governance, where named individuals are made responsible for particular data sets or fields. Nominated data owners and / or data stewards should be the primary drivers of improvement and specify the data quality targets.
- Combining data quality with the disciplines of master and reference data management (MDM / RDM) will help to eliminate many of the problems with data completeness, validity, consistency and validity by creating and maintaining single versions of the truth.
- When tackling data quality problem areas it’s vital to focus on the data that has the biggest impact on the business or organisation. If you look hard enough (and often you don’t have to look very hard) you can find data quality problems in a multitude of places across an organisation. Not are all equally important; prioritise those that can be directly related to key business drivers and imperatives.
Taking this multi-dimensional approach to tracking and improving data quality is essential. If these disciplines had been adopted on health and social care data in Wales before the onset of the pandemic, many of the problems which emerged would have been avoided or at least reduced. Good data quality not only enhances trust and reputation, it can even save lives.