Managing Data Quality. Tim King
Чтение книги онлайн.
Читать онлайн книгу Managing Data Quality - Tim King страница 6
1
This first part of the book will help you to understand better the nature of the data asset and why it can be difficult to manage, particularly in an enterprise or organisational context. Generic behaviours of people relating to data will be explored to help understand how people can affect data quality. Finally, some real-life examples and case studies of data quality problems will be used to help you understand some of the impacts of data that have poor quality.
Part I
The challenge of enterprise data
3
This chapter describes the differences between data and information, and how these relate to most business activities. We then consider the nature of the data asset and the generic life cycles of data and explain what is meant by the term ‘data quality’. Finally, we introduce the objectives of data quality management.
What ARE data?
Before going much further, there are some key terms and concepts that need to be defined and clarified to help ensure consistent understanding as you read this book.
The title of this book is Managing Data Quality, and, because they so often appear together when discussing the impact of computer technology on organisations, there are three important relevant terms that need to be clarified: data, information and knowledge.
When you have more than one data professional in a room, it is likely that there will be fierce debate about these terms. Even the ISO Online Browsing Platform1 (a place where all ISO definitions are gathered together) has numerous different definitions for these terms.
As the subject of this book is data, we can establish a solid foundation for our understanding by referring to the definition for data in ISO 8000-2:
Data: ‘reinterpretable representation of information in a formalized manner suitable for communication, interpretation, or processing’.
In the case of definitions for information and knowledge, making a choice is more controversial, not least because potential definitions often use the other two terms and any single collection of definitions becomes recursive. However, we believe the following key observations provide sufficient understanding to read the remainder of this book (while we leave more detailed discussion to others):
Use of the term ‘information’ suggests richness of meaning, and is typically taking an end-user view of the value of data to organisations to enable decision making.
1 The data asset
1 https://www.iso.org/obp/ui/
Managing Data Quality
4
Use of the term ‘knowledge’ suggests an understanding acquired through experience or education, putting knowledge outside the scope of this book; for example, it doesn’t matter how many books you read about cycling, it is only when you have ridden a bike that you have knowledge of how to cycle!
Another complication is use of the terms ‘structured data’ and ‘unstructured data’. These terms have been a handy tool for marketing teams who are promoting particular software functionality (typically to extract meaning from unstructured data), but the two terms hide the reality that no data set in digital form is either fully structured or fully unstructured.
Structured data contain explicit, discrete elements (e.g. the tables, columns and keys within a relational database or the tags within an XML file) to represent meaning. These elements enable automation to generate insight and foresight from the meaning (e.g. being able to identify all the children in a hospital database by filtering the rows where age is less than 18).
Unstructured data are fundamentally text and images, which provide meaning in a way that requires either human expertise or artificial intelligence methods to process the meaning (e.g. a doctor reviews the medical scan that is the content of an image file).
In these examples, though, the database will typically also include unstructured elements (e.g. a free-text field to capture observational notes) and the digital file of the MRI scan will also include structured data in the form of metadata (e.g. the creation date) to support management of all the images.
Furthermore, a spreadsheet is essentially semi-structured, sitting somewhere between a database and an image file, because the rows and columns provide some structure but without the full richness of a relational database or an XML file.
In summary, no data set is ever entirely structured or unstructured. Structure is definitely important to data quality, though, because it captures a more precise, controllable set of requirements for the data. Requirements for unstructured data are less easy to enforce by definitive, repeatable computer-based algorithms.
Data as part of business activities
Any business activity should support the strategy of the organisation (and may have some part to play in developing this strategy). There should be governance in place to ensure that there is suitable senior or executive control and monitoring of this activity. Business activity in this context is not just applicable to commercial organisations, but refers to the activity by which any organisation delivers its core mission. Figure 1.1 illustrates this relationship.
The data asset
5
The four core components of a typical business activity are:
The process, which defines the individual steps to be undertaken and, importantly, should ensure that the end-to-end process is effective in delivering the desired outcomes.
Data, which include inputs to and outputs from the process, and flows through it.
Software and hardware systems, which automate the process by storing and manipulating the data, although not every process will be automated by software.
People, who are the ‘actors’ in the process, undertaking key process steps and ensuring suitable organisational outcomes.
Despite data being a key enabler for any process, in many organisations there is a greater management focus on the technology elements, particularly when undertaking business change projects involving software. The software product is likely to be expensive, have a recognised name and be a core part of the project, therefore leading to much attention.
In typical situations, however, the data that will be used to enable the technology to deliver the required outcomes are the data in one or more existing software systems. These data will need to be migrated to the new software tool, but the data migration process is typically a high-risk part of the overall project and, if not undertaken correctly, will actually degrade the quality of the data.
If the quality of existing data is perceived to be poor then no matter how good a new software tool is, and how well it has been implemented, the outcomes of the system will be limited by the quality of the data. This poor quality data can mean that data migration is far more challenging and expensive, and may not even be feasible at all.
Figure