Managing Data Quality. Tim King
Чтение книги онлайн.
Читать онлайн книгу Managing Data Quality - Tim King страница 10
6 × 1
27/4/14
£7.12
045
Wood
60.0
29.5
28.6
Yellow
15/7/15
£4.21
Accuracy: Whether the data reflect the real object it represents. For example, looking at the records in Table 1.1, by inspecting the real object (the bricks) we can confirm that brick 045 is a yellow wooden block with the dimensions L 60 × W 29.5 × H 28.6. If the real object turns out to be a green brick or to have different dimensions from those in the table, then the data are inaccurate.
Completeness: Whether all relevant items are recorded and all their attributes are populated. For example, the attributes for brick 010 are not complete. Similarly, if the toy box contains a brick 017, the list of bricks is not complete.
Consistency: Whether an entity recorded in more than one data store is comparable across data stores. For example, brick 012 has a purchase date of 01-09-2001, but in the purchasing system the transaction date is 04-12-2001. If that’s the case, then the data are inconsistent.
Validity: Whether data conform to the specified format. For example, the Purchase Date field contains many different date formats; which is the valid format?
Timeliness: Whether data are up to date and are available to users in a timely manner. For example, the entry for brick 045 could have been added two months after the purchase date, which is slower than the required update frequency. Additionally, if bricks are being purchased daily, then an absence of new data could indicate that the data update process has failed.
Uniqueness: Whether a single representation exists for each physical entity. For example, in the table, no ID appears twice, therefore it is likely that all entries for these bricks are unique.
This example analysis is the starting point for data quality, but further work would need to be done to provide a complete technical approach to ensure data are fit for purpose. This involves generating an explicit data specification to capture all the identified requirements and a set of tests to ensure the data meet these requirements. These tests vary from simple (e.g. comparing the content of a data set to the formal definition in the data specification of the required syntax) to complex (e.g. identifying if, for all
Managing Data Quality
14
current customers, contact details exist and are correct in the customer relationship management database).
In summary, data quality dimensions prompt the analysis of data requirements. These dimensions are, however, ultimately superseded by the content of the resulting data specification, which becomes the formal basis on which to test the quality of each relevant data set.
Given these technical complexities that underpin data quality, organisations face a challenge to ensure a consistent, effective and efficient approach to data management across all relevant stakeholders. Facing this challenge is the role of data quality management.
What is data quality management?
The subject of this book is data quality management, so it is important that the meaning of this term is clear. ISO 8000-2 defines data quality management as:
Whilst definitions in ISO standards can sometimes require a little effort to understand, this definition is relatively clear. In essence, it describes an overall approach consisting of different activities to monitor, manage and control data quality with suitable oversight to direct and control these activities.
Data quality management is more than just managing data quality; it involves consideration of why data are incorrect in the first place. For example, if you are undertaking a data cleansing exercise without also addressing the underlying root cause of the data errors, then it is highly likely to result in the data cleansing having to be repeated on a regular basis.
Data quality management is also not about trying to achieve an idealistic, ‘perfect’ data set. As mentioned earlier, the costs, time and effort to achieve perfection will not be attractive to any organisation and would probably be impossible to achieve. Data quality management is, therefore, about balancing current data quality with required quality and the benefits that can be achieved by these improvements.
Summary
Data are a key element of any enterprise.
By treating data as an asset, the enterprise focuses on delivering value from data.
Data quality is conformance to requirements rather than abstract perfection.
The next chapter explores the challenge of managing the requirements to establish the foundation for conformance.
coordinated activities to direct and control an organization with regard to data quality.
15
Managing data quality is not an easy or simple task, and there are various factors that determine the purpose and scope of data quality management in an enterprise context. This chapter explores the challenges of those factors and provides a summary checklist to help you identify those challenges that apply in your own organisation.
The complex data landscape
Within all but the smallest of organisations and enterprises, there will typically be numerous enterprise software tools, specialist decision support tools and databases or spreadsheets created by end users. There could also be a legacy of paper records and documents to consider. When cloud data stores and web-based software services that can be quickly established are part of the equation too, the data landscape is even more complex and getting more so at a rapid rate. Physical locations of data stores for an organisation are no longer solely in premises owned by that organisation.
Each of these data stores is likely to have a complex data structure to suit the requirements of the software. Developing the data models for these data stores will be a large task for an experienced data modeller. Taking a ‘step up’, enterprise architects should have an overview of the conceptual and logical data models for each of the corporate data stores. They should also understand the different areas where the same or similar data are stored.
This leads into the challenge of master data management (MDM); in other words, for all the entities that exist in more than one data store, there is awareness not only of all these entities, but also of the ‘master’ data source that is the ‘single source of truth’. Good examples of entities that are likely to appear in multiple data stores include: customers; products; employees; assets; and materials.
As data updates are required, MDM is primarily a business approach to ensure they are first applied to the master data source and then replicated to all the dependent data sources. This process can be supported by specific MDM software tools. It needs to be stressed, however, that these can be expensive to install, complex