Data growth remains one of the biggest challenges within the IT industry. Consistently ranked by IT professionals as one of the top three issues that loom above their heads, the sheer volume of data that accumulates within a single company offers unprecedented challenges.
Data is no longer a digital by-product of interest only to scientists; it has evolved into a precious resource, hoarded and gathered and sold as much as any physical object of value. These days, tech-savvy organizations are not only collecting and storing information, they’re taking that information and applying heavy-duty analytics to get more value out of their data, using it to make more informed decisions. Every company that deals with high volume of data realizes that lack of attention to data quality and accuracy can have serious business ramifications.
New technologies are rapidly being built to hold and process data with increasingly mind-blowing efficiency, but surprisingly little attention is being paid to making sure that data that is being used as a key component in business decisions is trustworthy. Even in this day of technologically informed CEOs and computer-savy employees, a shocking number of organizations believe that they can survive with poor data quality and consider data cleansing to be a temporary fix needed only when a visible problem occurs.
So what is data quality all about?
Data quality is all about having identification and remediation processes at data entry points and addressing issues at the source. It is not a one-time activity; rather it is a maintenance function, and once the processes are in place, it should be monitored frequently. Organizations need to be proactive, establishing a dedicated group of Data Stewards to improve and monitor the accuracy of data over time.
Where does the problem begin?
The standard data quality problems begin to accumulate from the initial moment of data creation, steadily losing veracity or reliability through data decay, data movement, and data use. The transfer of data in particular poses a lot of problems. Data extracted from legacy systems should be cleansed, restructured and reconciled before it is loaded into data warehouse, and these steps are often overlooked or waved away as unimportant. Even when companies ensure these steps are taken, inconsistencies within the cleansed data then need to be identified and corrected before the data is transformed.
While progress in this area is significant, data decay is an inevitable process. No data remains perfect or valid forever.
How do we resolve the issue?
For an organization to achieve long-term success and avoid a steady decline in data quality, it is essential for them to identify the root cause of the data quality issues rather than applying solutions only when the problem occurs.
The first step is Assessment. A solid foundation of definitions, rules and project scope is key to a successful data quality program. Before making any decisions or implementing any solutions, it is absolutely essential to definitively answer the following questions:
- What kind of data do you collect?
- Where do you store the data?
- How does the data move?
- How will it be monitored?
To answer these in more than just a superficial way, let’s look at them in more detail.
- 1. What kind of data do you collect?
This addresses the very basic question of the type of data you’re working with and the context within which the data resides. From one organization to the next, the context and the specific definitions of data can vary dramatically, but the below are the core ‘types’ of data you’re likely to be collecting:
- Customer Data
- Geographic Data
- Asset Data
- Organization Data
The cleansing functionalities should be built around the business rules that you use to define each of these types of data. For example, the first name of a customer may occasionally include an abbreviated title. The cleansing function should know the complete domain set of values for these titles and should map the values according to the rules defined by the type and context of the data. In this example, the business rules may state that full titles should be the standard in place of abbreviations. It should be able to correct Jr. to Junior, Sr.to Senior, and Dr.to Doctor, and the records should be flagged if it does not fall in the given list of values. Starting with clear definitions of the rules associated with data type and data context make it far easier when you begin implementing a solution.
2. Where do you store the data?
Ensure that you have a full understanding of where all your data physically resides and how it is divided up or clustered within individual fields. The data may be homogeneous or heterogeneous and a proper channel will be required to connect data that comes from different sources. Depending on the complexity of your existing systems, this integration may necessitate a few changes to the existing data model or schema design.
For example, the name field of the customer might potentially contain Prefix(Mr.), Title(Dr.) and Degree(Ph.D) for each customer; rather than have this information all located in a single field, it can be adjusted by separating these titles into their own fields for better customer identification and more consistency in customer name formatting.
3. How does the data move?
Knowing the type of data and the storage, the next step in the initial assessment is to understand the typical movement of your data. The data can flow in any number of ways – from source to staging, staging to operational data store, from data mart to data warehouse – at this stage, it is essential to document every movement pathway. For the data to be moved in a way that minimizes quality loss, mappings need to be set properly. Through mappings, business can understand where the data is staged and where it gets manipulated.
For example, the customer may have different addresses listed for different usages: billing, mailing, work and home. The current system might be designed to hold only one address usage type per record, which can easily result in duplication. To set it right, the future system can be designed to have a single record holding all the usage types after consolidation.
Additionally, data will change over time, so your solution should be prepared to handle changes both gradual and sudden. A company that exists today can merge with some other company tomorrow and so the cleansing process must be reliable for small, periodic upkeep as data moves back and forth as well as large influxes of entirely new data.
4. How will it be monitored?
I can’t stress enough that data quality improvement cannot be done as one-time activity and should be periodically monitored or assessed to avoid data deterioration. This can be achieved by implementing measures and metrics around the data types, data storages, and data movement. Rather than let a vendor tell you how you want to monitor your data quality, decide up front what you expect to be able to do and see. Your business needs are completely unique to your company, and how you need to govern your data will reflect that.
For example, the business rules may set in the validation process that the customer should have a valid email address and phone number. The count of invalid email addresses and count of invalid phone numbers can be measured and the metrics can be derived out of the measures. Based on the metric values, the organization can decide upon the monitoring frequency of data. Another company, however, may not need phone numbers for a customer record to be valid. This is one of many governance rules that must be decided within your organization.
So how do we put a data quality process into operation?
Profiling is the first step to discover the accuracy of data. This process is used to discover the presence of inaccurate data within any data storage. The more profiling you do, the more inaccuracies you dig out. There are lots of techniques available for profiling, some examples being Column Property Analysis, Structure Analysis and Data Rule Analysis.
The results of your profiling then require resolution.
The most common task in data management is the prevention and resolution of duplicate records, particularly where it applies to ensuring unique, consolidated customer identity. This can be achieved through key based methodology.
For example, a customer can be identified through the combination of LastName + AddressLine1 + City or LastName + EmailAddress or LastName + PhoneNumber.
The next step towards the data quality process is Cleansing. Inconsistent or missing information about an organization’s key data may turn into serious business issue. This, in the end, means the data does not support them in meeting the organization’s goals and costs money. The issues also come into the spotlight when organizations merge and integrate their IT architecture in numerous data migrations.
Data is deemed unclean for many different reasons. Various techniques have been developed to tackle the problem of data cleansing. Largely, data cleansing is an interactive approach, as different sets of data have different rules determining the validity of data.
Data Cleansing process usually consists of the following steps:
- Data Standardization
- Data Enrichment
- Data Validation
- Data Correction
Data Standardization is to ensure that the data is able to be shared across the enterprise. The standardization are generally applied to Address data, Name data and business data.
Address standardization is critical for organization and it is a process of reformatting the input Address Line entered by a user to conform with the United States/Canadian Postal Service’s Addressing Standards.
Name standardization involves the ability to recognize the many different name components in many different formats and patterns and then the ability to extract the corresponding strings and rearrange them into a format suitable for subsequent phases.
Sometimes a person can have the same name as his father’s, distinguishing is done through the use of an addition to the name, such as “Jr.” or “Sr.” When there are many generations of living people with the same name, they start to use numerals for distinction: “II,” “III,” “IV,” etc. The process for consolidating name data requires identity resolution – collection of algorithms that does standardization, parsing, and pattern matching to identify unique entities.
Data Enrichment is the process by which data is standardized and corrected in order to maximize its integrity and efficiency. Any data-intensive operation should undergo data enrichment as necessary in order to achieve optimal business results. The processes that are done as part of data enrichment are, Removal of obsolete data, Consolidation of multiple data sources, Identification and removal of incorrect data, Data aggregation, Identification and collation of similar data.
Data Validation requires the creation of rules for different classes of data or for specific datasets. These rules have to cover the requirements for data quality in terms of accuracy; completeness; and consistency. Data validation may be applied to the source system or extracted data set to improve the data quality.
For example, the email address validation for the customer can be done at the source system level.
The ultimate goal for measuring the quality of the data is to identify what needs to be ﬁxed. Data correction could come down to two main activities: Cleanup of existing bad data and Correction of offending system, process, or business practice causing the data problem.
Hopefully this post has given you some basic understanding of data quality management and the first steps needed to implement it successfully. In my next post, we’ll dig in a little deeper to take a look at what metrics, organizations should be measuring and managing, and how they impact the business process.