Best Practices in Data Validation

Overview

Data Quality is the buzz word in the digital age.

What is data quality and why is it so important?

“Data quality” is the term that is probably hidden but plays an important role in many streams. Data plays a vital role in acquiring a market place, especially in enterprise data management stream.

Data Quality Examples

Following are some examples which emphasize the need for data quality.

  • A customer shouldn’t be allowed to enter his age where he has to mention his marital status.
  • When a customer enters a store, there is a high possibility that he might miss out his original details to be filled up with the forms, some of it can be in a hurry not mentioning a correct phone number.
  • There is also a possibility of the billing staff to wrongly enter the store address as default in place of the customer address which contributes to a bad quality data that gets persisted in the system.

This data may be crucial as the customer might not just be a Guest customer and the customers’ viable interest towards the store becomes obscure.

This blog post speaks on Data Quality, the significance of Data Quality, business impacts, best practices to be followed, and Mastech InfoTrellis’ specialization in Data validation

Business Impacts on Data Quality

Recent researches from Gartner indicate that poor data quality is a primary reason for about 40% of failing business initiatives.

A low-quality data costs around $600 billion dollars for American businesses alone which in turn causes the failure of any advanced data and technology initiatives.

Significance of Data Quality

The successors of the big business clearly understand the importance of quality data.

The quality of data is directly proportional to the:

  • The marketing campaigns cost and the determination of the right audience
  • Knowing the customers interest
  • Converting the prospects into sales
  • The turnaround time for converting a prospect into sales
  • The precise business decisions that are made
  • How accurately you can make business decisions

The integral part is played by the Quality assurance consultants in revving up the data and ensuring that the data that is consumed by the upstream and downstream are credible.

Data Quality Techniques

For any data to be consumed by the system, the data need to be cleansed to understand the data model of the customer and post cleanse, the data needs to be profiled for a deeper understanding of the data model/ the pattern the data is accumulated

Figure 1: Data Quality Techniques

Business Case

One of our clients, had issues providing quality data to the subscribing source systems. The existing implementation did not provide a solution in achieving the goal of providing a quality data. Therefore it required production fixes by the customer business or customer IT team.

There were several issues with the current implementation that hindered the business from achieving its goal of providing good quality data to subscribing source systems. A large proportion of these issues had to do with adding and updating customer information.

Mastech InfoTrellis Solution Expertise

We as Data Consultants, followed the data validation cycle, analyzed, and identified the data pattern in “address” data. Since the customer had reported bad data quality, the data pattern had to be analyzed as a first step.

Sampling Example of address data

  • Invalid address – records containing duplicate addresses
  • Store address provided as customer address
  • Address line one, two/ three – null/ blank
  • Unknown/ TBD values provided in the attributes
  • Country value as Null
  • Zip postal code with invalid value

These were the patterns that were analyzed and presented to the clients for further evaluation.  Database was queried and the samples were provided to the clients. Once the pattern was evaluated by the clients, the solution was designed.

Sample Business Rule Validation

Postal Code Validation

Disallowing entry of invalid postal code or entering the Postal code of US address for a Canadian address

Best Practices

Figure 2: Best Practices

Our Specialization in Data Validation

As data management consultants, every resource needs to understand the verticals or domains in which we are specialized. Following are the various domains :

  • Retail
  • Manufacturing
  • Health care
  • Banking
  • Automobile

The data can be customer specific, contract or product data. We as data scientists have handled data from all these domains and from all geographical regions.

For example; a name of John can be common in the USA and not in South Africa. Hence, analyzing the data comes with experience.  We guide the customers and provide an insight into the data pattern.

We have profiled, cleansed the data and identified duplicates between the data for various clients who have their wings spread across different geographies.

Conclusion

Hence, the solutions are best designed with the analyzation of the customer problems. Data plays a vital role in capitalizing the market, which the major players on the market have already started eying. The consumers should be aware of the pattern the data should be segregated and displayed with the product owners, and the above methodologies gives a bird’s eye- view on some of the validation techniques.

About the Author

Narayan is an Associate Architect at Mastech InfoTrellis with an overall experience of around 6.5 years in certifying the IBM Master Data Management Advanced edition, Collaborative edition, Standard edition, Probabilistic matching engine, and ETL solutions.

Best Practices, Data Cleansing, Data Filter, Data Model, Data Pattern, Data Profiling, Data Quality, Data Validation, Quality Assurance, Rule Validation, Sampling, Source systems

2 responses to “Best Practices in Data Validation”

  1. […] http://www.infotrellis.com/best-practices-data-validation/ […]

  2. Kevin says:

    Hi Narayan,

    Amaze! I have been looking bing for hours because of this and i also in the end think it is in this article! Maybe I recommend you something helps me all the time?

    And the main mapping you define a mapping variable $$Expr_Var (Make sure you select IS_EXPR_VAR is true. Use the parameter file in the session and use the above variable in the calculation..so it will treat as a expression rather than mere string value.
     This exam measures your ability to utilize Informatica products in support of data governance. Measure your skills in ensuring high levels of data quality, integrity, availability, trustworthiness, and data security Informatica Data quality training.
    Hope it should work.

    Very useful post !everyone should learn and use it during their learning path.

    Thanks and Regards,
    Kevin

Leave a Reply

Your email address will not be published. Required fields are marked *