Posted by infotrellis on Thursday, Jun 8, 2017 @ 5:41 AM

The digital era has fostered information transfer between two systems but the communication flaws have led to information loss. For example; Product Information shared between manufacturers and retailers. Manufacturers often communicate about new Products or changes to existing Products, Price information changes to retailers manually and in an ad hoc manner, leading to the data quality and integrity issues in key retail systems. These problems result in revenue loss and dissatisfied consumers.

Considering these challenges in mind, GDSN (Global Data Synchronization Network) evolved as a key data synchronization mechanism in Product Information domain. IBM Info Sphere MDM Collaborative Edition, a compelling offering from IBM for Product Information Management, leverages GDSN and provides out-of-the-box capabilities that enable trading partners to globally share trusted product data automatically. This blog is focused on GDS, inner workings of GDS, associated concepts and the GDS component in IBM MDM Collaborative Edition.

Business Scenario

One of the global water quality solution providers for residential/industrial settings had data governance issues where information was scattered across multiple regions. InfoTrellis provided them Global Data Synchronization solution to have synchronized product information across regions and retailers. This really helped business in overcoming their revenue loss which eventually led to satisfied customers.

GDS Overview

Global data synchronization (GDS) is an ongoing business process that enables continuous exchange of data between trading partners and ensures sharing of synchronized information between them at any point in time. Each organization, Supplier or a Retailer, needs to join a data pool certified and tested by GS1.

data pool

Associated Concepts:

Trading Partners – Party who is either a manufacturer or retailer of products or both are considered as trading partners.

Subscriptions – A subscription is a message that establishes a request for trade item information for a trading partner who is receiving the data on a continuous basis.

GS1 messages – GS1 is the global organization responsible for the design and implementation of global standards and solutions to improve efficiency and visibility in the supply and demand chains across sectors. The GS1 system of standards is the most widely used supply-chain standards system in the world.

Global Location Number (GLN) – A global location number (GLN) is a unique 13-digit number that is used to identify a trade location. The first 7 digits represent the company prefix. The next 5 digits represent the trade location, and the last digit is the check digit.

Global Trade Item Number (GTIN) – A global trade item number (GTIN) is a unique 14-digit number that is used to identify trade products. The first 13 digits represent the product reference number and the last digit is the check digit.

GDS Flow – GDS works based on a publish/subscribe model. The supplier is required to publish the product information to a data pool, and the data pool then matches the published data to known subscribers of the data.

Global Registry

  1. Suppliers submit product information to source data pool
  2. The source data pool registers product in the global registry, which helps the GDSN community to locate data sources and manage ongoing synchronization relationships between trading partners
  3. After the product is registered at global registry, the supplier publishes the product in the source data pool and the retailer subscribes to the product by sending the subscription message to the recipient data pool
  4. The recipient data pool along with the subscription details, requests product information from the source data pool through the global registry. Based on retailer’s subscription information, the source data pool synchronizes with the recipient data pool to share the product information
  5. After getting the information from the source data pool, the recipient data pool forwards the product information to the retailer. The retailer confirms the recipient data pool if the product is approved or rejected
  6. The recipient data pool sends the product confirmation to the source data pool and the source data pool forwards the product confirmation to the supplier

IBM MDM CE – GDS Architecture:

IBM MDM CE offers Global Data Synchronization feature which caters to the needs of both Supply and Demand sides. Organizations that create Products and own Product data are on the supply side, and consumers of the data are on the demand side. The Producer and Consumers side sync up with the help of GDS data pool.

Data Model Support – Domain-specific data model for Global Data Synchronization is provided by InfoSphere MDM CE that spans across various industries. The data model for Supply Side adheres to 1WorldSync v7.1 XML and Demand Side adheres to GDSN BMS v3.1 XML. This default data model is provided in English only because attribute specifications and their valid values are predefined by GDSN and extended by data pool in English.

Data Model

IBM MDM CE GDSN Architecture

GDSN Architecture

Supply Side Global Data Synchronization – The Supply Side Global Data Synchronization is a process whereby suppliers register product data to a source data pool, which retailers can then subscribe to. Suppliers review and act upon requests from retailers about the provided product data.

Demand Side Global Data Synchronization – Demand Side Global Data Synchronization is a process whereby retailers subscribe to product data as published by suppliers to a source data pool, and synchronize that product data with their product data.

End-to-End communication between IBM MDM CE and GDS (Supply Side)

  1. A product is successfully saved in IBM MDM CE system by clearing all validation errors following business process
  2. A report can be created which successfully persists the product in GDS catalogs(Global Catalog, Global Local Catalog and Trade Catalog)
  3. GDS product status can be monitored during the entire process with the help of GDS status attribute during communication of PIM to 1WorldSync
  4. Once product is successfully created in GDS catalogs, it has to follow through the compliance process to validate attributes based on GDS data model. Mass Compliance check report is executed to validate this functionality
  5. If product is Compliant, Product Add request is sent to GDS registry via Mass Item Add report
  6. Product Add request is sent in XML format from IBM MDM CE -> GDS registry -> IBM Sterling -> IBM WebSphere MQ to 1WorldSync
  7. If request is compliant with 1WorldSync specifications (1WorldSync v7.1 XML) then product is successfully added and successful message is sent back from 1WorldSync to IBM MDM CE
  8. If request is non-compliant then product is not added and error message is sent back from 1WorldSync -> IBM WebSphere MQ -> IBM Sterling -> GDS registry -> IBM MDM CE
  9. Once product is successfully added and if the product has packaging links attached to it then Mass Item Add Link report should be executed to create links between various packaging types of a product
  10. Once product registration is complete then it can be published to the registry via GDS UI. If product is modified at a later point in time then those changes can be synchronized to 1WorldSync with the help of Mass Synchronization of Items report after Compliance check is successful and re-publish the changes to registry

InfoTrellis Advantage

InfoTrellis has helped many a clients, around the globe, across the verticals, to implement Product Information Management solutions along with GDS capabilities.  Reach out to us at to realize how you can alleviate your data management and governance woes.

Read More on IBM MDM CE and GDS here…

About the author:

Visalakshi is a Technical Consultant at InfoTrellis and has good knowledge on IBM MDM CE framework. She has over 3 years of experience across various technologies such as IBM Master Data Management Advanced and Collaborative Editions.

Topics: GDS Global Data Synchronization IBM MDM CE Master Data Management Product Management

Leave a Reply

Your email address will not be published. Required fields are marked *

Posted by infotrellis on Monday, Aug 8, 2016 @ 7:04 AM

Banking Regulations

Banking Regulations – Overview

Managing regulatory issues and risk has never been so complex. Regulatory expectations continue to rise with increased emphasis on the institution’s ability to respond to the next potential crisis.Financial Institutionscontinue to face challenges implementing a comprehensive enterprise-wide governance program that meets all current and futureregulatory expectations. There has been a phenomenal rise in expectations related to data quality, risk analytics and regulatory reporting.

Following are some of the US regulations that MDM and customer 360 reports can be used for compliance:

FATCA (Foreign Account Tax Compliance Act)

FATCA was enacted to target non-compliance by U.S. taxpayers using foreign accounts. The objective of FATCA is the reporting of foreign financial assets. The ability to align all key stakeholders, including operations, technology, risk, legal, and tax, is critical to successfully comply with FATCA.

OFAC (Office of Foreign Asset Control)

The Office of Foreign Assets Control (OFAC) administers a series of laws that impose economic sanctions against hostile targets to further U.S. foreign policy and national security objectives. The bank regulatory agencies should cooperate in ensuring financial institutions comply with the Regulations.

FACTA (Fair and Accurate Credit Transactions Act)

Its primary purpose is to reduce the risk of identity theft by regulating how consumer account information (such as Social Security numbers) is handled.

HMDA (Home Mortgage Disclosure Act)

This Act requires financial institutions to provide mortgage data to the public. HMDA data is used to identify probable housing discrimination in various ways.

Dodd Frank Regulations

The primary goal of the Dodd-Frank Wall Street Reform and Consumer Protection Act was to increase financial stability. This law places major regulations in the financial industry.

Basel III

A wide sweeping international set of regulations that many US banks must adhere to is Basel III. Basel III is a comprehensive set of reform measures, developed by the Basel Committee on Banking Supervision, to strengthen the regulation, supervision and risk management of the banking sector.

What do banks need to meet regulatory requirements?

To meet the regulatory requirements described in the previous section, Banks need an integrated systems environment that addresses requirements such as Enterprise-wide data access, single source of truth for customer details, customer identification programs, data auditability & traceability, customer data synchronization across multiple heterogeneous operational systems, ongoing data governance, risk and compliance reports.

How MDM can help?

master data management

Enterprise view of customer data

MDM solutions providean enterprise view of all customer data to ensure that a customer is in compliance with Government imposed regulations (e.g. FATCA, Basel II/III, Dodd Frank, HMDA, OFAC, AML etc.) and facilitate data linking for easy access.

Compliance Users

Users who satisfy the compliance criteriawill be able to retrieve the customer information such as name, address, contact method and demographics from the MDMsolution. They will be able to ensure customer compliance while creating reports, performing reviews and monitor the customer against watch lists.

Compliance Applications

FATCA supporting applications, Dodd Frank reporting applications, HMDA compliance reporting applications, Basel II & III compliance applications receive a data extract from the MDM solution containing detailed customer information such as name, addresses, contact methods, identifiers, demographics and customer to account relationships that enhance compliance reporting and customer analytics.

Compliance users can ensure compliance with all FATCA laws, create reports, link customer information to create HMDA reports and provide complete financial profile of all commercial customers to ensure compliance with Basel II & III regulations

Regulatory Risk Users

Regulatory risk users will be able to use customer data from MDM solution, create reports on an ad hoc basis, and perform annual reviews to ensure customer is compliant with risk regulations. These users will also be able to check if customers are on existing watch lists through pre-configured alerts and update the MDM solution as required during annual reviews.

Regulatory Risk Applications

MDM solution supplies detailed customer information such as name, addresses, identifiers, demographics, and customer to account relationships to Applications supporting AML, OFAC data, KYC, fraud analysis so that they can determine compliance to regulations such as AML. OFAC standards, determine if the proper KYC data has been captured for all customers and monitors fraudulent activities of any customer.

MDM solution will receive a close account transaction from the AML applications if the regulatory risk user determines the customer relationship must be exited for AML non-compliance.OFAC applications update customer’s watch list status within the MDM solution and send add/update/delete customer alert transactions to monitor customers on OFAC watch lists.


MDM solutions when implemented properly, can provide critical information to banks who have to comply with a number of regulations across many countries. At InfoTrellis, we have helped many organizations achieve these goals through IBM MDM implementations. You can contact us for further queries by sending an email to

 About the Author

Greg Pierce

Greg is a Senior MDM Business Architect at InfoTrellis. He has helped many clients across banking, insurance and retail clients actualize value out of their MDM investments.

Topics: Banking RegulationsX data lake for banking Data Management Consulting Services IBM MDM Master Data Management Master data management Banking Regulations master data management tools

Leave a Reply

Your email address will not be published. Required fields are marked *

Posted by Purnima Borate on Tuesday, May 24, 2016 @ 2:40 PM

Data Governance is the process of understanding, managing and making the critical data available with the goal to maximize its value and to ensure compliance.

InfoTrellis’ Data Governance Methodology follows a multi-phased iterative approach with 4 stages – Initiate, Define, Deploy and Optimize. This article is the second part of the Data Governance Methodology series by InfoTrellis. The first part of this series – Initiate your Data Governance – listed the essential foundations of successful Data Governance program.

‘Define’ stage primarily deals with defining effective Policies to address Data Governance issues. This article lists the important considerations of this stage.

Understand your Data Governance problem

Understand your Data Governance problem

Detailed investigation to understand the root cause of problem is essential to identify and solve Data Governance issues. For instance, a revenue amount discrepancy in financial report may look like a calculation error in first glance. Upon deep analysis, it could be revealed to be the result of interpreting the same business term, revenue, differently by different users which led to users applying different logic to arrive at the monthly figure.
Once we know the root cause of problem, it is important to categorize it. From our experience, categorizing a business problem into Data Domain Management, Business Process and Data Management Governance areas act as high level guides to understand the nature and scope of Data Governance problem. For example the revenue discrepancy problem mentioned above can be categorized into Finance data domain belonging to Accounting business process and Metadata Management Governance area. This helps to focus on the problem with the correct perspective.

Assemble the team to define Policies

Assemble the team to define policies

Data Governance is a wide domain and requires varied skillset. For instance, Metadata management skills are different from Data Retention skills. Categorizing the business problem as mentioned above also helps in identifying the required skillset to resolve the issue. From our experience, a dynamic team composition based on the nature of business problem works the best. Typical members of this team are Data owners and Architect of pertinent IT/Business system, Business Data Stewards and Technical Data Stewards who understand the business domain and the mapped Data Governance area.

Define the Policies, Standards and Processes

Policy is generally a high level statement that describes how you would tackle issues or plan actions for the Data Governance area. For the revenue discrepancy problem, you can frame a policy that states – We must define all Business terms in Metadata Repository that can be accessed by all users of Business terms. Metadata Repository must map technical metadata, business rules and data lineage.

A Policy can be broken down into one or more Standards. For the policy mentioned above, you can have following Standards –
1. Business Glossary must be developed to maintain definition of all business terms.
2. Sensitive and Private data must be marked or categorized appropriately in Glossary.
3. Technical Metadata of data attributes in databases must be mapped to Business terms in Glossary.

A Standard could be broken down into one or more Processes. Typically Processes are implemented using an IT tool or program by IT implementation team. For the Standard 1 mentioned above, you can have the following processes –

1. For existing Business terms, import from Excel files into Glossary; for duplicate terms, resolve conflict and retain one instance of each unique term
2. For new Business terms, create the term and its definition in Glossary
3. Create Collections to group associated Business terms.

Select the tool – Some Policies would need an IT tool for implementation. The Enterprise Architect assigned to the Data Governance program can suggest the tool to be used based on Enterprise IT standards, existing tools in enterprise, future usage of tool, Data Governance” maturity of the tool and skillset of team. It is a best practice to select and get approval for the Data Governance Framework and Tools by IT office at enterprise level and make it a standard tool for addressing a specific domain of Data Governance Solutions. This ensures usage of uniform tools in enterprise to address common set of problems.

In conclusion, there are many variations to how teams would be setup and Policies would be defined. Keeping the above points in mind would help the enterprise to formulate the team with skillsets required to define effective Policies.

Stay tuned for Part 3 of this 4 part series on Data Governance from InfoTrellis. In the meanwhile, please send us a note with your queries and feedback.

Topics: data governance Master Data Management

Leave a Reply

Your email address will not be published. Required fields are marked *

Posted by sathishbaskaran on Tuesday, May 12, 2015 @ 9:43 AM

MDM BatchProcessor is a multi-threaded J2SE client application used in most of the MDM implementations to load large volumes of enterprise data into MDM during initial and delta loads. Oftentimes, processing large volumes of data might cause performance issues during the Batch Processing stage thus bringing down the TPS (Transactions per Second).

Poor performance of the batch processor often disrupts the data load process and impacts the go-live plans. Unfortunately, there is no panacea available for this common problem. Let us help you by highlighting some of the potential root causes that influence the BatchProcessor performance. We will be suggesting remedies for each of these bottlenecks in the later part of this blog.

Infrastructure Concerns

Any complex, business-critical Enterprise application needs careful planning, well ahead of time, to achieve optimal performance and MDM is no exception. During development phase it is perfectly fine to host MDM, DB Server and BatchProcessor all in one physical server. But the world doesn’t stop at development. The sheer volume of data MDM will handle in production needs execution of a carefully thought-out infrastructure plan. Besides, when these applications are running in shared environments Profiling, Benchmarking and Debugging become a tedious affair.

CPU Consumption

BatchProcessor can consume lot of precious CPU cycles in most trivial of operations when it is not configured properly. Keeping an eye for persistently high CPU consumption and sporadic surges is vital to ensure CPU is optimally used by BatchProcessor.


Deadlock is one of the frequent issues encountered during the Batch Processing in multi-threaded mode. Increasing the submitter threads count beyond the recommended value might lead into deadlock issue.

Stale Threads

As discussed earlier, a poorly configured BatchProcessor might open up Pandora’s Box. Stale threads can be a side-effect of thread count configuration in BatchProcessor. Increasing the submitter threads, reader and writer threads beyond the recommended numbers may cause some of the threads to wait indefinitely thus wasting precious system resources.

100% CPU Utilization

“Cancel Thread” is one of the BatchProcessor daemon threads, designed to gracefully shutdown BatchProcessor when the user intends to. Being a daemon thread, this thread is alive during the natural lifecycle of the BatchProcessor. But the catch here is it hogs up to nearly 90% of CPU cycles for a trivial operation thus bringing down the performance.

Let us have a quick look at the UserCancel thread in BatchProcessor client. The thread waits for user interruption indefinitely and checks for the same every 2 seconds once while holding on the CPU all the time.

Thread thread = new Thread(r, “Cancel”);



while (!controller.isShuttingDown()) {



            int i =;

            if (i == -1)






              catch (InterruptedException e) {}




              char ch = (char)i;

              if ((ch == ‘q’) || (ch == ‘Q’)) {





          catch (IOException iox) {}


BatchProcessor Performance Optimization Tips

We have so far discussed potential bottlenecks in running BatchProcessor at optimal levels. Best laid plans often go awry. What is worst is not having a plan. A well thought out plan needs to be in place before going ahead with data load. Now, let us discuss some useful tips that could help to improve the performance during data load process.

Infrastructure topology

For better performance, run the MDM application, DB Server and BatchProcessor client on different physical servers. This will help us to leverage the system resources better.

Follow the best thread count principle

If there are N number of physical CPUs available to IBM InfoSphere MDM Server that caters to BatchProcessor, then the recommended number of submitter threads in BatchProcessor should be configured between 2N and 3N.

For an example, assume the MDM server has 8 CPUs then start profiling the BatchProcessor by varying its submitter threads count between 16 and 24. Do the number crunching, keep an eye on resource consumption (CPU, Memory and Disk I/Os) and settle on a thread count that yields optimal TPS in MDM.


You can modify the Submitter.number property in to change the Submitter thread count.

For example:

Submitter.number = 4

Running Multiple BatchProcessor application instances

If MDM server is beefed up with enough resources to handle huge number of parallel transactions, we should consider parallelizing the load process by dividing the data into multiple chunks. This involves running two or more BatchProcessor client instances in parallel, either in same or different physical servers depending on the resources available in that server. Each BatchProcessor application instance here must work with a separate batch input and output; however they can share the same server-side application instance or operate against a dedicated instance(each BatchProcessor instance pointing to a different Application Server in the MDM cluster). This exercise will increase the TPS and lower the time spent in data load.

Customizing the Batch Controller

Well, this one is a bit tricky. We are looking at modifying the OOTB behavior here. Let us go ahead and do it as it really helps.

  • Comment out the following snippet in runBatch() method ofjava


  • Recompile the BatchProcessor class and keep it in the jar
  • Replace the existing DWLBatchFramework.jar, present under <BatchProcessor Home>/lib with this new one which contains modified BatchController class
  • Bounce the BatchProcessor instance and check the CPU consumption

Manage Heap memory

Memory consumption may not be a serious threat while dealing with BatchProcessor but in servers that host multiple applications along with BatchProcessor the effective memory that can be allocated to it could be very low. During the data load process if high memory consumption is observed then allocating more memory to BatchProcessor helps to ensure a smooth run. In the BatchProcessor invoking script (named as runbatch.bat in Windows environments and in UNIX environments), there are couple of properties that control the memory allocated to the BatchProcessor client.

set minMemory=256M

set maxMemory=512M

It is recommended to keep the minMemory and maxMemory at 256M & 512M respectively. If the infrastructure is of high-end, then minMemory and maxMemory can be increased accordingly. Again, remember to profile the data load process and settle for optimal numbers.

Reader and Writer Thread Count

It is recommended by IBM to keep the Reader and Writer Number thread counts as 1. Since, they are involved in lightweight tasks this BatchProcessor configuration should suit most of the needs.

Shuffle the data in the Input File

By shuffling the data in the input file,  the percentage of similar records (records with high probability of getting collapsed/merged in MDM) being processed at the same time can be brought down thus avoiding long waits and deadlocks.

Scale on the Server side

Well, well, well. We have really strived hard to make BatchProcessor client to perform at optimal levels. Still, poor performance is observed resulting in very low TPS? It is time to look into the MDM application. Though optimizing MDM is beyond the scope of this blog let us provide a high-level action plan to work on.

You can either:

  1. Increase the physical resources(more CPUs, more RAM) for the given server instance
  2. Hosting MDM in a clustered environment
  3. Allocating more application server instances to the existing cluster which hosts MDM
  4. Having dedicated cluster with enough resources for MDM rather than sharing the cluster with other applications
  5. Logging only critical, fatal errors in MDM
  6. Enabling SAM and Performance logs in MDM and tweaking the application based on findings

Hope you find this blog useful. Try out these tips when you are working on a BatchProcessor data load process next time and share how useful you find them. I bet you’ll have something to say!

If you are looking at any specific recommendations on BatchProcessor, feel free to contact Always happy to assist you.

Topics: InfoTrellis Master Data Management MasterDataManagement mdm mdm hub MDM Implementation
Posted by manasa1991 on Monday, May 11, 2015 @ 5:36 PM

Calvin: “You can’t just turn on creativity like a faucet. You have to be in the right mood.”
Hobbes: “What mood is that?”
Calvin: “
Last-minute panic.”

Okay, apologies for an unscheduled delay on the follow up post. Let’s get back to discussing how we manage our MDM Projects.

In my previous post, we talked about the first two stages of “InfoTrellis SMART MDM Methodology”, namely “Discovery and Assessment” and “Scope and Approach”. In these two stages, we spoke about activities around understanding business expectations, helping clients formulate their MDM strategy, help them identify scope of an MDM implementation along with defining right use cases and the optimal solution approach. I also mentioned that we generally follow a “non-iterative” approach to these stages as this helps us build a solid foundation before we can go on to the actual implementation.


Once scope of an MDM project is defined and client agrees to the solution approach, we enter the iterative phases of the project. We group them into two stages in our methodology:

  1. Analysis and Design
  2. Development and QA

Through these stages, we perform detailed requirements analysis, technical design, development and functional testing across several iterations.

Requirements Analysis:

At this stage of the project, high level business requirements are already available and we must start analyzing and prioritizing which requirements need to go into which phase. For Iteration I, we typically take up all foundation aspects of MDM such as the data model changes, initial Maintain services, ETL initial load and related activities. An MDM product consultant will interpret the business requirements, and work with the technical implementation leads to come up with:

  1. Custom Data Model with additions and extensions, as per project requirements
  2. Detailed data mapping document that captures source to MDM mapping for services as well as Initial load (one time migration) – data mapping is tricky; there will be different channels through which data will be brought into MDM. All different channels need to be identified and specific mapping for all these channels have to be completed; Doing this right will help us avoid surprises at a later stage
  3. Functional Requirements for each of the features – Services, Duplicate processing and so on

Apart from the requirements analysis, work on the “Requirements Traceability Matrix” should start at this stage. This is one document that captures system traceability of requirements to test cases and will come in handy throughout the implementation.


Functional requirements are translated into detailed technical design for both MDM and ETL. Significant design decisions are listed out, Object model, business objects designed, and detailed design sequence diagrams are created. Similar sets of design artifacts are created for ETL components as well. The key items that are worked on during the design phase are:

  • Significant use cases – From a technical perspective, functional use cases are interpreted so the developer has a better grip on use cases and how they are connected together to form the overall solution
  • Detailed design elements – Elaboration on each technical component so development team has to just interpret what is designed as MDM code or ETL components
  • Unit Test cases – The technical lead plans unit test cases so 360 degree coverage is ensured during unit testing, and most of the simple unit level bugs are identified

Within the sphere of tools that we use, if unit test automation is possible we do that as well.


MDM and ETL development happen in most of our projects. Apart from IBM’s MDM suite, we also work on a spectrum of ETL tools such as IBM DataStage, Informatica Power Center, SAP PI, IBM CastIron, Talend ETL, and Microsoft SSIS. Some aspects that we emphasize on across all our projects are:

  • Coding standards – MDM and ETL teams have respective coding standards which are periodically reviewed as per changes in different product releases, and technological changes in general. The developers are trained to follow these standards when they write code
  • Continuous Integration – Most of our clients have svn repositories and our development teams actively use these repositories so the code remains an integral unit. We also have local repositories that can be used when the client does not have a repository of their own and explicitly allow us to host their code in our network
  • Peer code review – Every module is reviewed by a peer who acts as another pair of eye to bring in a different perspective
  • Lead code review – Apart from peer review, code is also reviewed by the tech lead to ensure development is consistent and error free
  • Unit Testing – Thorough unit testing is driven off the test cases written by development leads during design phase. Wherever possible, we also automate unit test cases redundancy and efficiency

With these checks and balances the developed code moves into testing phase.


QA lead comes up with comprehensive test strategy covering Functional, system, performance and user acceptance testing. The different types of testing that we participate in differs from project to project, based on client requirements. We typically take up functional testing within the iterative Implementation phase. Rest are done once all functional components are developed and tested thoroughly.

Functional testing is driven off functional requirements. Our QA lead reviews the design as well to understand significant design decisions that helps in creating optimal test scenarios. Once requirements and design documents are reviewed, detailed test scenarios and test cases are created and are reviewed by the Business Analyst to ensure sufficient coverage. A mix of manual and automated testing is performed based on allowed scope in the project. Functional testing process will involve the following:

  • Test Definition – Scenarios / cases created, test environments identified, defect management and tracking methodology established, test data prepared or planned for
  • Test execution – Every build is subject to a build acceptance test, and upon being successful, the build is tested in detail for functionality
  • Regression runs – Once we enter defect fixing mode, multiple runs of (mostly automated) regression tests are run to ensure that test acceptance criteria is met
  • Test Acceptance – Our commitment is to provide a thoroughly tested product at the end of each iteration. For every release, we ensure all severity 1 and severity 2 defects are fixed, and low severity defects if deferred are documented and accounted for in subsequent releases.


In the deployment stage, we group the following activities together:

  1. System, UAT, Performance testing – All aspects of testing that sees the implementation as a single functional unit are performed
  2. MDM code deployment – MDM code will be deployed in production environment, and delta transactions (real time, or near real time) will be started
  3. One time migration or Initial Load to MDM – From various source systems, data will be extracted, transformed and loaded into MDM as a one-time exercise.

Deployment is very critical as it is a culmination of all work done until that point in the project. This is also the point at which the MDM system will get exposed to all other external systems in the client organization. If MDM is part of a much detailed revamp, or a bigger program, there will be many other projects that will need to go live or get deployed at the same time. To ensure deployment is successful, the following key points are to be considered:

  • Identify all interconnecting points and come up with an system plan that covers MDM and all integrating systems
  • If applicable, participate actively at program level activities as well to ensure the entire program accounts for all the items that have been built as part of the MDM project
  • Initial load happens across many days mostly in 24-hour cycles. Come up with clear plan, team, roles and responsibilities and if possible perform a trial / mock run of initial load

There is typically a post deployment support period and in this period we monitor the MDM hub to ensure master data is created as planned. If needed, optimizations and adjustments are made to ensure that the MDM hub performs as desired.

Once deployment is successfully completed, don’t forget to celebrate with the project team!!!

Topics: Master Data Management MasterDataManagement MDM Implementation

Leave a Reply

Your email address will not be published. Required fields are marked *

Posted by Jan D. Svensson on Monday, Nov 10, 2014 @ 12:47 PM

I often become involved in an organization’s MDM program when they’ve reached out to InfoTrellis for help with cleaning up after a failed project or initiating attempt number X at achieving what, to some, is a real struggle. There can be a lot of reasons for a Master Data Management implementation failing, and none of them are due to the litany of blame game reasons that can be used in these scenarios.  Most failures arise from common problems that people just were not prepared for.

Let’s examine some of the top reasons MDM implementations fail. In the end they probably won’t surprise you, but if you haven’t experienced it yet you will be better prepared to face them if they happen.

Underestimating the work

I am starting with this one because it leads to many of the others, and is a complex topic. It seems like a simple thing to estimate the work but there are a lot of aspects to an MDM project that aren’t obvious that can severely impact timelines and your success.

“It’s just a project like any other”

Let me start by saying MDM is not a project, it’s a journey, or at the very least a program.

Most organizations thinking about implementing MDM are large to global companies. Even medium sized companies that started small and experience growth over time have the same problems as their global sized piers.  While the size of the chaos in a global company may seem much larger, they also have far more resources to throw at the problem than their smaller brethren.

If we stick to the MDM party domain as a point of reference (most organizations start here with MDM), the number of sources or points of contact with party information can be staggering. You may have systems that:

  • Manage the selling of products or services to customers
  • Manage vendors you deal with or contract to
  • Extract data to data warehouse for customer analytics and vendor performance
  • HR systems to manage employees who may also be customers
  • Self-service customer portals
  • Marketing campaign management systems
  • Customer notification systems
  • Many others

A lot of large organizations will have all of these systems, each having multiple applications, and often multiple systems responsible for the same business function. So by now you are probably saying, yes I know this, and…?  Well your MDM “project” will need to sit in the middle of all of this, and in many cases since many of these systems will be legacy mainframe based systems, you will need to be transparent as these systems won’t be allowed to be changed.

MDM can be on the scale of many of the transformation programmes your organization may be undertaking to replace aging legacy systems and moving to modern distributed Service Oriented Architecture based solutions.

Big Bang Never Works

Now that we have seen the potential size of your MDM problem, let me just remind you that you can’t do it all at once. Sure you can plan your massive transformation programme and execute it – but if you have ever really done one of these, you know it’s a lot harder than it seems and that the outcome is usually not as satisfying as you expected it to be.  You end up cutting corners, blowing the budget, missing the timelines, and de-scoping the work just trying to deliver.

What is one of the typical reasons this happens on your MDM transformation project?

You Don’t Know What You Don’t Know

You have all these systems you are going to integrate with and in many cases you are going to need to be transparent in that those systems may not know they are going to be interacting with your new MDM solution. You are going to need to know things like:

  • What data do they use?
    • How often?
    • How much?
    • When?
  • Do they update the data?
    • How often?
    • How?
    • What?
  • Do they need to know about changes made by others?
    • How often is the change notice required?
    • Do they need to know it’s changed, or what the change was?

This type of information seems pretty straight forward. I haven’t told you anything you probably didn’t know, but, when you go to ask these questions, the answer you will mostly likely often get is:

“I don’t know.”

Ok, so the documentation isn’t quite up to date, (I am being kind), but you are just going to go out and find the answer. Which leads to the next problem.

Not Enough Resources

So this is an easy problem to solve. I’ll hire some more business analysts, get some more developers to look at the code, get some more project managers to keep them on track.  Seems like a plan, and on the surface it looks like the obvious answer, (ignoring how hard it is to locate available quality IT people these days), but these aren’t the resources that are the problem.

You don’t have enough SMEs.

The BA’s, developers and others are all going to need time from your subject matter experts.  The subject matter experts are already busy because they are subject matter experts.  There typically aren’t enough of them to go around, and if you have a lot of systems to deal with, you are facing a lot of IT and business SME’s.

What your SMEs bring to the table is intellectual property. Intellectual property is critical to the success of your implementation.  You will need the knowledge your SMEs bring on your various systems, but there is another kind of intellectual property that you are going to need and can be tied to a very lengthy process.

Data Management through Governance

In order to be able to master your information, you will need to amalgamate data from multiple sources and both the meaning and the use of that information will need to be clearly defined. What may appear to be the same information from one source may have a different meaning.  Data governance is a key requirement to be able to establish the enterprise data definitions that are crucial for your master data.  Even in mature environments this can be a challenging task and can consume significant time and resources.

Data governance may seem like a problematic and time consuming exercise but it is an effective tool to use against one of the other major hurdles you will face in trying to establish a common set of master data.

That’s My Data

Many organizations are organized into silos. The silos are designed to look after their own interests, funded to maintain their business goals and competitive for resources and funding.  While the end goal of any organization is the success of the organization, the silo measures its success in terms of itself.

An MDM implementation is by nature at odds with the silo based organization as master data is data that is of value to a cross section of the business and thus spans silos. The danger in many organizations is that a particular silo has significantly more influence than another, often laying with the revenue generating lines of business.  This over balance of power can easily lead to undue influence on your master data implementation, making it just another project for division X, instead of an enterprise resource to be shared by all.

Data governance is one of the key factors to help keep this situation in check. Your data governance board will be comprised of representatives from all stake holders, giving equal representation to all.  The cross organizational nature of data governance is also the reason that decisions can be a difficult and lengthy process as it requires consensus across all the silos.

Aside from enterprise data definitions, another important aspect of master data management is the establishment of business rules.

Too Many Rules

The business and data governance will need to be involved to establish business rules for:

  • ETL processes to loading data into your MDM application
  • Updates to information from multiple sources
  • Matching rules
  • Survivorship rules

The establishment of rules is designed to address one of the big problems MDM is meant to solve: data quality. Organizations will want to manage both data quality on load and ongoing data quality.  One of the big mistakes often made is to try and introduce too many rules right away.

The use of too many rules early on can have a significant impact on the initial data loads into your MDM solution. You are ready for production and most likely getting your first crack at live data to only find out vast numbers of records are being rejected due to your business rules.  Your data loads have now failed and you need to go back and rethink your rules, revise your ETL process and try again.

You finally get your data loaded and your consumers have arrived to start to use the data and your legacy transactions are failing. Why are they failing? Because the application isn’t validating the input according to your business rules, or collecting enough information to satisfy the rules.

Of course there is one way you could reduce this risk, but it often isn’t done well enough and sometimes isn’t done at all.

What Profiling?

Data profiling is the one task that is critical to understanding what your data looks like and what you need to plan for. There are often many barriers to profiling because your party master data will likely contain personally identifiable information (PII) and access will be restricted for security reasons.  You have to overcome these barriers because data profiling is the only way to foresee the gotchas that are going to put you far off track down the road.

Data profiling can be a significant task as each source system needs to be profiled. As you learn more about your data you will have more questions that need to get answered.  All this profiling takes time and most likely needs the time of specific resources as they are the only ones that have access to the information you require.  (There’s that resource problem again.)

Project Management is my Problem?

So far you haven’t heard any magical reasons as to why your MDM implementation should fail. In fact  many of the problems seem to be tied to the typical reasons any IT project can fail:

  • Underestimating the work
  • Not enough resources
  • Trying to do too much at once (including scope creep)
  • Time required for discovery

An aspect of an MDM implementation that may be a little non typical includes the need for data governance. Data governance not only gives you the enterprise view of the information you are trying to master, but can also be an effective way of dealing with competing agendas between silos.

Data governance is also one of the key success actors for the ongoing success of your implementation. Since MDM is a journey not a project, longevity is a characteristic of a successful implementation.  Once you have delivered your foundation, the succeeding phases will build upon the base and provide more coverage of your master data.  To ensure the ongoing success of your implementation you will need the support of data governance, to ensure that new systems and upgrades to existing systems use the master data and don’t just create islands of their own.

In the past we tried to achieve what master data management promises today, but with a lack of controls and governance, we ended up with the data sprawl we are trying to correct with MDM. Once the project is over, the role of master data management does not end.  It is important to recognize that you must establish the processes and rules to not only create the master data store, but also to maintain it and integrate it into your systems.  Master data management is not about the installation and configuration of a shiny new software product.  The product is an enabler making the job easier.  The establishment of rules, governance processes and enforcement are what will bring you success.

One final thing that every master data management implementation requires, and you are pretty much doomed to failure without, is strong executive sponsorship. Your MDM implementation is going to take years.  You will require consistent funding and support to be able to take the journey and only an executive can bring that level of support.  Organizations that are organized into silos often don’t play well together, and while data governance can help in this situation, the time may come when a little intervention is required to ensure things keep moving in the proper direction on the expected timelines.

Your executive is a key resource in and out of the board room.  In the board room you will need t champion that has the vision of what your MDM implementation is going to bring to the organization, and keep the journey progressing over time.  Out of the board room you will be faced with competing agendas, data hoarding, shifting priorities, and silos trying to work together.  The executive influence here can be used to make sure that everyone continues to work towards the common goal, and provides the resources required to achieve the gaols in a reasonable time line.

Topics: master data governance Master Data Management mdm mdm hub

Leave a Reply

Your email address will not be published. Required fields are marked *

Posted by ochughtai on Friday, Mar 28, 2014 @ 4:57 PM

Product Information Management (PIM) is a vast subject area. Not only does each industry vertical have its own way of defining a product, within a particular vertical each company may choose to represent their product information differently to satisfy their particular business needs. Having said that, there are indeed standard practices and processes that makes PIM expertise portable across various projects.

The first step to understanding PIM is to get an appreciation of the diversity around the definition of product. The example below, from the retail industry, provides some insight in to the complex world of PIM. In my subsequent blogs, I will address various nuances of product modeling.

What is a Product?
We all love shopping. But have you ever wondered what it takes for your favorite product to reach a store shelf, or what that product means for the retailer? This blog is going to take you through the journey of a diaper bag from supplier to store shelf – let’s see if we can figure out what a product is.

So let’s start backwards.

You see the diaper bag on the store shelf – how did it get there? Well, each store has an inventory area at its back and you must have seen store workers periodically replenishing the shelves with products. That’s one way the shelves are filled. The other way is that vendors themselves come over and stock the shelves. This is referred to as direct store delivery, or DSD; one example might be 2-liter Coca Cola bottles. DSD is, however, a discussion for another day. By the way, the product that you see on the store shelf is also referred to as the sellable unit.

So how does that diaper get to the store’s inventory area? Each store has an inventory management and an ordering system; any time the inventory falls beyond a certain limit, an order is placed with the relevant distribution center for supply replenishment. The ordering lead times are already known, so the ordering takes place in such a way that the store seldom runs out of diapers. Ordering may be manual or automated based off of the inventory levels.

So how does the diaper get to the distribution center/warehouse? This is a complex piece and is also referred to as “Network Alignment”, where the supply chain managers determine which suppliers are going to supply which distribution centers, and which distribution centers, in turn, are going to service which stores. The network alignment is product specific and is usually based off of geographical locations. The distribution centers have to keep track of their inventory levels as well as their lead times both from the supplier ordering (inbound) and store ordering (outbound) perspective to maintain the service level agreements (SLAs) with the stores. Also, distribution centers have their own stocking unit for each product, which they use to manage inventory.

It takes months of planning before the product actually hits the store shelves. A lot of systems (e.g. Item Master, Planogram, Labelling and Shelf Tags, Warehouse Management, ERP, Forecasting, Pricing and Promotions, Data Warehouse and Analytics, etc.) have to be set up with item information as part of the new item introduction (NPI) process.

So then, what is a product? Well simply put, product means different things to different stakeholders -for a warehouse management system product is the stocking unit; for a store product is the sellable unit, for analytics product is diapers irrespective of whether it is being sold as a bag or travel pack. It depends on your perspective.

Topics: Master Data Management mdm PIM Product Product Information Management

Leave a Reply

Your email address will not be published. Required fields are marked *

Posted by infotrellislauren on Monday, Mar 10, 2014 @ 12:03 PM

There are lots of excellent lists out there for who to follow on the subjects of customer experience, Big Data, and social media technology – but what about Master Data Management? Finding a bit of a dearth of resources for MDM Twitter influencers, I decided to put together a little list of my own on the people I personally find are great for MDM content in the Twittersphere.


1. Prashanta Chandramohan aka Prash Chan (@MDMgeek)

The author of an excellent blog with high-quality and thoughtful articles about MDM, Chandramohan produces a steady stream of great content and frequently participates in Twittersphere discussions around MDM. He’s an IBMer so occasionally his Tweets take an IBM slant, but he’s a techie first and foremost and you’ve little to fear if your goal is to avoid marketing messages in your feed.


2. Henrik L. Sørensen (@hlsdk)

Sørensen runs a blog on data quality and MDM that he regularily updates with his commentary and insights, generally in the form of short, easily-digestible posts that concisely bring up an interesting point or a new perspective. He’s also excellent at reposting other MDM related blog posts that he reads, and has a great eye for pointing out the ones worth paying attention to.


3. Aaron Zornes (@azornes)

Zornes is an institution in his own right in the MDM world; the odds are that you already know his name if you’re at all involved in the industry. As the driving force behind The MDM Institute and the Data Governance Summit events worldwide, he’s a familiar face and a sharp mind when it comes to all things MDM. Follow him to tap into his insights from events, surveys, research and other resources.


4. Sunil Soares (@SunilSoares1)

With four highly respected books about information management to his name, Soares engages in the MDM conversation with a level of authority that is emphasized by his willingness to get to the point with technical language in lieu of business-speak. His contribution is always practical and his company, Information Asset, frequently puts out useful research on data governance.


5. Jim Harris (@ocdqblog)

As his Twitter handle suggestions, Harris runs OCDQ Blog, or “Obsessive-Compulsive Data Quality”. His articles tend to be light-hearted, often drawing imaginative parallels between data quality and pop culture references to make a point that is both memorable and meaningful. Like his posts, he is a friendly personality and highly responsive to people who mention him or engage with him on his favorite topic.


6. Axel Troike (@AxelTroike)

If you missed a noteworthy MDM article, Troike will surely point it out for you at some point. His Tweets often serve the purpose of circulating great content, and it’s a fairly event split between articles and research produced by other MDM experts and ones put out by his company, Grandite. He’s consistent with crediting the sources, too, so he makes it easy to expand your follow list.


7. Ravi Shankar (@Ravi_Shankar_)

Although as part of the marketing team for Informatica, Shankar links a predictably large amount of content produced by the company, it’s usually interesting content and worth reading. If you’re looking for MDM and data governance stuff on Twitter, he’s got it in spades, and many of them are resources that can be easily absorbed by more business-focused folk than pure IT folks.


8. Gary Alleman (@Gary_Alleman)

A frequent Tweeter on the topics of data quality and governance and a MDM evangelist, Alleman shares a wide variety of interesting and practical links on top of running his own blog. He’s another great resource for MDM news and discussions and definitely worth following.

Topics: influencers Master Data Management mdm social media Twitter

Leave a Reply

Your email address will not be published. Required fields are marked *

Posted by marianitorralba on Friday, Sep 6, 2013 @ 2:27 PM

Deterministic Matching versus Probabilistic Matching

Which is better, Deterministic Matching or Probabilistic Matching?

I am not promising to give you an answer.  But through this article, I would like to share some of my hands-on experiences that may give some insights to help you make an informed decision in regards to your MDM implementation.

Before I got into the MDM space three years ago, I worked on systems development encompassing various industries that deal with Customer data.  It was a known fact that duplicate Customers existed in those systems.  But it was a problem that was too complicated to address and was not in the priority list as it wasn’t exactly revenue-generating.  Therefore, the reality of the situation was simply accepted and systems were built to handle and work around the issue of duplicate Customers.

Corporations, particularly the large ones, are now recognizing the importance of having a better knowledge of their Customer base.  In order to achieve their target market share, they need ways to retain and cross-sell to their existing Customers while at the same time, acquire new business through potential Customers.  To do this, it is essential for them to truly know their Customers as individual entities, to have a complete picture of each Customer’s buying patterns, and to understand what makes each Customer tick.   Hence, solving the problem of duplicate Customers has now become not just a means to achieve cost reduction, higher productivity, and improved efficiencies, but also higher revenues.

But how can you be absolutely sure that two customer records in fact represent one and the same individual?  Conversely, how can you say with absolute certainty that two customer records truly represent two different individuals?  The confidence level depends on a number of factors as well as on the methodology used for matching.  Let us look into the two methodologies that are most-widely used in the MDM space.

Deterministic Matching

Deterministic Matching mainly looks for an exact match between two pieces of data.  As such, one would think that it is straightforward and accurate.  This may very well be true if the quality of your data is at a 100% level and your data is cleansed and standardized in the same way 100% of the time.  We all know though that this is just wishful thinking.  The reality is, data is collected in the various source systems across the enterprise in many different ways.  The use of data cleansing and standardization tools that are available in the market may provide significant improvements, but experience has shown that there is still some level of customization required to even come close to the desired matching confidence level.

Deterministic Matching is ideal if your source systems are consistently collecting unique identifiers like Social Security Number, Driver’s License Number, or Passport Number.  But in a lot of industries and businesses, the collection of such information is not required, and even if you try to, most customers will refuse to give you such sensitive information.  Thus, in majority of implementations, several data elements like Name, Address, Phone Number, Email Address, Date of Birth, and Gender are deterministically matched separately and the results are tallied to come up with an overall match score.

The implementation of Deterministic Matching requires sets of business rules to be carefully analyzed and programmed.  These rules dictate the matching and scoring logic.  As the number of data elements to match increases, the matching rules become more complex, and the number of permutations of matching data elements to consider substantially multiplies, potentially up to a point where it may become unmanageable and detrimental to the system’s performance.

Probabilistic Matching

Probabilistic Matching uses a statistical approach in measuring the probability that two customer records represent the same individual.  It is designed to work using a wider set of data elements to be used for matching.  It uses weights to calculate the match scores, and it uses thresholds to determine a match, non-match, or possible match.  Sounds complicated?  There’s more.

I recently worked on a project using the IBM InfoSphere MDM Standard Edition, formerly Initiate, which uses Probabilistic Matching.  Although there were other experts in the team who actually worked on this part of the project, here below are my high-level observations.  Note that other products available in the market using the Probabilistic Matching methodology may generally work around similar concepts.

  • It is fundamental to properly analyze the data elements, as well as the combinations of such data elements, that are needed for searching and matching.  This information goes into the process of designing an algorithm where the searching and matching rules are defined.
  • Access to the data up-front is crucial, or at least a good sample of the data that is representative of the entire population.
  • Probabilistic Matching takes into account the frequency of the occurrence of a particular data value against all the values in that data element for the entire population.  For example, the First Name ‘JOHN’ matching with another ‘JOHN’ is given a low score or weight because ‘JOHN’ is a very common name.  This concept is used to generate the weights.
  • Search buckets are derived based on the combinations of data elements in the algorithm.  These buckets contain the hashed values of the actual data.  The searching is performed on these hashed values for optimum performance.  Your search criteria are basically restricted to these buckets, and this is the reason why it is very important to define your search requirements early on, particularly the combinations of data elements forming the basis of your search criteria.
  • Thresholds (i.e. numeric values representing the overall match score between two records) are set to determine when two records should: (1) be automatically linked since there is absolute certainty that the two records are the same; (2) be manually reviewed as the two records may be the same but there is doubt; or (3) not be linked because there is absolute certainty that the two records are not the same.
  • It is essential to go through the exercise of manually reviewing the matching results.  In this exercise, sample pairs of real data that have gone through the matching process are presented to users for manual inspection.  These users are preferably a handful of Data Stewards who know the data extremely well.  The goal is for the users to categorize each pair as a match, non-match, or maybe.
  • The categorizations done by the users in the sample pairs analysis are then compared with the calculated match scores, determining whether or not the thresholds that have been set are in line with the users’ categorizations.
  • The entire process may then go through several iterations.  Per iteration, the algorithm, weights, and thresholds may require some level of adjustment.

As you can see, the work involved in Probabilistic Matching appears very complicated.  But think about the larger pool of statistically relevant match results that you may get, of which a good portion might be missed if you were to use the relatively simpler Deterministic Matching.

Factors Influencing the Confidence Level

Before you make a decision on which methodology to use, here are some data-specific factors for you to consider.  Neither the Deterministic nor the Probabilistic methodology is immune to these factors.

Knowledge of the Data and the Source Systems

First and foremost, you need to identify the Source Systems of your data.  For each Source System that you are considering, do the proper analysis, pose the questions.  Why are you bringing in data from this Source System?  What value will the data from this Source System bring into your overall MDM implementation?  Will the data from this Source System be useful to the enterprise?

For each Source System, you need to identify which data elements will be brought into your MDM hub.  Which data elements will be useful across the enterprise?  For each data element, you need to understand how it is captured (added, updated, deleted) and used in the Source System, the level of validation and cleansing done by the Source System when capturing it, and what use cases in the Source System affect it.  Does it have a consistent meaning and usage across the various Source Systems supplying the same information?

Doing proper analysis of the Source Systems and its data will go a long way in making the right decisions on which data elements to use or not to use for matching.

Data Quality

A very critical task that is often overlooked is Data Profiling.  I cannot emphasize enough how important it is to profile your data early on.  Data Profiling will reveal the quality of the data that you are getting from each Source System.  It is particularly vital to profile the data elements that you intend to use for matching.

The results of Data Profiling will be especially useful in identifying the anonymous and equivalence values to be considered when searching and matching.

Here are some examples of Anonymous values:

Here are some examples of Equivalence values:

  • First Name ‘WILLIAM’ has the following equivalencies (nicknames): WILLIAM, BILL, BILLY , WILL, WILLY, LIAM
  • First Name ‘ROBERT’ has the following equivalencies (nicknames): ROBERT, ROB, ROBBY, BOB, BOBBY
  • In Organization Name, ‘LIMITED’ has the following equivalencies: LIMITED, LTD, LTD.
  • In Organization Name, ‘CORPORATION’ has the following equivalencies: CORPORATION, CORP, CORP.

If the Data Profiling results reveal poor data quality, you may need to consider applying data cleansing and/or standardization routines.  The last thing you want is polluting your MDM hub with bad data.  Clean and standardized data will significantly improve your match rate.  If you decide to use cleansing and standardization tools available in the market, make sure that you clearly understand its cleansing and standardization rules.  Experience has shown that some level of customization may be required.

Here are important points to keep in mind in regards to Address standardization and validation:

  • Some tools do not necessarily correct the Address to produce exactly the same standardized Address every time.  This is especially true when the tool is simply validating that the Address entry is mailable.  If it finds the Address entry as mailable, it considers it as successfully standardized without any correction/modification.
  • There is also the matter of smaller cities being amalgamated into one big city over time.  Say one Address has the old city name (e.g. Etobicoke), and another physically the same Address has the new city name (e.g. Toronto).  Both Addresses are valid and mailable addresses, and thus both are considered as successfully standardized without any correction/modification.

You have to consider how these will affect your match rate.

Take the time and effort to ensure that each data element you intend to use for matching has good quality data.  Your investment will pay off.

Data Completeness

Ideally, each data element you intend to use for matching should always have a value in it, i.e. it should be a mandatory data element in all the Source Systems.  However, this is not always the case.  This goes back to the rules imposed by each Source System in capturing the data.

If it is important for you to use a particular data element for matching even if it is not populated 100% of the time, you have to analyze how it will affect your searching and matching rules.  When that data element is not populated in both records being compared, would you consider that a match?  When that data element is populated in one record but not the other, would you consider that a non-match, and if so, would your confidence in that being a non-match be the same as when both are populated with different values?

Applying a separate set of matching rules to handle null values adds another dimension to the complexity of your matching.

Timeliness of the Data

How old or how current is the data coming from your various Source Systems?  Bringing outdated and irrelevant data into the hub may unnecessarily degrade your match rate, not to mention the negative impact the additional volume may have on performance.  In most cases, old data is also incomplete, and collected with fewer validation rules imposed on it.  As a result, you may end up applying more cleansing, standardization, and validation rules to accommodate such data in your hub.  Is it really worth it?  Will the data, which might be as much as 10 years old in some cases, truly be of value across the enterprise?

Volume of the Data

Early on in the MDM implementation, you should have an idea on the volume of data that you will be bringing in to the hub from the various Source Systems.  It will also be worthwhile if you have some knowledge on the level of Customer duplication that currently exists in each Source System.

A fundamental decision that will have to be made is the style of your MDM implementation.  (I will reserve the discussion on the various implementation styles for another time.)  For example, you may require a Customer hub that will just persist the cross reference to the data but the data is still owned by and maintained in the Source Systems, or you may need a Customer hub that will actually maintain, be the owner and trusted source of the Customer’s golden record.

Your knowledge of the volume of data from the Source Systems, combined with the implementation style that you need, will give you an indication of the volume of data that will in fact reside in your Customer hub.  This will then help you make a more informed decision on which matching methodology will be able to handle that volume better.

Other Factors to Consider

In addition to the data-specific factors above, here are other factors that you should give a great deal of thought.

Goal of the Customer Hub

What are your short-term and long-term goals for your Customer hub?  What will you use it for?  Will it be used for marketing and analytics only, or to support your transactional operations only, or both?  Will it require real-time or near-real-time interfaces with other systems in the enterprise?  Will the interfaces be one-way or two-way?

Just like any software development project, it is essential to have a clear vision of what you need to achieve with your Customer hub.  It is particularly important because the Customer hub will touch most, if not all, facets of your enterprise.  Proper requirements definition early on is key, as well as the high-level depiction of your vision, illustrating the Customer hub and its part in the overall enterprise architecture.   You have a much better chance of making the right implementation decisions, particularly as to which matching methodology to use, if you have done the vital analysis, groundwork, and planning ahead of time.

Tolerance for False Positives and False Negatives

False Positives are matching cases where two records are linked because they were found to match, when they in fact represent two different entities.  False Negatives are matching cases where two records are not linked because they were found to not match, when they in fact represent the same entity.

Based on the very nature of the two methodologies, Deterministic Matching tends to have more False Negatives than False Positives, while Probabilistic Matching tends to have more False Positives than False Negatives.  But these tendencies may change depending on the specific searching and matching rules that you impose in your implementation.

The question is: what is your tolerance for these false matches?  What are the implications to your business and your relationship with the Customer(s) when such false matches occur?  Do you have a corrective measure in place?

Your tolerance may depend on the kind of business that you are in.  For example, if your business deals with financial or medical data, you may have high tolerance for False Negatives and possibly zero tolerance for False Positives.

Your tolerance may also depend on what you are using the Customer hub data for.  For example, if you are using the Customer hub data for marketing and analytics alone, you may have a higher tolerance for False Positives than False Negatives.

Performance and Service Level Requirements

The performance and service level requirements, together with the volume of data, need careful consideration in choosing between the two methodologies.   The following, to name a few, may also impact performance and hence need to be factored in: complexity of the business rules, transactions that will retrieve and manipulate the data, the volume of these transactions, and the capacity and processing power of the machines and network in the system infrastructure.

In the Deterministic methodology, the number of data elements being used for matching and the complexity of the matching and scoring rules can seriously impact performance.

The Probabilistic methodology uses hashed values of the data to optimize searching and matching, however there is also that extra overhead of deriving and persisting the hashed values when updating/adding data.  A poor bucketing strategy can degrade the performance.

On-going Match Tuning

Once your Customer hub is in production, your work is not done yet.  There’s still the on-going task of monitoring how your Customer hub’s match rate is working for you.  As data is added from new Source Systems, new locales, new lines of business, or even just as updates to existing data are made, you have to observe how the match rate is being affected.   In the Probabilistic methodology, tuning may include adjustments to the algorithm, weights, and thresholds.  For Deterministic methodology, tuning may include adjustments to the matching and scoring rules.

Regular tuning is key, more so with Probabilistic than Deterministic methodology.  This is due to the nature of Probabilistic, where it takes into account the frequency of the occurrence of a particular data value against all the values in that data element for the entire population.  Even if there is no new Source System, locale, or line of business, the Probabilistic methodology requires tuning on a regular basis.

It is therefore prudent to also consider the time and effort required for the on-going match tuning when making a decision on which methodology to use.


So, which is better, Deterministic Matching or Probabilistic Matching?  The question should actually be: ‘Which is better for you, for your specific needs?’  Your specific needs may even call for a combination of the two methodologies instead of going purely with one.

The bottom line is, allocate enough time, effort, and knowledgeable resources in figuring out your needs.  Consider the factors that I have discussed here, which by no means is an exhaustive list.   There could be a lot more factors to take into account.  Only then will you have a better chance of making the right decision for your particular MDM implementation.

Topics: CDI Data Deterministic matching Integration Master Data Management Match Matching mdm MDM Implementation Probabilistic Probabilistic matching

Leave a Reply

Your email address will not be published. Required fields are marked *

Posted by infotrellislauren on Tuesday, Jul 16, 2013 @ 11:40 AM

Everybody, it seems, is getting onto the social media bandwagon. You can’t get far into any discussion about information management or marketing without it coming up, and it’s fascinating to see the emerging best practices and strategies behind social media products and consulting groups.

Here are five lessons from over a decade of working with Master Data Management, a much older piece of data-wrangling technology, that will serve any marketing or IT professional well as they navigate the social media technology landscape.


1. Huge Investments are a Tough Sell

I’m going to assume if you’re reading this that you see value in social media marketing, or else you see the potential for value. If you’re looking to leverage social media for your organization at a scale and level of sophistication higher than a summer intern firing off tweets now and then under the corporate handle, you’re going to have to actually spend money – and in an organization, that can be easier said than done.

Master Data Management teaches a very simple lesson on the subject of talking to your executives about a wonderful, intangible solution that will surely provide ROI if they can find it in themselves to approve the needed budget. The lesson is this: the bigger the price tag, the harder time you’ll have convincing a major decision maker it’s a necessary or worthwhile investment.

Often with MDM the more it’ll cost to implement, the more fantastic of an impact it will have on the data within the business. With social media, that’s a little harder to prove. It doesn’t help that there are more “social media marketing solutions” out there than you can shake a stick (or a corporate credit card) at.

If your executive doesn’t have time for your technobabble pitch for a million dollar overhaul, try wiggling your foot into the door by starting small without a lot of commitments. For MDM, that’s a proof-of-concept, and there’s no reason that can’t be applied to social media marketing. Consider starting off with something that is subscription based (my more IT-minded colleagues would refer to this as “software as a service” or “SaaS”) to give your management the confidence that if they aren’t seeing returns, they can just turn off the subscription and stop spending money on it.


A high level dashboard application is an ideal place to start.


This is your social media marketing proof-of-concept – if your initial test run gets you great results, that’s a good sign that your organization is part of an industry that stands to really benefit from a bigger, more expensive social media based project. Maybe even something that involves the term “big data”, but let’s not run before we walk.


2. Consolidated Records Mean More Accurate Information

This is the core premise of Master Data Management as an information management principle: you want there to be one copy of an important record that consolidates information from all its sources in the organization, containing only the most up to date and accurate data. It’s a simple but powerful idea, the philosophy of combining multiple copies of the same thing so that you only have one trustworthy copy, and then actively preventing new duplicates from cropping up.

The same thing applies to social media, especially when we’re talking about the users as actual human beings and not as individual accounts across multiple channels. Face it, we’re not interested in social media as an abstract concept – we’re there for the people using it.

(Which is why I love to cite this actual exchange between an older gentleman of a CEO and his marketing manager that goes something like: “I don’t get Twitter. I don’t use it, I don’t want to use it, I don’t personally know anybody that does use it, and I think it’s stupid.” “I agree. I honestly think it’s stupid too – but that doesn’t change the fact that 90% of our customer base uses it, and that’s why we need to pay attention to it.”)

So we’re there for the people – why on earth would we approach gathering and visualizing metrics and data on user accounts instead of people? Should we treat the Facebook, Pinterest, Twitter, LinkedIn and Tumblr account of one individual as having the weight of five individual voices?

What you really want to be looking for is a solution that matches and combines users across multiple channels. This isn’t quite the same process that it would be as part of MDM – this is new ground here that needs to be broken, and if you want to figure out that a Facebook user is the same person as a Twitter user, you need to be a little more creative than just checking to see if they have the same name.

With access to less traditional data (like a phone number or an address) it takes a bit of new technology combined with new approaches to match social media accounts accurately. I won’t bother getting into the details here, but suffice to say it’s something that today’s technology has the ability to do and a couple of companies are actually offering it. It seems perfectly logical to me that if you’re going to seriously use social media, especially in any sort of decision making process, you need to have a consolidated view of each user instead of a mishmash of unattributed accounts, which would, without a doubt, skew your numbers one way or another.

I’m going to briefly mention that if you want to take it a step above and beyond for even more insight into your customers, you can further consolidate that data by matching it to your internal records – Joe B in your client database is Joe B on Facebook and JoeTweet on Twitter, for example – but this is a much more ambitious project.


3. Data Quality is Not Just An IT Concern

Master Data Management is intended to bring greater value to an organization’s data by making it more accurate and trustworthy. Whether or not that actually happens very strongly depends on the quality of the data to begin with. As they say, “garbage in, garbage out,” and that’s even more true of social media marketing solutions. If you thought the quality of data in your organization was sorry to behold, I have a startling fact for you: the internet is full of garbage data. Absolutely overflowing with it. Not just things that are incorrect, but also things that are irrelevant.

If you’re going to get facts from social media, you’d better start taking data quality seriously – and make sure whatever solution you use is built by someone who takes it even more seriously. Let me give you an example.

Suppose you’re a retailer who sells Gucci products. You have a simple social media solution, a nice little application that gives you sentiment analysis and aggregate scores. You investigate how your different brands are doing and, to your shock, find that Gucci has a horrible sentiment rating. People are talking about the brand and boy are they unhappy.

You do some quick mental math and determine that it must be related to the promotion you just did around a new Gucci product. The customers must hate the product, or the promotion itself. You hurriedly show your CEO and she tells you to pull the ads.

What you didn’t know, and what your keyword based social media monitoring application didn’t know, is that there is a rap artist who goes by Gucci Mane whose fans tweet quite prolifically with reference to his name and an astonishing bouquet of language that the sentiment analysis algorithms determined to be highly negative.

Your customers are, in fact, pretty happy with Gucci and the most recent promotion, but the relevant data was drowned out and wildly skewed by a simple factor like a recording artist with a name in common. This wasn’t a question of “the data was wrong” – the data was accurate, it was just irrelevant, and the ability to distinguish between the two requires technology built on a foundation of data quality governance.

If you’re going to use social media data, especially when you’re using it as a measure for the success of a marketing campaign and subsequently the allocation of marketing budget, make sure you’re paying attention to data quality. Don’t veer away in alarm or boredom from terms like data governance just because they aren’t as sexy as SEO or content marketing or 360 view of the customer – train yourself to actively seek the references to data quality as part of the decision making process around a social media strategy.


4. Don’t Let Someone Else Define Your Business Rules

One of the most time consuming aspects of preparing for a Master Data Management implementation is sitting down to define your business rules. There is no one definition of the customer and no one definition of a product. These are complex issues that depend heavily on the unique needs and goals of an organization, and don’t let anybody try to tell you otherwise.

To that end, social media marketing demands the same level of complexity. If you’re building a social media strategy, you absolutely need to be thinking about those business rules and definitions. How do you define a suspect? A prospect? A customer? What makes someone important and worth targeting to you? Is it more important to you to have fifty potential leads or five leads that are defined by very specific requirements for qualification?

Every organization will be different, and a good social media solution takes that into account. Be wary of a piece of software or a consulting company that has a set of pre-established business rules that aren’t easily customizable or – even worse – are completely set in stone. If an outside company tries to tell you what your company’s priorities are and applies that same strategy to every single one of their clients, thank them for their time and look elsewhere.

Also steer clear of a solution that oversimplifies things. If you’re looking to social media opinion leaders as high value targets, you want to know how they’re defining that person as an opinion leader. Are they using one metric, like Klout score or number of followers? Are they using five? Would they be willing to give more emphasis to one over the other if your company places more value on, say, number of retweets than on number of likes?

Good solutions come preconfigured at a logical setting that is based on best practices and past client success – but are also flexible and able to match themselves to your unique business definitions and strategy as much as possible.


5. Data Silos Are Lost Opportunities

Finally, I want to talk about data silos. I’m going to expand on this term for those of you reading this who are marketing people like me and not necessarily information management junkies (although I confess the people who are both combined in one are always a delight to talk to). A data silo generally refers to situations in which the different lines of business hoard their databases and don’t like to share their information throughout the entire organization. This can be a huge problem for Master Data Management adoption, because of course the point is to make it so that everyone is using the same data, but it’s also a problem for social media marketing.

Social media data, first of all, is not just marketing data. Your sales teams will undoubtedly have uses for it in terms of account handling, and your product development teams, if you have them, will be interested in learning more about what customers actively crave from the market, and heck, your customer service division almost certainly can make use of an application that instantaneously warns them when people are dissatisfied.

The fact is, if you want to prove that gathering this data is useful, don’t hoard it all to yourself. Share that data around and let people play with it. Creativity – and creative ways to use data – happens when people think about things in ways they don’t normally think about them. Traditionally social media has been relegated to marketing, but it doesn’t have to be.

An ideal social media solution, even one of those affordable subscription-based ones I’ve been talking about, presents the data in an accessible, easily shared format. The good ones come with both a high level dashboard in business terms that even a CEO who thinks Twitter is stupid can log into and gain insight from and also the ability to drill down and export raw data so that the people who want to do complex and unique number crunching have that ability without the restraints of the program itself.


Shown above: Social Cue™, the InfoTrellis social media solution


It’s important to have a good balance of goal-oriented strategy – never go into social media without a plan or a purpose – and openness to innovation. It’s even more important to be working with an application that accommodates both.


InfoTrellis is a premier consulting company in the MDM and Big Data space that is actively involved in the information management community and constantly striving to improve the value of CRM and Big Data to their customers. To learn more about Social Cue™, our social media SaaS offering, contact the InfoTrellis team directly at to schedule a product demonstration.

Topics: allsight Big Data data governance Data Quality Marketing Master Data Management mdm Social Cue social media Social Media Marketing

Leave a Reply

Your email address will not be published. Required fields are marked *