Posted by Kevin Wright on Friday, Dec 11, 2015 @ 12:00 AM

With the introduction of IBM Master Data Management v11, IBM has created a new implementation style combining the strengths of both MDM Physical and Virtual editions. While MDM Physical is more suited to the “centralized” MDM style (system of record), and MDM Virtual is aligned with the “registry” MDM style (system of reference), MDM Hybrid uses a “coexistence” style to provide a mixed system of reference & record. This article will give an overview of the MDM Hybrid implementation style and a couple of interesting lessons learned during a recent InfoTrellis engagement.

MDM Hybrid was first introduced in MDM v11.0 in June 2013 to leverage capabilities of both MDM Virtual and MDM Physical which themselves have grown considerably in capability in recent years. However, MDM Hybrid is still not yet mainstream due to a handful of reasons. One, it does represent a relative increase in complexity and requires practitioners competent in both MDM Virtual and MDM Physical. Two, it can be a difficult migration from an existing MDM Physical or MDM Virtual implementation (although the transition from virtual to hybrid is the easier of the two). Hopefully this article can help alleviate some of those concerns! We at InfoTrellis believe that MDM Hybrid is a strong offering that gives us the capability to have both Virtual MDM and Physical MDM in the same box – the best of both worlds. Additionally, MDM Hybrid is excellent for new MDM implementations, and can be implemented relatively quickly in a basic manner. IBM has also provided a detailed implementation path in its Knowledge Center (see link below).

When describing MDM Hybrid to clients, I have been couching it in terms of a “Virtual Side” and a “Physical Side”, as the product is still mostly segregated.   Between the two “sides” is a fence traversed by a physical MDM service. This MDM service, persistEntity, is one of the workhorses of any MDM Virtual implementation and will be the focus of much of the customization.

Member records (source data) are contained in the MDM Virtual side, processed through the powerful probabilistic matching process that MDM Virtual provides, and assembled into a “golden record” composite view that is then mapped into the MDM Physical schema and “thrown over the fence” to the MDM Physical side using the persistEntity service. The “golden record” is persisted on the physical side. Physical MDM services such as addParty and updateParty are disabled, and modifications to attributes mastered by the MDM Virtual side are not permitted. Other attributes, however, can be modified. For example, name types not in the golden record, privacy preferences, and product or contract data can be modified using standard Physical MDM services.

Special care needs to be taken when implementing the other MDM domains such as contract or product. The persistEntity service could initiate a call to deleteParty if the golden record no longer exists in the system – this could cause issues if there are Contract Roles or Party Product Roles. And how would one establish these in the first place? InfoTrellis recently implemented MDM hybrid with both the party and contract domains at a client, and we came away with some interesting lessons in how to accomplish this.

At our client, a large insurance corporation, we were charged with implementing MDM Hybrid using version 11.3 and using the Contract domain as well as the Party domain with both Persons and Orgs. While we implemented many pieces of the contract domain, this discussion will be simplified to contain the entities and attributes below.


Figure 1. Simplified Logical Data Model.

This new MDM solution would be fed by both web services and batch loads (initial and regular delta loads). We decided early on to use the MDM Physical type of services as the entry point for MDM. This was for a couple of reasons. First, I think the MDM Physical schema presents a much simpler and more intuitive structure to consumers (primarily web services). Second, and perhaps more importantly, the Physical MDM schema already had all the contract objects defined.

For those reasons and others, we created a composite service to handle both party and contract data from an external perspective. This maintainContract service handled adding or updating a contract & contract component, and sending the party to the MDM Virtual Side (“Tossing it over the fence”) via memput. When the golden record came back via persistEntity, the contract role was handled via another custom service – maintainContractRole. This required the customization of the PartyMaintenanceActionRule (#211). Since the golden record processing is an asynchronous operation, the service only handled the contract, contract component, and call (or calls) to memput before returning.


Figure 2. Process Flow for customized PartyMaintenanceActionRule (#211).

In order to maintain contract role data, we had to create a “backpack” (and I’m sorry for the number of metaphors here – it helped us to explain this process to the client and has stuck in my mind as a method of explanation). This backpack would contain all the data needed to establish a contract role in Physical MDM and would accompany a party as it was processed by Virtual MDM and then get picked up by the persistEntity call on the round trip back into Physical MDM. On the virtual side, this data would not be used for searching or matching. On the Physical side, we had to create a transient data object (TDO) that would be mapped using the graphical data mapper (GDM) included in the workbench. This TDO is the backpack in the metaphor. Also, it needed to be added as an extension object under the TCRMOrganizationBObj & TCRMPersonBObj.

I hope this overview of the MDM Hybrid system has been informative. Unfortunately, as I wrote it, I noticed a number of components that I left out in the interest of giving a (hopefully) better overview of the case study without droning on for 20 pages. These include – handling role locations, customizing deleteParty, modifying the Virtual algorithm, constructing the composite view, and the framework we constructed to interface between Physical and Virtual MDM. We can dive deeper into those topics in a future article.

MDM Hybrid provides a number of exciting new functionalities, and with the flexibility inherent in the IBM MDM product, there remain many unexplored avenues and even ways of doing the same thing. Between MDM Virtual, MDM Physical, and now MDM Hybrid there’s no excuse to avoid creating a Master Data Management solution in your organization. If you’re considering an MDM Hybrid implementation (Or any other IBM MDM solution), give us a call!


Topics: Hybrid IBM MDM mdm MDM 11 Use Cases

Leave a Reply

Your email address will not be published. Required fields are marked *

Posted by vidyasagarpanati on Monday, Sep 28, 2015 @ 5:26 AM

“Effort is important, but knowing where to make an effort makes all the difference!”

A few days ago, at the end of a very intense release, one of our long term clients asked what is the secret behind our team’s high quality testing effort, despite the very aggressive timelines and vast scope of work that she sets up for us. She was very much interested in understanding what we do different from the many large SI’s she has used in the past, who according to her were always struggling to survive in a highly time-conscious and fast changing environment. We went back with a presentation to the client’s delivery team, which was highly appreciated by one and all. This blog provides a gist of the practices that we follow to optimize our testing effort.

The fundamental principles that help us in managing an optimum balance between Scope, Time and Costs while ensuring high quality delivery are Build for Reuse, Automation and Big Picture Thinking.



To understand these principles better, let us consider the real project that we just concluded for this specific client. This project had three major work streams – MDM, ETL and BPM. The duration of the project was 8 months and was executed using the InfoTrellis Smart MDMTM methodology. In total, 3 resources were dedicated for testing activities, 1 QA Lead and 2 QA Analysts. Of the allocated 8 months (36 weeks), we spent 6 weeks on discovery & assessment, 6 weeks on scope & approach & 4 weeks on the final deployment. The remaining 20 weeks, that was spent on Analysis, Design, Development and QA, was split into 3 iterations with durations of 7, 7 and 6 weeks respectively. The QA Activities in this project were spread over these 3 iterations.

Build for Reuse:

While every project and the iterations within a project will have its unique set of requirements, team members and activities, there will always be few tasks that are repetitive and will remain the same across iterations and across projects. Test Design Techniques, templates for test strategy, test cases, test reporting, test execution processes are some assets which can be heavily reused.

Being the experts in this field, we’ve built a rich repository of assets that can be reused across different projects. During the 1st iteration, the team utilized the whole 4 weeks which included some time for tweaking the test assets to suit the specific project needs. Due to the effort put in the 1st iteration to set up reusable assets, the team was able to complete the next two iterations in 2 weeks each.


On the whole, we were able to save 2 weeks’ [6 man-weeks] worth of efforts in the next two iterations with the help of reusable assets.


The task of testing encompasses the following four steps.

  1. Creation of test data
  2. Converting data to appropriate input formats
  3. Execution & validation of test cases
  4. Preparation of reports based on the test results

With 500 test cases in the bucket, the manual method would have taken us around 675 hours or 17 weeks approximately to complete the testing. However by using the various automation tools that we have built in-house such as ITLS Service tester, ITLS XML Generator, ITLS Auto UI and ITLS XML Comparator and many others we were able to complete our testing within 235 hours. The split of the effort is as follows:


The automation set up & test script preparation took us 135 hours approximately. But by investing time in this effort, we saved around 440 hours or 11 weeks even with executing 3 rounds of exhaustive regression tests. This was a net saving of 33 man weeks for the QA team.

Big Picture Thinking:

One day a traveler, walking along a lane, came across 3 stonecutters working in a quarry. Each was busy cutting a block of stone. Interested to find out what they were working on, he asked the first stonecutter what he was doing and stonecutter said “I am cutting a stone!” Still no wiser the traveler turned to the second stonecutter and asked him what he was doing. He said “I am cutting this block of stone to make sure that its square, and its dimensions are uniform, so that it will fit exactly in its place in a wall.” A bit closer to finding out what the stonecutters were working on but still unclear, the traveler turned to the third stonecutter. He seemed to be the happiest of the three and when asked what he was doing replied: “I am building a cathedral.”

The system under test had multiple work streams like MDM, ETL and BPM that were interacting with each other and the QA team was split to work on the individual work streams. Like the 3rd stonecutter, the team not only knew about how their work streams were expected to function but also about how each of them would fit into the entire system.

Thus we were able to avoid writing unnecessary test cases that could have resulted due to duplication of validations across multiple work streams or due to scenarios that may not have been realistic when considering the system as a whole. This is captured in the table below.


Our ability to identify the big picture thus saved us 128 hours or 3.2 weeks. To avoid such effort going down the drain, we get our QA leads to participate in the scope & approach phase so that they are able to grasp the “Big Picture” and educate their team members.


Using our testing approach, we saved more than 16 weeks [48 man weeks] of QA effort and thus were able to complete the project in 8 months. Without our approach, this project could have gone easily for over 12 months. This also meant that we did not require the services of a team of 6 InfoTrellis resources [1 Project Manager, 0.5 Architect, 0.5 Dev Lead, 1 Developer, 1 QA Lead and 2 QA Analysts] for 4 additional months i.e. 24 man months and avoided the many client resources who would have been on this project otherwise.

What we have described in this blog is only common sense which is well known to everyone in our industry. However common sense is very uncommon. At InfoTrellis, we have made full use of this common sense and are able to deliver projects faster and with better quality. This has helped our clients realize value from their investments much sooner than anticipated and at a much lower total cost of ownership.




Mohana Raman (QA Practice Lead)

Topics: Automation Big Picture mdm QA Reuse Testing

Leave a Reply

Your email address will not be published. Required fields are marked *

Posted by sathishbaskaran on Tuesday, May 12, 2015 @ 9:43 AM

MDM BatchProcessor is a multi-threaded J2SE client application used in most of the MDM implementations to load large volumes of enterprise data into MDM during initial and delta loads. Oftentimes, processing large volumes of data might cause performance issues during the Batch Processing stage thus bringing down the TPS (Transactions per Second).

Poor performance of the batch processor often disrupts the data load process and impacts the go-live plans. Unfortunately, there is no panacea available for this common problem. Let us help you by highlighting some of the potential root causes that influence the BatchProcessor performance. We will be suggesting remedies for each of these bottlenecks in the later part of this blog.

Infrastructure Concerns

Any complex, business-critical Enterprise application needs careful planning, well ahead of time, to achieve optimal performance and MDM is no exception. During development phase it is perfectly fine to host MDM, DB Server and BatchProcessor all in one physical server. But the world doesn’t stop at development. The sheer volume of data MDM will handle in production needs execution of a carefully thought-out infrastructure plan. Besides, when these applications are running in shared environments Profiling, Benchmarking and Debugging become a tedious affair.

CPU Consumption

BatchProcessor can consume lot of precious CPU cycles in most trivial of operations when it is not configured properly. Keeping an eye for persistently high CPU consumption and sporadic surges is vital to ensure CPU is optimally used by BatchProcessor.


Deadlock is one of the frequent issues encountered during the Batch Processing in multi-threaded mode. Increasing the submitter threads count beyond the recommended value might lead into deadlock issue.

Stale Threads

As discussed earlier, a poorly configured BatchProcessor might open up Pandora’s Box. Stale threads can be a side-effect of thread count configuration in BatchProcessor. Increasing the submitter threads, reader and writer threads beyond the recommended numbers may cause some of the threads to wait indefinitely thus wasting precious system resources.

100% CPU Utilization

“Cancel Thread” is one of the BatchProcessor daemon threads, designed to gracefully shutdown BatchProcessor when the user intends to. Being a daemon thread, this thread is alive during the natural lifecycle of the BatchProcessor. But the catch here is it hogs up to nearly 90% of CPU cycles for a trivial operation thus bringing down the performance.

Let us have a quick look at the UserCancel thread in BatchProcessor client. The thread waits for user interruption indefinitely and checks for the same every 2 seconds once while holding on the CPU all the time.

Thread thread = new Thread(r, “Cancel”);



while (!controller.isShuttingDown()) {



            int i =;

            if (i == -1)






              catch (InterruptedException e) {}




              char ch = (char)i;

              if ((ch == ‘q’) || (ch == ‘Q’)) {





          catch (IOException iox) {}


BatchProcessor Performance Optimization Tips

We have so far discussed potential bottlenecks in running BatchProcessor at optimal levels. Best laid plans often go awry. What is worst is not having a plan. A well thought out plan needs to be in place before going ahead with data load. Now, let us discuss some useful tips that could help to improve the performance during data load process.

Infrastructure topology

For better performance, run the MDM application, DB Server and BatchProcessor client on different physical servers. This will help us to leverage the system resources better.

Follow the best thread count principle

If there are N number of physical CPUs available to IBM InfoSphere MDM Server that caters to BatchProcessor, then the recommended number of submitter threads in BatchProcessor should be configured between 2N and 3N.

For an example, assume the MDM server has 8 CPUs then start profiling the BatchProcessor by varying its submitter threads count between 16 and 24. Do the number crunching, keep an eye on resource consumption (CPU, Memory and Disk I/Os) and settle on a thread count that yields optimal TPS in MDM.


You can modify the Submitter.number property in to change the Submitter thread count.

For example:

Submitter.number = 4

Running Multiple BatchProcessor application instances

If MDM server is beefed up with enough resources to handle huge number of parallel transactions, we should consider parallelizing the load process by dividing the data into multiple chunks. This involves running two or more BatchProcessor client instances in parallel, either in same or different physical servers depending on the resources available in that server. Each BatchProcessor application instance here must work with a separate batch input and output; however they can share the same server-side application instance or operate against a dedicated instance(each BatchProcessor instance pointing to a different Application Server in the MDM cluster). This exercise will increase the TPS and lower the time spent in data load.

Customizing the Batch Controller

Well, this one is a bit tricky. We are looking at modifying the OOTB behavior here. Let us go ahead and do it as it really helps.

  • Comment out the following snippet in runBatch() method ofjava


  • Recompile the BatchProcessor class and keep it in the jar
  • Replace the existing DWLBatchFramework.jar, present under <BatchProcessor Home>/lib with this new one which contains modified BatchController class
  • Bounce the BatchProcessor instance and check the CPU consumption

Manage Heap memory

Memory consumption may not be a serious threat while dealing with BatchProcessor but in servers that host multiple applications along with BatchProcessor the effective memory that can be allocated to it could be very low. During the data load process if high memory consumption is observed then allocating more memory to BatchProcessor helps to ensure a smooth run. In the BatchProcessor invoking script (named as runbatch.bat in Windows environments and in UNIX environments), there are couple of properties that control the memory allocated to the BatchProcessor client.

set minMemory=256M

set maxMemory=512M

It is recommended to keep the minMemory and maxMemory at 256M & 512M respectively. If the infrastructure is of high-end, then minMemory and maxMemory can be increased accordingly. Again, remember to profile the data load process and settle for optimal numbers.

Reader and Writer Thread Count

It is recommended by IBM to keep the Reader and Writer Number thread counts as 1. Since, they are involved in lightweight tasks this BatchProcessor configuration should suit most of the needs.

Shuffle the data in the Input File

By shuffling the data in the input file,  the percentage of similar records (records with high probability of getting collapsed/merged in MDM) being processed at the same time can be brought down thus avoiding long waits and deadlocks.

Scale on the Server side

Well, well, well. We have really strived hard to make BatchProcessor client to perform at optimal levels. Still, poor performance is observed resulting in very low TPS? It is time to look into the MDM application. Though optimizing MDM is beyond the scope of this blog let us provide a high-level action plan to work on.

You can either:

  1. Increase the physical resources(more CPUs, more RAM) for the given server instance
  2. Hosting MDM in a clustered environment
  3. Allocating more application server instances to the existing cluster which hosts MDM
  4. Having dedicated cluster with enough resources for MDM rather than sharing the cluster with other applications
  5. Logging only critical, fatal errors in MDM
  6. Enabling SAM and Performance logs in MDM and tweaking the application based on findings

Hope you find this blog useful. Try out these tips when you are working on a BatchProcessor data load process next time and share how useful you find them. I bet you’ll have something to say!

If you are looking at any specific recommendations on BatchProcessor, feel free to contact Always happy to assist you.

Topics: InfoTrellis Master Data Management MasterDataManagement mdm mdm hub MDM Implementation
Posted by Jan D. Svensson on Monday, Nov 10, 2014 @ 12:47 PM

I often become involved in an organization’s MDM program when they’ve reached out to InfoTrellis for help with cleaning up after a failed project or initiating attempt number X at achieving what, to some, is a real struggle. There can be a lot of reasons for a Master Data Management implementation failing, and none of them are due to the litany of blame game reasons that can be used in these scenarios.  Most failures arise from common problems that people just were not prepared for.

Let’s examine some of the top reasons MDM implementations fail. In the end they probably won’t surprise you, but if you haven’t experienced it yet you will be better prepared to face them if they happen.

Underestimating the work

I am starting with this one because it leads to many of the others, and is a complex topic. It seems like a simple thing to estimate the work but there are a lot of aspects to an MDM project that aren’t obvious that can severely impact timelines and your success.

“It’s just a project like any other”

Let me start by saying MDM is not a project, it’s a journey, or at the very least a program.

Most organizations thinking about implementing MDM are large to global companies. Even medium sized companies that started small and experience growth over time have the same problems as their global sized piers.  While the size of the chaos in a global company may seem much larger, they also have far more resources to throw at the problem than their smaller brethren.

If we stick to the MDM party domain as a point of reference (most organizations start here with MDM), the number of sources or points of contact with party information can be staggering. You may have systems that:

  • Manage the selling of products or services to customers
  • Manage vendors you deal with or contract to
  • Extract data to data warehouse for customer analytics and vendor performance
  • HR systems to manage employees who may also be customers
  • Self-service customer portals
  • Marketing campaign management systems
  • Customer notification systems
  • Many others

A lot of large organizations will have all of these systems, each having multiple applications, and often multiple systems responsible for the same business function. So by now you are probably saying, yes I know this, and…?  Well your MDM “project” will need to sit in the middle of all of this, and in many cases since many of these systems will be legacy mainframe based systems, you will need to be transparent as these systems won’t be allowed to be changed.

MDM can be on the scale of many of the transformation programmes your organization may be undertaking to replace aging legacy systems and moving to modern distributed Service Oriented Architecture based solutions.

Big Bang Never Works

Now that we have seen the potential size of your MDM problem, let me just remind you that you can’t do it all at once. Sure you can plan your massive transformation programme and execute it – but if you have ever really done one of these, you know it’s a lot harder than it seems and that the outcome is usually not as satisfying as you expected it to be.  You end up cutting corners, blowing the budget, missing the timelines, and de-scoping the work just trying to deliver.

What is one of the typical reasons this happens on your MDM transformation project?

You Don’t Know What You Don’t Know

You have all these systems you are going to integrate with and in many cases you are going to need to be transparent in that those systems may not know they are going to be interacting with your new MDM solution. You are going to need to know things like:

  • What data do they use?
    • How often?
    • How much?
    • When?
  • Do they update the data?
    • How often?
    • How?
    • What?
  • Do they need to know about changes made by others?
    • How often is the change notice required?
    • Do they need to know it’s changed, or what the change was?

This type of information seems pretty straight forward. I haven’t told you anything you probably didn’t know, but, when you go to ask these questions, the answer you will mostly likely often get is:

“I don’t know.”

Ok, so the documentation isn’t quite up to date, (I am being kind), but you are just going to go out and find the answer. Which leads to the next problem.

Not Enough Resources

So this is an easy problem to solve. I’ll hire some more business analysts, get some more developers to look at the code, get some more project managers to keep them on track.  Seems like a plan, and on the surface it looks like the obvious answer, (ignoring how hard it is to locate available quality IT people these days), but these aren’t the resources that are the problem.

You don’t have enough SMEs.

The BA’s, developers and others are all going to need time from your subject matter experts.  The subject matter experts are already busy because they are subject matter experts.  There typically aren’t enough of them to go around, and if you have a lot of systems to deal with, you are facing a lot of IT and business SME’s.

What your SMEs bring to the table is intellectual property. Intellectual property is critical to the success of your implementation.  You will need the knowledge your SMEs bring on your various systems, but there is another kind of intellectual property that you are going to need and can be tied to a very lengthy process.

Data Management through Governance

In order to be able to master your information, you will need to amalgamate data from multiple sources and both the meaning and the use of that information will need to be clearly defined. What may appear to be the same information from one source may have a different meaning.  Data governance is a key requirement to be able to establish the enterprise data definitions that are crucial for your master data.  Even in mature environments this can be a challenging task and can consume significant time and resources.

Data governance may seem like a problematic and time consuming exercise but it is an effective tool to use against one of the other major hurdles you will face in trying to establish a common set of master data.

That’s My Data

Many organizations are organized into silos. The silos are designed to look after their own interests, funded to maintain their business goals and competitive for resources and funding.  While the end goal of any organization is the success of the organization, the silo measures its success in terms of itself.

An MDM implementation is by nature at odds with the silo based organization as master data is data that is of value to a cross section of the business and thus spans silos. The danger in many organizations is that a particular silo has significantly more influence than another, often laying with the revenue generating lines of business.  This over balance of power can easily lead to undue influence on your master data implementation, making it just another project for division X, instead of an enterprise resource to be shared by all.

Data governance is one of the key factors to help keep this situation in check. Your data governance board will be comprised of representatives from all stake holders, giving equal representation to all.  The cross organizational nature of data governance is also the reason that decisions can be a difficult and lengthy process as it requires consensus across all the silos.

Aside from enterprise data definitions, another important aspect of master data management is the establishment of business rules.

Too Many Rules

The business and data governance will need to be involved to establish business rules for:

  • ETL processes to loading data into your MDM application
  • Updates to information from multiple sources
  • Matching rules
  • Survivorship rules

The establishment of rules is designed to address one of the big problems MDM is meant to solve: data quality. Organizations will want to manage both data quality on load and ongoing data quality.  One of the big mistakes often made is to try and introduce too many rules right away.

The use of too many rules early on can have a significant impact on the initial data loads into your MDM solution. You are ready for production and most likely getting your first crack at live data to only find out vast numbers of records are being rejected due to your business rules.  Your data loads have now failed and you need to go back and rethink your rules, revise your ETL process and try again.

You finally get your data loaded and your consumers have arrived to start to use the data and your legacy transactions are failing. Why are they failing? Because the application isn’t validating the input according to your business rules, or collecting enough information to satisfy the rules.

Of course there is one way you could reduce this risk, but it often isn’t done well enough and sometimes isn’t done at all.

What Profiling?

Data profiling is the one task that is critical to understanding what your data looks like and what you need to plan for. There are often many barriers to profiling because your party master data will likely contain personally identifiable information (PII) and access will be restricted for security reasons.  You have to overcome these barriers because data profiling is the only way to foresee the gotchas that are going to put you far off track down the road.

Data profiling can be a significant task as each source system needs to be profiled. As you learn more about your data you will have more questions that need to get answered.  All this profiling takes time and most likely needs the time of specific resources as they are the only ones that have access to the information you require.  (There’s that resource problem again.)

Project Management is my Problem?

So far you haven’t heard any magical reasons as to why your MDM implementation should fail. In fact  many of the problems seem to be tied to the typical reasons any IT project can fail:

  • Underestimating the work
  • Not enough resources
  • Trying to do too much at once (including scope creep)
  • Time required for discovery

An aspect of an MDM implementation that may be a little non typical includes the need for data governance. Data governance not only gives you the enterprise view of the information you are trying to master, but can also be an effective way of dealing with competing agendas between silos.

Data governance is also one of the key success actors for the ongoing success of your implementation. Since MDM is a journey not a project, longevity is a characteristic of a successful implementation.  Once you have delivered your foundation, the succeeding phases will build upon the base and provide more coverage of your master data.  To ensure the ongoing success of your implementation you will need the support of data governance, to ensure that new systems and upgrades to existing systems use the master data and don’t just create islands of their own.

In the past we tried to achieve what master data management promises today, but with a lack of controls and governance, we ended up with the data sprawl we are trying to correct with MDM. Once the project is over, the role of master data management does not end.  It is important to recognize that you must establish the processes and rules to not only create the master data store, but also to maintain it and integrate it into your systems.  Master data management is not about the installation and configuration of a shiny new software product.  The product is an enabler making the job easier.  The establishment of rules, governance processes and enforcement are what will bring you success.

One final thing that every master data management implementation requires, and you are pretty much doomed to failure without, is strong executive sponsorship. Your MDM implementation is going to take years.  You will require consistent funding and support to be able to take the journey and only an executive can bring that level of support.  Organizations that are organized into silos often don’t play well together, and while data governance can help in this situation, the time may come when a little intervention is required to ensure things keep moving in the proper direction on the expected timelines.

Your executive is a key resource in and out of the board room.  In the board room you will need t champion that has the vision of what your MDM implementation is going to bring to the organization, and keep the journey progressing over time.  Out of the board room you will be faced with competing agendas, data hoarding, shifting priorities, and silos trying to work together.  The executive influence here can be used to make sure that everyone continues to work towards the common goal, and provides the resources required to achieve the gaols in a reasonable time line.

Topics: master data governance Master Data Management mdm mdm hub

Leave a Reply

Your email address will not be published. Required fields are marked *

Posted by Jan D. Svensson on Tuesday, Jun 10, 2014 @ 1:36 PM

Let me start by saying that this is not an article about big data.  While the source of big data is external to your organization, it is a topic of its own.  Many of the concepts and approaches discussed will definitely apply to your big data initiatives, but that won’t be the focus of this article.

External data is information that is sourced from outside of your organization.  This could be information you purchase from a marketing or service organization, a government agency, the post office, or a business partner.  There are many potential sources of external information.

External data can be used for various purposes in your MDM implementation.  You can use external data to:

  • Enrich your MDM data with new information you are unable to collect on your own
  • Validate information you have captured in your own systems
  • Update information you have captured to improve the quality of the data
  • Use it as reference data to provide additional information without fully integrating it into your MDM system
  • Use it as a source of test data for environments where data privacy prevents usage without masking but valid data is required

There are many approaches that can be used to integrate external data with your MDM implementation.  Your external data can be used to update your MDM data as is the case with enrichment and data quality initiatives.  Your external data can be stored outside of your MDM implementation which is used for reference implementations and data quality initiatives.  How you decide to integrate your external data can also have licensing implications either for your data provider or your MDM licensing.  Integration choices, depending on the source, may also offer real time integration services versus taking the information base in-house.

When integrating information from external sources, you may face the issue of which source to trust over another.  Typically, purchased sources from outside provide a high quality of information, while data provided by a business partner may have issues of its own.  If you haven’t yet addressed trusted sources for your MDM single version of truth, external data sources may highlight this need.


Using External Data for Enrichment

Data enrichment is the process of augmenting your MDM data with external information.  Sources used for data enrichment are usually purchased and the provider has made significant investment ensuring the quality of that information.

When purchasing external data to be used for enrichment it is important to access the information provider, the data they provide, and even their information delivery methods.  Things to consider could include:

  • Does the provider use appropriate methods to collect and validate the information?  Data to be used for enrichment should be of high quality and be able to be trusted.
  • Is the information provided complete and consistent?  You do not want to have to scrub external data you have paid for.
  • Is the information well formatted so it can be easily integrated with your own information?  Addresses can be particularly troublesome to integrate if the core information is incomplete such as missing a country code, or inconsistently formatted, making parsing and standardization a challenge.
  • Does the licensing of the data allow you to keep the data you have stored in your MDM platform if you decide to discontinue licensing the data from the provider?  Understand your rights using the information even after you decide you no longer want to purchase it.

When integrating external data, make sure you understand the attributes you are collecting and that they are fit for the purpose intended.  Status attributes and categorizations are usually tied to some business rules.  Ensure that you and your consumers understand what the attributes mean and what they should be used for.

When purchasing external data, there’s often a rich set of attributes available for you to use in your MDM platform.  Care should be taken to observe your MDM rules for what is and is not master data.  Just because the attributes are available does not mean that you should automatically collect them.  Select your external data attributes with the same rigor you apply to your internal data.  Don’t collect the attributes because you can!


Using External Data for Validation and Quality Improvement

External data sources can be used to validate information you have collected from other sources.  The most common of these validation processes is address validation.  Address validation contributes to data quality and can often also provide correction capability so that addresses that were supplied incorrectly can be fixed to represent valid addresses.

Address correction can be a tricky proposition and may have implications for your consumers.  If the address being captured in MDM is tied to a legal document such as a sale or insurance policy, correction may be unwanted as it changes the legal document – which cannot be arbitrarily changed.  Address correction is also very susceptible to error if the address information was incomplete and missing key attributes such as country codes or postal/zip codes.  The quality of the source data and the validation data can be an issue during this process.

Using external data for validation and quality improvement can have performance implications.  The data source for the validation is usually outside of your MDM platform and so a call must be made to invoke the service.  This adds time to your update time for validation.

Some providers may offer real time services which can be integrated with your MDM service to perform the validation and correction.  The performance implications of calling an external service may also affect whether you use an in-house or external service.  You may also want to consider applying the data validation and quality improvements after capturing the source data to keep MDM transaction performance high, and deal with the corrections afterwards.

External data sources can be a valuable tool to keep your MDM data clean and fresh.  Validation routines can vastly improve the quality of your data by ensuring organizations exist on government registries and post office change of address files keep addresses current.


External Data as Reference Data

External data does not always have to be integrated to be useful for an MDM implementation.  You can decide to use the external data as reference data.  When implementing external data as reference data your MDM platform contains an identifier, or key, from your external data on your MDM object.  The identifier is used to identify the link between your MDM object (such as a party) to the additional information stored outside of MDM.  This is the same concept you would use when tracking the source system key to your party record.

If the reference data approach is used then the implication is that queries for party data that incorporate attributes from the external data must be merged.  The usual place for such integration of the two sources is the service bus.  If you do not use such an integration platform, then it may be possible to extend your MDM platform to do the lookup in the external data source for you.  If possible you may want to have a way to indicate when these additional attributes are actually required so you can avoid the additional overhead when the data is not required.  You may use a separate service to get the two sets of information, or some kind of indicator to show that the two sources need to be queried.

Often when external data is integrated into MDM as reference data, there is a desire to provide services targeted only at the reference data, as it may have significantly more attributes available than is required in your MDM platform.  Design these services so that your MDM platform requirements are also met by the same set of services to avoid duplicating effort.  Both DB2 and Oracle provide tools to automate the generation of web services based on queries to databases and provide a simple capability to expose your reference data as a service.


Using External Data as Test Data

Every implementation of MDM requires test data for testing of database performance, application features, and integration processes.  Some environments have very strict rules on the use of data and often for MDM, valid data is required for testing, as standardization and validation routines depend on real data to perform properly.  External data sources may be a useful tool for generating test data, because:

  • It provides the volume of information required for performance testing and sizing estimates
  • It may be publically available information so there are no security issues with it being seen by staff and consultants
  • It is well formatted so it is easy to transform into the formats required for loading your MDM platform

You should check your licensing of the data before using it for long term testing purposes as you want to be able to retain test data even if you decide to stop your subscription for the external data.


Trusted Sources

One of the new issues you may face when introducing external data is the concept of trusted sources.  Trusted sources are data sources that are trusted to have higher quality data than other data sources that supply information.  Since the external data is of higher quality you may not want values set from the external data to be replaced by values from an internal system.

A source trust framework may be required to prevent updates to information that was set by a more trusted source by a less trusted source.  These update rules can be complicated and are often implemented with sophisticated tools such as rules engines.  Source trust is often implemented in the integration platform, as it can be an enterprise problem.  Source trust can also be implemented in your MDM platform for a localized solution.

An alternate approach that can also be used to control this type of data protection is to store the data from the trusted source in attributes dedicated to data from that source.  A set of custom services with additional capability are used to process these attributes to keep them away from the normal consumers.


Integrating External Data

There are many choices you will face when trying to integrate your external data.

Sometimes licensing can be one of the drivers affecting your decisions.  Your MDM platform may be licensed according to the number of parties to be stored in the system.  Your external data source may provide data for every organization in the country but you don’t do business with every organization, so loading them all into your MDM platform will just inflate the number of parties and thus needlessly affect your licensing.

Some data providers only supply complete extracts of the data and others provide a full extract and delta changes.  When you are planning your use of the external data you may need to add your own change detection process requiring you to match the previous file with the current file to detect updates, adds and deletes.

Some data providers offer file based data as well as online, real time services for you to integrate with.  You must consider:

  • The amount of information you require versus the base of data that would need to be managed
  • The performance and cost implications of calling outside for the service versus an in-house service call
  • The cost of the file base of data compared to the cost of the size of data you really need and can be accessed real time

When using external data to enrich your MDM data, you need to be able to update your MDM platform based on changes reflected in the external data.  Depending on the size of both your own MDM data and the external data, sophisticated processes may be required to apply the updates to your MDM platform.  If large amounts of data need to be applied daily, then you may need to maintain pointers to your MDM data in the external data so you can efficiently process and avoid needless MDM lookups to see if anything is impacted.  In this scenario you may want to consider having the external data be a consumer of MDM change notifications to keep its pointers up-to-date so you can recognize when a change in the external data affects your MDM data.


In Summary

While we have talked mostly about purchased external data, data sourced from a business partner is subject to the same problems and considerations discussed here.  External data can be a very useful tool especially when it comes to validation and enrichment.  Be forewarned that there is work involved to integrate the external data, but the rewards associated with higher quality and breadth and depth can be of significant business value.

Topics: correction Data Quality enrichment external data Integration mdm Reference Data validation

Leave a Reply

Your email address will not be published. Required fields are marked *

Posted by ochughtai on Friday, Mar 28, 2014 @ 4:57 PM

Product Information Management (PIM) is a vast subject area. Not only does each industry vertical have its own way of defining a product, within a particular vertical each company may choose to represent their product information differently to satisfy their particular business needs. Having said that, there are indeed standard practices and processes that makes PIM expertise portable across various projects.

The first step to understanding PIM is to get an appreciation of the diversity around the definition of product. The example below, from the retail industry, provides some insight in to the complex world of PIM. In my subsequent blogs, I will address various nuances of product modeling.

What is a Product?
We all love shopping. But have you ever wondered what it takes for your favorite product to reach a store shelf, or what that product means for the retailer? This blog is going to take you through the journey of a diaper bag from supplier to store shelf – let’s see if we can figure out what a product is.

So let’s start backwards.

You see the diaper bag on the store shelf – how did it get there? Well, each store has an inventory area at its back and you must have seen store workers periodically replenishing the shelves with products. That’s one way the shelves are filled. The other way is that vendors themselves come over and stock the shelves. This is referred to as direct store delivery, or DSD; one example might be 2-liter Coca Cola bottles. DSD is, however, a discussion for another day. By the way, the product that you see on the store shelf is also referred to as the sellable unit.

So how does that diaper get to the store’s inventory area? Each store has an inventory management and an ordering system; any time the inventory falls beyond a certain limit, an order is placed with the relevant distribution center for supply replenishment. The ordering lead times are already known, so the ordering takes place in such a way that the store seldom runs out of diapers. Ordering may be manual or automated based off of the inventory levels.

So how does the diaper get to the distribution center/warehouse? This is a complex piece and is also referred to as “Network Alignment”, where the supply chain managers determine which suppliers are going to supply which distribution centers, and which distribution centers, in turn, are going to service which stores. The network alignment is product specific and is usually based off of geographical locations. The distribution centers have to keep track of their inventory levels as well as their lead times both from the supplier ordering (inbound) and store ordering (outbound) perspective to maintain the service level agreements (SLAs) with the stores. Also, distribution centers have their own stocking unit for each product, which they use to manage inventory.

It takes months of planning before the product actually hits the store shelves. A lot of systems (e.g. Item Master, Planogram, Labelling and Shelf Tags, Warehouse Management, ERP, Forecasting, Pricing and Promotions, Data Warehouse and Analytics, etc.) have to be set up with item information as part of the new item introduction (NPI) process.

So then, what is a product? Well simply put, product means different things to different stakeholders -for a warehouse management system product is the stocking unit; for a store product is the sellable unit, for analytics product is diapers irrespective of whether it is being sold as a bag or travel pack. It depends on your perspective.

Topics: Master Data Management mdm PIM Product Product Information Management

Leave a Reply

Your email address will not be published. Required fields are marked *

Posted by infotrellislauren on Monday, Mar 10, 2014 @ 12:03 PM

There are lots of excellent lists out there for who to follow on the subjects of customer experience, Big Data, and social media technology – but what about Master Data Management? Finding a bit of a dearth of resources for MDM Twitter influencers, I decided to put together a little list of my own on the people I personally find are great for MDM content in the Twittersphere.


1. Prashanta Chandramohan aka Prash Chan (@MDMgeek)

The author of an excellent blog with high-quality and thoughtful articles about MDM, Chandramohan produces a steady stream of great content and frequently participates in Twittersphere discussions around MDM. He’s an IBMer so occasionally his Tweets take an IBM slant, but he’s a techie first and foremost and you’ve little to fear if your goal is to avoid marketing messages in your feed.


2. Henrik L. Sørensen (@hlsdk)

Sørensen runs a blog on data quality and MDM that he regularily updates with his commentary and insights, generally in the form of short, easily-digestible posts that concisely bring up an interesting point or a new perspective. He’s also excellent at reposting other MDM related blog posts that he reads, and has a great eye for pointing out the ones worth paying attention to.


3. Aaron Zornes (@azornes)

Zornes is an institution in his own right in the MDM world; the odds are that you already know his name if you’re at all involved in the industry. As the driving force behind The MDM Institute and the Data Governance Summit events worldwide, he’s a familiar face and a sharp mind when it comes to all things MDM. Follow him to tap into his insights from events, surveys, research and other resources.


4. Sunil Soares (@SunilSoares1)

With four highly respected books about information management to his name, Soares engages in the MDM conversation with a level of authority that is emphasized by his willingness to get to the point with technical language in lieu of business-speak. His contribution is always practical and his company, Information Asset, frequently puts out useful research on data governance.


5. Jim Harris (@ocdqblog)

As his Twitter handle suggestions, Harris runs OCDQ Blog, or “Obsessive-Compulsive Data Quality”. His articles tend to be light-hearted, often drawing imaginative parallels between data quality and pop culture references to make a point that is both memorable and meaningful. Like his posts, he is a friendly personality and highly responsive to people who mention him or engage with him on his favorite topic.


6. Axel Troike (@AxelTroike)

If you missed a noteworthy MDM article, Troike will surely point it out for you at some point. His Tweets often serve the purpose of circulating great content, and it’s a fairly event split between articles and research produced by other MDM experts and ones put out by his company, Grandite. He’s consistent with crediting the sources, too, so he makes it easy to expand your follow list.


7. Ravi Shankar (@Ravi_Shankar_)

Although as part of the marketing team for Informatica, Shankar links a predictably large amount of content produced by the company, it’s usually interesting content and worth reading. If you’re looking for MDM and data governance stuff on Twitter, he’s got it in spades, and many of them are resources that can be easily absorbed by more business-focused folk than pure IT folks.


8. Gary Alleman (@Gary_Alleman)

A frequent Tweeter on the topics of data quality and governance and a MDM evangelist, Alleman shares a wide variety of interesting and practical links on top of running his own blog. He’s another great resource for MDM news and discussions and definitely worth following.

Topics: influencers Master Data Management mdm social media Twitter

Leave a Reply

Your email address will not be published. Required fields are marked *

Posted by kevinwrightinfotrellis on Wednesday, Mar 5, 2014 @ 1:25 PM

What big changes does this upgrade bring?

IBM brought together Initiate Master Data Service (MDS), InfoSphere MDM Server (MDM) and InfoSphere MDM Server for PIM into a single market offering as InfoSphere MDM v10.  The market offering contained four editions: standard, advanced, collaboration and enterprise.

In InfoSphere MDM v11, IBM further unified the products from a technology perspective.  Specifically, the legacy Initiate MDS and MDM Server products were combined together into a single technology platform.

This is a significant achievement that positions IBM to address the “MDM Journey” that is much talked about.  It allows clients to start with a Registry Style (or “Virtual Hub”, which is easier to start with and then transition to a Hybrid or Centralized Style (or “Physical Hub”).  The key differentiator is the true implementation of the Hybrid Style.

The whole product has been re-architected under the covers to use the OSGi framework, which is different from the old EAR-based process, and comes with a host of new technological features and promises.

Other changes & new features include:

  • Enhanced MDM & DataStage integration
  • Expanded Patient Hub feature for medical applications
  • Through IBM PureSystems, it should be easier than ever to get up and running with MDM 11
  • InfoSphere MDM v10 introduced the Probabilistic Match Engine (PME) in Advanced Edition.  This was the embedding of Initiate MDS’s matching engine into MDM Server.  This capability has now been surfaced up into a “Probabilistic Search Service”, an alternative to the deterministic search traditionally offered with MDM Server
  • For Weblogic Server clients, unfortunately Weblogic is no longer supported and a migration to WAS is required (due to the OSGi support)

What problems does this upgrade solve?

Version 11 promises to deliver improved efficiency by integrating the standard and advanced editions – basically combining the traditional MDM and the Initiate Master Data Service – which means a number of duplicated functions are removed. There have also been some batch processor improvements.

Security is now on by default, which of course helps to minimize potential future issues and ensure that only the people who need to see the data can see the data.

In general, though, this upgrade is less about solving “problems” than it is about moving forward and enhancing existing efficiencies and strengths.  This upgrade is an evolution more than a revolution.

What’s the real value of this upgrade from a technology perspective?

To an implementer, the OSGi framework is such a different way of looking at the MDM product as opposed to the old EAR-based system that it’s worth it to start working with this upgrade just for the advantage of getting an early start on familiarizing yourself with this new technology.  While still maturing in the IBM MDM product, it promises faster and more dependable deployments, dependency management, and a modular code structure.  It comes with the ability to start and stop individual modules, or upgrade them without shutting down the whole application.  This can lead to much improved uptime for the MDM instance(s).

It’s also worth noting that for a company on the IBM stack, the improved integration with products like DataStage can really increase the value of this product to the enterprise.

How much effort is it going to take to implement?

IBM has held strong to their “Backwards Compatibility” statements, which is key in upgrade projects.  However, given the technology change with OSGi, effort-wise this upgrade will take a little more than if going up to say, 10.1.  We’ve seen a number of PMRs, etc to be expected from a new release, particularly one on new technology.  Fortunately InfoTrellis has been involved in a good number of installation and product-related PMRs and has experience both working with IBM and clients to resolve them quickly.

What if I’m running a much older version?

MDM 8.5 Goes out of service on 30 April 2014 and 9.0.2 goes out of service 30 April 2015.  As far as any prior versions, it has value to move to more current versions of DB & WAS, not just MDM.  OSGi looks well positioned to be used across the board in the near future considering all of the advantages it provides; so again, it’s good to get your hands on it and start learning to work with it sooner rather than later.

What about Standard Edition (Initiate) users?

Organizations currently using Standard Edition (Initiate) will be majorly impacted by MDM version 11, because this upgrade means they will have an entirely new technology platform to migrate to, which includes the WebSphere Application Server.

The biggest advantage this release provides to existing Standard Edition users is the ability to implement true hybrid scenarios.  One scenario, for example, is being able to persist a composite view of a “virtual entity” to a “physical entity”.  This can realize performance advantages if the virtual entities are made up of many member records.  Also, there is then the ability to decorate the physical entity with additional attributes that come in the Advanced Edition platform such as Privacy Preferences, Campaign History and Hierarchies to name a few.  This scenario allows an organization to progress along their MDM journey if they have requirements to do so.  This article doesn’t address any licensing impacts to leverage Advanced Edition features.

The Advanced Edition (or “physical MDM”) capabilities are very feature rich and couples very well with Standard Edition (or “virtual MDM”).  However, with that said it is very important for clients that want to transition from Standard to Advanced Edition to leverage partners that have expertise in both of those platforms.

If I implemented MDM very recently, should I upgrade?

If you’re currently using MDM 10.x, it might not turn out to be worth the effort to upgrade immediately if implementation just took place.  It is worth reiterating that v11 is the way of the future from an implementation standpoint, and the OSGi framework will likely be the way of the future.

How does this impact a business-end user?

Working with a more modern MDM means less need to upgrade in future, and future upgrades using OSGi are easier to implement. Version 11 comes with an increased feature set – Big Data, virtual/physical MDM, etc – that will allow much better creation of business value from the data that you already have. Increased or improved integration with other products, like InfoSphere Data Explorer or InfoSphere BigInsights, is another big plus for those already invested in IBM products.

How does this impact an IT user?

A number of things stand out:

  • Improved performance from MDM 11, WAS 8.5, newer versions of DB2/Oracle
  • OSGi
  • Improved MDM 11 workbench
  • Much smaller code base to track – just customized projects – the end result being a much smaller deployable artifact
  • Enforced security
  • Streamlined installation – basically same for workbench and server which helps to improve the experience for the developer who also performs installation
  • Batch processor improvements
  • Initiate users gain the benefits of
    • Event notification
    • Address standardization via QS

What unique insight into this upgrade does InfoTrellis have that other vendors don’t have?

Put quite simply, experience: we are already ahead of the game by being one of the first implementers in North America if not the world to be participating in upgrade and implementation efforts for MDM 11.  We’re also able to leverage our volumes of experience with prior versions.  InfoTrellis is involved in a number of MDM 11 projects already – both upgrades and new implementations – on a variety of Operating Systems (Linux, Solaris, AIX) and database (Oracle, DB2).


If you’re looking into upgrading your MDM, give us a shout. Reach out to my colleague Shobhit at to talk to the foremost MDM experts about how we can help you with your implementation.

Topics: IBMMDM MasterDataManagement mdm MDM11 OSGi WebSphere

Leave a Reply

Your email address will not be published. Required fields are marked *

Posted by marianitorralba on Friday, Sep 6, 2013 @ 2:27 PM

Deterministic Matching versus Probabilistic Matching

Which is better, Deterministic Matching or Probabilistic Matching?

I am not promising to give you an answer.  But through this article, I would like to share some of my hands-on experiences that may give some insights to help you make an informed decision in regards to your MDM implementation.

Before I got into the MDM space three years ago, I worked on systems development encompassing various industries that deal with Customer data.  It was a known fact that duplicate Customers existed in those systems.  But it was a problem that was too complicated to address and was not in the priority list as it wasn’t exactly revenue-generating.  Therefore, the reality of the situation was simply accepted and systems were built to handle and work around the issue of duplicate Customers.

Corporations, particularly the large ones, are now recognizing the importance of having a better knowledge of their Customer base.  In order to achieve their target market share, they need ways to retain and cross-sell to their existing Customers while at the same time, acquire new business through potential Customers.  To do this, it is essential for them to truly know their Customers as individual entities, to have a complete picture of each Customer’s buying patterns, and to understand what makes each Customer tick.   Hence, solving the problem of duplicate Customers has now become not just a means to achieve cost reduction, higher productivity, and improved efficiencies, but also higher revenues.

But how can you be absolutely sure that two customer records in fact represent one and the same individual?  Conversely, how can you say with absolute certainty that two customer records truly represent two different individuals?  The confidence level depends on a number of factors as well as on the methodology used for matching.  Let us look into the two methodologies that are most-widely used in the MDM space.

Deterministic Matching

Deterministic Matching mainly looks for an exact match between two pieces of data.  As such, one would think that it is straightforward and accurate.  This may very well be true if the quality of your data is at a 100% level and your data is cleansed and standardized in the same way 100% of the time.  We all know though that this is just wishful thinking.  The reality is, data is collected in the various source systems across the enterprise in many different ways.  The use of data cleansing and standardization tools that are available in the market may provide significant improvements, but experience has shown that there is still some level of customization required to even come close to the desired matching confidence level.

Deterministic Matching is ideal if your source systems are consistently collecting unique identifiers like Social Security Number, Driver’s License Number, or Passport Number.  But in a lot of industries and businesses, the collection of such information is not required, and even if you try to, most customers will refuse to give you such sensitive information.  Thus, in majority of implementations, several data elements like Name, Address, Phone Number, Email Address, Date of Birth, and Gender are deterministically matched separately and the results are tallied to come up with an overall match score.

The implementation of Deterministic Matching requires sets of business rules to be carefully analyzed and programmed.  These rules dictate the matching and scoring logic.  As the number of data elements to match increases, the matching rules become more complex, and the number of permutations of matching data elements to consider substantially multiplies, potentially up to a point where it may become unmanageable and detrimental to the system’s performance.

Probabilistic Matching

Probabilistic Matching uses a statistical approach in measuring the probability that two customer records represent the same individual.  It is designed to work using a wider set of data elements to be used for matching.  It uses weights to calculate the match scores, and it uses thresholds to determine a match, non-match, or possible match.  Sounds complicated?  There’s more.

I recently worked on a project using the IBM InfoSphere MDM Standard Edition, formerly Initiate, which uses Probabilistic Matching.  Although there were other experts in the team who actually worked on this part of the project, here below are my high-level observations.  Note that other products available in the market using the Probabilistic Matching methodology may generally work around similar concepts.

  • It is fundamental to properly analyze the data elements, as well as the combinations of such data elements, that are needed for searching and matching.  This information goes into the process of designing an algorithm where the searching and matching rules are defined.
  • Access to the data up-front is crucial, or at least a good sample of the data that is representative of the entire population.
  • Probabilistic Matching takes into account the frequency of the occurrence of a particular data value against all the values in that data element for the entire population.  For example, the First Name ‘JOHN’ matching with another ‘JOHN’ is given a low score or weight because ‘JOHN’ is a very common name.  This concept is used to generate the weights.
  • Search buckets are derived based on the combinations of data elements in the algorithm.  These buckets contain the hashed values of the actual data.  The searching is performed on these hashed values for optimum performance.  Your search criteria are basically restricted to these buckets, and this is the reason why it is very important to define your search requirements early on, particularly the combinations of data elements forming the basis of your search criteria.
  • Thresholds (i.e. numeric values representing the overall match score between two records) are set to determine when two records should: (1) be automatically linked since there is absolute certainty that the two records are the same; (2) be manually reviewed as the two records may be the same but there is doubt; or (3) not be linked because there is absolute certainty that the two records are not the same.
  • It is essential to go through the exercise of manually reviewing the matching results.  In this exercise, sample pairs of real data that have gone through the matching process are presented to users for manual inspection.  These users are preferably a handful of Data Stewards who know the data extremely well.  The goal is for the users to categorize each pair as a match, non-match, or maybe.
  • The categorizations done by the users in the sample pairs analysis are then compared with the calculated match scores, determining whether or not the thresholds that have been set are in line with the users’ categorizations.
  • The entire process may then go through several iterations.  Per iteration, the algorithm, weights, and thresholds may require some level of adjustment.

As you can see, the work involved in Probabilistic Matching appears very complicated.  But think about the larger pool of statistically relevant match results that you may get, of which a good portion might be missed if you were to use the relatively simpler Deterministic Matching.

Factors Influencing the Confidence Level

Before you make a decision on which methodology to use, here are some data-specific factors for you to consider.  Neither the Deterministic nor the Probabilistic methodology is immune to these factors.

Knowledge of the Data and the Source Systems

First and foremost, you need to identify the Source Systems of your data.  For each Source System that you are considering, do the proper analysis, pose the questions.  Why are you bringing in data from this Source System?  What value will the data from this Source System bring into your overall MDM implementation?  Will the data from this Source System be useful to the enterprise?

For each Source System, you need to identify which data elements will be brought into your MDM hub.  Which data elements will be useful across the enterprise?  For each data element, you need to understand how it is captured (added, updated, deleted) and used in the Source System, the level of validation and cleansing done by the Source System when capturing it, and what use cases in the Source System affect it.  Does it have a consistent meaning and usage across the various Source Systems supplying the same information?

Doing proper analysis of the Source Systems and its data will go a long way in making the right decisions on which data elements to use or not to use for matching.

Data Quality

A very critical task that is often overlooked is Data Profiling.  I cannot emphasize enough how important it is to profile your data early on.  Data Profiling will reveal the quality of the data that you are getting from each Source System.  It is particularly vital to profile the data elements that you intend to use for matching.

The results of Data Profiling will be especially useful in identifying the anonymous and equivalence values to be considered when searching and matching.

Here are some examples of Anonymous values:

Here are some examples of Equivalence values:

  • First Name ‘WILLIAM’ has the following equivalencies (nicknames): WILLIAM, BILL, BILLY , WILL, WILLY, LIAM
  • First Name ‘ROBERT’ has the following equivalencies (nicknames): ROBERT, ROB, ROBBY, BOB, BOBBY
  • In Organization Name, ‘LIMITED’ has the following equivalencies: LIMITED, LTD, LTD.
  • In Organization Name, ‘CORPORATION’ has the following equivalencies: CORPORATION, CORP, CORP.

If the Data Profiling results reveal poor data quality, you may need to consider applying data cleansing and/or standardization routines.  The last thing you want is polluting your MDM hub with bad data.  Clean and standardized data will significantly improve your match rate.  If you decide to use cleansing and standardization tools available in the market, make sure that you clearly understand its cleansing and standardization rules.  Experience has shown that some level of customization may be required.

Here are important points to keep in mind in regards to Address standardization and validation:

  • Some tools do not necessarily correct the Address to produce exactly the same standardized Address every time.  This is especially true when the tool is simply validating that the Address entry is mailable.  If it finds the Address entry as mailable, it considers it as successfully standardized without any correction/modification.
  • There is also the matter of smaller cities being amalgamated into one big city over time.  Say one Address has the old city name (e.g. Etobicoke), and another physically the same Address has the new city name (e.g. Toronto).  Both Addresses are valid and mailable addresses, and thus both are considered as successfully standardized without any correction/modification.

You have to consider how these will affect your match rate.

Take the time and effort to ensure that each data element you intend to use for matching has good quality data.  Your investment will pay off.

Data Completeness

Ideally, each data element you intend to use for matching should always have a value in it, i.e. it should be a mandatory data element in all the Source Systems.  However, this is not always the case.  This goes back to the rules imposed by each Source System in capturing the data.

If it is important for you to use a particular data element for matching even if it is not populated 100% of the time, you have to analyze how it will affect your searching and matching rules.  When that data element is not populated in both records being compared, would you consider that a match?  When that data element is populated in one record but not the other, would you consider that a non-match, and if so, would your confidence in that being a non-match be the same as when both are populated with different values?

Applying a separate set of matching rules to handle null values adds another dimension to the complexity of your matching.

Timeliness of the Data

How old or how current is the data coming from your various Source Systems?  Bringing outdated and irrelevant data into the hub may unnecessarily degrade your match rate, not to mention the negative impact the additional volume may have on performance.  In most cases, old data is also incomplete, and collected with fewer validation rules imposed on it.  As a result, you may end up applying more cleansing, standardization, and validation rules to accommodate such data in your hub.  Is it really worth it?  Will the data, which might be as much as 10 years old in some cases, truly be of value across the enterprise?

Volume of the Data

Early on in the MDM implementation, you should have an idea on the volume of data that you will be bringing in to the hub from the various Source Systems.  It will also be worthwhile if you have some knowledge on the level of Customer duplication that currently exists in each Source System.

A fundamental decision that will have to be made is the style of your MDM implementation.  (I will reserve the discussion on the various implementation styles for another time.)  For example, you may require a Customer hub that will just persist the cross reference to the data but the data is still owned by and maintained in the Source Systems, or you may need a Customer hub that will actually maintain, be the owner and trusted source of the Customer’s golden record.

Your knowledge of the volume of data from the Source Systems, combined with the implementation style that you need, will give you an indication of the volume of data that will in fact reside in your Customer hub.  This will then help you make a more informed decision on which matching methodology will be able to handle that volume better.

Other Factors to Consider

In addition to the data-specific factors above, here are other factors that you should give a great deal of thought.

Goal of the Customer Hub

What are your short-term and long-term goals for your Customer hub?  What will you use it for?  Will it be used for marketing and analytics only, or to support your transactional operations only, or both?  Will it require real-time or near-real-time interfaces with other systems in the enterprise?  Will the interfaces be one-way or two-way?

Just like any software development project, it is essential to have a clear vision of what you need to achieve with your Customer hub.  It is particularly important because the Customer hub will touch most, if not all, facets of your enterprise.  Proper requirements definition early on is key, as well as the high-level depiction of your vision, illustrating the Customer hub and its part in the overall enterprise architecture.   You have a much better chance of making the right implementation decisions, particularly as to which matching methodology to use, if you have done the vital analysis, groundwork, and planning ahead of time.

Tolerance for False Positives and False Negatives

False Positives are matching cases where two records are linked because they were found to match, when they in fact represent two different entities.  False Negatives are matching cases where two records are not linked because they were found to not match, when they in fact represent the same entity.

Based on the very nature of the two methodologies, Deterministic Matching tends to have more False Negatives than False Positives, while Probabilistic Matching tends to have more False Positives than False Negatives.  But these tendencies may change depending on the specific searching and matching rules that you impose in your implementation.

The question is: what is your tolerance for these false matches?  What are the implications to your business and your relationship with the Customer(s) when such false matches occur?  Do you have a corrective measure in place?

Your tolerance may depend on the kind of business that you are in.  For example, if your business deals with financial or medical data, you may have high tolerance for False Negatives and possibly zero tolerance for False Positives.

Your tolerance may also depend on what you are using the Customer hub data for.  For example, if you are using the Customer hub data for marketing and analytics alone, you may have a higher tolerance for False Positives than False Negatives.

Performance and Service Level Requirements

The performance and service level requirements, together with the volume of data, need careful consideration in choosing between the two methodologies.   The following, to name a few, may also impact performance and hence need to be factored in: complexity of the business rules, transactions that will retrieve and manipulate the data, the volume of these transactions, and the capacity and processing power of the machines and network in the system infrastructure.

In the Deterministic methodology, the number of data elements being used for matching and the complexity of the matching and scoring rules can seriously impact performance.

The Probabilistic methodology uses hashed values of the data to optimize searching and matching, however there is also that extra overhead of deriving and persisting the hashed values when updating/adding data.  A poor bucketing strategy can degrade the performance.

On-going Match Tuning

Once your Customer hub is in production, your work is not done yet.  There’s still the on-going task of monitoring how your Customer hub’s match rate is working for you.  As data is added from new Source Systems, new locales, new lines of business, or even just as updates to existing data are made, you have to observe how the match rate is being affected.   In the Probabilistic methodology, tuning may include adjustments to the algorithm, weights, and thresholds.  For Deterministic methodology, tuning may include adjustments to the matching and scoring rules.

Regular tuning is key, more so with Probabilistic than Deterministic methodology.  This is due to the nature of Probabilistic, where it takes into account the frequency of the occurrence of a particular data value against all the values in that data element for the entire population.  Even if there is no new Source System, locale, or line of business, the Probabilistic methodology requires tuning on a regular basis.

It is therefore prudent to also consider the time and effort required for the on-going match tuning when making a decision on which methodology to use.


So, which is better, Deterministic Matching or Probabilistic Matching?  The question should actually be: ‘Which is better for you, for your specific needs?’  Your specific needs may even call for a combination of the two methodologies instead of going purely with one.

The bottom line is, allocate enough time, effort, and knowledgeable resources in figuring out your needs.  Consider the factors that I have discussed here, which by no means is an exhaustive list.   There could be a lot more factors to take into account.  Only then will you have a better chance of making the right decision for your particular MDM implementation.

Topics: CDI Data Deterministic matching Integration Master Data Management Match Matching mdm MDM Implementation Probabilistic Probabilistic matching

Leave a Reply

Your email address will not be published. Required fields are marked *

Posted by infotrellislauren on Tuesday, Jul 16, 2013 @ 11:40 AM

Everybody, it seems, is getting onto the social media bandwagon. You can’t get far into any discussion about information management or marketing without it coming up, and it’s fascinating to see the emerging best practices and strategies behind social media products and consulting groups.

Here are five lessons from over a decade of working with Master Data Management, a much older piece of data-wrangling technology, that will serve any marketing or IT professional well as they navigate the social media technology landscape.


1. Huge Investments are a Tough Sell

I’m going to assume if you’re reading this that you see value in social media marketing, or else you see the potential for value. If you’re looking to leverage social media for your organization at a scale and level of sophistication higher than a summer intern firing off tweets now and then under the corporate handle, you’re going to have to actually spend money – and in an organization, that can be easier said than done.

Master Data Management teaches a very simple lesson on the subject of talking to your executives about a wonderful, intangible solution that will surely provide ROI if they can find it in themselves to approve the needed budget. The lesson is this: the bigger the price tag, the harder time you’ll have convincing a major decision maker it’s a necessary or worthwhile investment.

Often with MDM the more it’ll cost to implement, the more fantastic of an impact it will have on the data within the business. With social media, that’s a little harder to prove. It doesn’t help that there are more “social media marketing solutions” out there than you can shake a stick (or a corporate credit card) at.

If your executive doesn’t have time for your technobabble pitch for a million dollar overhaul, try wiggling your foot into the door by starting small without a lot of commitments. For MDM, that’s a proof-of-concept, and there’s no reason that can’t be applied to social media marketing. Consider starting off with something that is subscription based (my more IT-minded colleagues would refer to this as “software as a service” or “SaaS”) to give your management the confidence that if they aren’t seeing returns, they can just turn off the subscription and stop spending money on it.


A high level dashboard application is an ideal place to start.


This is your social media marketing proof-of-concept – if your initial test run gets you great results, that’s a good sign that your organization is part of an industry that stands to really benefit from a bigger, more expensive social media based project. Maybe even something that involves the term “big data”, but let’s not run before we walk.


2. Consolidated Records Mean More Accurate Information

This is the core premise of Master Data Management as an information management principle: you want there to be one copy of an important record that consolidates information from all its sources in the organization, containing only the most up to date and accurate data. It’s a simple but powerful idea, the philosophy of combining multiple copies of the same thing so that you only have one trustworthy copy, and then actively preventing new duplicates from cropping up.

The same thing applies to social media, especially when we’re talking about the users as actual human beings and not as individual accounts across multiple channels. Face it, we’re not interested in social media as an abstract concept – we’re there for the people using it.

(Which is why I love to cite this actual exchange between an older gentleman of a CEO and his marketing manager that goes something like: “I don’t get Twitter. I don’t use it, I don’t want to use it, I don’t personally know anybody that does use it, and I think it’s stupid.” “I agree. I honestly think it’s stupid too – but that doesn’t change the fact that 90% of our customer base uses it, and that’s why we need to pay attention to it.”)

So we’re there for the people – why on earth would we approach gathering and visualizing metrics and data on user accounts instead of people? Should we treat the Facebook, Pinterest, Twitter, LinkedIn and Tumblr account of one individual as having the weight of five individual voices?

What you really want to be looking for is a solution that matches and combines users across multiple channels. This isn’t quite the same process that it would be as part of MDM – this is new ground here that needs to be broken, and if you want to figure out that a Facebook user is the same person as a Twitter user, you need to be a little more creative than just checking to see if they have the same name.

With access to less traditional data (like a phone number or an address) it takes a bit of new technology combined with new approaches to match social media accounts accurately. I won’t bother getting into the details here, but suffice to say it’s something that today’s technology has the ability to do and a couple of companies are actually offering it. It seems perfectly logical to me that if you’re going to seriously use social media, especially in any sort of decision making process, you need to have a consolidated view of each user instead of a mishmash of unattributed accounts, which would, without a doubt, skew your numbers one way or another.

I’m going to briefly mention that if you want to take it a step above and beyond for even more insight into your customers, you can further consolidate that data by matching it to your internal records – Joe B in your client database is Joe B on Facebook and JoeTweet on Twitter, for example – but this is a much more ambitious project.


3. Data Quality is Not Just An IT Concern

Master Data Management is intended to bring greater value to an organization’s data by making it more accurate and trustworthy. Whether or not that actually happens very strongly depends on the quality of the data to begin with. As they say, “garbage in, garbage out,” and that’s even more true of social media marketing solutions. If you thought the quality of data in your organization was sorry to behold, I have a startling fact for you: the internet is full of garbage data. Absolutely overflowing with it. Not just things that are incorrect, but also things that are irrelevant.

If you’re going to get facts from social media, you’d better start taking data quality seriously – and make sure whatever solution you use is built by someone who takes it even more seriously. Let me give you an example.

Suppose you’re a retailer who sells Gucci products. You have a simple social media solution, a nice little application that gives you sentiment analysis and aggregate scores. You investigate how your different brands are doing and, to your shock, find that Gucci has a horrible sentiment rating. People are talking about the brand and boy are they unhappy.

You do some quick mental math and determine that it must be related to the promotion you just did around a new Gucci product. The customers must hate the product, or the promotion itself. You hurriedly show your CEO and she tells you to pull the ads.

What you didn’t know, and what your keyword based social media monitoring application didn’t know, is that there is a rap artist who goes by Gucci Mane whose fans tweet quite prolifically with reference to his name and an astonishing bouquet of language that the sentiment analysis algorithms determined to be highly negative.

Your customers are, in fact, pretty happy with Gucci and the most recent promotion, but the relevant data was drowned out and wildly skewed by a simple factor like a recording artist with a name in common. This wasn’t a question of “the data was wrong” – the data was accurate, it was just irrelevant, and the ability to distinguish between the two requires technology built on a foundation of data quality governance.

If you’re going to use social media data, especially when you’re using it as a measure for the success of a marketing campaign and subsequently the allocation of marketing budget, make sure you’re paying attention to data quality. Don’t veer away in alarm or boredom from terms like data governance just because they aren’t as sexy as SEO or content marketing or 360 view of the customer – train yourself to actively seek the references to data quality as part of the decision making process around a social media strategy.


4. Don’t Let Someone Else Define Your Business Rules

One of the most time consuming aspects of preparing for a Master Data Management implementation is sitting down to define your business rules. There is no one definition of the customer and no one definition of a product. These are complex issues that depend heavily on the unique needs and goals of an organization, and don’t let anybody try to tell you otherwise.

To that end, social media marketing demands the same level of complexity. If you’re building a social media strategy, you absolutely need to be thinking about those business rules and definitions. How do you define a suspect? A prospect? A customer? What makes someone important and worth targeting to you? Is it more important to you to have fifty potential leads or five leads that are defined by very specific requirements for qualification?

Every organization will be different, and a good social media solution takes that into account. Be wary of a piece of software or a consulting company that has a set of pre-established business rules that aren’t easily customizable or – even worse – are completely set in stone. If an outside company tries to tell you what your company’s priorities are and applies that same strategy to every single one of their clients, thank them for their time and look elsewhere.

Also steer clear of a solution that oversimplifies things. If you’re looking to social media opinion leaders as high value targets, you want to know how they’re defining that person as an opinion leader. Are they using one metric, like Klout score or number of followers? Are they using five? Would they be willing to give more emphasis to one over the other if your company places more value on, say, number of retweets than on number of likes?

Good solutions come preconfigured at a logical setting that is based on best practices and past client success – but are also flexible and able to match themselves to your unique business definitions and strategy as much as possible.


5. Data Silos Are Lost Opportunities

Finally, I want to talk about data silos. I’m going to expand on this term for those of you reading this who are marketing people like me and not necessarily information management junkies (although I confess the people who are both combined in one are always a delight to talk to). A data silo generally refers to situations in which the different lines of business hoard their databases and don’t like to share their information throughout the entire organization. This can be a huge problem for Master Data Management adoption, because of course the point is to make it so that everyone is using the same data, but it’s also a problem for social media marketing.

Social media data, first of all, is not just marketing data. Your sales teams will undoubtedly have uses for it in terms of account handling, and your product development teams, if you have them, will be interested in learning more about what customers actively crave from the market, and heck, your customer service division almost certainly can make use of an application that instantaneously warns them when people are dissatisfied.

The fact is, if you want to prove that gathering this data is useful, don’t hoard it all to yourself. Share that data around and let people play with it. Creativity – and creative ways to use data – happens when people think about things in ways they don’t normally think about them. Traditionally social media has been relegated to marketing, but it doesn’t have to be.

An ideal social media solution, even one of those affordable subscription-based ones I’ve been talking about, presents the data in an accessible, easily shared format. The good ones come with both a high level dashboard in business terms that even a CEO who thinks Twitter is stupid can log into and gain insight from and also the ability to drill down and export raw data so that the people who want to do complex and unique number crunching have that ability without the restraints of the program itself.


Shown above: Social Cue™, the InfoTrellis social media solution


It’s important to have a good balance of goal-oriented strategy – never go into social media without a plan or a purpose – and openness to innovation. It’s even more important to be working with an application that accommodates both.


InfoTrellis is a premier consulting company in the MDM and Big Data space that is actively involved in the information management community and constantly striving to improve the value of CRM and Big Data to their customers. To learn more about Social Cue™, our social media SaaS offering, contact the InfoTrellis team directly at to schedule a product demonstration.

Topics: allsight Big Data data governance Data Quality Marketing Master Data Management mdm Social Cue social media Social Media Marketing

Leave a Reply

Your email address will not be published. Required fields are marked *