Posted by sathishbaskaran on Tuesday, May 12, 2015 @ 9:43 AM

MDM BatchProcessor is a multi-threaded J2SE client application used in most of the MDM implementations to load large volumes of enterprise data into MDM during initial and delta loads. Oftentimes, processing large volumes of data might cause performance issues during the Batch Processing stage thus bringing down the TPS (Transactions per Second).

Poor performance of the batch processor often disrupts the data load process and impacts the go-live plans. Unfortunately, there is no panacea available for this common problem. Let us help you by highlighting some of the potential root causes that influence the BatchProcessor performance. We will be suggesting remedies for each of these bottlenecks in the later part of this blog.

Infrastructure Concerns

Any complex, business-critical Enterprise application needs careful planning, well ahead of time, to achieve optimal performance and MDM is no exception. During development phase it is perfectly fine to host MDM, DB Server and BatchProcessor all in one physical server. But the world doesn’t stop at development. The sheer volume of data MDM will handle in production needs execution of a carefully thought-out infrastructure plan. Besides, when these applications are running in shared environments Profiling, Benchmarking and Debugging become a tedious affair.

CPU Consumption

BatchProcessor can consume lot of precious CPU cycles in most trivial of operations when it is not configured properly. Keeping an eye for persistently high CPU consumption and sporadic surges is vital to ensure CPU is optimally used by BatchProcessor.

Deadlock

Deadlock is one of the frequent issues encountered during the Batch Processing in multi-threaded mode. Increasing the submitter threads count beyond the recommended value might lead into deadlock issue.

Stale Threads

As discussed earlier, a poorly configured BatchProcessor might open up Pandora’s Box. Stale threads can be a side-effect of thread count configuration in BatchProcessor. Increasing the submitter threads, reader and writer threads beyond the recommended numbers may cause some of the threads to wait indefinitely thus wasting precious system resources.

100% CPU Utilization

“Cancel Thread” is one of the BatchProcessor daemon threads, designed to gracefully shutdown BatchProcessor when the user intends to. Being a daemon thread, this thread is alive during the natural lifecycle of the BatchProcessor. But the catch here is it hogs up to nearly 90% of CPU cycles for a trivial operation thus bringing down the performance.

Let us have a quick look at the UserCancel thread in BatchProcessor client. The thread waits for user interruption indefinitely and checks for the same every 2 seconds once while holding on the CPU all the time.

Thread thread = new Thread(r, “Cancel”);

thread.setDaemon(true);

thread.start();

while (!controller.isShuttingDown()) {

          try

          {

            int i = System.in.read();

            if (i == -1)

            {

              try

              {

                Thread.sleep(2000L);

              }

              catch (InterruptedException e) {}

            }

            else

            {

              char ch = (char)i;

              if ((ch == ‘q’) || (ch == ‘Q’)) {

                controller.requestShutdown();

              }

            }

          }

          catch (IOException iox) {}

        }

BatchProcessor Performance Optimization Tips

We have so far discussed potential bottlenecks in running BatchProcessor at optimal levels. Best laid plans often go awry. What is worst is not having a plan. A well thought out plan needs to be in place before going ahead with data load. Now, let us discuss some useful tips that could help to improve the performance during data load process.

Infrastructure topology

For better performance, run the MDM application, DB Server and BatchProcessor client on different physical servers. This will help us to leverage the system resources better.

Follow the best thread count principle

If there are N number of physical CPUs available to IBM InfoSphere MDM Server that caters to BatchProcessor, then the recommended number of submitter threads in BatchProcessor should be configured between 2N and 3N.

For an example, assume the MDM server has 8 CPUs then start profiling the BatchProcessor by varying its submitter threads count between 16 and 24. Do the number crunching, keep an eye on resource consumption (CPU, Memory and Disk I/Os) and settle on a thread count that yields optimal TPS in MDM.

 

You can modify the Submitter.number property in Batch.properties to change the Submitter thread count.

For example:

Submitter.number = 4

Running Multiple BatchProcessor application instances

If MDM server is beefed up with enough resources to handle huge number of parallel transactions, we should consider parallelizing the load process by dividing the data into multiple chunks. This involves running two or more BatchProcessor client instances in parallel, either in same or different physical servers depending on the resources available in that server. Each BatchProcessor application instance here must work with a separate batch input and output; however they can share the same server-side application instance or operate against a dedicated instance(each BatchProcessor instance pointing to a different Application Server in the MDM cluster). This exercise will increase the TPS and lower the time spent in data load.

Customizing the Batch Controller

Well, this one is a bit tricky. We are looking at modifying the OOTB behavior here. Let us go ahead and do it as it really helps.

  • Comment out the following snippet in runBatch() method ofjava

  //UserCancel.start();

  • Recompile the BatchProcessor class and keep it in the jar
  • Replace the existing DWLBatchFramework.jar, present under <BatchProcessor Home>/lib with this new one which contains modified BatchController class
  • Bounce the BatchProcessor instance and check the CPU consumption

Manage Heap memory

Memory consumption may not be a serious threat while dealing with BatchProcessor but in servers that host multiple applications along with BatchProcessor the effective memory that can be allocated to it could be very low. During the data load process if high memory consumption is observed then allocating more memory to BatchProcessor helps to ensure a smooth run. In the BatchProcessor invoking script (named as runbatch.bat in Windows environments and runbatch.sh in UNIX environments), there are couple of properties that control the memory allocated to the BatchProcessor client.

set minMemory=256M

set maxMemory=512M

It is recommended to keep the minMemory and maxMemory at 256M & 512M respectively. If the infrastructure is of high-end, then minMemory and maxMemory can be increased accordingly. Again, remember to profile the data load process and settle for optimal numbers.

Reader and Writer Thread Count

It is recommended by IBM to keep the Reader and Writer Number thread counts as 1. Since, they are involved in lightweight tasks this BatchProcessor configuration should suit most of the needs.

Shuffle the data in the Input File

By shuffling the data in the input file,  the percentage of similar records (records with high probability of getting collapsed/merged in MDM) being processed at the same time can be brought down thus avoiding long waits and deadlocks.

Scale on the Server side

Well, well, well. We have really strived hard to make BatchProcessor client to perform at optimal levels. Still, poor performance is observed resulting in very low TPS? It is time to look into the MDM application. Though optimizing MDM is beyond the scope of this blog let us provide a high-level action plan to work on.

You can either:

  1. Increase the physical resources(more CPUs, more RAM) for the given server instance
  2. Hosting MDM in a clustered environment
  3. Allocating more application server instances to the existing cluster which hosts MDM
  4. Having dedicated cluster with enough resources for MDM rather than sharing the cluster with other applications
  5. Logging only critical, fatal errors in MDM
  6. Enabling SAM and Performance logs in MDM and tweaking the application based on findings

Hope you find this blog useful. Try out these tips when you are working on a BatchProcessor data load process next time and share how useful you find them. I bet you’ll have something to say!

If you are looking at any specific recommendations on BatchProcessor, feel free to contact sathish.baskaran@infotrellis.com. Always happy to assist you.

Topics: InfoTrellis Master Data Management MasterDataManagement mdm mdm hub MDM Implementation
Posted by infotrellislauren on Monday, Sep 22, 2014 @ 10:59 AM

BoothPhoto

It’s that time of year again! InfoTrellis will be attending and exhibiting at two major tradeshows this 2014 as the weather starts to get a little chillier. If you’ve ever wanted the chance to chat with one of the brilliant brains behind Customer ConnectId™, ask the hard questions about Big Data integration from someone who can actually give you an answer, or just wanted to learn more about the only Master Data Management SI in the industry with a 100% success rate, here’s your chance to meet with us in the flesh.

GTEC 2014

October 27th– 30th, Ottawa, ON

Booth Number 611

 Booth Map GTEC 2014

GTEC (Government Technology Exhibition and Conference)  is the primary forum where government and private sector communities gather to exchange ideas and advance the business of ICT in government.

IBM Insight 2014

October 27th – 30th, Las Vegas, NA

Booth Number 207

Booth Map IBM Insight 2014

Previously known as IBM’s Information on Demand, IBM Insight is bustling with purveyors of cutting-edge technology, and InfoTrellis is delighted to yet again have the honor of being among their numbers. With the growing emphasis on Big Data analytics, we’ll have a lot to say and will certainly be running demos of our own solutions in the space all throughout the show.

We hope to see you all there, and if you plan to attend and would like to set up a meeting in advance with one of our executives, please send me an email at lauren@infotrellis.com to arrange a time for a discussion.

Topics: GTEC IBM IOD Information On Demand InfoTrellis ottawa Tradeshow Tradeshow Schedule

Leave a Reply

Your email address will not be published. Required fields are marked *

Posted by infotrellislauren on Tuesday, Oct 1, 2013 @ 10:35 AM

BoothPhoto

InfoTrellis will be attending and exhibiting at two major tradeshows this 2013 as the weather starts to get a little chillier. If you’ve ever wanted the chance to chat with one of the brilliant brains behind Social Cue™, ask the hard questions about Big Data integration from someone who can actually give you an answer, or just wanted to learn more about the only Master Data Management SI in the industry with a 100% success rate, here’s your chance to meet with us in the flesh.

The DMA Show 2013

October 12th– 17th, Chicago, IL

Booth Number 461

 InfoTrellis DMA 2013 Booth Location Map

The Direct Marketing Association, with its slogan of Data-Driven Marketing, is a hotbed of activity and ideas for how to start using data in ways that lead to higher revenue, lead generation, and customer satisfaction.

IBM’s Information on Demand 2013

November 3rd – 7th, Las Vegas, NA

Booth Number 325

InfoTrellis IOD 2013 Booth Location Map

IBM’s Information on Demand is bustling with purveyors of cutting-edge technology, and InfoTrellis is delighted to yet again have the honor of being among their numbers. With the growing emphasis on Big Data analytics, we’ll have a lot to say and will certainly be running demos of our own solutions in the space all throughout the show.

We hope to see you all there, and if you plan to attend and would like to set up a meeting in advance with one of our executives, please send me an email at lauren@infotrellis.com to arrange a time for a discussion.

Topics: DMA Show IBM IOD Information On Demand InfoTrellis Tradeshow Tradeshow Schedule

Leave a Reply

Your email address will not be published. Required fields are marked *

Posted by Kumaran Sasikanthan on Friday, Sep 27, 2013 @ 2:50 AM

InfoTrellis has begun its hiring at universities and colleges in Canada, US and India for 2013-14. We hire primarily for technical roles from universities and colleges offering high quality education in computer science.

 

At InfoTrellis, we work on some of the most challenging Master Data Management projects in the world, serving clients across multiple domains and industries. Ours is a David vs. Goliath story; we compete with companies that employ armies of professionals and massive resources – and in spite of their size advantages we win based on the strength of our reputation for superior client service and competency.

AtifSpeaking
Members of our team of experts are often called upon to speak at industry events.

With our specialist skills, niche focus, practical light weight processes, relaxed and empowering team structures and most importantly smart and highly motivated people, we have consistently won clients and delivered superior results. We are currently looking for entrepreneurial minded individuals to join us in taking our company to the next level.

DSC_3943
Our team comes together to tackle the big challenges and to celebrate the big wins!

 

I wanted to write this blog article specifically to share some insights on how we hire, what we look for and what you can expect once you join us.

  1. We don’t necessarily look for what you have done, but what you can do. New skills can be attained quickly, but aptitude can’t be developed overnight. We look for strong aptitude for analytical and logical reasoning. All candidates go through a written test that tests your analytical and logical reasoning skills. Puzzles and logical questions are common in our interviews.
  2. We work primarily on Java technologies, so obviously programming skills in Java are highly desirable. It isn’t a total dealbreaker, though: if you’ve got experience with any other programming languages like C or C++, you should be fine during the interview process. Expect a lot of programming questions both in the written test and personal interviews.
  3. Our focus is in Master Data Management and related technologies, so knowledge of databases and SQL is a very strong asset. We don’t look for any specific vendor database skills. Expect quite a few data modelling and design questions that will test your understanding of the various database concepts.
  4. We believe that titles are for facilitating external interactions and not for flaunting within the organization. You will find our people multi-cultural, humble yet ambitious. Expect questions that seek to determine your fit into our environment of enthusiastic and success-motivated team players.
  5. Every InfoTrellis employee is part of working towards our overall mission: to build a company that is recognized worldwide as the premier consulting company in the field of information management. We have a lot of be proud of, looking at what we’ve already accomplished, but there is a lot more to be achieved and all this will require hard work and deep commitment to our unified vision. Expect questions that probe your drive to be part of a group that is highly motivated and works in a very fast paced environment.
  6. Our work has a direct impact on our client’s top line and bottom line. With that in mind, our consultants need to have the acumen and eagerness to understand our client’s business and how technology can solve their business problems. Expect questions that seek to assess your business acumen.

Once you join us, you can expect:

  1. Training on core skills that will allow you to ramp up quickly in our focus areas. Live POC’s and mock projects to help you prepare for the real thing will be part of the training program
  2. Regular interactions with senior members of the company, including the founding members and executive team
  3. Working as part of a global team, interacting with different geographies and cross functional teams
  4. Opportunity to work on high profile client engagements very early in your career
  5. If you are based in North America, opportunity to travel to different client locations, build up your mileage points and live the exciting life of a high end consultant.

InfoTrellis is the ideal company for sharp-minded young professionals who want to start building the foundations of a rewarding and challenging career – if you find yourself thinking you want more than just a job, this is the place for you.

I wish you all the best in your search for a career.

The author is the VP for Consulting at InfoTrellis and is directly responsible for hiring, training, retaining all of our people across US, Canada and India. . Please feel free to provide comments and send any queries you may have to careers@infotrellis.com.

Topics: computer science employment hiring HR infoformation technology information management InfoTrellis IT jobs MDM Implementation

Leave a Reply

Your email address will not be published. Required fields are marked *

Posted by hmrizin on Monday, Apr 1, 2013 @ 11:07 AM

Yes. Social Media is important for business. Thanks to the analysts, advocates, industry experts and the zillion articles & blog posts on the topic.

Now the question is where to start and how to go about consuming Social Media Data. What are the steps involved? Are there any best practices, frameworks, patterns around consuming Social Media Data? This blog tries to answer some of these questions, keep reading…

Within the general IT community we’ve learned many lessons over the years that lead us to the conclusion that data is an enterprise asset and we need proper controls and governance over it.  Only then can we trust the results of analytics, use it in real-time processes with confidence and gain operational efficiencies.

Social media data, or external data in general, is no exception to this.  We can and must apply techniques we commonly apply in master data management, data quality management and enterprise data warehousing disciplines so that we can draw the most value possible out of social media data.  On top of that there are additional techniques to apply given the unique nature of this type of data.

This article proposes the concept of a centralized and managed hub of social media data, or “social media master”.  It addresses the following topics:

  1. Justification
  2. Data Acquisition
  3. Enrichment
  4. Data Quality and “Data Quality Measures”
  5. Relevance (i.e., finding the signal in the noise)
  6. Consumption (i.e., use of the data)
  7. Integration with other data and “Social Media Resolution”
  8. Governance

Justification

This article does not go into much detail on why you should centralize and manage social media data and I would assume information management practitioners would accept this as the right thing to do.  Instead the question is “when is the right time to do it?”  It is natural to take a project-based and simple approach to managing social media data when starting out with only one or two initiatives.  But you do not want to fall into the trap of having multiple initiatives on the go, each with their own silo of data that is inconsistent, incomplete and with unknown quality.  That is reminiscent of the proliferation of project and departmental based data warehouses in the 1990’s that many organizations are still trying to address in their enterprise data warehousing strategies.

Data Acquisition

The first and foremost task is to collect the social media data of interest.  There are of course different ways this can be done such as:

  1. Subscribing to the social media site’s streaming API (i.e., data is pushed to you).
  2. Using the social media site’s API (i.e., you pull data from them).
  3. Purchasing data from a third party provider.

All of the popular social media sites like Twitter and Facebook have well documented APIs, schemas and terms and conditions.

The method you choose will depend on the criteria and volume of data you expect.  For example, if you simply want all Tweets that mention your company name (or some other keywords) then perhaps subscribing to the social media site’s streaming API may be sufficient.  If you expect a large volume of Tweets and you want to go back in time then you will likely have to get that data from a third party provider.

Enrichment

All popular social media sites have a well-defined schema that describes the content of the data.  And the content for many includes the same basic data such as a user id (or handle), a name, location information, timestamp data and of course the actual social content such as the 140 characters of Tweet text.

This is raw data and should not be considered ready for consumption as you first need to apply quality functions to the data, measure the quality and also create relevance measures so you can find “signal in the noise”.  There is also opportunity to enrich the data, which not only helps with quality and relevance measures but also provides additional data that can be very useful in analytics.  Let’s look at a few examples.

The first example is enriching a Twitter user’s profile with gender information.  By analyzing the user’s name and handle it is possible to derive gender along with a confidence level.  It is not possible in all cases but is possible in many.  Gender is, of course, a very important dimension in analyzing data for many organizations.

A second example is by analyzing the text of the social content.  For example, are there any mentions of brand names, product names or competitors?  A simple yet effective way is to use reference files of keywords and simple string matching to pull out this information.  Another way is to use more advanced natural language processing (NLP) and machine learning techniques to do this, which is better suited for enriching the raw data with things such as sentiment and categories.

A final example is with Four Square check-ins (over Twitter), which broadcasts a user’s location such as “I’m at Lowe’s Home Improvement (Mississauga, ON)”.  Different check-in services have different formats but Four Square check-ins are usually in the format of “I’m at <Store-Name> (<Place>, <State/Province>)”.  This is packed full of good information even though it is brief.  You can pull out not only city and state/province level information but also store level information that can be matched to a reference file of stores and used as a dimension in analytics.

You have the ability with enrichment techniques to augment the raw data with additional data that is very useful both in downstream use (analytics and real-time processes) but also in subsequent activities for mastering the data.

Data Quality and Data Quality Measures

Data quality is an activity that cannot be ignored in any data management/integration exercise and the same applies to social media data.

Different quality functions can be applied depending on what data is available to you.  One simple example is in analyzing free form location information that can in formats such as “Toronto”, “Toronto, Ontario”, “Toronto, On”, “Toronto, ONT”, “T.O.”, “Toronto Ontario”.  This data can be put into a standard format so it is consistent and uniform across the data set.

Given this is external data that is not under your control, it is very important to not just apply quality functions but to also measure the quality.  For example, what is your level of confidence that the city data refers to an actual city?  Your confidence in whether or not the user lives in that city is a different matter all-together, however.

When you create and manage a hub of social media data you can expect the multiple consumers will have different uses of the data.  They will therefore pick data that is appropriate for them and that they believe is “good enough” for their purpose.  This is why measuring the quality of the data is important, if not critical.

Relevance

One important quality measure is “relevance”.  Ultimately, relevance is contextual because one set of data that is relevant to one user may not be for another user.  However, it is important to create a relevance, or qualification score as a basis.

By definition “Qualification = Quantifying the confidence you have in the quality of the data” .  In simple terms the question that needs to be answered is “how confident are you that this particular social media content, such as a Tweet, is relevant for you?”   As an example, a Tweet that you’ve acquired that has nothing to do about your company, competitors, products or brands may have a low (or zero) relevance and therefore “filtered out” from being used in downstream processing.

This is finding the signal in the noise and ensuring consumers use data that matters, which provides better business outcomes.

Consumption

The “managed” social media data can be consumed from a centralized hub once the data has been acquired, enriched, enhanced with quality and measured.

sm-flow

Just like an MDM hub, the consumers can be analytical or operational (real-time) in nature.  And just like an MDM hub, best practices should be followed in terms of security, audit, setting SLAs and having the right infrastructure components in place.

One major difference between an MDM hub and a social media hub is the level of trust and confidence in the data.  This is not a topic that can be ignored with MDM hubs but we are in a different game since we are dealing with external data versus internal data.  That is why enriching, measuring quality and measuring relevance of the data is critical.  It provides the ability for consumers to work with the data that is appropriate for their needs and tolerance levels.

It is also important to have a well-defined schema in place.  Much of the actual social media content is unstructured however there is structured data around it and the enrichments and quality measurements are structured.  Just because it is well-defined doesn’t mean it has to be normalized into a fully typed relational model or object model.  What is most important is there are basic structures in place to aid in consuming the data.

sm-schema

Integrating with other data and Social Media Resolution

Social media data, just like reference data, master data and other types of data, is not an island.  It is when you combine qualified and relevant social media data with your internal data that you have huge business potential.

In some cases an organization may have Twitter handles or other social account identifiers in their MDM hub that they can use to join to a social media hub to see relevant activity.  But for most this is not the case and instead they would need to look at “social media resolution” as a technique to match and merge data.

Social media resolution comes in two forms:

  1. Resolving identities/accounts across social media services (e.g., a Twitter user to a Facebook user).
  2. Resolving social media identities/accounts to internal data (e.g., a Twitter user to a customer in an MDM hub or enterprise data warehouse).

This is a very different problem than matching “customers to customers” and different data points, techniques and technologies are required to make it happen.

It is not in the scope of this article to describe how to do it, however if you find yourself interested in hearing more details then you can contact me at riyaz@infotrellis.com to chat about it. We’ve been innovating in this area at InfoTrellis with the development of our AllSight big data platform and I’d be happy to talk to you at greater length on the subject.

Governance

All enterprise assets need to be properly governed and to govern something you need to first measure it.  Therefore it is important to capture and analyze key measures of the social media hub such as:

  1. New data acquired (how the data is changed)
  2. Quality of the data and how it is trending (we can do this since we measure the quality)
  3. Success in enriching the data
  4. Who is using the data and how are they using it

Below is an example of a Twitter Dashboard that is used to get insights into the key measures of a social media hub containing Twitter accounts and Tweets.

sm-dashboard

Conclusion

The past has taught us that we need to be proactive and properly manage data as an enterprise asset if we want to get the most out of it and have confidence in what we get.  Social media data is no different than this and hopefully this article has provided you some insight into what a social media hub can look like and what it must do.

If you agree or disagree or want to chat further on the topic then please leave a comment or contact me directly!

Topics: allsight bigdata data aquisition data governance Data Quality data silos enrichment facebook InfoTrellis master data analytics Master Data Management mdm qualification relevance social media social media master social media schema tweets Twitter

Leave a Reply

Your email address will not be published. Required fields are marked *

Posted by lavanyaramkumar on Monday, Jan 7, 2013 @ 10:45 AM

In recent years reference data management (RDM) has slowly crept into the forefront of business decision-makers’ consciousnesses, making its way steadily upwards in priority within corporate goals and initiatives. Organizations are suddenly seeing the benefits of investing in RDM, attention grabbed by potential paybacks like smoother interoperability among various functions of the organization and centralized ownership and accountability in creating trustworthy data.

Before we dive into talking about approaches for implementation, I want to look at the potential significance of RDM in an enterprise. When the market is inclined towards concepts like Data Integration, MDM and Business Intelligence tend to hog the spotlight. For these particular corporate initiatives, the primary focus of data is key business information like customers, products or suppliers. It is equally important, however, to appreciate the fact that reference data plays a major role in organizing and comprehending all these key pieces of business data.

Whenever there is a change in reference data, the definition of that business data changes as well. That’s why it’s so important to invest meaningful effort into the maintenance of reference data, especially in any globally distributed network or where enterprises have diverse systems each with their own localized data. Half-hearted maintenance of reference data degrades quality of business data and results in misleading reports in BI and CRM initiatives.

Organizations looking for an efficient, quick and low-risk approach are shifting focus towards reference data management solutions. RDM allows different versions of codes to be managed from a central point, simplifies the creation of mappings between different versions, and enables transcoding of values across data sets. Cross-enterprise reference data can then be reconciled for application integration, business process integration, statutory and non-statutory reporting, compliance and BI analysis.

RDM treats reference data as a special type of master data and applies proven master data management (MDM) techniques including data governance, security, and audit control. A good RDM solution enables efficient management of complex mappings between different reference data representations and coordinates the use of reference data standards within an organization. User interface to the RDM hub provides a centralized authorization and approval process, publishes data changes to enterprise systems, and handles exceptional situations.

RDMJan7

Key Considerations for Implementation

     1.     Data Identification

Identifying common definitions and classifications across the organization and then generalizing a golden set of definitions is the first step to RDM success. The same data may have historically been maintained by several groups, directly or indirectly wasting resources like effort, budget and time, and starting with clear definitions is the best way to eliminate that waste. Some examples of common reference data issues resolved by clearer definition are:

Transaction Codes: Manufacturing or sales units of an organization can have different “transaction codes” that require consolidation of several systems to communicate status rather than a single dashboard providing universal status for all departments.

 

HR Codes: A global organization having several sub-units or frequent mergers and acquisitions can struggle to unify its data on employees, resulting in a failure to leverage employee expertise across the organization and ultimately underutilizing current staff and spending money to look for the same skills externally.

 

Segment codes for sales & marketing: Maintaining a single version of zonal or segment code improves an organization’s ability to concentrate on markets with the greatest potential growth and ensure that focus is globally distributed instead of fixated on a single market segment.

 

Fixed Asset codes: Organizations often have multiple assets of the same type or category such as machinery, equipment, furniture, and real estate, and face difficulties in segmenting them universally to identify an accurate global financial status of the company.

     2.    Define Two-Fold Rule

Defining data rules and business rules is incredibly important. The first category focuses on validation rules, which can be as simple as data validations imposed by industry standards. The second category focuses on compliance with business processes and data governance objectives.

For instance,

Data Rules Business Rules
NAICS Code is a 2-6 digit number Hierarchy management and constraints within Code Sets
State Code must have an associated Country Code Life cycle management for a Code Set from creation to distribution

     3.    Know Your Integration Points

Centralized management of reference data within an enterprise will enable organizations to improve efficiency and provide strong data governance, security, audit & process controls and change management.

Unlike Master Data Management scenarios, where the MDM Hub can function as a replacement to legacy systems for single source information, the RDM Hub is positioned at the center of the enterprise architecture with anywhere between one and a dozen source and distribution points. Significant considerations must be laid out to ensure the seamless integration of data into downstream systems for real time services and consistent operations. Well defined integration process and mechanisms at this stage will provide long term returns for your organization.

     4.    Ease of Data Governance  

Strategic data governance is absolutely essential in an RDM implementation. Without any sort of RDM, consolidation of data for internal or regulatory reporting must be achieved through an inefficient, labor intensive manual process. The benefit of having an RDM solution is that it removes the burden of maintaining reference data from what is usually several individual IT teams, transferring ownership to one data governance team with more visibility and control over the business rules around reference data.

Having an RDM solution facilitates the establishment of a lean, efficient data governance team that manages multiple versions of code, builds complex mapping and hierarchy, authorizes changes, manages data for reporting purposes, publishes changes to downstream systems, and manages a variety of other valuable tasks.

Ultimately, a powerful RDM solution can save a corporation from wasting significant amounts of valuable resources, and a proper implementation is key to that solution.

 

I hope this article has given you some valuable insight into successful RDM implementation. For more details on our RDM solutions and client RDM successes, feel free to contact me at lavanya@infotrellis.com or visit www.infotrellis.com

Topics: data governance Data Quality InfoTrellis Integration RDM Reference Data

Leave a Reply

Your email address will not be published. Required fields are marked *

Posted by David Borean on Wednesday, Nov 7, 2012 @ 3:08 PM

Intertwined fates

There has been an interesting shift in the MDM space over the last few years.  It wasn’t long ago that the most common question used to be “What is MDM?” – these days that question is instead “What are the best practices in implementing and sustaining MDM?”

There are best practices that have become common knowledge, one example being the practice of approaching MDM as a “program” and not a “project”, employing phased implementations that provide incremental business value.

Other best practices have yet to enter the mainstream; among them the absolutely essential practice of establishing MDM not in isolation but as part of a broader Data Governance program – a practice that cannot be undervalued for its impact on long term success.  This is an approach that takes time to see the effects and understand the value of, which goes a long way towards explaining why it so often gets overlooked, especially in light of the fact that MDM is still a relatively young idea for many companies.  You can get MDM off the ground without Data Governance, but over time you will certainly feel the effects of gravity much more without it.

We understand that successful Data Governance will lead to better and higher value business outcomes by managing data as a strategic asset.  It is also widely recognized that a critical success factor in effective Data Governance is having the right metrics and insights into the data.  Taking it one step further, if you concede that master data is the most strategic data for many organizations (most people would), having the right metrics and insights into that master data is a must.

MDM requires Data Governance to be successful beyond the first phases of implementation – and Data Governance requires metrics and insights into master data to be successful.  So what are these required metrics and insights and where do they come from?

Metrics and insight

The most important metrics and insights about your master data are as follows:

What is the composition of your master data?

When you bring data in from multiple sources and “mesh” it together you’ll want to understand what the resulting “360 view” of that data looks like, as it will provide interesting insights and opportunities.  For example, on average how many addresses does each customer have?  How many customers have no addresses?  More than five addresses?  How many customers have both US and Canadian addresses?

How is your master data changing and who is impacting the change?

In any operational system you want to know how many new records have been added and how many existing records have been updated for different time dimensions (e.g., daily, weekly, monthly) and time periods.  In an MDM hub, you need to take this a step further and understand entity resolution metrics – such as how many master data records have been collapsed together and split apart.  Entity resolution is the key capability of an MDM hub responsible for matching and linking/merging records from multiple sources, and you therefore need on-going metrics on it in order to optimize it.

Furthermore, it is also important to understand what sources are causing the changes, given that master data records are composed of records from multiple sources.  Is the flow of information what you expect?

How are quality issues trending and where are they originating?

It is obviously important to know the current state of quality and how many issues are outstanding for resolution, aiding in your ability to address these issues in priority order.  It is, however, also important to see the bigger picture and be aware of how quality issues are trending over given time dimensions and time periods.  Ultimately you want to fix any data quality issues at their source, and in order to do this you will need to understand which of your sources are providing poor quality data to the MDM hub.

Take address data, for example.  You may detect that a number of address records in the MDM hub have a value of “UNKNOWN” for the city element. With proper Data Governance you are able to trace these values back to a particular source system, and from there address the issue at source.  The result is being able to see and track this particular quality issue trending downwards over time.

Not only does this help in increasing the quality of the data but can also be used to justify the existence of the MDM program, especially if you can put a unit cost to a quality issue (possible for some quality issues like bad addresses).  It is extremely difficult to put a price on data – but comparatively easy to put a cost on bad data.

How is quality issue resolution trending?

Ultimately you want to see new quality issues trending downwards over time,  but oftentimes you still need to deal with resolving existing quality issues.  It is important to be able to see if the overall quality of the MDM hub is increasing or decreasing.  As above, having metrics and trends on the resolution of issues measured against a unit cost is a valuable and meaningful resource for data governance councils to have in hand to justify their efforts.

Sometimes your quality issues can be resolved through external events, such as a customer calling to update their address that may have a “returned mail” flag on it.  Other times quality issues are resolved by data stewards.  Quality issue resolution trends help to understand not just the outstanding data stewardship workload but also their productivity, which is useful in team planning.

Who is using the master data, how are they using it and are you meeting their SLAs?

It is common for consumers of MDM hubs to grow over time until eventually there are many consuming front-end channel systems and back-end systems.  I’ve seen MDM implementations grow from one or two consumers in initial phases to many consumers across the enterprise, invoking millions and tens of millions of transactions a day against the MDM hub.  Understanding what workload each consumer is putting on the MDM hub, error rates and SLA attainment is essential information for a data governance council to have.  To give the most obvious example, having access to this information allows for capacity planning to ensure the MDM hub will continue to handle future workloads.

The missing link – where do the metrics and insight come from?

The key metrics and insights listed above are required for successful MDM and Data Governance.  But where do you get them from?  They are not something provided by operational MDM hubs, as the hubs themselves are focused on operational real-time management and use of master data.  It is not their duty to capture facts and trending information to support the analysis of master data that produces the metrics and insights. That’s more of an “analytical process”, and it doesn’t fit well within an operational hub.  Instead, what we’re talking about is the job of “Master Data Analytics”.

I define Master Data Analytics as the discipline of gaining insights and uncovering issues with master data to support Data Governance, increase the effectiveness of the MDM program, and justify the investment in it.

This has been a missing capability in the overall MDM space for some time now.  Some clients have addressed this capability by custom-building it and, even worse, some clients have done nothing at all.  Seeing firsthand the need for a solution to this universal stumbling block, our team began work some time ago on providing that solution. There is now a best of breed product by InfoTrellis called ROM that incorporates our experience of over 12 years of implementing MDM for Master Data Analytics that delivers these required metrics and insights for success.

InfoTrellis ROM provides a set of analytics and reports that are configurable and extendible to support Data Governance – and you can think of it as a technology component in your overall MDM program that is an extension to your existing MDM hub.

One very big advantage of ROM is it allows you to capture your master data policies (e.g., quality concerns) and test them against your source system data prior to implementing MDM, providing initial snapshots of quality issues to prioritize and manage in the implementation.

Rather than talk about the product in much detail here, if you’re interested in more information on ROM or in seeing a set of sample reports, just check out the product page at http://infotrellis.wpengine.com/master_data_analytics_ROM.php.

Topics: data governance Data Quality InfoTrellis master data analytics master data governance master data reporting mdm

Leave a Reply

Your email address will not be published. Required fields are marked *

Posted by Khurram on Tuesday, Jul 31, 2012 @ 12:03 PM

Why upgrade?
Clients ask me all the time, why? Why do we need to upgrade? We are happy with the way the software is working? if it ain’t broke, don’t fix it!

This is what I tell them…
There are a variety of reasons why one might choose to upgrade. Usually, our business needs are evolving and as such we can take advantage of the new features in the product. Sometimes, it is not the business but rather our technical needs that are evolving, which push us towards an upgrade. Then again there are times when we upgrade not so much because of change but because the current version of the product has reached the end of its life cycle and will no longer be supported (see below).

Version End of Service Date
IBM Initiate Master Data Service v6.x December 31, 2009
IBM Initiate Master Data Service v7.0, v7.2 December 31, 2010
IBM Initiate Master Data Service v7.5 June 30, 2011
IBM Initiate Master Data Service v8.xIBM Initiate Address Verification Interface v8.x September 30, 2013
IBM Initiate Master Data Service v9.0, v9.2IBM Initiate Address Verification Interface v9.0, v9.2 September 30, 2014

Change for the sake of change is not necessarily a value driven MDM strategy. While it is true that the MDM strategy in many organizations pave the way for other initiatives, it is also true that in order to derive the most value out of any initiative, we should have a holistic view of all changes so that a synergistic approach can be taken to decide which changes should be implemented. The solution should add to the overall synergism of the solution, not take away from it.

Let’s Initiate®
Like most software, there are no groundbreaking changes from one version to the next but when we look at the breadth of changes across multiple versions, a strong case can be made to upgrade to the latest. Let’s look at a few upgrade scenarios that outline some of the major changes between the different releases.

Version Key Features Description

10.0

MDM Name and Packaging Initiate v10 is part of IBM InfoSphere MDM suite of products. The combined packaging provides easier methods to move from one platform to the other
BPM Express configure workflow based solutions that support data stewardship and data governance applications
Automated Weight Management Workbench guides the process of determining if the weights are appropriate for the data set based on guidelines and rules developed by IBM’s data scientists
Workbench Simplification Modifications to weight generation, algorithm configuration, and threshold calculation functionality have been made to reduce time and simplify hub configuration and deployment.
GNR Name Variants Integration (embedded in v10) same as 9.7 but GNR is now embedded in v10
MDM Application Toolkit Formerly, Initiate Composer, the MDM Application Toolkit is a library of MDM application building blocks (business components or widgets) that make MDM capabilities available to end-users. It helps development teams, customers, and partners accelerate the development of MDM powered applications.
Event Notification By using event notification, you can expose the changes made in the MDS to external applications or workflow systems (such as BPM Express v7.5 available with MDM v10).
Linkage Search The Inspector tool now allows users to be able to search for entities using a variety of different criteria (similar to task searches). This new functionality enables data stewards to inspect entities that have been autolinked or manulinked to verify the quality of the linkages.
Algorithm functions for Chinese names The 10.0 release introduces the CHINESE standardization, comparison, and bucket generation functions to support searching, matching, and linking by the Master Data Engine.
Relationship Linker Performance The batch relationship linker (RelLinker) process has been modified to improve scalability and performance.

9.7

IBM Initiate Provider Direct (not part of standard edition) Is a web-based application which enhances IBM Initiate Provider by offering organization-wide access to data about physicians, care organizations, nurses, and other care providers, supports more collaborative interaction between these provider groups.
Flexible Search Is an additional search capability built into the Master Data Engine which is independent of the heritage search capability. Multiple query operation types are supported. For example: wild cards, Boolean queries, inexact queries,range queries, etc.
GNR – Global Name Recognition (separate license) Provides a list of global name variants. Name variants can be used by the Master Data Engine during candidate selection to provide better matches

9.5

Performance Log Manager The ability to monitor system performance is vital to alerting operations staff of potential issues or clues to resolving existing issues. The Performance Log Manager can easily capture MDS information during a given interval and output the results in a web-based report.
Multi-threaded Entity Manager The entity manager has become a multi-threaded process for increased overall efficiencies of the entity management process.
International Data Accuracy Enhancements to algorithm functions have been made to increase MDS’ accuracy for comparing and linking international names, addresses, and phone numbers. Specifically, for name parsing, custom phonetics, bucketing on partial strings, and date ranges. These solutions can also provide better accuracy for local data.
Initiate Composer is a unified development environment used to quickly build lightweight but robust applications to display, access, and manipulate the data managed by IBM Initiate® Master Data Service

9.2

Interceptor tool Enhancements to the Interceptor tool enables speedier upgrade and maintenance of the Master Data Engine by recording interactions executed on one Engine and replaying those interactions on other Engines.
Entity Management by Priority Customers can set the priority of records that enter the entity management queue (e.g. data from real-time systems are higher priority than data from batch systems).
Initiate Composer is a unified development environment used to quickly build lightweight but robust applications to display, access, and manipulate the data managed by IBM Initiate® Master Data Service
Compensating Transactions Compensating transactions via MemUnput and MemUndo interactions enable the rollback of a member insert or update. These interactions are available for the Java and .NET SDKs.

9.0

Advanced issue management Allows customers to implement and manage custom data issues, also called implementation defined tasks (IDTs) or custom tasks.
Initiate Enterprise SOA Toolkit The 9.0 release introduces the Initiate Enterprise SOA Toolkit that provides a Java object API and WebServices interface to the Initiate Master Data Service.
AES encryption and IPv6 support To increase security, the Initiate Master Data Service now supports Advanced Encryption Standard (AES) and Internet Protocol (IP) v6.

Upgrade Recommendations

  • We recommend all clients on version 8.x or prior should upgrade to version 10, which is the most recent version. While a target upgrade to version 9.5 or 9.7 is possible, moving to version 10 will give a longer span before next upgrade is necessary.
  • Since version 10 is a part of the IBM Infospere MDM suite of products, it provides certain advantages when moving between platforms. Almost 50% of the effort can be reused when migrating from one platform to the other.
  • All components listed under the Initiate MDS platform must be upgraded during any major or minor upgrade process, as per IBM’s recommendation
  • Special consideration needs to be given to the custom processes or code, e.g.
    • Engine Callouts: provides the ability to intercept & modify existing Engine behavior either pre or post-interaction.
    • Custom Search Forms: Customized search screens in data steward tools (inspector and/or EnterpriseViewer) need to have their customizations moved to the upgraded solution.
    • API code (JAVA or WebServices) Ensure that existing functionality is unaffected
    • To minimize downtime, we recommend doing as much of the upgrade in parallel as possible. However some downtime will still be inevitable.

Are we ready? (to upgrade?)
We may not realize it but staying with the current solution or upgrading to the latest are two distinct decisions, not one. Either decision will have a lasting impact on the vision and the mission of the organization. Regardless the decision, the recommendation would be that the approach we take should not only be vetted by industry experts but ideally it should be created with the help of those experts. The right experts can help you validate and evaluate that even if the decision is to stay with the current solution, at least the solution will not adversely impact the organization. They can ensure that the current solution is in line with the organization’s vision and mission. The right experts would have the Initiative needed for the organization To relate, evaluate, locate, link & identify subjects.

Decision…
At the end of the day, it really comes down to what do we need today and what might we need tomorrow. Regardless of the multitude of new features, the question remains…are they enough to warrant an upgrade? Are we making substantial strides towards our goals? Are the new features relevant to our current and/or future business needs? Should we upgrade even if there isn’t a lot of value today? The decision rests solely with you.

However, regardless of the decision, we should keep this principle in mind:

“Change when change isn’t absolutely necessary does give us the luxury to plan, procure & implement

not only what we need today but also what we will need tomorrow. On the other hand, change when

change is absolutely necessary forces us to put a band-aid on the issue and just fix the problem(s) at hand.

Who am I, and why am I saying this?
I am a Sr. Initiate Consultant at InfoTrellis and have a long history in Initiate MDS. I started working on the Initiate MDS platform before it became IBM Initiate MDS. I have seen the product grow from early versions with a limited feature-set to a very mature and robust product that it is today. I have worked with a number of healthcare and other clients over the years. Almost all of my client projects where Initiate was their first MDM product, the clients were usually hesitant when they first start working with the product. As time went by, we saw the (proverbial) light bulb go on and the clients started to “get” the potential of what could be. A lot of times it was hard to quantify every single iota of value before the project is implemented. However, in my experience, there was seldom a client that did not derive more value from the implementation then what was initially targeted in the project scope.

Today, I am not directly connected to IBM but I am still very much involved in the MDM industry. I am also a strong and vocal supporter of the Initiate platform and the related services that developed during my tenure with Initiate and then IBM. These days I am working with a dynamic and smart group of MDM specialists at InfoTrellis to help organizations realize their true destination as they travel on their MDM Journey. (more details, later…)


Topics: Algorithm CDI Cleanup Data Data Quality Deterministic DQR EMPI Entities Finance Government Healthcare IBM Industry Infosphere InfoTrellis Initiate Leader Link Linkages Master Match mdm MDS MPI Probabilistic Remediation Retail Score Service Steward upgrade Why

Leave a Reply

Your email address will not be published. Required fields are marked *

Posted by manasa1991 on Tuesday, Jul 31, 2012 @ 7:56 AM

Welcome to the InfoTrellis blog! This is another channel through which we interact with our clients, prospects, students, and/or anyone who is interested in the Information Management space. In this blog, we will share our experience around the various services and products that we offer, and about the latest trends in our industry. In the coming weeks, you will see posts on a variety of topics including Master Data Management, Enterprise Data Integration, Big data and many more.

For those who know little about us, InfoTrellis is a boutique consulting company operating in the Information Management space. Founded by a team of technical architects who built one of the pioneering products in the Master Data Management (MDM) space, we are into strategic and tactical consulting in Master Data, Data Integration, Big Data and related domains. Our product team is working on some very exciting products, whose details we will share periodically here. To know more about us, please visit www.infotrellis.com

In this blog, you will see posts written by some of the most experienced people in the information management space, including our directors, principals, solution architects, senior consultants and other employees. We will answer many questions that our consultants come across during their engagements.

Looking forward to interesting comments, feedback, and healthy discussions through this blog!

Topics: InfoTrellis Welcome

Leave a Reply

Your email address will not be published. Required fields are marked *