Posted by sathishbaskaran on Tuesday, May 12, 2015 @ 9:43 AM

MDM BatchProcessor is a multi-threaded J2SE client application used in most of the MDM implementations to load large volumes of enterprise data into MDM during initial and delta loads. Oftentimes, processing large volumes of data might cause performance issues during the Batch Processing stage thus bringing down the TPS (Transactions per Second).

Poor performance of the batch processor often disrupts the data load process and impacts the go-live plans. Unfortunately, there is no panacea available for this common problem. Let us help you by highlighting some of the potential root causes that influence the BatchProcessor performance. We will be suggesting remedies for each of these bottlenecks in the later part of this blog.

Infrastructure Concerns

Any complex, business-critical Enterprise application needs careful planning, well ahead of time, to achieve optimal performance and MDM is no exception. During development phase it is perfectly fine to host MDM, DB Server and BatchProcessor all in one physical server. But the world doesn’t stop at development. The sheer volume of data MDM will handle in production needs execution of a carefully thought-out infrastructure plan. Besides, when these applications are running in shared environments Profiling, Benchmarking and Debugging become a tedious affair.

CPU Consumption

BatchProcessor can consume lot of precious CPU cycles in most trivial of operations when it is not configured properly. Keeping an eye for persistently high CPU consumption and sporadic surges is vital to ensure CPU is optimally used by BatchProcessor.

Deadlock

Deadlock is one of the frequent issues encountered during the Batch Processing in multi-threaded mode. Increasing the submitter threads count beyond the recommended value might lead into deadlock issue.

Stale Threads

As discussed earlier, a poorly configured BatchProcessor might open up Pandora’s Box. Stale threads can be a side-effect of thread count configuration in BatchProcessor. Increasing the submitter threads, reader and writer threads beyond the recommended numbers may cause some of the threads to wait indefinitely thus wasting precious system resources.

100% CPU Utilization

“Cancel Thread” is one of the BatchProcessor daemon threads, designed to gracefully shutdown BatchProcessor when the user intends to. Being a daemon thread, this thread is alive during the natural lifecycle of the BatchProcessor. But the catch here is it hogs up to nearly 90% of CPU cycles for a trivial operation thus bringing down the performance.

Let us have a quick look at the UserCancel thread in BatchProcessor client. The thread waits for user interruption indefinitely and checks for the same every 2 seconds once while holding on the CPU all the time.

Thread thread = new Thread(r, “Cancel”);

thread.setDaemon(true);

thread.start();

while (!controller.isShuttingDown()) {

          try

          {

            int i = System.in.read();

            if (i == -1)

            {

              try

              {

                Thread.sleep(2000L);

              }

              catch (InterruptedException e) {}

            }

            else

            {

              char ch = (char)i;

              if ((ch == ‘q’) || (ch == ‘Q’)) {

                controller.requestShutdown();

              }

            }

          }

          catch (IOException iox) {}

        }

BatchProcessor Performance Optimization Tips

We have so far discussed potential bottlenecks in running BatchProcessor at optimal levels. Best laid plans often go awry. What is worst is not having a plan. A well thought out plan needs to be in place before going ahead with data load. Now, let us discuss some useful tips that could help to improve the performance during data load process.

Infrastructure topology

For better performance, run the MDM application, DB Server and BatchProcessor client on different physical servers. This will help us to leverage the system resources better.

Follow the best thread count principle

If there are N number of physical CPUs available to IBM InfoSphere MDM Server that caters to BatchProcessor, then the recommended number of submitter threads in BatchProcessor should be configured between 2N and 3N.

For an example, assume the MDM server has 8 CPUs then start profiling the BatchProcessor by varying its submitter threads count between 16 and 24. Do the number crunching, keep an eye on resource consumption (CPU, Memory and Disk I/Os) and settle on a thread count that yields optimal TPS in MDM.

 

You can modify the Submitter.number property in Batch.properties to change the Submitter thread count.

For example:

Submitter.number = 4

Running Multiple BatchProcessor application instances

If MDM server is beefed up with enough resources to handle huge number of parallel transactions, we should consider parallelizing the load process by dividing the data into multiple chunks. This involves running two or more BatchProcessor client instances in parallel, either in same or different physical servers depending on the resources available in that server. Each BatchProcessor application instance here must work with a separate batch input and output; however they can share the same server-side application instance or operate against a dedicated instance(each BatchProcessor instance pointing to a different Application Server in the MDM cluster). This exercise will increase the TPS and lower the time spent in data load.

Customizing the Batch Controller

Well, this one is a bit tricky. We are looking at modifying the OOTB behavior here. Let us go ahead and do it as it really helps.

  • Comment out the following snippet in runBatch() method ofjava

  //UserCancel.start();

  • Recompile the BatchProcessor class and keep it in the jar
  • Replace the existing DWLBatchFramework.jar, present under <BatchProcessor Home>/lib with this new one which contains modified BatchController class
  • Bounce the BatchProcessor instance and check the CPU consumption

Manage Heap memory

Memory consumption may not be a serious threat while dealing with BatchProcessor but in servers that host multiple applications along with BatchProcessor the effective memory that can be allocated to it could be very low. During the data load process if high memory consumption is observed then allocating more memory to BatchProcessor helps to ensure a smooth run. In the BatchProcessor invoking script (named as runbatch.bat in Windows environments and runbatch.sh in UNIX environments), there are couple of properties that control the memory allocated to the BatchProcessor client.

set minMemory=256M

set maxMemory=512M

It is recommended to keep the minMemory and maxMemory at 256M & 512M respectively. If the infrastructure is of high-end, then minMemory and maxMemory can be increased accordingly. Again, remember to profile the data load process and settle for optimal numbers.

Reader and Writer Thread Count

It is recommended by IBM to keep the Reader and Writer Number thread counts as 1. Since, they are involved in lightweight tasks this BatchProcessor configuration should suit most of the needs.

Shuffle the data in the Input File

By shuffling the data in the input file,  the percentage of similar records (records with high probability of getting collapsed/merged in MDM) being processed at the same time can be brought down thus avoiding long waits and deadlocks.

Scale on the Server side

Well, well, well. We have really strived hard to make BatchProcessor client to perform at optimal levels. Still, poor performance is observed resulting in very low TPS? It is time to look into the MDM application. Though optimizing MDM is beyond the scope of this blog let us provide a high-level action plan to work on.

You can either:

  1. Increase the physical resources(more CPUs, more RAM) for the given server instance
  2. Hosting MDM in a clustered environment
  3. Allocating more application server instances to the existing cluster which hosts MDM
  4. Having dedicated cluster with enough resources for MDM rather than sharing the cluster with other applications
  5. Logging only critical, fatal errors in MDM
  6. Enabling SAM and Performance logs in MDM and tweaking the application based on findings

Hope you find this blog useful. Try out these tips when you are working on a BatchProcessor data load process next time and share how useful you find them. I bet you’ll have something to say!

If you are looking at any specific recommendations on BatchProcessor, feel free to contact sathish.baskaran@infotrellis.com. Always happy to assist you.

Topics: InfoTrellis Master Data Management MasterDataManagement mdm mdm hub MDM Implementation
Posted by manasa1991 on Monday, May 11, 2015 @ 5:36 PM

Calvin: “You can’t just turn on creativity like a faucet. You have to be in the right mood.”
Hobbes: “What mood is that?”
Calvin: “
Last-minute panic.”

Okay, apologies for an unscheduled delay on the follow up post. Let’s get back to discussing how we manage our MDM Projects.

In my previous post, we talked about the first two stages of “InfoTrellis SMART MDM Methodology”, namely “Discovery and Assessment” and “Scope and Approach”. In these two stages, we spoke about activities around understanding business expectations, helping clients formulate their MDM strategy, help them identify scope of an MDM implementation along with defining right use cases and the optimal solution approach. I also mentioned that we generally follow a “non-iterative” approach to these stages as this helps us build a solid foundation before we can go on to the actual implementation.

Implementation:

Once scope of an MDM project is defined and client agrees to the solution approach, we enter the iterative phases of the project. We group them into two stages in our methodology:

  1. Analysis and Design
  2. Development and QA

Through these stages, we perform detailed requirements analysis, technical design, development and functional testing across several iterations.

Requirements Analysis:

At this stage of the project, high level business requirements are already available and we must start analyzing and prioritizing which requirements need to go into which phase. For Iteration I, we typically take up all foundation aspects of MDM such as the data model changes, initial Maintain services, ETL initial load and related activities. An MDM product consultant will interpret the business requirements, and work with the technical implementation leads to come up with:

  1. Custom Data Model with additions and extensions, as per project requirements
  2. Detailed data mapping document that captures source to MDM mapping for services as well as Initial load (one time migration) – data mapping is tricky; there will be different channels through which data will be brought into MDM. All different channels need to be identified and specific mapping for all these channels have to be completed; Doing this right will help us avoid surprises at a later stage
  3. Functional Requirements for each of the features – Services, Duplicate processing and so on

Apart from the requirements analysis, work on the “Requirements Traceability Matrix” should start at this stage. This is one document that captures system traceability of requirements to test cases and will come in handy throughout the implementation.

Design:

Functional requirements are translated into detailed technical design for both MDM and ETL. Significant design decisions are listed out, Object model, business objects designed, and detailed design sequence diagrams are created. Similar sets of design artifacts are created for ETL components as well. The key items that are worked on during the design phase are:

  • Significant use cases – From a technical perspective, functional use cases are interpreted so the developer has a better grip on use cases and how they are connected together to form the overall solution
  • Detailed design elements – Elaboration on each technical component so development team has to just interpret what is designed as MDM code or ETL components
  • Unit Test cases – The technical lead plans unit test cases so 360 degree coverage is ensured during unit testing, and most of the simple unit level bugs are identified

Within the sphere of tools that we use, if unit test automation is possible we do that as well.

Development:

MDM and ETL development happen in most of our projects. Apart from IBM’s MDM suite, we also work on a spectrum of ETL tools such as IBM DataStage, Informatica Power Center, SAP PI, IBM CastIron, Talend ETL, and Microsoft SSIS. Some aspects that we emphasize on across all our projects are:

  • Coding standards – MDM and ETL teams have respective coding standards which are periodically reviewed as per changes in different product releases, and technological changes in general. The developers are trained to follow these standards when they write code
  • Continuous Integration – Most of our clients have svn repositories and our development teams actively use these repositories so the code remains an integral unit. We also have local repositories that can be used when the client does not have a repository of their own and explicitly allow us to host their code in our network
  • Peer code review – Every module is reviewed by a peer who acts as another pair of eye to bring in a different perspective
  • Lead code review – Apart from peer review, code is also reviewed by the tech lead to ensure development is consistent and error free
  • Unit Testing – Thorough unit testing is driven off the test cases written by development leads during design phase. Wherever possible, we also automate unit test cases redundancy and efficiency

With these checks and balances the developed code moves into testing phase.

Testing:

QA lead comes up with comprehensive test strategy covering Functional, system, performance and user acceptance testing. The different types of testing that we participate in differs from project to project, based on client requirements. We typically take up functional testing within the iterative Implementation phase. Rest are done once all functional components are developed and tested thoroughly.

Functional testing is driven off functional requirements. Our QA lead reviews the design as well to understand significant design decisions that helps in creating optimal test scenarios. Once requirements and design documents are reviewed, detailed test scenarios and test cases are created and are reviewed by the Business Analyst to ensure sufficient coverage. A mix of manual and automated testing is performed based on allowed scope in the project. Functional testing process will involve the following:

  • Test Definition – Scenarios / cases created, test environments identified, defect management and tracking methodology established, test data prepared or planned for
  • Test execution – Every build is subject to a build acceptance test, and upon being successful, the build is tested in detail for functionality
  • Regression runs – Once we enter defect fixing mode, multiple runs of (mostly automated) regression tests are run to ensure that test acceptance criteria is met
  • Test Acceptance – Our commitment is to provide a thoroughly tested product at the end of each iteration. For every release, we ensure all severity 1 and severity 2 defects are fixed, and low severity defects if deferred are documented and accounted for in subsequent releases.

Deployment:

In the deployment stage, we group the following activities together:

  1. System, UAT, Performance testing – All aspects of testing that sees the implementation as a single functional unit are performed
  2. MDM code deployment – MDM code will be deployed in production environment, and delta transactions (real time, or near real time) will be started
  3. One time migration or Initial Load to MDM – From various source systems, data will be extracted, transformed and loaded into MDM as a one-time exercise.

Deployment is very critical as it is a culmination of all work done until that point in the project. This is also the point at which the MDM system will get exposed to all other external systems in the client organization. If MDM is part of a much detailed revamp, or a bigger program, there will be many other projects that will need to go live or get deployed at the same time. To ensure deployment is successful, the following key points are to be considered:

  • Identify all interconnecting points and come up with an system plan that covers MDM and all integrating systems
  • If applicable, participate actively at program level activities as well to ensure the entire program accounts for all the items that have been built as part of the MDM project
  • Initial load happens across many days mostly in 24-hour cycles. Come up with clear plan, team, roles and responsibilities and if possible perform a trial / mock run of initial load

There is typically a post deployment support period and in this period we monitor the MDM hub to ensure master data is created as planned. If needed, optimizations and adjustments are made to ensure that the MDM hub performs as desired.

Once deployment is successfully completed, don’t forget to celebrate with the project team!!!

Topics: Master Data Management MasterDataManagement MDM Implementation

Leave a Reply

Your email address will not be published. Required fields are marked *

Posted by kevinwrightinfotrellis on Wednesday, Mar 5, 2014 @ 1:25 PM

What big changes does this upgrade bring?

IBM brought together Initiate Master Data Service (MDS), InfoSphere MDM Server (MDM) and InfoSphere MDM Server for PIM into a single market offering as InfoSphere MDM v10.  The market offering contained four editions: standard, advanced, collaboration and enterprise.

In InfoSphere MDM v11, IBM further unified the products from a technology perspective.  Specifically, the legacy Initiate MDS and MDM Server products were combined together into a single technology platform.

This is a significant achievement that positions IBM to address the “MDM Journey” that is much talked about.  It allows clients to start with a Registry Style (or “Virtual Hub”, which is easier to start with and then transition to a Hybrid or Centralized Style (or “Physical Hub”).  The key differentiator is the true implementation of the Hybrid Style.

The whole product has been re-architected under the covers to use the OSGi framework, which is different from the old EAR-based process, and comes with a host of new technological features and promises.

Other changes & new features include:

  • Enhanced MDM & DataStage integration
  • Expanded Patient Hub feature for medical applications
  • Through IBM PureSystems, it should be easier than ever to get up and running with MDM 11
  • InfoSphere MDM v10 introduced the Probabilistic Match Engine (PME) in Advanced Edition.  This was the embedding of Initiate MDS’s matching engine into MDM Server.  This capability has now been surfaced up into a “Probabilistic Search Service”, an alternative to the deterministic search traditionally offered with MDM Server
  • For Weblogic Server clients, unfortunately Weblogic is no longer supported and a migration to WAS is required (due to the OSGi support)

What problems does this upgrade solve?

Version 11 promises to deliver improved efficiency by integrating the standard and advanced editions – basically combining the traditional MDM and the Initiate Master Data Service – which means a number of duplicated functions are removed. There have also been some batch processor improvements.

Security is now on by default, which of course helps to minimize potential future issues and ensure that only the people who need to see the data can see the data.

In general, though, this upgrade is less about solving “problems” than it is about moving forward and enhancing existing efficiencies and strengths.  This upgrade is an evolution more than a revolution.

What’s the real value of this upgrade from a technology perspective?

To an implementer, the OSGi framework is such a different way of looking at the MDM product as opposed to the old EAR-based system that it’s worth it to start working with this upgrade just for the advantage of getting an early start on familiarizing yourself with this new technology.  While still maturing in the IBM MDM product, it promises faster and more dependable deployments, dependency management, and a modular code structure.  It comes with the ability to start and stop individual modules, or upgrade them without shutting down the whole application.  This can lead to much improved uptime for the MDM instance(s).

It’s also worth noting that for a company on the IBM stack, the improved integration with products like DataStage can really increase the value of this product to the enterprise.

How much effort is it going to take to implement?

IBM has held strong to their “Backwards Compatibility” statements, which is key in upgrade projects.  However, given the technology change with OSGi, effort-wise this upgrade will take a little more than if going up to say, 10.1.  We’ve seen a number of PMRs, etc to be expected from a new release, particularly one on new technology.  Fortunately InfoTrellis has been involved in a good number of installation and product-related PMRs and has experience both working with IBM and clients to resolve them quickly.

What if I’m running a much older version?

MDM 8.5 Goes out of service on 30 April 2014 and 9.0.2 goes out of service 30 April 2015.  As far as any prior versions, it has value to move to more current versions of DB & WAS, not just MDM.  OSGi looks well positioned to be used across the board in the near future considering all of the advantages it provides; so again, it’s good to get your hands on it and start learning to work with it sooner rather than later.

What about Standard Edition (Initiate) users?

Organizations currently using Standard Edition (Initiate) will be majorly impacted by MDM version 11, because this upgrade means they will have an entirely new technology platform to migrate to, which includes the WebSphere Application Server.

The biggest advantage this release provides to existing Standard Edition users is the ability to implement true hybrid scenarios.  One scenario, for example, is being able to persist a composite view of a “virtual entity” to a “physical entity”.  This can realize performance advantages if the virtual entities are made up of many member records.  Also, there is then the ability to decorate the physical entity with additional attributes that come in the Advanced Edition platform such as Privacy Preferences, Campaign History and Hierarchies to name a few.  This scenario allows an organization to progress along their MDM journey if they have requirements to do so.  This article doesn’t address any licensing impacts to leverage Advanced Edition features.

The Advanced Edition (or “physical MDM”) capabilities are very feature rich and couples very well with Standard Edition (or “virtual MDM”).  However, with that said it is very important for clients that want to transition from Standard to Advanced Edition to leverage partners that have expertise in both of those platforms.

If I implemented MDM very recently, should I upgrade?

If you’re currently using MDM 10.x, it might not turn out to be worth the effort to upgrade immediately if implementation just took place.  It is worth reiterating that v11 is the way of the future from an implementation standpoint, and the OSGi framework will likely be the way of the future.

How does this impact a business-end user?

Working with a more modern MDM means less need to upgrade in future, and future upgrades using OSGi are easier to implement. Version 11 comes with an increased feature set – Big Data, virtual/physical MDM, etc – that will allow much better creation of business value from the data that you already have. Increased or improved integration with other products, like InfoSphere Data Explorer or InfoSphere BigInsights, is another big plus for those already invested in IBM products.

How does this impact an IT user?

A number of things stand out:

  • Improved performance from MDM 11, WAS 8.5, newer versions of DB2/Oracle
  • OSGi
  • Improved MDM 11 workbench
  • Much smaller code base to track – just customized projects – the end result being a much smaller deployable artifact
  • Enforced security
  • Streamlined installation – basically same for workbench and server which helps to improve the experience for the developer who also performs installation
  • Batch processor improvements
  • Initiate users gain the benefits of
    • Event notification
    • Address standardization via QS

What unique insight into this upgrade does InfoTrellis have that other vendors don’t have?

Put quite simply, experience: we are already ahead of the game by being one of the first implementers in North America if not the world to be participating in upgrade and implementation efforts for MDM 11.  We’re also able to leverage our volumes of experience with prior versions.  InfoTrellis is involved in a number of MDM 11 projects already – both upgrades and new implementations – on a variety of Operating Systems (Linux, Solaris, AIX) and database (Oracle, DB2).

 

If you’re looking into upgrading your MDM, give us a shout. Reach out to my colleague Shobhit at shobhit@infotrellis.com to talk to the foremost MDM experts about how we can help you with your implementation.

Topics: IBMMDM MasterDataManagement mdm MDM11 OSGi WebSphere

Leave a Reply

Your email address will not be published. Required fields are marked *