Multithreaded MRP and related issues

Today, I will share some experience related to setting up and fine tuning MRP for best performance.

The most significant MRP-related improvement of DAX2009 over older versions is an ability to execute MRP in multi-threaded mode, when planning process is spread over several independent processes executed in parallel. As usual, multi-threaded execution works only when MRP is executed in batch mode, since helper threads are spawned as additional tasks inside the same batch job. This is the current way to support multi-threaded execution in Axapta.

Basic setup

The number of helper threads is specified on the second tab (Scheduling Helpers) of MRP start menu parameters. What number of helpers should be used ? To my experience, MRP scales very well. On my current project, we use 15 helpers (and 1 main thread). We tried to increase number of helpers to 31, but have not found any significant performance difference. Generally, it depends on configuration of your batch server(s) and your database server.When you are trying to increase the number of helper threads, you should check resource utilization for both batch server and database server. If one of these servers becomes saturated after yet another increase in helper threads number, then it is time to stop or, maybe, time to upgrade your hardware. I also want to mention, that I have positive experience of allocating MRP onto several batch servers. I would say, that my personal recommendation is as the following: First check every batch server, which will participate in MRP for the number of CPU cores it have. Then configure every batch server to run 2xNumber_Of_Cores batch threads. (In Administration->Setup->Server Configuration). Then specify number of helper as total number of batch threads for all batch servers serving your MRP batch group MINUS 1. Remember, that main batch thread also consumes one thread from your batch server group capacity. Also, if during MRP run, you are going to run some other batch processes on the same batch server(s), you may want to decrease the number even further, so maybe you should set number of helper-threads to total thread capacity minus 2 or 3.

Next important parameter to discuss is Item Distribution factor. What does it mean? On the very first stages of MRP (In Update
phase to be precise), the system allocates all items (all items in inventTable or all items in inventTable, which fit to the query specified in MRP startup dialog box) into chunks. Every chunk is a kind of unit of work, which is processed by one thread. During the phases of MRP, which are executed in parallel (Data regeneration, Coverage planning, Futures planning and Action Messages), every thread grabs chunk, process it, then grab another chunk, process it and so on, until chunks for given stage and BOM level are marked as processed.

Size of the chunk is calculated as Item_Distribution_Factor*(Number_Of_Helpers+1).

From one side, smaller chunks ensure smooth processing and even allocation of items between threads. The smaller chunk is, the less are chances for a thread to grab a chunk with high number of computational intensive items. Say, one item takes from 1 to 10 seconds to be coverage planed. If size of the chunk is only 1 item, then in the worst case, we most late thread will finish BOM level coverage only 10 seconds after the most early thread finished. If chunk size is 300 items, then in the worst case scenario, difference between time to process is 300*10Seconds-300*1Second==2700Seconds==45Minutes. It means that for this worst case scenario, there are good chances that most of helper threads will be doing nothing, waiting for 30-45 minutes for the last, unlucky thread to finish processing. It would increase planning time and it also would lead to non-optimal usage of hardware. (Since most of the threads would do nothing, while waiting for end of BOM Level). From other side, allocation of the chunk to a thread is a competitive process, which leads to temporary database locks. Several threads often try to allocate the same chunks in parallel; Only one of these threads succeed, while others repeat allocation process until they grab their own chunk. So, chunk allocation can become a bottleneck itself if number of chunks is too high and size of the chunk is too small. To my experience, reasonable size of the chunk is somewhere between 10 and 60. To find out the optimum chunk size and distribution factor, you can simply make several test with different distribution factors. Also, you can check ‘Track Item Process Duration’ checkbox in MRP Dialog parameters and then check typical item planning time in Master Planning->Inquiries->Unfinished Scheduling Processes->Inquiries->Item Process Duration. If item process duration varies a lot, you can benefit from smaller chunk size; if it does not, then probably increase of chunk size can be more beneficial.

Another potential way to improve performance (I never tried it though) is to put randomly-ordered items into a chunk. Now, when the system creates chunks, it simply iterates over inventTable ordered by itemId, so a chunk contain items with similar itemId. Since, usually, items with the same complexity of planning often has sequential itemIds, it often leads to uneven of distribution of items between chunks. Some chunks consist from items which are regular purchased items, while other chunks consist of complex BOM-items, which require complex resource planning on many work center groups. If you add special MRPOrder field to inventtable, fill it with random number during item creation and then sort by this field during chunk generation, you can have more even distribution of items between chunks.

Infant mortality issue

The frequent problem of parallel MRP run is early termination of helper threads. Say, we started our MPR as usual, but then in 15 minutes, we see that all our helper batch tasks terminated withthe strange message “Nothing to process”. (You can see this message, if you click Log button in Batch Task form before main thread of MRP-batch terminate). Then the only remaining main thread continues to run MRP in single-thread mode (very slowly).
Here is what is happening:

Main thread spawn helper threads and continues it work
Helper threads continuously check status of process waiting for it to advance to “Deleting and inserting data”
Main thread deletes net requirements and planned orders for plan being regenerated. (and it takes a lot of time usually)
Helper threads, after waiting for 15 minutes for status update, terminates with “Nothing to process” message
Main thread runs MRP in single thread mode.

First of all, the standard logic of MRP has obvious bug in helper thread behavior. While waiting for status update from main thread, the system keeps reading process status table (reqProcessList) continuously,without any pauses between executing SELECT statements. It is easy to figure out, that, say, 16 threads, issuing 70-100 statements in a second will send to DB server about 1500 statements in a second. This amount of queries can easily consume an average DB Server’s capacity, so it simply won’t have a time for useful work (like deletion of net requirements in main thread). To prevent the system from this behavior, you need to modify method run() of ReqProcessExternalThread class. It contains the ill-fated status waiting loop. You should find the end of while(true){} loop and
insert into it sleep(500); statement. This will add 0.5 seconds delay between re-reads of the process status, thus giving to your database server a time to breathe.

Another issue to fix is low timeout value for helper threads. If you have realistically large working set of net requirements, deletion of the plan would take much more time then 15 minutes, expected by developers of the functionality. To increase timeout, you should open class declaration part of the mentioned class (reqProcessExternalThread) and modify definition of WAITFORPROCESSSTATUSUPDATE macros. Try to change the value of macros from 15 to at least 60 or maybe even 90.

Slow end-of-level processing.

Often, you can see in the Unfinished Scheduling Process form, that during coverage phase, in the end of BOM Level he system works very slowly. It looks like it almost hanged for 30-40 minutes, but then, suddenly it advances to the new level. If you start AOS tracing on the batch server, you will see that all helper threads are doing nothing (they simply rereading process status waiting for advance to the next BOM level), while one unlucky thread is busy calculating coverage for portion of items. Also, in Unfinished Scheduling Process form, you can realize that the number of items on the level decreases ! 10 minutes ago you
had more then 15000 items on the level, now it is slightly less then 15000, and in a few minutes it will be less then 14500.

What’s happening ?

The main reason for the issue is incorrect BOM Level data in inventTable. When the system allocates items between chunks, every
chunk has attached BOM Level. But sometimes, during explosion of BOM for,say, 5th level, the system finds as a component of
exploded BOM the item, BOM Level of which (according to inventTable) is 2. Coverage for this item was already created
already. How does the system calculates coverage info for this item again ?

The line for the item in ReqProcessItemListLine (It is a list of lines in a chunk) is marked as ‘IsLevelChanged’
The system finds of create special ‘Spare’ chunk for items, which were pushed to the next level because of sudden level
change. This chunk’s line in reqProcessItemList is marked as ‘Spare’.
The system reassign the line for the item in ReqProcessItemListLine to Spare Chunk
During processing of next BOM Level, the system checks the item for being ‘pushed’ from level from level. If level for item was changed, the system deletes planned orders for item, previously created during planning on incorrect BOM Level.

The main performance issue of this approach is that the systemdoes not control the size of ‘Spare’ chunk at all. I witnessed
cases, when size of this chunk exceeded 1000 items. If planning of every item takes,say,4 seconds, then unlucky thread, which managed
to grab this chunk will spend more then 1 hour for processing of it. Since this chunk has highest number for a BOM level, it is usually
grabbed by some unlucky thread in the very end of BOM Level processing. Then one thread is processing this chunk for an hour, while all other threads are doing nothing, simply waiting for our unlucky thread.

The very first idea of how to fix the issue which came to the mind is to calculate BOM levels before every full regeneration planning. Unfortunately – it does not always help. Sometimes, we have incorrect BOMLevel info in inventTable only because we changed out BOM structure without recalculating BOM levels. But often, the reason for level change is that real life nesting of items in production BOMs does not fit to theoretical BOM structure described in master-data BOM. Say, we have a shining brass-head bolt, which is used only to screw a label with the name of our company to the final assembled good. Naturally, after calculation, it will get BOM Level of 1. Say, then, during production of a deeply nested sub-BOM of our finish good, someone decided to use the same bolt for screwing much smaller label to one of the sub-components. He dropped the line with standard bolt from production order’s BOM and added new line with the shining brass-head one.(Maybe it is not very good practice, but you simply can not update standard BOM structure for every possible small change, requested by a customer). If you recalculate BOM Levels, this bolt will still have BOMLevel 1. But if you try to run MRP, you would find out that it coverage actually is performed in level 5 or 6. By the way, one of the very last stages of MRP is BOMLevel update. During this stage, the system updates InventTable with actual BOM Level data, gathered during MRP processing, not with theoretical BOM Level from master-data BOM Structures.

So, the only way to resolve the issue, is to change behavior of the system to create many smaller Spare Chunks for a level, instead of one large Spare Chunk. To accomplish this, you need to modify method getSpareListNum() of ReqProcessItemList table:

static server ReqProcessListNum getSpareListNum(ReqProcessId        _processId,
                                                BOMLevel            _level,
                                                ReqProcessStatus    _status,
                                                Connection          _con
                                                )
{
    ReqProcessItemList          reqProcessItemList;

    ReqProcessItemListLine       reqProcessItemListLine;
    RandomGenerate              randomGenerate=new RandomGenerate();

    ;

    select firstonly ListNum, RecId from reqProcessItemList
    order by listNum desc
    where
        reqProcessItemList.ProcessId == _processId &&
        reqProcessItemList.Level     == _level     &&
        reqProcessItemList.Spare     == true;
    if (reqProcessItemList)
    {
        select count(recid)
        from reqProcessItemListLine
        where reqProcessItemListLine.ProcessId==_processId &&
              reqProcessItemListLine.ListNum==reqProcessItemList.ListNum;
    }
    if (!reqProcessItemList || (reqProcessItemListLine.RecId>=#MAXLISTSIZE)) //#MAXLISTSIZE is the macros with size of the spare chunk. Should be around 30-60
    {
        randomGenerate.parmSeed(timenow());
        try
        {

            reqProcessItemList=null;
            reqProcessItemList.setConnection(_con);

            select maxof(ListNum) from reqProcessItemList where reqProcessItemList.ProcessId == _processId;

            reqProcessItemList.ListNum=reqProcessItemList.ListNum+randomGenerate.randomInt(1,20);
            reqProcessItemList.ProcessId = _processId;
            reqProcessItemList.Level     = _level;
            reqProcessItemList.Status    = _status;
            reqProcessItemList.Spare     = true;

            reqProcessItemList.insert();
        }
        catch(Exception::DuplicateKeyException)
        {
            retry;
        }

    }

    return reqProcessItemList.ListNum;
}

The only strange thing in this piece of code is a usage of random number generator. I use it to decrease the number of potential conflicts, when several threads is trying to allocate new Spare Chunk, leading to duplicate key exception.

Join the Conversation

3 Comments

Allen says:

28.03.2012 at 21:07

Great write-up, but I have a question. If I want to review the MRP data on a per-thread basis, how do I do that? I’m looking in the ReqProcessItemTrace table and only see a handful of thread IDs, which is leading me to believe my MRP is not processing in parallel at all. I would assume each helper gets its own thread ID and would be logged as such?

1. denisfed says:
  
  29.03.2012 at 16:46
  
  Hi Alen
  
  You can try to look onto ReqProcessItemList table. It holds the list of planning chunks. Every chunk, which is processed or being processed has threadiId field filled in. Try to execute query “Select threadId,count(*) from reqProcessItemList where reqProcessItemList.processId= group by threadId”. If you see that chunks are allocated between threads more or less evenly, then MRP is running Okey. If allocation is uneven, or you see just one thread, it probably means that either some threads crushing during execution.
  
  Also, check how many batch tasks you have for MRP batch job during MRP. If some of the tasks are ended or terminated with error, it means that you are facing some kind of problem. Also, try to check Log for every helper task in the batch screen…
  
  Regards
  Denis
  
Petr Sehnal says:

09.08.2012 at 11:07

Denis, Excellent post, thanks.

Allen, is the problem you mention that you can see only one thread in ReqProcessItemTable even if multiple threads were used?

I experienced this problem and found out there was a bug in standard appl – AX2009 RU5 (5.0.1500.2985). Due to bug in passing parameters to helpers they never write into the ReqProcessItemTrace table.

Fixing this is easy.
In the ReqCalc class just add the parameter isItemTraceEnabled into the CurrentThreadList macro.

Best Regards
Petr Sehnal

Basic setup

Infant mortality issue

Slow end-of-level processing.

Join the Conversation

Cancel reply

Leave a comment