Two stories about inventory closing and SQL Locks

Story #1: update_recordset considered harmfull.

Recently, I was invited to a customer who had started to use Inventory Closing functionality about a month ago. They were using DAX2009 with recent rollup 7. Every time when they were running inventory closing, the system locked all (or at least – most of the) inventTrans records, safely preventing users from any regular work with logistics module. Before I came to the customer I already heard rumors that new functionality of non-financial transfer closing can cause long duration locks to inventory data. Now I had a chance to witness it with my own eyes. The sql statement which was a reason for show-stopper problem in IC was very easy to find. (I knew the place in the code before I went to a customer’s site). It was the following query from InventCostNonFinancialTransferHandler.updateInventTrans() method:

update_recordset inventTrans
setting NonFinancialTransferInventClosingRecId = _inventClosing.RecId,
DateClosed = inventTrans.DateFinancial,
ValueOpen = InventTransOpen::No
where inventTrans.DateFinancial != dateNull()
&& inventTrans.DateFinancial <= _inventClosing.TransDate
&& inventTrans.ValueOpen == InventTransOpen::Yes
&& inventTrans.Direction == _direction
// Select transactions that represent transfers:
// Transfer journal, WMS Location refill
&& (inventTrans.TransType == InventTransType::InventTransfer
// Transfer Order
|| inventTrans.TransType == InventTransType::TransferOrderReceive
|| inventTrans.TransType == InventTransType::TransferOrderShip
// WMS Pallet transport
|| inventTrans.TransType == InventTransType::WMSTransport)
join inventDim
where inventDim.InventDimId == inventTrans.InventDimId
exists join inventTransTransferMatch
where ((inventTrans.InventTransIdTransfer != ” && inventTransTransferMatch.InventTransId == inventTrans.InventTransIdTransfer)
|| (inventTrans.InventTransIdTransfer == ” && inventTransTransferMatch.InventTransId == inventTrans.InventTransId))
&& inventTransTransferMatch.ItemId == inventTrans.ItemId
&& inventTransTransferMatch.Direction == transferMatchDirection
// select transactions where the corresponding transaction has the same values of financial inventory dimensions
#InventDimJoin(inventTransTransferMatch.inventDimId, inventDimTransferMatch, inventDim, _inventDimParm)
join inventTable
where inventTable.ItemId == inventTrans.ItemId
&& inventTable.DimGroupId == _inventDimGroupId
// exclude std cost items
join inventModelGroup
where inventModelGroup.ModelGroupId == inventTable.ModelGroupId
&& inventModelGroup.InventModel != InventModel::StdCost
// there should be no settlements for these transactions
join inventTransNotSettled
where inventTransNotSettled.InventTransRecId == inventTrans.RecId
notexists join inventCostNonFinTransferMarking
where inventCostNonFinTransferMarking.InventTransRecId == inventTrans.RecId;

 

Looks impressive, hugh ? If you explode nested views, you find out that this query is joining 10 or 11 tables. (I lost count to be honest). After I started to look onto the query, I understood that from application logic point of view, the query is Okey. It must update all non-financial transfers for a given Dimension Group in one update. It should not lock users from current operations (e.g – reserving stock for sales orders or posting picking list), because it should update only financial transactions from period to be closed. I made an assumption, that maybe this query is just causing lock escalation from record level to page level and then to table level, thus causing total lockup. I tried to disable lock escalation for inventTrans with alter table set(lock_escalation=disable) statement. (Just for a test; I think it would be a bad idea in general). It did not help. Then I tried to analyze lock information by selecting data from sys.dm_trans_locks table. I found out that this statement placed U-lock over seemingly every page of the inventTrans (Or, at least, all pages with given dataareaid).

I never had a lot of experience with U-locks before. I saw them in a database from time to time, but I red somewhere in SQL BOL that it is a kind of short duration locks issued by update statement to maintain consistency and prevent deadlocks. When I tried to find more information on update locks in internet, I ran into perfect article  Read Committed and Updates by Craig Freedman. It turned out that update statement in SQL (and update_recordset statement in Axapta respectively) place U-lock ON EVERY row it read. What happens next ? If update statement has simple logic (like update table set nonindex_field=constant where another_field=constant), the system either update a row (if it fits a where condition) and replace U lock with X, either it immediately release U lock (if row does not fit to the condition). So for simple cases, U-lock is placed on a row for really short duration – from reading of a page into memory until end of the page’s in-memory scan. But in complex queries, especially in queries with self-join, U-lock is placed in the very beginning of SQL Statement execution (when the page is being read for the first time) and kept until query reached a home-run stage, when all records to be updated are defined and the query started update phase itself. Since our query in question has a join of 11 tables, even in the very best case scenario the system was taking a couple of minutes to evaluate all necessary info and proceed to update phase. In the worst case (and this worst case happens too often), the system was choosing incorrect execution plan. In this case, U-Locks persisted for 30-40 minutes. It was locking all the pages in the table, because the system was simply reading all pages with given dataAreaId (because inventTrans do not have more specific index suitable for query conditions) thus leading to a total lock-up for 30-40 minutes.

BTW, there is no much point in locking here, because these updates being called only from starting part of Inventory Closing, which is executed in single user mode. Chances that these inventTrans records would be updated but some other user activity (not IC) are pretty thin: There is no much you can do with already financially updated inventory transfer transactions. The only thing which is coming to my mind is an idea that some crazy user will start to change InventDimId of inventory transactions via Item Details->Functions->Edit dimensions function. But I think we can ignore this possibility 😉 I can understand SQL Server concerns about the data to be modified by different process, but from application logic point of view it is highly improbable.

When I identified the issue, I started to try different ways to fix it. First thing  I tried  to replace an update_recordset with equivalent set of statements w/o update_recordset: while select table {select forupdate table2 where table2.recId==table.recid; table2.update()} . Since outermost select does not use ‘pessimisticklock’ clause (and InventTrans has OCC enabled), it does not places any locks on table it read. Inner select use ‘forupdate’ clause, but because of OCC it is just a syntactic sugar, which does not generate any additional clauses or hints for the select statement sent to SQL Serever.  This approach brought some progress: Although the query (now select, not update) was still taking 30-40 minutes to execute, it did not place these nasty U-locks over all records; IC was still slow, but it stopped to lock users at least. As a next stage, I tried to split one select onto two nested selects. For everyone who is into SQL Statement tuning it sounds like very counter-natural, but since we are unable to directly influence the query plan in X++ with hints, we must somehow ensure SQL Server selecting more or less decent execution plan. I tried the following approach: On outer level I placed a query which was simply producing a list of itemIds, which have non-closed transfers (any transfers, maybe even financial ones), in nested while select I used the copy of original select, only with adding ‘itemId==itemIdFromOuterSelect’ condition to every occurrence of InventTrans and inventSettlement in the query. This approach brings good results: Processing of non-financial transfers in my case took only 1-1.5 minutes. Maybe in your case, if you have many items with transfers, it will take about 30-40 minutes for execution of the whole logical block, but it yet much better than 30-40 minutes for one item dimension group. Well, maybe this approach is not the most perfect one. Maybe I should try to combine outermost select (to select item list) and inner update_recordset. Maybe this U-Locks won’t be so nasty after all. If the statement takes only 5-10 seconds to execute, then maybe benefits of fast execution will outweigh potential lock conflicts. But for my case, it was acceptable solution and the customer kept to use it.

As far as I understand, the whole feature was first designed for DAX2012 (where related statement is always executed per-item) and then backported to DAX2009 without too much testing. Also, to my knowledge currently MS is working on a patch for this feature in DAX2009.

Story #2:  Wrong query execution plan can ruin the helper-based algorithm in Axapta

On another project, the customer often had performance issues with inventory closing cancellation.

In the beginning, it was a mere issue of total execution time. When they were running IC cancellation in interactive mode, it took around 3-4-5 hours to complete. I pointed them that to the fact, that if you execute IC cancellation in batch mode, it runs many times faster than in interactive. The trick is simple: When IC cancellation is started, it checks the mode of execution. If  it is  running in a batch mode, it checks which items has inventSettlement records in inventory closing being canceled, then it spawn a batch task  of class InventCostClosingCancel_workInvent for every such an item. Next it iterates over all indirect cost journals created by inventory closing being canceled, and for every cost journal, it spawns a batch job of class inventCostClosingCancel_WorkProd. Then all these jobs start to execute (up to a batch server maximum number of thread at a time), thus brings significant (like 5-10 times) increase in performance. (It is worth to mention that all batch jobs are spawned with empty batch group, so you have to configure at least one of your batch servers to serve batch jobs from this batch group).

So, after they started to run Inventory Closing cancellation in a batch mode, time of the operation dropped to 45-60 minutes. (Progress percent in batch screen to advance by 1-2 percent every minute.) But from time to time, a batch mode cancellation went exceptionally slow (like 0.1-0.15% in a minute). I decided that SQL server generates silly execution plans sometimes and prescribed to run update statistics on inventTrans/inventSettlement/prodTableJour before running IC cancelation. For a while  this well known home remedy against spontaneous slowness of queries helped to resolve the issue.  But eventually, since 8-10 months after GoLive, even after statistics update, inventory closing cancellation become very slow, so I had to investigate the issue a bit more.

When I checked the list of connection, I found out that there is a lot of blocked processes from batch server, so it was a locking issue, not the issue of wrong execution plan. (Or at least so I decided after first look). Next I ran SQL Profiler, I also found out that the system creates several deadlock in a second. Finally I started to check the list of running queries and resources they are waiting for. I found out, that all processes was waiting on newly inserted key into inventSettlement.DateVoucherIdx or prodTableJour.dateVoucherIdx. I drew a conclusion that they are waiting on new key, because when I tried to select data from the tables using %%lockres%% pseudo-column, I was unable to find record being locked). I made assumption that a transaction insert record, then everyone wait on it and then transaction is getting rolled-back because of deadlock detection. (This was the only reasonable explanation why all connections are waiting on in-existent record.) The waiting statements were one of the following update_recordset statements from inventCostClosingCancel_workInvent.duplicateSettlements() and inventCostProdCancel.prodTableJour() respectively:

update_recordset inventSettlement
setting Cancelled = NoYes::Yes,
QtySettled = -1 * inventSettlement.QtySettled,
CostAmountSettled = -1 * inventSettlement.CostAmountSettled,
CostAmountAdjustment = -1 * inventSettlement.CostAmountAdjustment
where inventSettlement.Voucher == newVoucher
&& inventSettlement.TransDate == newTransDate
&& inventSettlement.Cancelled == NoYes::No
&& inventSettlement.ItemId == itemId;

update_recordset prodTableJour
setting Canceled = NoYes::Yes,
AmountFinancial = -1 * prodTableJour.AmountFinancial,
AmountPhysical = -1 * prodTableJour.AmountPhysical,
Adjustment = -1 * prodTableJour.Adjustment
where prodTableJour.Voucher == newVoucher
&& prodTableJour.TransDate == newTransDate
&& prodTableJour.Canceled == NoYes::No
&& prodTableJour.ProdId == prodTable.ProdId;

After I found and identified the places in the X++ code, which was causing deadlocks, I was finally able to understand the issue. Meaning of IC cancellation is quite simple: It simply insert exact copy of closing records, but with reversed sign. Since insert_recordset statement in X++ does not support expressions in select list, the author of the code found the simple pattern to overcome this: First the system inserts copy of old data (inventSettlement/prodTableJour) ‘as is’ (only voucher id and date are changed), next, special update_recordset statement (which supports some basic expressions in setting clause), is updating newly inserted records, inverting a value of the fields. Since this insert and update are made in 1 transaction, no other user can read our half-cooked records with un-inverted sign.

From application logic point of view, since, every helper thread is processing its own itemId or JournalId, no lock conflicts or deadlock can occur. But since an update statement in SQL  (and update_recordset statement in X++) has a habit of locking EVERY record it read, locking logic suddenly becomes in-determenistic.If SQL Server use execution plan with index scan by ItemId/JournalId indexm several threads avoids locking conflicts, since every of them update its own personal set of keys. But from time, SQL Server having  a bad day and it is selecting voucher+date index for both queries. In this case,1st helper thread inserts records for item A and locks the related key entries in Voucher+Date index; Then the 1st thread start to iterate over WHOLE index entries for given voucher and date (which are common for all helpers, since all inventSettlement records of the cancellation use the same voucher and date). Sooner or later, this scan was trying to read uncommitted inserts of the second helper thread, which caused 1st thread to be locked until 2nd thread commit transaction. Second thread, in turn, sooner or later starts  to iterate all the index entries for the same voucher and date. Sooner or later it was hitting the locks on records inserted by 1st thread. SQL Server, after finding classical deadlock situation (2 processes are waiting on each other), terminates one of the two (or several) deadlocking connections causing all the threads by one to redo the same transaction.

In both cases, I fixed the issue by building a very specific index to prevent any potential conflicting index reads in this pieces of code. In the first case (for inventSettlement) it was an index on TransDate+Voucher+ItemId. In the second case (for prodTableJour) it was an index on TransDate+Voucher+ProdId. Since then, SQL Server always use these indexes for related update_recordset statements, thus eliminating this silly deadlock issue. (It is silly, because it was not caused by some real life conflict for the same resources. It was just a result of mis-optimization of a query in SQL Server).

If you do not believe me or just want to reproduce the 2nd issue on your own test installation, try to disable itemDateIdx index on inventSettlement and NumDateIdx index on prodTableJour table. Then try to run inventory closing cancellation in batch mode with SQL profiler running deadlock trace. Moreover, when I was preparing this article, I managed to run into the same issue on standard CONTOSO database on standard international version of  DAX2009ru7. It seems that the issue can be quite frequent and even average installation can benefit from building these two additional indexes, I mentioned on the previous paragraph.

Lessons learned:

  1. Locking in SQL Server can be in-deterministic. The set of records locked can be defined not by your application logic, but by some random fluctuation of SQL Server optimizer
  2. Be careful with update_recordset statement.It can place locks on records, which you never expects. Think about building appropriate indexes or rewriting the query if you are updating a table with highly competitive access scenarios.
  3. Sometimes a bad query plan can bring not only performance issues, but also concurrency issues. Do not forget to do all usual hoodoo to keep SQL optimizer happy: Update statistics regularly, rebuild index regularly and so on. If it does not help – try to rewrite a query…


 

Join the Conversation

4 Comments

  1. Very interesting!

    BTW. Could you, please, use syntax highlighter the for your next artcle (there are lots of them online & offline)

  2. Very good post.
    In my case the solution for the slow cancellation was to use a dedicated AOS with a maximum of 4 threads.
    More than that just caused everything to slow down due to locks/waits.
    But it still runs very slow compared to what you might consider ok.

Leave a comment

Leave a Reply to Anonymous Cancel reply

Your email address will not be published. Required fields are marked *