Monday, July 20, 2009

bts_UpdateMsgbox_BizTalkServerApplication causing MessageBox deadlock

I encountered the following list of errors when QA team was submitting a whole bunch (about 70) of test requests to BizTalk.

-----------------------------------------------------------------------------
Event Type: Error
Event Source: BizTalk Server 2006
Event Category: BizTalk Server 2006
Event ID: 6912
Date: 7/20/2009
Time: 2:23:27 PM
User: N/A
Computer: CTOR-QBIZTALK01
Description:
The following stored procedure call failed: " { call [dbo].[bts_UpdateMsgbox_BizTalkServerApplication]( ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?)}". SQL Server returned error string: "Transaction (Process ID 59) was deadlocked on lock resources with another process and has been chosen as the deadlock victim. Rerun the transaction.".

For more information, see Help and Support Center at http://go.microsoft.com/fwlink/events.asp.
-----------------------------------------------------------------------------
Event Type: Error
Event Source: XLANG/s
Event Category: None
Event ID: 10008
Date: 7/21/2009
Time: 1:12:15 AM
User: N/A
Computer: CTOR-QBIZTALK01
Description:
Uncaught exception (see the 'inner exception' below) has suspended an instance of service 'ArielBPM.BizTalk.RequestManagement.QueueUpAndNotify.QUAN_Orchestration(3c1decfe-a30b-14b9-c942-044cec6befed)'.
The service instance will remain suspended until administratively resumed or terminated.
If resumed the instance will continue from its last persisted state and may re-throw the same unexpected exception.
InstanceId: b2141f90-2d02-4cbd-b04e-d9de8770275f
Shape name: Start SndDocument ODX 4 Init Docs
ShapeId: 87668c55-e194-41f2-802c-34f909d271dd
Exception thrown from: segment 15, progress 11
Inner exception: Exception occurred when persisting state to the database.
Exception type: PersistenceException
Source: Microsoft.XLANGs.BizTalk.Engine
Target Site: Void Commit()
The following is a stack trace that identifies the location where the exception occured

at Microsoft.BizTalk.XLANGs.BTXEngine.BTXXlangStore.Commit()
at Microsoft.XLANGs.Core.Service.Persist(Boolean dehydrate, Context ctx, Boolean idleRequired, Boolean finalPersist, Boolean bypassCommit, Boolean terminate)
at Microsoft.XLANGs.Core.ServiceContext.PendingCommit(Boolean ignore, XMessage msg)
at Microsoft.XLANGs.Core.ExceptionHandlingContext.PendingCommit(Boolean ignoreCommit, XMessage msg)
at Microsoft.XLANGs.Core.ExceptionHandlingContext.PendingCommit(Boolean ignoreCommit, XMessage msg)
at Microsoft.XLANGs.Core.ExceptionHandlingContext.PendingCommit(Boolean ignoreCommit, XMessage msg)
at Microsoft.BizTalk.XLANGs.BTXEngine.BTXService.ExecService(Context cxt, Type serviceType, Object[] parameters, ActivityFlags flags)
at Microsoft.XLANGs.Core.Service.ExecService(Context cxt, Type serviceType, Object[] parameters)
at ArielBPM.BizTalk.RequestManagement.QueueUpAndNotify.QUAN_Orchestration.segment15(StopConditions stopOn)
at Microsoft.XLANGs.Core.SegmentScheduler.RunASegment(Segment s, StopConditions stopCond, Exception& exp)
Additional error information:

A batch item failed persistence Item-ID 26ce289e-2325-4d2e-8cd2-121a216259af OperationType MAIO_CommitBatch Status -1061151942 ErrorInfo A database failure occurred due to a deadlock. .
Exception type: PersistenceItemException


For more information, see Help and Support Center at http://go.microsoft.com/fwlink/events.asp.
-----------------------------------------------------------------------------

Base on the volume, it didn't seem like a huge load. At first, I thought this is most likely caused by the BizTalk MessageBox Database running out of space. It was trying to increase the size of the database, yet it couldn’t keep up with the speed of the message coming in. Hence I tried to fix by increasing the DB Autogrowth to a higher number (for example increase from default of 1MB to 100MB). Nonetheless, this attempt didn't seem to resolve the problem. I was still seeing the same errors after this change.


Since the first attempt wasn't successful, the search went on. Then I found a couple forums mentioned that this issue could be related to memory. I decided to check the specification of the server that's hosting BizTalk server. Surprisingly, I found that the server only have 1GB of RAM, which barely meets the minimum requirement of BTS 2006 R2 installation. Looking at the BizTalk server 2006 R2 installation guide, it states that minimum memory requirement is 1 GB of RAM (2 GB is recommended). However, the server only have 1 GB of RAM, not to mention that the server is also being used as the Application DB server and BizTalk DB server. I was wondering if this could be the cause of the deadlocks. To confirm if that's the case, I request the SI team to increase the physical memory of the server that's hosting the BizTalk 2006.

While I was waiting for the physical memory, I kept searching and found 2 Microsoft KB articles the talks about how to maintain and troubleshoot BizTalk Server databases (Microsoft KB 952555) and blocking, deadlock conditions, or other SQL Server issues when you try to connect to the BizTalkMsgBoxDb database in BizTalk Server (Microsoft KB 917845). Based on the articles, I tried the following 3 steps:

Step 1: Disable the Auto Create Statistics option and the Auto Update Statistics option on the BizTalkMsgBoxDb database in SQL Server.

Step 2: Set the max degree of parallelism option to 1 on the computer that is running SQL Server.

Step 3: I ran "DBCC SHOWCONTIG" against the BizTalkMsgBox. According to the KB article, if the Scan Density value that is returned by the DBCC SHOWCONTIG statement is less than 30 percent, the BizTalk Server indexes can be rebuilt during downtime. There are a couple result of the scan caught my eyes. The Scan Density of some of the tables are less than 15%. These results indicated that BizTalk Message Box is highly defragmented. Hence there's a need to re-build index for the BizTalk Message Box database.

------------------------------------------------------------------
DBCC SHOWCONTIG scanning 'PredicateGroupZeroSum1' table...
Table: 'PredicateGroupZeroSum1' (50099219); index ID: 1, database ID: 10
TABLE level scan performed.
- Scan Density [Best Count:Actual Count].......: 13.64% [6:44]
------------------------------------------------------------------
DBCC SHOWCONTIG scanning 'Subscription' table...
Table: 'Subscription' (1205579333); index ID: 1, database ID: 10
TABLE level scan performed.
- Scan Density [Best Count:Actual Count].......: 14.29% [1:7]
------------------------------------------------------------------
DBCC SHOWCONTIG scanning 'EqualsPredicates' table...
Table: 'EqualsPredicates' (1381579960); index ID: 1, database ID: 10
TABLE level scan performed.
- Scan Density [Best Count:Actual Count].......: 13.47% [26:193]
------------------------------------------------------------------
DBCC SHOWCONTIG scanning 'EqualsPredicates2ndPass' table...
Table: 'EqualsPredicates2ndPass' (1397580017); index ID: 1, database ID: 10
TABLE level scan performed.
- Scan Density [Best Count:Actual Count].......: 12.79% [60:469]
------------------------------------------------------------------
DBCC SHOWCONTIG scanning 'PredicateGroup' table...
Table: 'PredicateGroup' (1733581214); index ID: 1, database ID: 10
TABLE level scan performed.
- Scan Density [Best Count:Actual Count].......: 13.11% [8:61]
------------------------------------------------------------------
DBCC SHOWCONTIG scanning 'Spool' table...
Table: 'Spool' (1781581385); index ID: 1, database ID: 10
TABLE level scan performed.
- Scan Density [Best Count:Actual Count].......: 12.52% [489:3905]
------------------------------------------------------------------
DBCC SHOWCONTIG scanning 'Parts' table...
Table: 'Parts' (1909581841); index ID: 1, database ID: 10
TABLE level scan performed.
- Scan Density [Best Count:Actual Count].......: 12.65% [76:601]
------------------------------------------------------------------
DBCC SHOWCONTIG scanning 'MessageParts' table...
Table: 'MessageParts' (1941581955); index ID: 1, database ID: 10
TABLE level scan performed.
- Scan Density [Best Count:Actual Count].......: 12.63% [49:388]
------------------------------------------------------------------
DBCC SHOWCONTIG scanning 'MessageRefCountLog1' table...
Table: 'MessageRefCountLog1' (1957582012); index ID: 1, database ID: 10
TABLE level scan performed.
- Scan Density [Best Count:Actual Count].......: 12.64% [68:538]
------------------------------------------------------------------
DBCC SHOWCONTIG scanning 'PartRefCountLog1' table...
Table: 'PartRefCountLog1' (2085582468); index ID: 1, database ID: 10
TABLE level scan performed.
- Scan Density [Best Count:Actual Count].......: 12.57% [46:366]
------------------------------------------------------------------

In order to re-build the indexes, I partially stop BizTalk Applications and Application Host Instances. Update the store procedure bts_RebuildIndexes according to Microsoft KB article 917845. Then I run the bts_RebuildIndexes stored procedure in BizTalk Message Box database (BizTalkMsgBoxDb) to rebuild the indexes. I refresh BizTalk Group, restart the BizTalk Applications and Application Host Instances after rebuilding the indexes. Then I run the test again submitting about 1000 requests at the same time (about 15 times of the original failed tests). And guess what, there's no more error.

Although I find the fix for the database deadlock, I still need to find out the root cause of the problem. I recall that there's a set of SQL server agent jobs installed when BizTalk is installed. The purpose of these jobs are to automate the administrative tasks for the Message Box, Management Database and other BizTalk databases. These jobs are critical to the health of the BizTalk server. If they fail running successfully, the performance of BizTalk server will degrade severely in the long run. One of the jobs is responsible for removing physical messages from the message box that have no more subscribers remaining. If this job fails, the orphan messages will remain in the message box forever, which will slow down the BizTalk server and eventually lead to the system being in an unmanageable state that will stop the BizTalk server. Having this in mind, I connect to the database server trying to look for any failed SQL server agent jobs for BizTalk databases. Surprisingly, I find that all the SQL server agent jobs are disabled and have never been run ever since BizTalk is installed. In order to prevent the same problem from happening again, we definitely need to enable these SQL server agent jobs and monitor to make sure they run successfully all the time.