Tuesday, March 2, 2010

OLE DB provider "SQLNCLI" for linked server "HOSTNAME\SECONDARY" returned message "Query timeout expired". Warning: A Remote Query Timeout may have occurred

I got this error when "Backup BizTalk Server" SQL Server Agent Job was being executed.
Processed 630 pages for database 'BizTalkDTADb', file 'BizTalkDTADb_log' on file 1.
BACKUP LOG successfully processed 630 pages in 0.511 seconds (10.084 MB/sec).
Processed 6 pages for database 'BizTalkMgmtDb', file 'BizTalkMgmtDb_log' on file 1.

BACKUP LOG successfully processed 6 pages in 0.123 seconds (0.378 MB/sec).
Processed 1 pages for database 'BizTalkRuleEngineDb', file 'BizTalkRuleEngineDb_log' on file 1.
BACKUP LOG successfully processed 1 pages in 0.324 seconds (0.003 MB/sec).
Processed 12 pages for database 'SSODB1', file 'SSODB1_log' on file 1.


Server: Msg 50000, Level 16, State 1, Procedure sp_MarkAll, Line 118
Failed running the log backup on HOSTNAME\SECONDARY.BizTalkMsgBoxDb
BACKUP LOG successfully processed 12 pages in 0.053 seconds (1.835 MB/sec).
OLE DB provider "SQLNCLI" for linked server "HOSTNAME\SECONDARY" returned message "Query timeout expired".
Warning: A Remote Query Timeout may have occurred


Scheduler:
processed 1 pages in 0.041 seconds (0.024 MB/sec). [SQLSTATE 01000] (Message 3014) Processed 2 pages for database 'SSODB1', file 'SSODB1_log' on file 1. [SQLSTATE 01000] (Message 4035) Failed running the log backup on HOSTNAME\SECONDARY.BizTalkMsgBoxDb [SQLSTATE 42000] (Error 50000) BACKUP LOG successfully processed 2 pages in 0.027 seconds (0.322 MB/sec). [SQLSTATE 01000] (Error 3014) OLE DB provider... The step failed.

Based on the error message, it seems like the backup script time out before it's completed. Looking at the backup history with the following SQL script, I found that the last time the BizTalkMsgBoxDB is successfully back up is 3 months ago. Checking the size of  BizTalkMsgBoxDB, I found that BizTalkMsgBoxDB_log database file is already up to 78 GB.

SELECT * FROM [BizTalkMgmtDb].[dbo].[adm_BackupHistory]
WHERE [DatabaseName] LIKE '%MsgBox%' ORDER BY [DatabaseName], [BackupDateTime]

Before I go any further, I think it's worthwhile to mention the architecture of the SQL servers. As shown in the diagram below, the SQL Services are clustered in an Active/Active configuration to maximize the available hardware resources. The high availability is achieved by configuring BizTalk databases separately of one another. As shown, BizTalk MessageBox database is deployed in one server and the other dependant databases are deployed to the other server within the cluster.
SQL Server Node A SQL Server Node B
BizTalkMsgBoxDB(MessageBox) BizTalkMgmtDB(Management)
  BizTalkDTADB(Tracking)
  SSODB(Single Sign-On)
  BizTalkRuleEngineDB(Rule Engine)

As such, a linked server is added in each node. For example, in Node A, a linked server is setup to link to database server that hosts databases such as BizTalkMgmtDB, BizTalkDTADB and etc. In Node B, a linked server is setup to link to the database server that hosts the BizTalkMsgBoxDB database.

Because of this configuration, whenever the "Backup BizTalk Server" SQL Server Agent Job is being executed, the linked server is utilized. With further investigation, I found that “Remote query timeout” of HOSTNAME\SECONDARY (SQL cluster Node B - BizTalkMsgBoxDB ) is set to 600 seconds, which is exactly the same amount of time the "Backup BizTalk Server" SQL Server Agent Job ran before it failed with aforementioned error. With this finding, I decided to change the "Remote query timeout" value to a higher value (such as 0 - no timeout).

Unfortunately, this change didn't resolve the problem. The job still failed after 10 minutes with exactly the same message. What else could go wrong?

Since there's only 2 servers involved in running the "Backup BizTalk Server" SQL agent job, if the error is not caused by HOSTNAME\SECONDARY (SQL cluster Node B - BizTalkMsgBoxDB ), it must be caused by the other server (SQL cluster Node A - other BTS databases other than BizTalkMsgBoxDB ). With this in mind, I decided to change the "Remote query timeout" value for cluster Node A to "0" as well.

After the change is made to cluster Node A, I re-ran the backup job again and bingo it ran successfully without any error.

No comments:

Post a Comment