BizTalk 2013 or 2010 Hosts Crashing

I’ve seen some posts on some discussion boards and blogs and support cases opened on this BTS2013 issue so I just wanted to provide some general information about the problem, what your workaround options are, and details on the fix. 

Problem

Periodically, your BizTalk host process crashes with the following errors in the eventlogs.  Note the error code is 80131544.  If you see the same error with a different code, you’re likely running into a different issue.

Also, notice none of the errors come from event source BizTalk Server.  The ones in the Application event logs come from .NET Runtime and Application Error and the one in the System event logs comes from Service Control Manager.

Log Name:      Application
Source:        .NET Runtime
Date:          9/20/2013 3:47:42 PM
Event ID:      1023
Task Category: None
Level:         Error
Keywords:      Classic
User:          N/A
Computer:      servername
Description:
Application: BTSNTSvc64.exe
Framework Version: v4.0.30319
Description: The process was terminated due to an internal error in the .NET Runtime at IP 000007FDED170BC1 (000007FDECE00000) with exit code 80131544.

Log Name:      Application
Source:        Application Error
Date:          9/20/2013 3:47:42 PM
Event ID:      1000
Task Category: (100)
Level:         Error
Keywords:      Classic
User:          N/A
Computer:      servername
Description:
Faulting application name: BTSNTSvc64.exe, version: 3.10.229.0, time stamp: 0x50fe567a
Faulting module name: clr.dll, version: 4.0.30319.19106, time stamp: 0x51a512d4
Exception code: 0x80131544
Fault offset: 0x0000000000370bc1
Faulting process id: 0xca8
Faulting application start time: 0x01ceb6394f1dd32a
Faulting application path: C:Program Files (x86)Microsoft BizTalk Server 2013BTSNTSvc64.exe
Faulting module path: C:WindowsMicrosoft.NETFramework64v4.0.30319clr.dll
Report Id: 830374f6-222d-11e3-93f8-00155d4683a2
Faulting package full name:
Faulting package-relative application ID:  

Log Name:      System
Source:        Service Control Manager
Date:          9/20/2013 3:47:43 PM
Event ID:      7031
Task Category: None
Level:         Error
Keywords:      Classic
User:          N/A
Computer:      servername
Description:
The BizTalk Service BizTalk Group : BTSOrchHost service terminated unexpectedly.  It has done this 2 time(s).  The following corrective action will be taken in 60000 milliseconds: Restart the service.

Root Cause

There is a change in the .NET 4.5 CLR that results in the BizTalk process crashing during XLANG AppDomain shutdown.  XLANG AppDomain Shutdown is when the .NET AppDomain that contains the Orchestration Engine tears itself down during periods of inactivity or idleness. 

The Orchestration engine’s AppDomain shuts down by default after 20 minutes of idleness (all orchestration instances are dehydrateable) or 30 minutes of inactivity (no orchestration instances exist).

BTW, if you see this same issue in BizTalk 2010, it’s because you installed .NET 4.5 in your BizTalk 2010 environment – which is not tested or supported.  You need to be on .NET 4.0 if you’re running BTS2010.

Update:  BizTalk 2010 CU7 added support for .NET 4.5 so you should not see this issue once you are on CU7 – see Permanent Solution section for details.

Workarounds

The first and easiest workaround is to do nothing.  Note that this crash happens during periods of idleness/inactivity so your orchestrations aren’t doing anything anyway.  Also, Service Control Manager is configured to bring a BTS process up a minute after a crash so the process will come back up just fine and continue to process fine.  The crash won’t happen again until after there is orchestration activity and another period of idleness/inactivity occurs.

Note that if there are other things going on in the same host (like receive or send ports) then I wouldn’t be comfortable letting the host crash since non-orchestration work within the same host could be impacted.  In that case, you can separate out non-orchestration work into other hosts (that’s generally recommended anyway) or you can go with the below workaround. 

Another thing to consider is that, depending on the Windows Error Reporting (Dr. Watson) settings on the machine, these crashes can build up dump files on the drive.  The default location for these would be C:ProgramDataMicrosoftWindowsWERReportQueue.  Check for any subfolders starting with “AppCrash_BTSNTSvc” – you can delete them if you don’t need them.  So if you have limited disk space on the system drive, that might also be a reason to go with the below workaround. 

If allowing the host to crash during periods of orchestration idleness/inactivity is not an option, the workaround is to turn off XLANG AppDomain shutdown.  This is generally safe to do.  The only concern is if you have an orchestration that calls custom code that pins objects in memory to the appdomain (not good in general), then not tearing it down periodically could lead to excessive memory usage.  Still, any 24×7 environment will never have appdomain shutdown happening anyway and I’ve seen a number of environments that have very low latency requirements turn it off (since reloading XLANG engine after periods of inactivity is a perf hit to that first request). 

So, to prevent the crash from happening altogether, here’s what you do:

  1. Go to your BTS folder (default is C:Program Files (x86)Microsoft BizTalk Server 2013)
  2. First, save a copy of the BTSNTSvc64.exe.config file with a new name since we need to modify the original.  (BTSNTSvc.exe.config if it’s a 32 bit host that is crashing – you can check the error message to see if the crash is happening to BTSNTSvc.exe or BTSNTSvc64.exe)
  3. Open the original file in notepad and directly below the <configuration> node, add the following:
     
        <configSections>
            <section name=”xlangs” type=”Microsoft.XLANGs.BizTalk.CrossProcess.XmlSerializationConfigurationSectionHandler, Microsoft.XLANGs.BizTalk.CrossProcess” />
        </configSections>
     
  4. Then, directly below the </runtime> node, add the following:
     
        <xlangs>
            <Configuration>
                <AppDomains AssembliesPerDomain=”50″>
                                    <DefaultSpec SecondsIdleBeforeShutdown=”-1″ SecondsEmptyBeforeShutdown=”-1″/>
                </AppDomains>
            </Configuration>
        </xlangs>
     
  5. Recycle the host

Permanent Solution

Microsoft has created fixes for this issue in BizTalk 2013 and 2010.

BizTalk 2013:

BizTalk 2010:

Duplicate key row in object ‘dbo.bts_LogShippingJobs’

A customer was setting up Log Shipping disaster recovery for their BizTalk Server 2006 R2 environment. http://technet.microsoft.com/en-us/library/aa560961(v=bts.20).aspx  They had gotten to the step where they run the stored procedure bts_ConfigureBizTalkLogShipping. They received the error “Msg 2601, Level 14, State 1,  Procedure bts_ImportSQLAgentJobs, Line 56 Cannot insert duplicate key row in object ‘dbo.bts_LogShippingJobs’ with unique index ‘CIX_LogShippingJobs’.”

It turns out that this customer also had several of their own SQL Agent jobs running on the BizTalk server. As part of configuring the destination environment, we attempt to recover all of the jobs running on the BizTalk server with one exception: we don’t support importing of SQL Agent jobs where the steps use more than one database. The script’s logic iterates over each database to be recovered and logs the jobs associated with that database in bts_LogShippingJobs for later recovery. If a job has more than one database association, the script attempts to log it twice. But we never want to recover the same job more than once, so bts_LogShippingJobs doesn’t allow duplicate jobs. When the script attempts to log the job the second time, it fails.

In general, we discourage running any jobs on the SQL Server that supports BizTalk Server other than the jobs that ship with the product.

For those customers who choose to run their own jobs and encounter this issue, the solution is to temporarily remove any job(s) associated with multiple databases while setting up the destination recovery environment. You will also need to develop your own recovery plan for any such job(s). Once the destination environment is configured, you can restore the jobs. The following T-SQL will identify jobs associated with multiple databases:

SELECT j.name, COUNT(DISTINCT js.database_name) AS dbcount
INTO #tmp FROM msdb.dbo.sysjobsteps js
JOIN msdb.dbo.sysjobs j WITH (NOLOCK)
ON j.job_id = js.job_id
GROUP BY j.name
SELECT * FROM #tmp WHERE dbcount > 1
DROP TABLE #tmp

BizTalk EDI Leap Year Fix FAQ

As you  may have heard, there is an issue in the BizTalk 2006 R2 and 2009 EDI Engines that results in a failed EDI transaction if the EDI message contains a date node that has a leap year value – for example 2/29/2012.  This issue and hotfix is documented at http://support.microsoft.com/kb/2435900.  With 2/29/2012 just days away, this is a big deal and you will want to address this issue in your environment ASAP.

You can also get some great information about this issue at http://blogs.msdn.com/b/biztalkcrt/archive/2012/02/22/edi-leap-year-hotfix-biztalk-2009-and-biztalk-2006-r2.aspx.

I’ve also compiled a list of questions that have been asked about this issue over the last week and created a detailed FAQ document about this.  Check it out below. 

Also, to download the pipeline sample mentioned in the FAQ, click here.

(FAQ last updated on 02/29/2012 at 12:22PM Central Time)

BTSLeapYearFAQ.docx

When tuning messaging engine threads per cpu lower from defaults can help performance

Say you have a large flatfile in an application that does messaging only.   You drop a single file and you notice elevated cpu for the receive host for several minutes.    What’s happening in the host is that a single messaging thread is using the ffdasm to disassemble the flat file.   This is a very cpu intensive activity

Now you drop 100 large flatfiles.  Since messaging engine threads per cpu is set to 20, if you have a two processors, then there will be 40 messaging engine threads created and operating on 40 inbound flatfiles.    cpu still goes to 100%, but it takes longer for the first flatfile to be processed because other threads are taking valuable cpu resources.  

In testing on a two processor system I was able to get the best end to end performance by specifying that only one messaging engine thread per cpu be used on the receive host when performance testing with 100 files.  Tuning messaging engine threads per cpu to a lower value results in faster processing of the first file through the system and additionally faster overall processing.

1 messaging engine thread per cpu on the receive host about 4 minutes to process the first file.  1 hour and 30 minutes for all files

10/21/2011  02:46 PM                 set messaging engine threads to 1 for full test.txt

10/21/2011  02:50 PM         2,752,368 {FC0AD608-1F76-43A2-B7D5-44DDB66E66DF}.xml

10/21/2011  04:16 PM         2,752,368 {F7241E83-C7F1-428D-A5DD-2582CF8ABE20}.xml

2 messaging engine threads per cpu on the receive host about 4 minutes to process the first file.  1 hour 42 minutes to finish all approx..

10/19/2011  04:29 PM               set messaging engine threads to 2 for full test.txt

10/19/2011  04:33 PM         2,752,368 {C29DB82A-98E8-4B41-B917-093DDC708D21}.xml

10/19/2011  05:02 PM         2,752,368 {E9AFFEC2-B60B-446C-9D63-1BBBB2B2FD12}.xml  33rd file in 34 minutes.  Had to stop test early summary

extrapolated.

4 messaging engine threads per cpu on the receive host  10 minutes time to first message and 1 hour and 39 minutes total processing time.

10/19/2011  06:06 PM                 set messaging engine threads to 4 for full test.txt

10/19/2011  06:16 PM         2,752,368 {E96CD77C-8D24-4BB9-A111-871594B06F3F}.xml

10/19/2011  07:45 PM         2,752,368 {2F2EE1F9-C9F0-44A9-B2D1-1A02BC9F7EAD}.xml

20 messaging engine threads per cpu on the receive host (the default setting) 18 minutes.  1 hour and 42 minutes for all files to be processed.

10/20/2011  09:35 AM                 set messaging engine threads to 20 for full test.txt

10/20/2011  09:53 AM         2,752,368 {9BA6A42F-A634-4FEF-A29E-F16A774D823F}.xml

10/20/2011  11:17 AM         2,752,368 {262F3035-CACF-423A-BDFA-5EA817D2639B}.xml

50 messaging engine threads per cpu on the receive host  23 minutes before first file is processed.  1 hour 54 minutes for all files to be processed.

10/20/2011  02:56 PM                 set messaging engine threads to 50 for full test.txt

10/20/2011  03:19 PM         2,752,368 {40BF75AF-3935-4DC1-99C4-048CD61421D1}.xml

10/20/2011  04:50 PM         2,752,368 {5B72CC56-FB61-4A33-9186-3A3F954AC38A}.xml

BizTalk Terminator Not Cleaning Up Caching Items?

BizTalk Terminator Not Cleaning Up Caching Items?

I’ve been pinged a number of times on this so thought I should blog the workaround and an explanation. 

First, let’s say MBV shows you something like the following in the Warning and Summary Report:

Or you just notice in MBV that there’s a bunch of cache messages in one of the queue tables:

Well, according to my Using BizTalk Terminator to Resolve Issues article, you simply run the Terminate Caching Instances task: 

Issue Identified by MBV

Resolution Options

Terminator Resolution Task

Terminator View Task

Root Cause

Orphaned Cache Instances

MBV Integration or Manual Task Selection

Terminate Caching Instances

(in Delete task category)

View Count of Cache Messages in All Host Queues

View Count of Cache Instances in All Hosts

This is due to a known bug and there is a hotfix available.  See KBs 944426 & 936536 for details.

That should do it.  If it doesn’t, make sure you have stopped all the BizTalk hosts (that includes the IIS app pool hosting the BizTalk isolated host if the caching items are there) and try again.  (Hey, you shouldn’t be running Terminator without stopping all the BTS hosts anyway).

Now, that will definitely do it.

Well… like 99.9% of the time.

Let’s say you do all of the above and then run one of those View tasks (or MBV)  and notice that Terminator left behind some of those caching instances (and their associated caching messages).  This happens even though Terminator claims to have terminated all of them successfully.  And rerunning the task doesn’t help – Terminator will repeatedly say it successfully terminated those instances but either of the View tasks will show that they’re still there.

Ok, so now you’re most likely running into a very rare scenario that I’ve come across a few times. 

First, I should point out that the Terminate Caching Instances and Terminate Instances tasks use BizTalk’s WMI API to interact with the messagebox – and that’s key.  I had the opportunity to analyze some data from a customer environment that was running into this issue and it turns out that there are certain times when msgbox logic prevents the stored procs called by WMI from terminating “internal” instances – with caching being considered one of those “internal” types.  As far as when exactly the msgbox logic goes down this “rare” path, I’ve never had access to a repro environment where I could fully debug this so I don’t have a good answer for that.

So I was going to write code in Terminator’s WMI class to catch this scenario and warn the user that they need to use the workaround to clean up the remaining items but unfortunately the msgbox logic catches the failed call and doesn’t pass that info on to the WMI caller – so there’s no way for Terminator (or any WMI client) to know that some of the instances weren’t deleted.  As far as the WMI client is concerned, the stored proc call completed sucessfully so it assumes the instances it asked to be Terminated were actually terminated.

So what’s the workaround?  Well, don’t use WMI for this particular situation.  Instead, use Terminator’s Hard Termination tasks (Terminate Multiple Instances (Hard Termination) or Terminate Single Instance (Hard Termination).

While I was working on Terminator’s WMI class and having BizTalk engineers use Terminator on an internal-only basis within Microsoft, we noticed that on very rare occasions the termination tasks would not terminate something.  This would not be a limitation of Terminator and we could reproduce it with any WMI client (including the BTS Admin console).  I created the hard termination tasks specifically to handle those scenarios.  They bypass BizTalk’s normal termination API and use SQL calls to directly interact with the BizTalk msgbox.  They can terminate anything (well, so far) – even internal instances.  Originally, we just had the Terminate Single Instance (Hard Termination) task to help terminate a one-off instance that just wouldn’t terminate any other way.  That worked great for those one-off scenarios but I soon realized that sometimes there would be a larger number of instances that needed hard termination.  I left the single instance task to give users that functionality and wrote the Terminate Multiple Instances (Hard Termination) task.  That allows the user to choose the Host, Class, Status, and the Max number of instances to terminate and will do hard terminates on all items that fit the filter criteria. 

The only thing that is painful about the Terminate Multiple Instances (Hard Termination) task is that if you have instances across multiple hosts with various statuses, you will need to run the task for each permutation since it doesn’t have the ability to handle, in one execution, multiple hosts and statuses (or classes) like the WMI-based Terminate Instances and Terminate Caching Instances tasks.  Since Terminator v2 supports Powershell, I’m hoping at some point to create a powershell-based hard terminate task that can provide this functionality – just haven’t had the time to do that yet.

So, in short:

Problem:

The Terminate Caching Instances task is not cleaning up all cache items.

Solution:

Use the Terminate Multiple Instances (Hard Termination) task, choose Caching as the Class Parameter. 

You will need to re-run this task for each Host and Status that applies to the cache items you’re trying to terminate – MBV or the Terminator View Count of Cache tasks should give you some info in this regard. 

BTW, I’ve seen active, dehydrated, and suspended cache items so you may need to cycle through all of those Statuses.

In general, if you ever find that any of the termination tasks are not terminating what you want, the two Hard Termination tasks (Single and Multiple) are the workaround. 

Remember that Hard Termination tasks bypass the normal BTS APIs so should only be used if the normal Terminate Instances or Terminate Caching Instances task is not working – and as always, be careful with Terminator – especially when doing any of the deletion tasks.