[Note: This text-wrapping due to the size of the images and the site layout may make this post hard to read. If so, you should have a better experience with the copy pasted to http://biztalkperformance.members.winisp.net/BlogPages/MaxWorkerThreads.htm ].
I recently worked on a puzzling problem where spikes in load to BizTalk in a large environment would cause some unpleasant SQL Server problems. This turned out to be directly related to the .NET thread-pool size in the BizTalk hosts, which is configurable via the registry key ’MaxWorkerThreads’, and can be set as described in http://msdn2.microsoft.com/en-us/library/aa561380.aspx . I’ll attempt to outline the issue here from the perspective of understanding and addressing the root cause, rather than a blow-by-blow chronology of troubleshooting the issue.
The internals
The orchestration engine in BizTalk does most of its work on worker threads from the .NET thread-pool. This is easily verified by doing some sample based profiling of the running XLANG host using the excellent Visual Studio performance tools. Below you can see the bulk of the time spent in a call stack that goes from the thread pool’s “_ThreadPoolWaitCallback” method into the XLANG scheduler’s “RunSomeSegments” method. If you drill further into this call stack you’ll see much time is spent in “BTSDBAccessor.dll”. What this tells us is that XLANG does most of its work, including accessing the BizTalk databases to send and receive messages, on .NET thread-pool worker threads.
Due to this, the actual number of running orchestrations at any one time is directly limited by the size of the .NET thread pool. You can verify this with the performance counter “\XLANG/s Orchestration(hostname)\Running Orchestrations” and see if it ever exceeds your “thead-pool size * number of CPUs”.
Now it should be no surprise to anyone who has worked with BizTalk that one of your key performance limiters is the backend SQL Server database – specifically the MessageBox. Typically your running orchestrations will be spending a lot of time reading and writing to the MessageBox. So what happens if your existing pool of database sessions are all busy serving other instances and your current instance needs to interact with the message box? Well, generally it will open another database session. Thus, under load, your .NET thread-pool size directly relates to your running orchestrations, which directly relates to how many database sessions the host could potentially open.
What about throttling based on DB Sessions you ask!? Well the way this works is that if you are using more database sessions than the number at which throttling is configured to kick in, then it will gradually introduce delays prior to database requests. This gives the existing connections in use a chance to become free, but doesn’t effectively block processing if above the throttling limit (this is a throttle after all, and not a hard limit). What this means is that (a) If the database requests are taking a long time (which is common under load) you will still open up more sessions, and (b) Under a quick spike in load you may open up a bunch of sessions for your running instances before throttling takes effect.
The testing
To demonstrate the effects a large number of suddenly busy database connections to the MessageBox can have, let’s do some testing. For the testing I created a scenario that uses MQSeries exclusively as the message transport (a common enterprise queuing platform). It includes some correlation with a delay (to ensure the subscriptions built up a little) and a couple of dynamic send ports. I also made the dynamic send ports run in parallel within an atomic scope, and constructed a new message before another final send. This is a simple scenario without any compensation, business rules, etc, but includes a few of the complexities that exist in real business processes and should be plenty to keep the MessageBox busy. The solution is downloadable from this blog if you need further details.
For generating the load I again used some of the great new features put into Visual Studio – this time the Load Testing component included in the Team Suite or Test editions. I created a simple unit test to send a message with a unique OrderId to my MQSeries queue which is the receive location for my XLANG activation port. I then wrapped this inside a Load Test which allowed me to specify the number of concurrent users and a warm up time, as well as what performance counters I wanted to gather on what machines during the run. Note that a “user” in Visual Studio Load Testing effectively equals a thread which will loop through your unit test. In order to produce spikes in load I had each thread “sleep” for a fair amount of time in the unit test while simulating low load, then for one minute out of every ten I’d reduce this sleep considerably. This was a crude – but effective – method for simulating periodic “bursts” of traffic.
Attached to this blog is a link to my solution containing the BizTalk project, the Test project, and a helper class library for working with the MQSeries queues. (Note: You’ll need the MQSeries client bits installed on the Visual Studio box to run the load generation – or just change the projects to use a different transport for your test environment).
My environment was highly scaled out. My MessageBox was running on an 8-way SQL Server (with BAM/DTA/SSO on a separate SQL Server). I also had 10 dual-proc machines; 2 running the receive hosts, 2 running the send hosts, 4 running the XLANG hosts, one dedicated to running MQSeries, and one dedicated to Visual Studio (generating the load, monitoring the counters, etc).
The results
For the first test the idea was to push SQL beyond its limits with the volume of database sessions created by BizTalk. For this I set the MaxWorkerThreads for all the hosts to 200. If you multiply this number of threads and by the number of active host instances across the deployment – this has the potential for a lot of connections! Running the scenario a few key counters can been seen below. (Note: If wondering what tool this graph is from, this is a screen shot of a section of the Visual Studio Load Test results UI).
200 ’MaxWorkerThreads’, default (0) ’DB Connections Per CPU’ throttling
It can be seen from the counters that I dropped ~59,000 test messages, averaged 31.4% SQL CPU on my MessageBox server, and peaked at 1,093 SQL connections! That’s a lot of connections. Of all the test messages dropped ~20,000 were received by the receive locations during the run, and only ~5,000 XLANGs instances completed.
The big problem here though is SQL. Why does it keep dropping to 0% CPU? Looking in the SQL event log gives the reason:
Event Type: Information
Event Source: MSSQLSERVER
Event Category: (2)
Event ID: 17888
Date: 9/21/2007
Time: 11:12:08 AM
User: N/A
Computer: BPI8X32PERF01
Description:
All schedulers on Node 0 appear deadlocked due to a large number of worker threads waiting on LCK_M_S. Process Utilization 0%
(Note: For details on what the SQL Server Scheduler is and the various causes of the above event, see the article at http://www.microsoft.com/technet/prodtechnol/sql/2005/diagandcorrecterrs.mspx ).
The above message appears several times. So SQL Server is effectively completely locked up to the point where it can’t do anything for extended periods. This causes various problems on the BizTalk hosts where you may see events such as the below:
Event Type: Error
Event Source: BizTalk Server 2006
Event Category: BizTalk Server 2006
Event ID: 6912
Date: 9/26/2007
Time: 11:13:21 AM
User: N/A
Computer: BPI2X64PERF04
Description:
The following stored procedure call failed: ” { call [dbo].[bts_UpdateMsgbox_XlangHost]( ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?)}”. SQL Server returned error string: “Transaction (Process ID 378) was deadlocked on lock | thread resources with another process and has been chosen as the deadlock victim. Rerun the transaction.”.
.. or ..
Event Type: Error
Event Source: XLANG/s
Event Category: None
Event ID: 10034
Date: 9/26/2007
Time: 11:13:22 AM
User: N/A
Computer: BPI2X64PERF04
Description:
Uncaught exception (see the ‘inner exception’ below) has suspended an instance of service ‘BtsLoadScenario.SingleOrch(8d7a5913-ce72-0c75-7e41-a8729c4c50c7)’.
The service instance will remain suspended until administratively resumed or terminated.
If resumed the instance will continue from its last persisted state and may re-throw the same unexpected exception.
InstanceId: ca11bfcf-b5d7-4115-b12d-d21b082b3094
Shape name: ConstructMessage_1
ShapeId: dfbab00f-9e42-4905-bd22-cb046262a86c
Exception thrown from: segment 1, progress 22
Inner exception: Exception from HRESULT: 0xC0C01B13
Exception type: COMException
Source: Microsoft.BizTalk.Interop.Agent
Target Site: Microsoft.BizTalk.Agent.Interop.IBTMessage ReadReferencedMessageByID(System.Guid ByRef)
The following is a stack trace that identifies the location where the exception occured
at Microsoft.BizTalk.Agent.Interop.IBTInstanceState.ReadReferencedMessageByID(Guid& msgID)
at Microsoft.BizTalk.XLANGs.BTXEngine.BTXMessageState._retrieveBTMessage(IStateManager ctx)
at Microsoft.BizTalk.XLANGs.BTXEngine.BTXMessageState._retrieveBTMessage()
at Microsoft.BizTalk.XLANGs.BTXEngine.BTXMessageState.GetPart(Int32 partIndex)
at Microsoft.XLANGs.Core.XMessage.GetUnderlyingPart(Int32 partIndex)
at Microsoft.XLANGs.Core.Part._slowProtectedRegisterWithValueTable()
at Microsoft.XLANGs.Core.Part.ProtectedRegisterWithValueTable()
at Microsoft.XLANGs.Core.Part.CopyFrom(Part part)
at Microsoft.XLANGs.Core.XMessage.CopyFrom(XMessage srcMsg)
at BtsLoadScenario.SingleOrch.segment1(StopConditions stopOn)
at Microsoft.XLANGs.Core.SegmentScheduler.RunASegment(Segment s, StopConditions stopCond, Exception& exp)
.. or ..
Event Type: Error
Event Source: XLANG/s
Event Category: None
Event ID: 10041
Date: 9/26/2007
Time: 11:16:39 AM
User: N/A
Computer: BPI2X64PERF04
Description:
Suspending due to exception occurring during dehydration of instance d1c19376-7008-4b37-ad95-ce2a6af85964.
Exception type: DehydrationFailedException
Additional error information:
Exception occurred when persisting state to the database.
Exception type: PersistenceException
Source: Microsoft.XLANGs.BizTalk.Engine
Target Site: Void Commit()
The following is a stack trace that identifies the location where the exception occured
at Microsoft.BizTalk.XLANGs.BTXEngine.BTXXlangStore.Commit()
at Microsoft.BizTalk.XLANGs.BTXEngine.BTXXlangStore.Dehydrate(Boolean fCommit)
at Microsoft.XLANGs.Core.Service.Persist(Boolean dehydrate, Context ctx, Boolean idleRequired, Boolean finalPersist, Boolean bypassCommit, Boolean terminate)
at Microsoft.BizTalk.XLANGs.BTXEngine.BTXService.PerformDehydration(Boolean wait)
at Microsoft.XLANGs.Core.Service.Dehydrate()
.. and ultimately perhaps ..
Event Type: Error
Event Source: Service Control Manager
Event Category: None
Event ID: 7031
Date: 9/26/2007
Time: 11:17:33 AM
User: N/A
Computer: BPI2X64PERF04
Description:
The BizTalk Service BizTalk Group : XlangHost service terminated unexpectedly. It has done this 1 time(s). The following corrective action will be taken in 60000 milliseconds: Restart the service.
Obviously SQL Server is not in a good way with so much work for the MessageBox over so many connections.
200 ’MaxWorkerThreads’, 20 ’DB Connections Per CPU’ throttling
After the above test I reset the environment and tested the “DB Connections Per CPU” throttling setting. I set this to 20, and as it’s a per CPU setting, this equates to 40 for the host instances on the test machines. The results of this test are below:
Here 58,400 test messages were sent to the receive queue. SQL CPU averaged a little better at 40.1% (though still dropped to 0% a few times), and SQL still peaked at 1,037 connections! A lot more of the messages were received at around 37,900, however the system still only got through 10,200 XLANG instances (double the previous test – but not great). Still worrying is the fact that SQL still often locked itself to the point of 0% CPU utilization, and I still suffered some of the same event log errors highlighted in the previous run.
The below chart showing different counters for this same run highlights a potential problem with this solution. If you’re throttling on “DB Connections”, then any time you get more active XLANG instances than this you may start to throttle (if they’re all trying to access the database at the same time – such as a burst of traffic). The counters below show that the sessions are much higher than the threshold, and thus “Message publishing throttling state” is pretty much stuck on ’8’ (“Throttling due to high session count”). Due to this the publishing delay rises and rises until it maxes out at 5 minutes (the default 300,000ms). It resets around the 26 min mark when the host restarts. While marginally better than the first run, this is still a far from ideal result.
100 ’MaxWorkerThread’s, default (0) ’DB Connections Per CPU’ throttling
The next test is to drop the number of worker threads per host (the MaxWorkerThreads value in the registry for the host service). The results of this run are shown below:
Now this is more like it! ~57,700 test messages were dropped and this time the CPU utilization on SQL was a respectable 76.6%. The max SQL connections value is down to 694. Most importantly all dropped docs were received by BizTalk by the end of the run, and a whopping 39,400 XLANGs instances have completed! The throttling picture looks much healthier, with the publishing delay never getting above a few seconds for very brief periods due to rate imbalance (and a little for DB growth at the end). Delivery throttling for inbound/outbound rate imbalance and in-memory messages is respectable, and to be expected with such bursts of load.
Most importantly however in this last test – there was no SQL lock-ups and no errors on any of the BizTalk hosts! This environment is pretty solid even under extreme loads.
Default (25) ’MaxWorkerThreads’, default (0) ’DB Connections Per CPU’ throttling
As a final test everything was run with just the default settings. Without specifying the registry keys the .NET thread-pool will have a maximum of 25 threads. The results for this run are below:
The numbers for this are very similar to the previous test. ~58,600 test messages were dropped. SQL CPU utilization is a little higher at an average of 81.7%. Due to the lower number of threads the number of SQL connections has peaked at a much lower 480. Again, all dropped docs were received, and 36,700 XLANGs instances have completed. This is a touch lower (~7.4%) for completed XLANG instances than the prior test. The throttling graph for this last test was also very similar to the previous test – so I won’t include it here.
Summary
The final conclusion from this testing is that you need to be mindful of the load you are putting on the back-end as you scale out your environment. There may be valid reasons why you need more threads in your thread pool, (KB article http://support.microsoft.com/kb/900455 describes for example needing to increase this if using orchestrations that can’t dehydrate), but be aware of the equation “number of hosts * max thread-pool size * number of CPUs = potential DB sessions”. Too many connections all trying to read and write to your MessageBox at the same time can overwhelm SQL Server, locking up the system and reducing your throughput dramatically.
Be aware of the type of work your host is doing and the settings that control the threads used. The MaxWorkerThreads setting outlined here controls the threads available to the XLANG engine, and the SOAP & HTTP send adapters. The messaging engine however uses its own thread pool which may be set as described in the documentation at http://msdn2.microsoft.com/en-us/library/aa577604.aspx . If you do need a large thread-pool, then consider throttling based on the number of DB Sessions in order to reduce the volume of requests on those sessions under high load – this will only add limited stability however, and may not be sufficient protection in environments with large sudden bursts of work. Ideally, start with the defaults and only adjust your thread-pool if testing shows this to be beneficial and stable. In a well designed system, more threads often just means more context switching and locking, not more completed work!
We all know that testing is important for a successful implementation, but testing with a realistic environment is key to avoid unpleasant surprises in production. This means testing with realistic hardware (thread-pool size – as with many settings – is per CPU), an appropriately sized test environment (the number of running hosts, not just the work they do, is highly relevant – nothing scales linearly!), a realistic load pattern (throttling state, current sessions, etc all behavior much differently under bursts than under steady load), and realistic response times (do correlated or response messages really return instantly in production? This directly effects how long your orchestrations run for and the number of active subscriptions at any one time – both key to performance).
I hope this has provided some insight and saved a few headaches.
– Bill Ticehurst