Memory Growth in BizTalk Messaging

By Kartik Paramasivam and Raied Malhas


In this document we list various reasons that might lead BizTalk to get into an Out Of Memory situation and then suggest mitigations/solutions for such conditions.


1.  Introduction to BizTalk Hosts:
Before discussing Out of Memory conditions, let us talk about hosts and how adapters are mapped to hosts.


Adapters (receive and send) and orchestrations run under the BizTalk NT Service.   An instance of the BizTalk NT Service is called a host instance. There can also be receive adapters (HTTP/SOAP) which can run under in any other process (e.g. IIS) instead of the BizTalk NT Service. Such adapters are called isolated adapters.
Hosts in BizTalk have 1 or more host instances (process). The default BizTalk installation has 2 hosts.
1) Default host
2) Isolated Host.


Note: You can only have 1 host instance per host on a given server.


The following is the mapping between end points (receive or send port) and hosts:
Send Port -> Send Handler -> Send Host
Receive location -> Receive Handler-> Receive Host


Typically you should create separate receive hosts for receive adapters and send hosts for all send adapters.  Doing this gives each adapter a separate process to live in. This guarantees that one adapter will not adversely affect another. Also if your BizTalk host/process goes Out of Memory, you will know which components are running in that process.


2. The Obvious Suspects for Out of Memory conditions:


When the process goes Out Of Memory (OOM), you need to 1st find out what components are running in that process.  If possible,  follow the recommendations listed above for separating out receive/send adapters and orchestrations into different hosts.  Once you have done that then you can check if one of the following 3 conditions exists in the process that is going OOM.


1) You are executing transforms/maps on relatively large messages in receive/send port or XLANG. The point here is that XSL transforms load the whole message in memory to transform them. 


Solution 1: Decrease the number of messages that your process operates on concurrently (section on Send Host below might give you some idea).
Solution 2: Decrease the size of your XML message that you are trying to transform.


2) You are executing XML receive/send pipeline on a document that contains one or more items such as the following:
– Large attribute values
– Large element values
– Large attribute or element tags.


If one of the above applies to your XML document then that item is fully loaded in memory.


Solution 1: Try to limit the size of the above entities.
Solution 2: If you can’t limit the size, then make sure your process doesn’t concurrently operate on multiple documents. (See sections below on limiting concurrency)


3) You have a custom pipeline component or adapter, which loads the whole document in memory.  Most of the components shipped with BizTalk (except transforms) support streaming as opposed to loading the whole document in memory, and hence have a low memory foot print. However, we have observed that custom pipeline components written by customers may or may not support streaming.  


3. Out of Memory in Send Host:


The Send Host of BizTalk Adapters can run Out of Memory when it is processing a large number of messages (i.e. when the system is under high stress loads). This can cause the Send Host memory utilization to rise rapidly causing the system to get into an Out Of Memory condition. Usually, the larger the messages being processed the faster the memory utilization of BizTalk’s SendHost process will be.


Calculating the Maximum number of messages that will be loaded in-memory by BizTalk’s Send Host process:
Under high stress loads, the BizTalk send host process  will load as many instances of messages into memory as it can until the number of messages exceeds the “HighWatermark”.  By default in BizTalk Server 2004, SP1, this value is 200 for messaging.  Below you’ll learn how to configure this value, more details on Watermark settings are available in the “BizTalk Server Performance Characteristics” document at: http://www.gotdotnet.com/team/wsservers/). So, if you have a Dual processor server that is processing a large number of messages, the BizTalk host will load a maximum of:
[HighWatermark Value] * NumberOfProcessors
= 200 * 2
= 400 messages in-memory



For a Hyper-threaded server:
If you have a hyper-threaded server, then the number of processors perceived by the BizTalk send host doubles the actual number of processors. So if you had a dual processor BizTalk server and if we assume that the default value for the maximum number of message instances processed at once is 200 (so, ”HighWatermark” value is equal to 200), then the BizTalk process hosting the Send Host will load a maximum of:
[HighWatermark Value]*[NumberOfProcessors for a Hyper-threaded machine]
= 200 * [NumberOfProcessors * 2]
= 200 * [2 * 2]
= 200 * 4
= 800 messages in-memory


So, in a case where you have a dual processor hyper-threaded BizTalk server with 2GigaBytes of RAM that’s under high stress, there is a possibility that SendHost will get into an Out of Memory state. The process in this case will be loading 800 messages (as calculated above) in memory. The following graph shows an example of such a case where the memory usage of the send host process grew very rapidly to 1.5 Giga Bytes in less than 15 minutes:


 
Graph1: An example of a test that hit an Out of Memory condition in a SendHost containing FILE Adapter


Note: The memory footprints of BizTalk hosts can be monitored using Performance Monitor to view “Private Bytes” counter for the required host (receive, send, etc) instance under the “Process” object.


Note: On a 32 bit box, the process can only grow to around 1.5 GB Max (2 GB in some rare cases) even if you have more physical memory to spare (i.e. 8 GB RAM for e.g.)



How to avoid Out of Memory in the SendHost:
Fortunately, the Out of Memory condition can be avoided by configuring the setting that controls the maximum number of messages that can be loaded into memory concurrently.


To configure the maximum number of message instances, in BizTalk’s Management Database (default name is: “BizTalkMgmtDB”), open the “adm_ServiceClass” table. You’ll see for “Messaging InProcess” row, by default, the value for LowWatermark is set to 100 and for HighWatermark to 200. The HighWatermark value is what dictates the maximum number of message instances that can be loaded in-memory of the Send Host process. Change the Low and High Watermark values to lower numbers, an example of lowered values are:
LowWatermark = 25
HighWatermark = 50


If those values (25-50) still show OOM conditions, you should decrease those values more until you stop hitting the Out of Memory condition.


The following graph shows an example of the same test case that the graph above shows. In the graph below, the Low and HighWatermark values were changed to 25-50, respectively. Please note how the memory remained constant without growing to get into an Out of Memory:




Graph2: An example of a test that avoided hitting an Out of Memory in a SendHost containing FILE Adapter


Note: Decreasing the Watermark values can decrease Performance. So, only decrease those values if you run into an Out of Memory condition.



4. Out of Memory in Receive Hosts:


It is fairly rare to see a host containing a receive adapter to go out of memory. If that happens it is typically due to one of the cases listed in section 2.


If none of the suggested recommendations in Section 2 can be performed, then reducing the number of messages processed concurrently can help protect the receive host from getting into an out of memory condition.  


Note: Remember however that by doing so, we decrease the concurrency and hence the throughput/performance will decrease.


1) Reduce the messaging engine thread pool size:  A user can control the number of threads used by the messaging engine to publish messages into the message box.  By reducing the number of threads used by the messaging engine we reduce the rate at which the receive adapter will receive messages into the message box.  Remember this setting only needs to be done for the host corresponding to the receive handler for the adapter. 


By default the messaging engine creates 10 threads/cpu for
publishing messages.


To specify the messaging thread pool size
1. Click Start, and then click Run.
2. In the Run dialog box, in the Open box, type regedit, and then click OK.
3. In Registry Editor, expand HKEY_LOCAL_MACHINE, expand SYSTEM, expand CurrentControlSet, expand Services, right-click BTSSvc* (you would first need to determine which of the BTSSvc* regkeys correspond to the host corresponding to the receive handler), point to New, click DWORD Value, type MessagingThreadPoolSize, and then press ENTER.
4. In Registry Editor, double-click MessagingThreadPoolSize.
In the Edit DWORD Value dialog box, in the Value data box, type a number between 1 and 30, and then click OK.


5. Note for Custom Adapters:


If you have verified all of the above conditions and if you think your custom receive/send adapter is causing the OOM conditions, then you need to look at the source code of your adapter and verify the following:


The following only applies to Adapters written in Managed code:
The adapters get an object of type IBTTransportBatch using the GetBatch() API on the TransportProxy object. Once the adapter is done using the IBTTransportBatch object the adapter needs to do a Marshal.ReleaseComObject( batch) in a loop to release the object.


Basically the .net GarbageCollector does not kick in in time to release the unmanaged memory.  Hence failure to call ReleaseComObject() will result in what looks like a memory leak. 


 

Low-latency scenarios that use two-way (solicit-response) HTTP send ports.

HttpOutCompleteSize


This setting controls the size of the batch of messages that is returned from the HTTP send adapter.  The default value for this is 5.  If the buffer is not full and there are outstanding responses then the adapter will wait for 1 second until it commits the batch.  For low-latency scenarios this should be set to 1 which will allow the adapter to send response messages immediately to the message box for processing.  This will have the greatest effect during times of low-throughput activity with varied response times from backend systems.


Actually this setting controls the number of messages being returned to BizTalk (EPM) from the HTTP adapter regardless of whether the port is one-way or two-way.  A message that is returned in the batch includes a request to DeleteMessage, MoveToNextTransport, MoveToSuspendedQ, and, most interestingly, SubmitResponseMessage.  Although this setting will improve response times in low-latency scenarios it does increase the amount of chattiness between the adapter and the message box.


It is the send side equivalent to the HttpBatchSize on an HTTP receive adapter.  In low-latency scenarios this is typically set to 1 to ensure that the messages are getting processed as quickly as they are received.


To set this value you need to add HttpOutCompleteSize as a DWORD value to the registry under:


HKLM\SYSTEM\CurrentControlSet\Services\BTSSvc{guid}\


where GUID is the ID of the host for HTTP send handler.


 


This information is specific to BizTalk Server 2004 and there is no guarantee that this setting will behave the same way in the next release of BizTalk Server.


 


Disclaimer: This is an undocumented setting and was not formally tested since it was not considered to be publicly available.  You must do your own testing to ensure that the behavior and performance of your system are as expected.


 

Low-latency scenarios that use two-way (solicit-response) HTTP send ports.

HttpOutCompleteSize


This setting controls the size of the batch of messages that is returned from the HTTP send adapter.  The default value for this is 5.  If the buffer is not full and there are outstanding responses then the adapter will wait for 1 second until it commits the batch.  For low-latency scenarios this should be set to 1 which will allow the adapter to send response messages immediately to the message box for processing.  This will have the greatest effect during times of low-throughput activity with varied response times from backend systems.


Actually this setting controls the number of messages being returned to BizTalk (EPM) from the HTTP adapter regardless of whether the port is one-way or two-way.  A message that is returned in the batch includes a request to DeleteMessage, MoveToNextTransport, MoveToSuspendedQ, and, most interestingly, SubmitResponseMessage.  Although this setting will improve response times in low-latency scenarios it does increase the amount of chattiness between the adapter and the message box.


It is the send side equivalent to the HttpBatchSize on an HTTP receive adapter.  In low-latency scenarios this is typically set to 1 to ensure that the messages are getting processed as quickly as they are received.


To set this value you need to add HttpOutCompleteSize as a DWORD value to the registry under:


HKLM\SYSTEM\CurrentControlSet\Services\BTSSvc{guid}\


where GUID is the ID of the host for HTTP send handler.


 


This information is specific to BizTalk Server 2004 and there is no guarantee that this setting will behave the same way in the next release of BizTalk Server.


 


Disclaimer: This is an undocumented setting and was not formally tested since it was not considered to be publicly available.  You must do your own testing to ensure that the behavior and performance of your system are as expected.


 

Nine years and counting…

My name is David Messner – Welcome to my WebLog.  This is the second time I’ve typed up my intial posting – the first time IE crashed and I lost all that I wrote.  Hmmff.  I guess one of my first rantings will have to be on software reliability, but I’ll save that for another time.  Microsoft is overall making some pretty good inroads in this area, I think (e.g. Windows error reporting), but clearly we’ve got room to improve.


I’m the development manager for Commerce Server.  Commerce Server was the first enterprise server product at Microsoft to embrace the .NET Framework. That was in the CS2002 release and I was one of the architects of the .NET integration for the product (I owned the configuration-driven HttpModule runtime framework among other components) .  Our goal was to make it much easier to build a commerce-enabled site and I think we did that but clearly there’s more work to do!  And that’s what my team is all about – continuing to improve the quality of the product as we add great new features for our 2006 release.


I’ve worked for Microsoft for almost nine years now (hard to believe!).  What’s it like to work here?  Well the things that are constant are that you’re always learning and growing – the pace of change here can be truly frantic at times.  While the challenges are great, I’d say that the rewards are just as great.  If you’re up to working in a fast paced environment where you will learn a lot and where the rewards are commensurate to your contributations, we are currently looking for a talented software engineer to join the team.  Apply through Microsoft.com/careers (don’t send me direct email – I will deposit it directly into the circular file).


So what can you expect to find in my ‘blog?  Well, hopefully useful information and how-to articles about Commerce Server, .NET, and software development in general.  And of course you may occassionally find ranting and ravings about whatever happens to be on my mind at the time.


-djm

Understanding BizTalk Server 2004 SP1 Throughput and Capacity


By Wayne Clark


What is Sustainable?


Of primary concern when planning, designing, and testing business solutions built on BizTalk Server 2004 SP1 is that the solutions must be able to handle the expected load and meet required service levels over an indefinite period of time.  Given the number of solution architectures, configurations, and topologies possible on BizTalk Server 2004 SP1, there are many things to consider when evaluating a proposed or existing deployment.  The purpose of this, our inaugural BizTalk Performance blog posting, is to provide guidance on:




  • Understanding BizTalk Server SP1 throughput and backlog capacity behavior and how to observe the behavior of your system.

  • Critical success factors when planning for capacity.

Let me start off by defining some terms and concepts:


Maximum Sustainable throughput is the highest load of message traffic that a system can handle indefinitely in production.  Typically this is measured and represented as messages processed per unit time.  Solution design choices such as the choice of adapters, pipeline components, orchestrations, and maps will all have a direct effect on system performance.  In addition, BizTalk offers scale-up and scale-out options that provide flexibility when sizing a system.  Often overlooked, however, are things like standard operations, monitoring, and maintenance activities that have an indirect effect on sustainable throughput.  For example, Performing queries against the messagebox database (e.g., from HAT) will require cycles from SQL and effect overall throughput depending on the type and frequency of the query.  Backup, archiving, and purging activities on the database also have an indirect effect on throughput, and so on.


Engine Capacity, also known as backlog capacity, is the number of messages that have been received into the messagebox, but have not yet been processed and removed from the messagebox.  This is easily measured as the number of rows in the messagebox database table named spool.


BizTalk Server 2004 SP1 Backlog Behavior


BizTalk 2004 SP1 implements a variety of store-and-forward capabilities for messaging and long and short running orchestrations.  The messagebox database, implemented on SQL server, provides the storage for in-flight messages and orchestrations.  As messages are received by BizTalk 2004 SP1, they are en-queued, or published into the messagebox database so they can be picked up by subscribers to be processed.  Subscribers include send ports and orchestrations.  Some of the arriving messages activate new subscriber instances.  Other messages arrive and are routed, via a correlation subscription, to a waiting instance of an already running subscriber such as a correlated orchestration. 


In order for correlated orchestrations to continue processing, arriving correlated messages must not be blocked.  To facilitate this, BizTalk does its best to make sure messages (both activating and correlated) continue to be received, even under high load, so that subscribers waiting for correlated messages can finish and make room for more processes to run.  This means it is possible to receive messages faster than they can be processed and removed from the messagebox, thus building up a backlog of in-process messages.  Being a store-and-forward technology, it is only natural for BizTalk to provide this type of buffering.


Every message that is received by, or created within, BizTalk 2004 SP1 is immutable.  That is, once it has been received or created, its content cannot be changed.  In addition, received messages may have multiple subscribers.  Each subscriber of a particular message, references the same, single copy of that message.  While this approach minimizes storage, a ref-count must be kept for each message and garbage-collection must be performed periodically to get rid of those messages that have a ref count of 0.  There are a set of SQL Agent jobs in BizTalk 2004 SP1 that perform garbage collection for messages:



  • MessageBox_Message_Cleanup_BizTalkMsgBoxDb – Removes all messages that are no longer being referenced by any subscribers.
  • MessageBox_Parts_Cleanup_BizTalkMsgBoxDb – Removes all message parts that are no longer being referenced by any messages.  All messages are made up of one or more message parts, which contain the actual message data.
  • PurgeSubscriptionsJob_BizTalkMsgBoxDb – Removes unused subscription predicates left over from things like correlation subscriptions.
  • MessageBox_DeadProcesses_Cleanup_BizTalkMsgBoxDb – Called when BizTalk detects that a BTS server has crashed and releases the work that that server was working on so another machine can pick that work up.
  • TrackedMessages_Copy_BizTalkMsgBoxDb – Copies tracked message bodies from the engine spool tables into the tracking spool tables in the messagebox database.
  • TrackingSpool_Cleanup_BizTalkMsgBoxDb ­- Flips which table TrackedMessages_Copy_BizTalkMsgBoxDb job writes to.

The first two from the above list are the ones responsible for keeping the messagebox cleared of garbage messages on a regular basis.  To do their work, they sort through the messages and message parts looking for messages with a ref count of 0 and for parts that are not referenced by any messages, respectively, and removing them.


So, what does all this have to do with throughput and capacity?


When a system is at steady state, that is, processing and collecting garbage as fast as messages are received, this is clearly indefinitely sustainable.  However, if for some length of time the system is receiving faster than it can process and remove, messages start to build up in the messagebox.  As the amount of this backlog builds up, the amount of work that the cleanup jobs have to do increases and they typically start taking longer and longer to complete.  In addition, the cleanup jobs are configured to be low priority in the event of a deadlock.  As a result, when running under high load, the cleanup jobs may start to fail as the result of being the victim of deadlocks.  This allows the messages being en-queued to have precedence and not be blocked.


As an example, let’s take a look at a system that we have driven at various throughput levels and investigate the observed behavior.  The system is configured as follows:




  • Two BizTalk Servers – These servers are HP DL380 G3, equipped with dual 3GHz processors with 2GB of RAM.  BizTalk 2004 SP1 is running on these two servers.  Local Disks.

  • One SQL Server Messagebox – This server is an HP DL580 G2, equipped with quadruple 1.6GHz processors with 4GB of RAM.  This server is connected to a fast SAN disk subsystem via fiber.  The server is dedicated to the messagebox database and the data and transaction log files for the messagebox database are on separate SAN LUNs.
  • One SQL Server All Other Databases – This server is an HP DL580 G2, equipped with quadruple 1.6Ghz processors with 2GB of RAM.  This server is also connected to the SAN.  This server houses all BizTalk databases other than the messagebox, including the management, SSO, DTA, and BAM databases.
  • Load Driver Server – This server is an HP DL380 G3, equipped with dual 3GHz processors with 2GB of RAM.  This server was used to generate the load for testing the system using an internally developed load generation.  This tool was used to send copies of a designated file to shares on the BizTalk servers to be picked up by the file adapter.

The Test Scenario


The test scenario is very simple. The load generation tool distributes copies of the input file instance evenly across shares on both BizTalk servers.  Using the file adapter (we’ll explore other adapters in subsequent blog entries), files are picked up from the local share and en-queued into the messagebox.  A simple orchestration containing one receive and one send shape, subscribes to each received message.  Messages sent back into the messagebox by the orchestration are picked up by a file send port and sent to a common share, defined on the SAN.  Files arriving on the output SAN share are immediately deleted in order to avoid file buildup on that share during long test runs.


There are 4 hosts defined for the scenario, one to host the receive location, one to host orchestrations, one to host the send port, and one to host tracking.  For the purposes of observing engine backlog behavior, tracking is completely turned off during the test runs.  Turning tracking off involves more than just stopping (or not creating) and tracking host instance.  To turn tracking completely off, use the WMI property MSBTS_GroupSetting.GlobalTrackingOption.  For more information on turning tracking on and off using this property, please see: http://msdn.microsoft.com/library/default.asp?url=/library/en-us/sdk/ht m/ebiz_sdk_wmi_msbts_groupsetting_fqxz.asp.  


Both BizTalk servers are identical in that they each have instances of the receive host, orchestration host, and send host.  No instances of the tracking host were created since tracking was turned off to isolate core messagebox behavior for these tests.


A simple schema was used and the instance files used for the test were all 8KB in size.  No mapping or pipeline components were used inbound or outbound in order to keep the test scenario simple to implement and keep the behavior observations focused on the messagebox.


As a first step, the system is driven at a level near, but below, it’s maximum sustainable throughput so that observations of a healthy system can be made.  The growth rate of the messagebox backlog is a key indicator of sustainability.  Clearly, the messagebox cannot continue to grow indefinitely without eventually running into problems.  So, the depth of the messagebox database backlog, monitored over time, is used to evaluate sustainability.  The messagebox table named spool contains a record for each message in the system (active or waiting to be garbage collected).  Monitoring the number of rows in this table, and the number of messages received per second, while increasing system load provides an easy way to find the maximum sustainable throughput.  Simply increase the input load until either (a) the spool table starts to grow indefinitely or (b) the number of messages received per second plateaus, whichever comes first, and that is your maximum sustainable throughput.  Note that if you are not able to drive a high enough load to cause the spool table to grow indefinitely, it simply means that the slowest part of your system is on the receive side, rather than the processing/send side.  The following graph shows key indicators after using this approach to find the maximum sustainable throughput of our test system (described above). 



The blue line shows the total messages received per second by the system (i.e., for both BizTalk servers), the pink line shows the running average of the messages received per second, and the yellow line shows the spool table depth (x 0.01) for the test duration time provided on the X axis.  What this graph shows is that, for the hour of the test, the spool was stable and not growing and making the sustainable throughput equal to the number of messages received per second, in this case 150 msgs/sec.


Part of any analysis of a BizTalk deployment performance should include checking some key indicators to understand resource bottlenecks.  The key measures and their values used for this deployment running under maximum sustainable throughput (i.e., the test in the graph above) were as follows:


CPU Utilization:


            BTS Servers (each):     Avg CPU Utilization = 59%


            MsgBox DB Server:     Avg CPU Utilization = 54%


            Mngmt DB Server:       Avg CPU Utilization = 13%


 


Physical Disk Idle Time:


            MsgBox DB Server, Data File Disk:                             Avg Disk Idle Time = 69


            MsgBox DB Server, Transaction Log File Disk:            Avg Disk Idle Time = 98


 


SQL Locks:


            MsgBox DB Server:     Avg Total Lock Timeouts/Sec = 1072


            MsgBox DB Server:     Avg Total Lock Wait Time (ms) = 40


 


Cleanup Jobs:


            MessageBox_Message_Cleanup_BizTalkMsgBoxDb: Typical Runtime = 30 Sec


            MessageBox_Parts_Cleanup_BizTalkMsgBoxDb:        Typical Runtime = 15 Sec


 


Evnet Log:


            No errors in any of the server application event logs.


 


From these data, we can draw the following conclusions:  There are no obvious resource bottlenecks in our system.  All of these indicators are well within healthy limits.  CPU and Disk Idle Times show that there is plenty of headroom and they are not even close to being pegged.  The SQL lock indicators look good.  Lock Timeouts/sec doesn’t start to become an issue until around 5000 or so (depending on your SQL Server) and Lock Wait times under .5 – 1 second are also healthy.  Finally, the cleanup jobs are completing successfully every time and are taking 30 seconds or less to run.  If the cleanup jobs start failing often, or start taking over a minute, this is an indication that the system is being over-driven and will typically be accompanied by an increasing spool depth.


TIP:  You can expose the number of rows in the spool table by using the user defined counter capability provided by SQL Server.  Create a stored procedure (in your own database) as follows:



CREATE PROCEDURE [dbo].[SetSpoolCounter] AS


SET NOCOUNT ON


SET TRANSACTION ISOLATION LEVEL READ COMMITTED


SET DEADLOCK_PRIORITY LOW


declare @RowCnt int


select @RowCnt = count(*) from BizTalkMsgBoxDB..spool with (NOLOCK)


execute sp_user_counter1 @RowCnt


GO


By running this stored procedure periodically (e.g., once per minute) as a SQL Agent job, you can add the depth of the spool table to your counters in Performance Monitor.  For more information, search on sp_user_counter1 in the SQL books on line.


For additional useful messagebox queries, check out Lee Graber’s paper on advanced messagebox queries: http://home.comcast.net/~sdwoodgate/BizTalkServer2004AdvancedMessageBoxQueries.doc?


Now that we have shown how to find the maximum sustainable throughput and seen what the key indicators look like for a sustainable, healthy system, let’s explore some behavior associated with a system that is receiving faster than it is processing and collecting garbage.


To simulate a continuously overdriven system, we configured the load generation tool to send in about 175 msgs/sec, 25 msgs/sec more than our measured maximum sustainable throughput.  The test was designed not only to over drive the system, but to get an idea of how long it would take to recover from a spool backlog depth of around 2 million messages.  To accomplish this, we drove the system at the increased rate until the spool depth was around 2 million and then stopped submitting messages altogether.  The following graph shows the same indicators as in the sustainable graph above.



As can be seen from the graph, the spool depth started building up immediately, peaking at just above 2 million records.  At this rate, it took just under 3 hours to get to the targeted 2 million record backlog.  After the load was stopped, it took around 4.5 hours for the cleanup jobs to recover from the backlog.


Note that, even though the receive rate started out at 175 msgs/sec, it didn’t take long for it to degrade to an average less than our maximum sustainable throughput.  This is primarily due to the throttling that BizTalk provides and to increased lock contention on the message box.  During the overdrive period of the test, BizTalk throttled the receiving of messages (by blocking the thread the adapters submit messages on) based on the number of database sessions opened by a host instance and the messagebox database.  This throttling is indicated by messages in the application event log that indicate when BizTalk starts and stops throttling.


Taking a look at our other key indicators during this test, we see some interesting trends.  Consider the following graph showing the physical disk idle time for the messagebox data file disk, average CPU utilization (%) for the messagebox server, and average lock timeouts per second on the messagebox database ( x 0.01).



Comparing this graph to the one above it, we can see that, while the system is being over driven and the spool is building up, the disk gets more and more saturated (i.e., disk idle time is trending down).  Also notice that, once the cleanup jobs are given “free reign” after the load is stopped, disk idle time drops to near zero.  If it wasn’t for the fact that the cleanup jobs are configured for low deadlock priority, they would take much more of the disk I/O bandwidth even earlier in the cycle.  Instead, what we see from the job histories is that they are failing nearly every time they are executed because of lock contention while the load is still underway (as indicated by the avg lock timeouts/sec).  Once the lock contention is reduced (at the point the load is stopped), the cleanup jobs are able to succeed and begin removing messages from the spool.  It’s interesting to note that the message cleanup job ran only twice after the load was stopped, but ran for hours each time in order to collect all the unreferenced messages. 


Not shown in the above graph, the lock wait times were also quite high, averaging 7 seconds during the load period, and then dropping to normal sub-100ms levels during the recovery period.


Floodgate Scenarios


QUESTION:  “But what if I only have one or two “floodgate” events per day?  Do I really have to build a system that will handle these peaks with no backlog, only for it to sit idle the rest of the time?


ANSWER: Of course not, as long as the system can recover from the backlog before the next floodgate event, you will be fine.


There are a number of scenarios where there are just a few large peaks (a.k.a., “floodgate events”) of messages that arrive at the system each day.  Between these peaks, the throughput can be quite low.  Examples of these types of scenarios include equity trading (e.g., market open and market close) and banking systems (e.g., end of day transaction reconciliation). 


Other types of events cause backlog behavior simlar to floodgate events.  For example, if a partner send address goes off line so that messages destined for that address must be re-tried and/or suspended, this can result in backlog building up.  When the partner comes back on line, there may be a large number of suspended messages that need to be resumed, resulting in another type of floodgate event.


To illustrate how this works, consider a third test of our system as follows.  We drove the system at around half the maximum sustainable throughput.  This was, of course, very stable.  Then, to simulate a floodgate event, we dropped 50,000 additional messages all a once (as fast as we could generate them) and monitored the system.  The graph below provides our now familiar indicators of messages received per second and spool depth.



Note from the graph that the spool indeed built up a backlog during the floodgate event.  However, because the event was relatively short lived and the subsequent receive rate after the event was below the maximum sustainable rate, the cleanup jobs were able to run and recover from the event without requiring a system receive “outage”.


Of course, every system is different, so “your mileage will vary”.  The best way to verify that you can recover is to test with a representative load before going into production.


Findings and Recommendations


Know your load behavior profile:  As our three examples above have shown, it is critical to know the profile of your load in terms of messages processed over time.  The better this is understood, the more accurately you can test and adjust your system capacity.  If all you know is your peak throughput requirement, then the most conservative approach would be to size your system so that your maximum sustainable throughput is the same or higher than your peak load.  However, if you know that you have predictable peaks and valleys in your load, you can better optimize your system to recover between peaks, resulting in a smaller, less expensive overall deployment.


Test performance early:  A common situation that we encounter at customers is that they have invested significant effort in designing and testing the functionality of their scenario, but have waited until the last minute to investigate its performance behavior on production hardware.  Run performance tests on your system, simulating your load profile, as early as you possibly can in your development cycle.  If you have to change anything in your design or architecture to achieve your goals, knowing this early will give you time to adjust and test again. 


Emulate your expected load profile when testing performance:  There are two primary components to this: 1) emulate the load profile over time and 2) run the test long enough to evaluate if it is sustainable.  If, like many customers, your cycles are daily in nature, you should plan to run performance tests for at least one day to validate sustainability.  The longer the tests, the better.


Emulate the production configuration:  For example, the number and type of ports, the host and host instance configuration, database configuration, and adapter setup.  Don’t assume that changes in the configuration will not be significantly different from a performance standpoint.


Use real messages: Message sizes and message complexity will have an affect on your performance, so be sure an test with the same message schemas and instances that you plan to see in production.


Emulate your normal operations during performance tests:  Though the examples above did not include them, standard operations activities such as periodic database queries, backups, and purging will affect your sustainable throughput, so make sure you are performing these tasks during your performance and capacity test runs.  This includes both DTA and BAM tracking, if you plan to use them in production.


The speed of the I/O subsystem for the messagebox is a key success factor:  Remember that, for this scenario, we are using a fast SAN for the messagebox database files that is dedicated to this build-out.  Even in this case, the cleanup jobs were able to drive the idle time to near zero for the SQL data file.  The I/O subsystem is the most common bottleneck we have seen in customer engagements.  A common strategy to optimize SQL I/O, for example, is to place the database data file(s) and log file(s) on separate physical drives, if possible.


Make sure the SQL Agent is running on all messagebox servers:  Clearly, the cleanup jobs will never run if the SQL Agent is not running, so make sure these are running.


Spool depth and cleanup job run time are key indicators:  Regardless of other indicators, these two measures will give you a quick and easy way to assess if your system is being over-driving or not.


Acknowledgements


I would like to thank the following contributors to this blog entry:


Mitch Stein:  Thanks for helping set up the test environment and generate the test data!


Binbin Hu, Hans-Peter Mayr, Mallikarjuna rao Nimmagadda, Lee Graber, Kevin Lam, Jonathan Wanagel, and Scott Woodgate:  Thanks for reviewing and providing great feedback that improved the content!


Disclaimer:  This posting is provided “AS IS” with no warranties, and confers no rights.

Welcome to the BizTalk Product Group Performance Team Blog!

 

Have you ever had a question about the performance of
BizTalk that you couldn’t find an answer to?  That is why we, the
product team responsible for performance and stress testing BizTalk, are
launching this blog!

 

As we work with customers on BizTalk solutions to meet their
performance requirements, we gain more and more insight into BizTalk behavior
and common customer questions and issues. We want to share with you these
findings on a regular basis.  Come to this blog for answers as we regularly
provide information on selected topics that address the most commonly asked
customer questions and issues.

 

We Want Your Experiences
and Feedback!

 

We want to make sure the information we provide is useful and
meets the needs of the community at large, so share with us your performance
and capacity related experiences using BizTalk Server 2004 by providing
feedback via the blog: 

 

  • Suggest blog topics,
  • Tell us what your top issues and pain points are, and
  • Comment directly on the blog entries so we can improve and
    prioritize.

 

 

Welcome to the BizTalk Product Group Performance Team Blog!

 

Have you ever had a question about the performance of
BizTalk that you couldn’t find an answer to?  That is why we, the
product team responsible for performance and stress testing BizTalk, are
launching this blog!

 

As we work with customers on BizTalk solutions to meet their
performance requirements, we gain more and more insight into BizTalk behavior
and common customer questions and issues. We want to share with you these
findings on a regular basis.  Come to this blog for answers as we regularly
provide information on selected topics that address the most commonly asked
customer questions and issues.

 

We Want Your Experiences
and Feedback!

 

We want to make sure the information we provide is useful and
meets the needs of the community at large, so share with us your performance
and capacity related experiences using BizTalk Server 2004 by providing
feedback via the blog: 

 

  • Suggest blog topics,
  • Tell us what your top issues and pain points are, and
  • Comment directly on the blog entries so we can improve and
    prioritize.