It's one of the common scenarios, there is a sudden surge in your business and number of transactions increases drastically for a period of time. Modern business solutions are not standalone applications. In a SOA/BPM scenario there will be "n" number of layers (applications) work together to form a business solutions. So, during this sudden surge, if a single layer (application) malfunctions – its going to bring the whole end to end process in trouble.
We were hit with a similar situation recently.
Quick overview of our overall solution:
Our top level over simplified architecture looks like this
In the above picture, BPM workflow solution and Business Service (Composite service calling multiple underlying integration services) were build with Microsoft BizTalk Server. In our case the main problem area is database and concurrency. Business users basically perform 2 sets of operations.
1. Validate Applications, and
2. Commit Applications
They scan applications and feed into the system, the system goes through full validation rules and give the result back. If there are errors, the business users correct them and feed back into the system and they cycle through this loop until all the errors are resolved. Finally after resolving all the errors they perform the commit operation. Due to the complexity of the solution if things goes wrong during commit, then it will end up in manual (time consuming!!) operation to complete the job. Its efficient if we reduce the number of manual processing.
During validate cycles the calls to database are purely read only and even at very high volume levels there are no serious issues (also because lot of information are cached in memory). But the commit calls persist information into the database and some of the hot tables got few million records and can't deal with simultaneous requests. The biggest problem is, if it gets into trouble it effects everybody, as shown in the picture, there are some external applications like web sites relies on the availability of the database. They get effected as well.
Overview of the Business Services architecture
The below picture shows the architecture of our Business Services layer. The responsibility of the business services is to create a usable abstract composite business services layer for the end consumers, so end consumers don't need to understand the interaction logic with underlying systems. It takes care of various thing like the order (sequencing) in which the underlying integration services are called, error handling across various calls, suppressing duplicate errors returned from various calls, single unified contract to interact with etc.
Now getting to the technical bits?
The sub orchestrations are designed more for modularity reasons and are called using the "Call Orchestration" shape from the two main orchestrations "Validate" and "Commit". All the orchestrations are hosted using a single BizTalk Host and one host instance per server is created (we got 6 BizTalk Servers. The BizTalk servers hosts other BizTalk applications as well..).
The important thing to note here is, all the orchestrations need to live within a single host due to their tight coupling.
This is how its been tightly coupled:
1. The called sub orchestrations needs to live in the same host as calling orchestrations ("Validate" and "Commit") since you can't specify a host for sub-orchestrations (orchestrations started using a call orchestration shape runs under the same host/host instance as the called orchestration), and
2. Both the top level orchestrations share the sub orchestrations.
So the only option is to configure and run all the orchestrations in a single host. The other interesting factor here is, the send ports are also not aware of the "Validate"/"Commit" differentiation. There is only one send port per end point, which handles both validate and commit calls. This is again the constraint created by the way sub-orchestrations are designed, there is only one logical port mapped to one physical port for both the "Validate" and "Commit" operations.
This is not a problem in majority of the cases. But as I mentioned earlier in this post, in our case the multiple concurrent "validate" calls are absolutely fine, you can thrash the system with such calls (Thanks to improved memory cache stores). But, when it comes to "commit" the underlying database is seriously suffering with concurrent calls. This just shows the importance of understanding the limitations of the underlying systems, and the importance of non-functional SLA requirements when designing a SOA/BPM system which relies on diverse applications in the organisation. There is nothing wrong with this applications, its designed for easy maintenance, not keeping operational constraints in mind.
I can think of two different approaches which would have given the opportunity to separate "Validate" and "Commit" operations and setting different throttling conditions.
1. By not reusing the sub-orchestrations, in which case both "Validate" and "Commit" can live in two different host with completely isolated throttling settings.
2. By not reusing the same send port for both "Validate" and "Commit", which could have given the opportunity to control it at send port level by configuring different host for send ports with different throttling settings.
BizTalk server is designed for high throughput and its going to push the messages to underlying systems (and databases) as quickly as it can. It does provide throttling, it understands underlying systems are suffering and will try to control it. But the throttling kick off is too late for us, the system has already gone to unrecoverable mode.
Production environment Constraints
First of all there are some very restricted conditions for us to work with
1. The environment is frozen. Next few days is our core business days in the year and we are not allowed to make changes (serious financial implications!!).
2. Time is against us. We need a working solution ASAP.
3. We don't want to slow down the validate process, this will keep the business users idle. In fact there are lot of temp business users hired for this period.
Understanding the pattern and requirement:
First we identified the pattern under which the underlying database suffers. We roughly need to process 700 applications per hour. The database is very happy and capable of processing this volume at only 60% resource utilization if we feed them at consistent rate (roughly 1 application every 5 seconds). But if we accidentally receive 100 applications concurrently, this will bring the database to stand still and block all further calls. Resource utilization will be nearly 100% and it will take a while for the database to recover. By that time we would got 100's of applications already failed and routed to manual processing.
Secondly, we should have the ability to stop and start processing commit request in a very controlled way (as soon as we start seeing issues on the database). The issues might be triggered by other external applications using the same database. Ex: Some part of the external web sites hits the database via the Integrations services.
So, it all comes down to:
1. Parking all the "Commit" (only Commit) application on the Business Services side in the BizTalk MessageBox database (creating a queuing effect. The validate application should flow through without any hold).
2. Releasing "Commit" applications in a controlled way to down line integration services (and into database), with the ability to start/stop/increase volume/decrease volume/manual submit all at runtime with minimal risk.
Finally, the solution:
This is the solution I came up with,
1. Stop the top level "Commit" orchestration via the BizTalk Admin Console. This will solve our first issue of parking the commit application on BizTalk Business Services layer. All the commit orchestration instance will get suspended in a resumable state and reside in BizTalk MessageBox.
2. Use BizTalk's ability to resume the suspended instances. The key thing to note here is you don't need to start your orchestration to resume suspended(resumable) instances. You only want to make sure the orchestrations are in enlisted state.
This solves the issue of ability to submit commit applications manually in a controlled way. But there are quite few challenges here
1. It's a daunting task to sit and do this job manually for hours.
2. High possibility of human error, note in the above picture, you are only 1 click away from terminating the instances.
3. There is a possibility of one more human error, you might resume a completely wrong instance in a wrong tab.
Bare in mind, its production environment.
Tool to resume Suspended(resumable) instances using WMI
All I had to do now is just to automate the manual task of resuming suspended orchestration instances automatically in a controlled way. I just knocked this tool in couple of hours using the BizTalk Servers WMI ability to interact with the BizTalk environment.
Its a very simple tool, this is how it works
1. You select the required Orchestration, and press refresh. It will show how many instances waiting in suspended-resumable state (for this particular orchestration).
2. Then you configure how many instances to submit every time (Ex: Number of Instances to submit = 1)
3. Then you can decided whether to submit it manually (by selecting the "Manual" option and clicking "Resume Instances"), or
4. You can configure to submit it in periodic intervals. Ex: 1 every 3 seconds (by selecting the "Timed" option, configuring the time in milliseconds and clicking "Resume Instances").
5. Once you started your timed submission, instances will be submitted in periodic intervals and you'll have the ability to stop them at any time by just clicking "Stop Timer".
Certainly I'm not recommending this as a long term solution, but it helped us to get out of the scenario. This solution requires constant human monitoring, which is ok for our peak window time (few days), but not sustainable to do it throughout the year. But I believe something similar is required built into the product to support scenarios like this, which looks very common for businesses.
Difficulties in understanding BizTalk Throttling mechanism:
Anyone who worked with BizTalk long enough will agree, throttling the BizTalk environment is not for faint hearted ones. Simply because the sheer number of tuning parameters BizTalk provides to the end users. Also, because its not clear to everyone each and every settings. Here are the few parameters you can tune
1. Registry setting to alter min/max IO and Worker Threads for host instances (you need to do it on each machine)
2. Host throttling settings for each host, which affects all the host instances for that host. You will see few options for threads here.
3. adm_services table in BizTalk management database where you configure polling interval, and threads for EPM, XLANG etc..which effects the whole environment.
4. You also got individual adapter settings which effect to certain extend. Example: Polling interval in SQL, MQSeries adapter etc.
After talking to many people, its very clear to me. They simply try various combinations and finally settle for one that works for their requirement. Most of the time they really don't know the implications which is bad.
This complexity also puts the constraint, its nearly not practical to tune them in production if your business requirements are changing every day. Example: In our case, we'll know the volume we are going to process every day in the beginning of the day (when all applications are received in post). So we need to set up our system for that daily volume. If its 500 applications, we don't need to do anything system will automatically coupe. If its 3000 applications then we need to modify certain settings. BizTalk server is a self tuning engine, it caters automatically for varying situation ? majority of the time. But the problem is BizTalk can't be aware of all its underlying systems ability!!
Throttling based on system resources or functional requirements?
BizTalk got a very good throttling mechanism (I'm only saying its bit difficult to understand it 🙂 ), but the throttling is based purely on the system resources (example: CPU utilization, memory utilization, threads allocation, db connections, spool depth etc etc) of the BizTalk environment. We need to keep in mind BizTalk is a middleware product and it will depend heavily on the performance and ability of the external system to work efficiently.
BizTalk of course provides message publishing throttling, so it slows down automatically when it sees the underlying system is suffering and its host queue length is building up. But its a system decision based on parameters. In some scenarios (like ours) its too late before the system can start throttling.
While researching for a solution to this problem, I have noticed lot of people in the same boat. They were trying to slow down the BizTalk environment to coupe with the underlying systems ability in various ways. Things like building a controller orchestration, adjusting the MaxWorkerThread settings to control the number of running orchestrations, Enabling ordered delivery on the send port etc.
Suggestion to BizTalk Server Product Team:
At the moment with the current version of BizTalk (2009 R2 or 2010!!) there is no separate mechanism for Message Processing throttling between Orchestrations (Business Processes) and other subscribers (ex: Send ports). In some scenarios business processes may be driven by various external factors and need to be controlled in a much more fine grained way. Also, as explained in this post there may be requirement to change the throttling behaviour in production at runtime quite frequently for varying reasons. Here I'm going to put my fictitious idea 🙂
On the Orchestration Properties window add a new tab called "Controlled", which may look like this
All the instances that are waiting to be processed can have a status something like "Ready to Run (Controlled)". The administrators will still have the ability to Terminate the instances if not required.
The context menu can be extended a bit so, the administrator can start/stop processing in a controlled or normal way easily.
In this approach, we don't need to restart host instances, the configuration changes should take effect on the fly.