Every so often I see people asking in the newsgroups how to solve certain challenges
they encounter while working on their BizTalk applications. One common question revolves
around being able to go "back to the past" when an error happens during
processing of a message.

This isn’t a bad question at all, and usually revolves around how to simulate the
behavior of atomic transactions in an environment where transactions can be a lot
more complex and not always as natural.

The question usually goes like this: "I’m receiving a message in BizTalk, which
is triggering an orchestration instance. The orchestration does this and that, and
if it any of those things fail, I want to put the message back where I got it from".


The question might seem simple, but it’s not always necessarily so. In fact, sometimes
you have to stop a moment and ask yourself whether this really makes sense. There
are several aspects you need to consider:

  1. Handling the case where "this" causes an error is probably not
    a big deal. Handling the case where "this" succeeded but "that"
    failed, however, might not be that simple. Not all actions your orchestration might
    do can be undone.
  2. Most of the time you’ll find that both actions can’t be done as a single unit in a
    single atomic transaction. Fortunately, BizTalk provides very good support for long-running
    transactions and compensation which can help quite a bit.

    Unfortunately, long-running transactions and compensation models are often misunderstood
    (cue in the inevitable "How long does a transaction have to last to be a
    long-running transaction?
    " jokes/questions).

    Here are a few articles that do a great job of describing the BizTalk Transaction
    features and how to use them effectively:

  3. The sentence "put the message back where I got it from" can be either a
    very good thing, or a very problematic thing. It basically relates to leaving stuff
    as you found it; in particular, leaving the message back into its origin (thus relating
    to the transactional concept of "nothing happened here, move along")so that
    you can try processing it again later on and hopefully it will succeed at
    that time.
  4. >

    The problem with number 3 is that it (a) isn’t always possible, and (b) it isn’t always
    a good idea.

    It might not be possible to put the message back where you got it if someone was pushing
    the message to you instead of you pulling it from somewhere. If you had a SOAP/HTTP
    WebService exposed that received a message from someone else, then you probably can’t
    put the message back where you got it from!

    On the other hand, this is a very common model for queued messaging systems: If you
    run into an error processing the message, you put it back into the queue and try again
    later. And this works great many times and can simplify error handling a great deal.

    The point where this becomes a problem is when you rely on this as your only error
    handling mechanism. If you blindly send the message back to its origin to retry processing
    for any and all errors and a message comes in that always fails, you’ve got
    yourself a poison message!

    toxic I’ve
    already talked
    about Poison Messages
    in the past, so I won’t comment much more on them. But there
    are other things you can keep in mind to improve the "back to the past"
    error handling technique, particularly if you don’t care about message processing:

    1. If you can identify and classify the source/cause of the errors, you can make your
      orchestration smarter about how to handle them. For example:
      • Can you distinguish transient error conditions? For example, a timeout connecting
        to the database might be a temporary condition because of a network fluke or a server
        being restarted. Sometimes retrying the operation after a short while is enough to
        deal with this situation effectively.
      • Can you distinguish errors that might require manual intervention to fix? Example:
        Validating an operation fails because some configuration data is missing. This is
        a case where you want to be proactive and raise an appropriate alert so that someone
        can get in there and fix the issue. Extra points if you can tell apart conditions
        that require intervention from a business users and those that require it from a systems

        Notice, however, that in this case putting the message back at the start right after
        creating the notification is not the right thing to do. People don’t react
        that fast. You need to set the message aside until such time as the corrective measure
        has been taken and it is safe to try processing it again.

    2. Can you control when the retry might happen? Can you throttle it if necessary? If
      the answer is no, then you might want to be very careful about using this technique.
      You could easily increase the system load substantially if lots of messages fail in
      a short time and you try reprocessing them in a tight loop.
    3. Be mindful of adapters that provide no ordering semantics. For example, if your original
      location used the FILE adapter and you put the message back in the original folder,
      it will likely get picked up very soon again for processing; which can quickly get
      you back to step 2.

      At least with an adapter like MSMQ you can push the message to the end of the queue,
      which might buy you some time.

    4. Even if you take 1, 2 and 3 into account, you still need to provide a way to deal
      with poison messages. Keep in mind that what started as a transient error condition
      can suddenly escalate to a full-blown problem you can’t do nothing about, like when
      that temporary network fluke turns into a days-long outage after some idiot digging
      a whole outside snaps your network fiber cable in two.

      In fact, sometimes you might need to go so far as to completely shut down processing.
      Sometimes being able to detect that some things that should be working keep failing
      after an extended period of time and alerting about it can help get things sorted
      out before they spiral out of control.

    5. >

      These are just some ideas that might help make your system more reliable and more
      manageable. Some of them do cost money; that is, you have to invest time and development/testing
      efforts in getting them done, and that’s where you’re going to have to evaluate what
      makes sense and what not.

      technorati Messaging, BizTalk, Architecture