If you’re writing text files in an encoding that supports a Byte-Order
Mark
, you should always try to make sure that you include it, unless you have
a protocol in place that precludes you from doing so (such as a legacy application
that doesn’t know how to deal with them).

One of the reasons you should always remember the BOM is that many applications can
use it to try to guess what encoding they should use when trying to read the text
you’re feeding them.

Encoding detection based on the BOM is not foolproof, but it’s better than having
nothing at all, particularly in cases cases where simply assuming blindly a predefined
encoding such as UTF-8 or UTF-16 might not be an option at all.

One particular case where remembering the BOM is very important is with UTF-8. Let
me tell you a story to illustrate why:

We’ve been working the past few weeks on testing and improving a BizTalk-based solution
for a client that some other consulting company had created. One particular piece
had been working fine until, suddenly, we started getting an error when BizTalk tried
to parse an incoming message with an error stating that "The character is not
valid on the specified encoding".

Looking at the message, it was supposed to be UTF-8 encoded, and for the most part
looked OK. The character causing trouble was, in fact, a 0xA0 char (non-breaking space)
inside an element value. While this was not good, it wasn’t clear why it was causing
trouble.

Since it was an XML message, we opened it up in Internet Explorer: Yep, that too parsed
it incorrectly and got stuck when it reached the problematic character.

Looking a bit further , we found that in this particular case, the original developer
had written a piece of code that created a Stream object with the message contents
and then fed that to BizTalk. The code looked a bit like this:

public static Stream
CreateStream(String msg) {

   MemoryStream stream = new MemoryStream();

   byte[] bytes = Encoding.UTF8.GetBytes(msg);

   stream.Write(bytes, 0, bytes.Length);

   stream.Position = 0;

   return stream;

}

The message text itself was a piece of XML that included an <?xml?> declaration
with the encoding attribute specifying UTF-8. This seemed OK, even if the code above
just seemed like a pretty uncomfortable way of creating the stream.

However, this gave us a clue: UTF8Encoding.GetBytes() won’t give you
a BOM. Looking at the message in a binary editor, we validated indeed the message
did not have a BOM at all. So we tried replacing the code above to one that simply
used a StreamReader object (which uses UTF-8 with a BOM by default), and that fixed
the issue right away!

This highlights why the BOM is so important for UTF-8: The basic characters in the
set share the same values as the ASCII code. This is generally an advantage, but it
can also mean that stuff that’s incorrectly coded (such as our example above) might
seem to work fine for a while until an unexpected character comes along and everything
crumbles down. This is unlike other encoding, such as UTF-16, where things usually
blow up right away.

In this particular case, the culprit was really a combination of factors: The lack
of a BOM together with the presence of the encoding
specification
in the XML declaration [1]. I’m not sure why the XML stacks get
stuck on a BOM-less UTF-8 file with an encoding declaration, but there you have it.
So don’t forget the BOM!

[1] I personally thing the encoding specification in the XML declaration is
probably the single most stupid idea included in the XML spec. It’s just downright
evil.

technorati XML, .NET, BizTalk