A few days ago Sam Vanhoutte posted on the BizTalk newsgroups about an issue he was
having while trying to process Unicode encoded messages using the BizTalk Framework
Disassembler. Here’s the tale of what we discovered in the process.

Problem description
The error happened while trying to process an UTF-16LE encoded
XML message using the BizTalk Framework Disassembler component. The message in question
was received with no <?xml?> declaration and hence no encoding attribute, and
contained no BOM. This cause the operation to fail with the “None of the
components at Disassemble stage can recognize the data”
error, suggesting that
the disassembler couldn’t figure out the document encoding.

After looking around a bit using Reflector, I noticed that the BizTalk Framework
Disassembler used the XML Disassembler (XmlDasm) underneath. Because of this I suggested
Sam he tried using my FixEncoding Pipeline component in the decode stage of his pipeline
to set the message’s Body part Charset to the correct encoding (UTF-16 Little Endian,
Codepage 1200). It worked, almost. Now the document was being recognized by the disassembler,
but the disassemble itself failed, with the following error:

System.Xml.XmlException : Name cannot begin with the ‘.’
character, hexadecimal value 0x00. Line 1, position 2.
 at Microsoft.BizTalk.Component.NamespaceTranslatorStream.Read(Byte[] buffer,
Int32 offset, Int32 count)
 at Microsoft.BizTalk.Streaming.MarkableForwardOnlyEventingReadStream.ReadInternal(Byte[]
buffer, Int32 offset, Int32 count)
 at Microsoft.BizTalk.Streaming.EventingReadStream.Read(Byte[] buffer, Int32
offset, Int32 count)
 at System.IO.StreamReader.ReadBuffer(Char[] userBuffer, Int32 userOffset, Int32
desiredChars, Boolean& readToUserBuffer)
 at System.IO.StreamReader.Read(Char[] buffer, Int32 index, Int32 count)
 at System.Xml.XmlTextReaderImpl.ReadData()
 at System.Xml.XmlTextReaderImpl.InitTextReaderInput(String baseUriStr, TextReader
input)
 at System.Xml.XmlTextReaderImpl..ctor(String url, TextReader input, XmlNameTable
nt)
 at System.Xml.XmlTextReader..ctor(TextReader input)
 at Microsoft.BizTalk.Streaming.Utils.GetDocType(MarkableForwardOnlyEventingReadStream
stm, Encoding encoding)
 at Microsoft.BizTalk.Component.XmlDasmReader.CreateReader(IPipelineContext pipelineContext,
IBaseMessageContext messageContext, MarkableForwardOnlyEventingReadStream data, Encoding
encoding, Boolean saveEnvelopes, Boolean allowUnrecognizedMessage, Boolean validateDocument,
SchemaList envelopeSpecNames, SchemaList documentSpecNames, IFFDocumentSpec docSpecType,
SuspendCurrentMessageFunction documentScanner)
 at Microsoft.BizTalk.Component.XmlDasmComp.Disassemble2(IPipelineContext pc,
IBaseMessage inMsg)
 at Microsoft.BizTalk.Component.XmlDasmComp.Disassemble(IPipelineContext pc,
IBaseMessage inMsg)
 at Microsoft.BizTalk.Component.BtfDasmComp.DoLoad(IPipelineContext pc, IBaseMessage
inMsg)
 at Microsoft.BizTalk.Component.BtfDasmStateLoad.LoadMessage(IBtfDasmAction act,
IPipelineContext pc, IBaseMessage inMsg)
 at Microsoft.BizTalk.Component.BtfDasmComp.Disassemble2(IPipelineContext pc,
IBaseMessage inMsg)
 at Microsoft.BizTalk.Component.BtfDasmComp.Disassemble(IPipelineContext pc,
IBaseMessage inMsg)
 at Microsoft.Test.BizTalk.PipelineObjects.Stage.Execute(IPipelineContext pipelineContext,
IBaseMessage inputMessage)


This was a clear sign that the disassembler was somehow trying to interpret the document
using the wrong encoding, even though we were clearly specifying the correct Charset.
At this point, I asked Sam to pass on the problematic file to see what I could find
out.

The Real Problem

After a lot of digging, I think I’ve discovered what seems to be a bug in the way
the BtfDasmComp component works. It seems like it doesn’t correctly decode documents
encoding with anything else than UTF-8, unless the .NET Framework’s XmlTextReader
can figure out the document encoding on it’s own. None of the requirements to be able
to do this were met by the problematic document, so apparently the disassembler was
defaulting to trying to interpret the document using UTF-8, which caused the error.

The question was then why this was happening, when we were specifying the correct
Charset for the document, and it was pretty obvious that made a difference, since
probing was succeeding. Why was the correct encoding being used while probing but
not while disassembling?

A guess

After spending a couple more hours going though the BizTalk Framework disassembler,
I can venture an educated guess as to why the wrong encoding is being used.

The first thing I noticed was that the BtfDasmComp component, just like the XmlDasmComp
component, clearly looked specifically for the part’s Chartset property (IBaseMessagePart.Charset)
both before probing and before disassembling the document. So up to here, everything
was just fine.

However, during disassembling, eventually control lands on the BtfDasmComp.DoLoad()
method, where the body part data stream of the message is replaced with an instance
of the BTFDasmTranslator class:

stream1 = new BTFDasmTranslatorStream(pc, stream1, “http://schemas.xmlsoap.org/soap/envelope/“,
“http://schemas.biztalk.org/btf-2-0/envelope”,
encoding1);
inMsg.BodyPart.Data = stream1;

Up to here, encoding1 correctly has the encoding created from the value of the part’s
Charset property. While the code correctly passes an encoding to the new stream, I
spotted that the BTFDasmTranslatorStream class is derived from the NamespaceTranslatorStream
class.

In one of the constructors for the NamespaceTranslatorStream class, a new XmlTextReader
class is created to process the document, but no encoding is specified for it; thus
letting the reader try to figure out itself what encoding the message stream has.
This makes no sense because by this point the disassembler knows exactly what encoding
to use. Here’s the relevant code:

public NamespaceTranslatorStream(IPipelineContext pipelineContext, Stream data, string
oldNamespace, string newNamespace, Encoding encoding) : base(new XmlTextReader(data),
encoding)

You can see that while the specified encoding is passed on to the base class (XmlBufferedStreamReader),
but it is not used in the creation of the XmlTextReader itself. Of course, the encoding
cannot be provided directly to it because the XmlTextReader class doesn’t contain
a constructor that contains an Encoding argument (which I think it should, really),
so instead you need to create a StreamReader object with the correct encoding and
construct the XmlTextReader on top of that, instead.


Workaround

It became clear with this that getting the messages to process correctly was not going
to be possible by simply selecting the proper Charset. Instead, Sam was able to work
around the problem successfully by creating a custom pipeline component that actually
transcoded the message from UTF-16 to UTF-8 and using that as part of the decoding
stage before the disassembler runs.