Encoding issue

Viewing 1 reply thread
  • Author
    Posts
    • #19837

      We have a flat file schema;The code page is set to UTF-8. in the schema there is a string type; When we have an input file with the text containing a British Pound sign” the Biztalk schema fails to validate it. We thought the UTF-8 should accept the character.However, it works fine the moment we remove the symbol. (Looking at the schema XMl file, the top processing instruction says encoding is in UTF-16?) Would anyone please advise what is going wrong here? Regards,

    • #19838

      The processing instruction at the top of the schema file defines the encoding of the schema file itself. It has no effect of any instance document.

      These are the rules the FF Disassembler uses to set the encoding on documents:

      When disassembling a flat file instance message, the following algorithm is used to determine and preserve encoding information:

      1. If the “Charset” in the body part is set, use it.
      2. Otherwise, if the envelope (or document) schema specifies a code page, use it.
      3. Otherwise, if a byte order mark is present, use it
      4. Otherwise, assume UTF-8.

      I assume the British pound sign is a single byte character in the file and not the multi-byte character required by UTF-8
      UTF-8 matches ASCII for the first 127 characters, then uses some escaping characters to specify other characters.   

      I would suggest using the Western-European (1252) code page to maintain the British pound sign.

       

      • #19844

         Hello Greg,
                  just to understand this a bit more clearly.

                  The source document is in ASCII and has the pound sign in it.
                  This character does not fall in the standard 127 character ASCII range.
                  The FF schema is set to UTF 8 in the schema properties.
                  So when the Biztalk processes, it tries to convert the single byte character into the equivalent
                  UTF-8(multibyte) which turns out to be an invalid string character?
                  kindly assist.
        Regards, 

        • #19846

          Biztalk does not try to convert the single byte character to its UTF-8 equivalent. It simply reads the characters as they exist. The British pound sign ASCII character is invalid according to the UTF-8 spec.

          The code page setting in the FF schema should match the actual encoding of the incoming document.

          • #19848

             Hello Greg,
                         thanks a lot for your inputs.
                         however still one confusion.
                   
            when you say “The British pound sign ASCII character is invalid according to the UTF-8 spec”

            Would you please elaborate on this since i am sure the pound symbol can’t be left out of UTF-8?
            I am a newbie so would appreaciate an explanation.

            thanks once again

            • #19851

              In ASCII the British pound sign is 0xA3

              In UTF16 it is 0x00A3

              In UTF-8 it is 0xC2 0xA3

              UTF-8 is an encoding that supports the tens of thousands of Unicode characters. The first 127 characters are the same as ASCII, after that all characters use 2,3 or 4 byte sequences to represent characters.

              check out http://www.unicode.org for more information 

Viewing 1 reply thread
  • The forum ‘BizTalk 2004 – BizTalk 2010’ is closed to new topics and replies.