Encoding issue

This topic has 6 replies, 1 voice, and was last updated 9 years, 2 months ago by community-content.

Viewing 1 reply thread

Author

Posts
- June 2, 2008 at 11:46 PM #19837
  community-content
  We have a flat file schema;The code page is set to UTF-8. in the schema there is a string type; When we have an input file with the text containing a British Pound sign” the Biztalk schema fails to validate it. We thought the UTF-8 should accept the character.However, it works fine the moment we remove the symbol. (Looking at the schema XMl file, the top processing instruction says encoding is in UTF-16?) Would anyone please advise what is going wrong here? Regards,
- June 3, 2008 at 1:29 AM #19838
  community-content
  The processing instruction at the top of the schema file defines the encoding of the schema file itself. It has no effect of any instance document.
  
  These are the rules the FF Disassembler uses to set the encoding on documents:
  
  When disassembling a flat file instance message, the following algorithm is used to determine and preserve encoding information:
  1. If the “Charset” in the body part is set, use it.
  2. Otherwise, if the envelope (or document) schema specifies a code page, use it.
  3. Otherwise, if a byte order mark is present, use it
  4. Otherwise, assume UTF-8.
  I assume the British pound sign is a single byte character in the file and not the multi-byte character required by UTF-8
  UTF-8 matches ASCII for the first 127 characters, then uses some escaping characters to specify other characters.
  
  I would suggest using the Western-European (1252) code page to maintain the British pound sign.
  - June 4, 2008 at 1:15 AM #19844
    community-content
    Hello Greg,
              just to understand this a bit more clearly.
    
              The source document is in ASCII and has the pound sign in it.
              This character does not fall in the standard 127 character ASCII range.
              The FF schema is set to UTF 8 in the schema properties.
              So when the Biztalk processes, it tries to convert the single byte character into the equivalent
              UTF-8(multibyte) which turns out to be an invalid string character?
              kindly assist.
    Regards,
    - June 4, 2008 at 4:22 AM #19846
      community-content
      
      Biztalk does not try to convert the single byte character to its UTF-8 equivalent. It simply reads the characters as they exist. The British pound sign ASCII character is invalid according to the UTF-8 spec.
      
      The code page setting in the FF schema should match the actual encoding of the incoming document.
      - June 4, 2008 at 8:12 AM #19848
        
        community-content
        
        Hello Greg,
                     thanks a lot for your inputs.
                     however still one confusion.
        
        when you say “The British pound sign ASCII character is invalid according to the UTF-8 spec”
        
        Would you please elaborate on this since i am sure the pound symbol can’t be left out of UTF-8?
        I am a newbie so would appreaciate an explanation.
        
        thanks once again
        
        June 4, 2008 at 1:31 PM #19851
        
        community-content
        
        In ASCII the British pound sign is 0xA3
        
        In UTF16 it is 0x00A3
        
        In UTF-8 it is 0xC2 0xA3
        
        UTF-8 is an encoding that supports the tens of thousands of Unicode characters. The first 127 characters are the same as ASCII, after that all characters use 2,3 or 4 byte sequences to represent characters.
        
        check out http://www.unicode.org for more information
        
        June 4, 2008 at 10:26 PM #19854
        
        community-content
        
        Thank You Very Much, Greg 🙂
Author

Posts

Viewing 1 reply thread

The forum ‘BizTalk 2004 – BizTalk 2010’ is closed to new topics and replies.

Search this Site:

Recent Posts

Recent Topics