I’m currently working on an EDI grammar, I came across some unexcited behavior upon tokenizing the input.

*UNA:+.? ' UNB+UNOC:3+123456789:ZZ+987654321:ZZ+090804:0758+491944' UNH+464009+APERAK:D:07B:UN:2.0b' BGM+313+464009' DTM+137:200908040758:203' RFF+ACE:100048193285' DTM+171:200908040606:203' NAD+MS+123456789::ZZ' NAD+MR+987654321::ZZ' ERC+Z06' FTX+ABO+++9904383000003' RFF+ACE:100048193285' UNT+11+464009' UNZ+1+491944'*

The above sample is an APERAK message. I won’t go into any details about the structure other then that there are a number of *Segments* such as (UNH, BGM, DTM etc). Each segment is separated by “’”. Every segment has *elements* separated by “+”, which in turn can have a number of component data elements separated by “:”. Some of the elements are optional and some are mandatory.

My problem occurred when elements are optional. Have a look at the sample grammar below:

syntaxMain = a:A? del? b:B? del? c:C? =>{A=>a,B=>b,C=>c} ;tokendel = ",";tokenA = ("A".."Z" | "a".."z" | "0".."9")+;tokenB = ("A".."Z" | "a".."z" | "0".."9")+;tokenC = ("A".."Z" | "a".."z" | "0".."9")+;

The syntax above states that there are three tokens (*A, B* and *C*), and they are all optional.

Given the input: * a,b,c * the output will be:

{

A => "a",

B => "b",

C => "c"

}

However, given the input of only ** a,b** the output comes out:

{

A => null,

B => "a",

C => "b"

}

This was somewhat unexpected for me. I would have expected the tokens to be filled from the left, leaving the “C” element empty. To solve this you need to complement the syntax with all possible combinations:

syntaxMain = a:A del b:B del c:C? =>{A=>a,B=>b,C=>c}

| a:A del? b:B?=>{A=>a,B=>b}

| a:A=>{A=>a} ;

Which gives the following output:

{

A => "a",

B => "b"

}