I’m currently working on an EDI grammar, I came across some unexcited behavior upon tokenizing the input.

UNA:+.? '
UNB+UNOC:3+123456789:ZZ+987654321:ZZ+090804:0758+491944'
UNH+464009+APERAK:D:07B:UN:2.0b'
BGM+313+464009'
DTM+137:200908040758:203'
RFF+ACE:100048193285'
DTM+171:200908040606:203'
NAD+MS+123456789::ZZ'
NAD+MR+987654321::ZZ'
ERC+Z06'
FTX+ABO+++9904383000003'
RFF+ACE:100048193285'
UNT+11+464009'
UNZ+1+491944'

The above sample is an APERAK message. I won’t go into any details about the structure other then that there are a number of Segments such as (UNH, BGM, DTM etc). Each segment is separated by “’”. Every segment has elements separated by “+”, which in turn can have a number of component data elements separated by “:”. Some of the elements are optional and some are mandatory.

My problem occurred when elements are optional. Have a look at the sample grammar below:

syntax Main =   a:A? del? b:B? del? c:C? =>{A=>a,B=>b,C=>c} ;
token del = ","; 
token A = ("A".."Z" | "a".."z" | "0".."9")+;
token B = ("A".."Z" | "a".."z" | "0".."9")+;
token C = ("A".."Z" | "a".."z" | "0".."9")+;

The syntax above states that there are three tokens (A, B and C), and they are all optional.

Given the input: a,b,c the output will be:

{
  A => "a",
  B => "b",
  C => "c"
}

However, given the input of only a,b the output comes out:

{
  A => null,
  B => "a",
  C => "b"
}

This was somewhat unexpected for me. I would have expected the tokens to be filled from the left, leaving the “C” element empty. To solve this you need to complement the syntax with all possible combinations:

syntax Main =   a:A del b:B del c:C? =>{A=>a,B=>b,C=>c}
                    | a:A del? b:B?=>{A=>a,B=>b}
                    | a:A=>{A=>a} ;

Which gives the following output:

{
  A => "a",
  B => "b"
}