UMF - The United Message Format



Abstract

   A number of methods and tools are available for defining the format
   of messages used for signalling protocols.  However, many of these
   methods and tools have been designed for purposes other than message
   definition, and have been adopted on the basis that they are readily
   available rather than being ideally suited to the task.  This often
   means that the methods make it difficult to get definitions correct,
   or result in unnecessary verbosity both in the definition and on the
   wire.

   UMF - the United Message Format - has been custom designed for the
   purpose of message definition.  It is thus easy to specify messages
   in a compact, extensible format that is readily machine manipulated
   to produce a compact encoding on the wire.

1. Introduction

   This document defines the UMF message definition language, and the
   default text encoding method for messages defined in this way.

2. Requirements for Message Definition and Encoding

   A good message definition method will have the following properties.
   It is these properties that UMF has been designed to have.

   Precise Definitions

      It is important to accurately capture type information in a
      message definition.  Some message definition methods simply
      capture the name of a parameter without specifying the type of the
      parameter (e.g. integer, boolean etc).  Additionally types like
      integers need to be constrained to appropriate values.

      UMF provides this precision of definition.

   Compact Definitions

      The message definition should be as compact as possible, but no
      more compact.  While helpful to the inexperienced developer,
      excessive keywords and other formatting can actually be


Copyright Tech-Know-Ware Ltd, 2002.  All rights reserved.        Page 1
                  UMF - The United Message Format           August 2002


      detrimental to the understanding of the experienced developer.

      UMF adopts a compact C like definition that contains minimal
      clutter and thus allows the true message structure to be readily
      seen at a glance.

   Readily Extensible

      The message definition and the resultant on the wire encoding need
      to support extensibility.  As part of this, code should be able to
      pass over parameters that it does not understand without becoming
      confused.

      The UMF message definition and encoding allows this.

   Extensible by Third Parties

      It often occurs that a protocol is defined by one body and then
      adopted and modified by another body.  In other cases a base
      protocol may be defined that is then augmented by external
      profiles.  An effective method of allowing a third-party to
      accurately specify a message definition as deltas to an existing
      message definition is important in this respect.

      UMF allows third-parties to specify protocol additions that should
      not clash with additions made by other third parties.

   Machine Parsable

      It is desirable that the message definition be machine readable so
      that as much of the slog involved in turning a message definition
      into running code is as automated as possible.  This improves time
      to market and significantly reduces the potential for adding bugs
      into the code.

      An UMF definition is in many respects a generalised form of C data
      structure definition.  Therefore it is relatively simple to
      convert a machine independent UMF definition into a machine
      dependent C definition and provide all the code to convert from
      one data representation to another.  This process can remove a
      vast amount of slog.  Additionally, the various compilers involved
      in the process can do a large amount of validating to ensure that
      the implementation is correct.

   Simplicity

      While accurate message definition is important, it is perhaps even
      more important that the message definition method be intelligible
      to people that do not have a great deal of time to become gurus in
      yet another language.  Therefore the definition method should be
      quick and easy to learn.  This means that the message definition
      language must have minimal complexity.  As complexity of


Copyright Tech-Know-Ware Ltd, 2002.  All rights reserved.        Page 2
                  UMF - The United Message Format           August 2002


      definition and expressiveness are often interrelated, in some
      cases it is necessary to restrict expressiveness in the interests
      of simplicity.  Additionally, consideration should also be given
      to the complexity of the required parser, which may favour
      simplicity of format over absolute message compactness.

      UMF is based on the 80-20 principle.  It is a small language that
      can accommodate the majority of situations extremely well.  There
      will be times where a UMF representation is sub-optimal in terms
      of on-the-wire compactness.  However, it is felt that on the
      whole, the gains in simplicity that this enables outweigh these
      sub-optimalities.

   Compact On-the-Wire Encoding

      As a general principle, it is desirable that encoded messages be
      as compact as possible.  This minimises transmission bandwidth,
      can make processing the messages more efficient, and prevents
      premature fragmentation of datagrams.  Compact messages are also
      important in the area of mobile devices that have limited memory
      and possibly transmission bandwidth.  This is particularly the
      case if the information is stored as persistent configuration data
      rather than being immediately discarded.  Also, in many cases,
      compact messages are easier for developers experienced in the
      protocol to read than some more verbose types, and it is these
      developers that should be the primary target for any measure aimed
      at easing debugging.

      Given that there are limits to how compactly the actual data in a
      message can be represented, the compactness of a message is
      determined largely by the tagging.  Existing protocols often use
      no tagging of data to minimise message size.  They also allow for
      comma separated lists of parameters that have the same meaning
      rather than requiring each parameter to be separately tagged.
      Additionally descriptive parameter names are essential to a clear
      message definition, but tags used in messages are often shorter
      than is descriptively useful (e.g. <p> instead of <paragraph>, <a>
      instead of <anchor>).  Therefore, it is desirable to be able to
      define a descriptive name that can be used in code and a tag name
      that can be used on the wire.  UMF accommodates all of these
      requirements.

   Flexible Implementation

      While turnkey solutions are desirable, they are potentially
      complex to develop, and thus may incur some cost to use, thus
      making them inaccessible to some.  Therefore a range of
      implementation routes are desirable, from minimal tools / maximum
      leg work, to maximal tools/minimum leg work.

      UMF has a number of implementation routes in addition to the
      compilation route.  An UMF definition can be converted into an


Copyright Tech-Know-Ware Ltd, 2002.  All rights reserved.        Page 3
                  UMF - The United Message Format           August 2002


      ABNF definition and implemented via that route, or a DOM like tree
      based parsing method can be used.  (Downloadable software for
      these implementation routes is - or soon will be - available from
      [1].)

   Support Easy Application Debugging

      Ideally the messages on the wire should be in a form that is aid
      the debugging process.

      By default UMF uses a text based line format, and is thus readily
      readable by human developers.  Additionally it is also easy to
      manually generate test messages.  With the aid of cb-like tools,
      it is possible to format messages so that they are more readable
      than the most compact line representation.  Additional tools make
      it possible to automatically generate test messages and use them
      as test vectors to test a parser, or validate that manually
      generated test messages actually conform to the message
      definition.

   Nesting of Protocols

      In some systems messages from one protocol are carried within
      messages from another protocol (TCP in IP is a simple example, as
      is HTML in HTTP).  The definition and line encoding should allow
      this.

      UMF allows this.

   Flexible On-the-Wire Encoding

      It is not always possible to anticipate the direction of
      development so flexibility in the actual wire representation of
      the messages is desirable.

      The principal UMF on-the-wire representation in text based.
      However, an UMF message definition can also be represented using
      alternate text formats such as XML, and can also be represented in
      binary.

2.1 That's UMF

   UMF has been specifically designed to meet all of the above
   requirements.  

3. UMF Messages Definition

   This section describes how UMF specifies the content of messages.  As
   the syntax is C-like it is felt that many will immediately understand
   the message definition.  For this reason a short example of a message
   definition is presented before describing the format in detail.  The
   example is also used to give a rough indication of what the formal


Copyright Tech-Know-Ware Ltd, 2002.  All rights reserved.        Page 4
                  UMF - The United Message Format           August 2002


   definition describes, and will thus hopefully help with the
   understanding of the latter.

3.1 Basic Principles of the Message Definition

   Before presenting an example, and a more formal definition, it may be
   helpful to describe the basic principles of the message definition
   format.

   Following the C language format, the basic format of a parameter
   definition is:

      type  name

   Type specifies things like integers, booleans, ASCII strings, Unicode
   strings and so on.    

   The name is obviously the name of the parameter.

   Thus a parameter definition might be:

      int   rfc-number ;

   In addition, a parameter definition can express constraints on the
   basic type, cardinality (how many instances of the type are valid in
   a message), and the tag to be used for the value on the wire.  For
   example, an integer may be limited to the values 0 to 255, and an
   ASCII string may be limited to a maximum size.  The fuller format of
   a parameter will have the form:

      type <constraint>  name [cardinality]  tagging

   For example:

      int <1..30000>  referenced-rfcs  [0..255]  as  refers ;

   This defines an integer that can have values between 1 and 30000.
   The name of the parameter is refereced-rfcs, but is tagged
   on-the-wire by 'refers'.  The parameter can consist of between 0 and
   255 instances of the integer in a valid encoding.

   Two types of compound parameter are also possible, these being
   'struct' and 'union'.  Having much the same meaning as they have in
   C, a struct specifies a group of parameters, all of which may be used
   in a particular instance of the struct.  A union similarly specifies
   a group of parameters, but in this case only one of the parameters
   can be used in any one instance of the union.

   An example of a struct is:





Copyright Tech-Know-Ware Ltd, 2002.  All rights reserved.        Page 5
                  UMF - The United Message Format           August 2002


      struct  rfc-links
      {
            int             rfc-number;
            int <1..32000>  referenced-rfcs[0..255]  as  refers ;
      };
   

3.2 An Example Message Definition

   The following is an example message definition:

      module com.tech-know-ware.my-example

      struct  my-example
      {
            int <0..255>  participant-id  as  ?;
            Action        action  as  ?;
            struct        my-addition[0..1] as new.tech-know-ware.com plugin
            {
                  bool    tkw-app-capable  as  ?;
            };
      };

      union  action
      {
            Join           join;
            Message        message  as  msg;
            void           leave;
      };

      struct  Join
      {
            ascii<0..63>   name;
      };

      struct  Message
      {
            int <0..255>   to-delegates[1..127]  as  to;
            ascii<0..255>  message  as  msg;
            [              // Version 2 additions
            int <0..5>     priority;
            bool           acknowledge as ack;
            ]
            [              // Version 5 additions
            ascii<0..16>   font-name[0..1] as font;
            void           bold[0..1];
            void           italic[0..1];
            void           underlined[0..1] as ul;
            ]
      };

   The above definition is intended to represent a very crude meeting


Copyright Tech-Know-Ware Ltd, 2002.  All rights reserved.        Page 6
                  UMF - The United Message Format           August 2002


   controller.  The first construct (my-example) is the root of all
   messages for the protocol.  Each message identifies a participant
   using an integer in the range 0 to 255, called participant-id.  When
   encoded on the wire, this parameter will be untagged due to the 'as
   ?' specification.

   Each message then has an action, which is also untagged.  The type of
   the action parameter is not immediately specified, and instead
   references the 'Action' definition.  

   The Action definition is a union in which only one of the specified
   parameters may appear in an instance of the Action construct.  This
   effectively represents a fork in the semantics of any given message.
   The options within Action can indicate that somebody has joined the
   meeting, left the meeting, or is sending a message to other
   delegates.

   There is no explicit tag for the 'join' and 'leave' options, so these
   will be tagged on-the-wire by the parameters' names, 'join' and
   'leave' respectively.  Conversely, an explicit tag for the 'message'
   parameter is specified, and hence the message option will be tagged
   by 'msg' on-the-wire.

   The join parameter also has a referenced definition.  Conceptually,
   when a person joins a meeting, all the other delegates are informed
   of their name.  The name is an ASCII string that has a minimum length
   of 0 characters and a maximum length of 63 characters.

   The message option is also a referenced definition.  Conceptually, to
   send a messages, the participant-id is used to identify the sender,
   and the to-delegates field contains the participant ids of all the
   people to whom the message is being sent.  On-the-wire, the
   to-delegates parameter will be tagged with 'to'.  Between one and 127
   instances of the to-delegates parameter may appear in a message.

   Also, the message itself is included.  The message will consist of
   ASCII characters and can be between 0 and 255 characters long.
   On-the-wire, the message field will have the tag 'msg'.  

   The priority and acknowledge fields within the message struct have
   been added in a later version of the protocol.  This is indicated by
   the square brackets in which the parameters are wrapped.  Similarly,
   font-name, and associated parameters have been added in version 5 of
   the protocol (according to the comment).  The reader should already
   understand enough of the definition language to understand the
   meaning of these fields.

   Returning to the 'my-example' root, a third-party has added an
   extension to the protocol in the form of the 'my-addition' parameter.
   It is identified as not being part of the base specification by the
   keyword 'plugin'.  On-the-wire, the additional parameter will be
   identified by the tag 'new.tech-know-ware.com' to differentiate it


Copyright Tech-Know-Ware Ltd, 2002.  All rights reserved.        Page 7
                  UMF - The United Message Format           August 2002


   from additions that may be made by other third parties.

   On-the-wire encoded examples of this message definition are shown in
   section 4.2.

3.3 Formal Message Definition Syntax

   There are two types of parameter in UMF, simple types and compound
   types.  The ABNF definition of these is:

      UMF-parameter  =  simple-param  /  compound-param

   Simple types represent parameters such as integers, booleans etc.

   The ABNF definition of a simple param is:

      simple-param = simple-type WS name [ OWS cardinality ] 
                                         [ WS "as" WS explicit-tag ]
                                         [ WS plugin ]  ";"

   where WS represents white space, and OWS represents optional white
   space.

   The 'simple-type' represents the type of the parameter.  It can have
   the following forms:

      simple-type = "void" / "bool" / "ipv4addr" / "ipv6addr" /   
                    "date" / "time" / "oid" /
                    integer-type / string-type / bytes-type /
                    embedded-type / const-type / reference

   where:

      integer-type  =  "int"  [ OWS "<"  range-constraint  ">"  ]

      string-type  =  ( "ascii" / "unquoted-ascii" / "unicode" ) 
                      [ OWS "<" length-constraint ">" ]

      const-type = "const" OWS "<" first-safe-char *( safe-char ) ">"

      bytes-type = "bytes" [ OWS "<" length-constraint ">" ]

      embedded-type = "embedded" [ OWS "<" length-constraint ">" ]

      reference = [ module-name "::" ] name   ; Refers to a type defined 
                                              ; elsewhere

      

      range-constraint = constraint

      length-constraint = constraint



Copyright Tech-Know-Ware Ltd, 2002.  All rights reserved.        Page 8
                  UMF - The United Message Format           August 2002


      constraint  =  [  min-constraint  ".."  ]  max-constraint

      min-constraint  =  ["-"] 1*DIGIT

      max-constraint  =  (  ["-"] 1*DIGIT  /  "*"  )

   In the case of integer-type, the optional constraint specifies the
   minimum and maximum permissible values that the integer can take.

   In the case of string-type, the optional constraint specifies the
   minimum and maximum number of characters that are allowed to appear
   in a valid encoding.

   In the case of bytes-type, the optional constraint specifies the
   minimum and maximum number of bytes that are allowed to appear in a
   valid encoding.

   In the constraint syntax, a maximum value '*' means infinite or
   unbounded.

   The various types have the following meaning:

      void

         A parameter that has no value.  This is most useful in unions,
         and can also be used to represent boolean events wherein the
         absence of the parameter indicates false, and the presence of
         the parameter indicates true.  It is more useful than you might
         at first think!

      bool

         Can be true or false

      int

         An integer value

      ipv4addr

         Represents an IPv4 address, but not the port.

      ipv6addr

         Represents an IPv6 address, but not the port.

      date

         Date according to the Gregorian calendar, with year, month and
         date.  Other calendar types may be constructed from primitive
         types if required.



Copyright Tech-Know-Ware Ltd, 2002.  All rights reserved.        Page 9
                  UMF - The United Message Format           August 2002


      time

         Represents the time in hours, minutes and seconds.  By default
         the time is adjusted to UTC, unless the time can be guaranteed
         to have only local significance.

      oid

         This is an ASN.1 style Object Identifier.  This is primarily
         included to enable identification of security protocols.

      ascii

         A string made up of ASCII characters, limited at most to values
         0 to 127.

      unquoted-ascii

         An ascii string usually has quote marks around it.  This type
         does not have quotes around it.  Consequently it can not have
         any white space, or include any special characters (such as
         "=", "{", and "}") that would confuse the parser.

      unicode

         A string made up of Unicode characters.

      const

         This type allows a constant value to be inserted into the
         encoded message.  It will typically be untagged.  One thing it
         might be used for is identifying the protocol of the message
         definition.  For example:

            const <HTTP>   protocol as ?;

      bytes

         An array of bytes.  Also useful for carriage of opaque data.

      embedded

         The value is an embedded UMF message.  This allows layering of
         message definitions.

   The name is the name of the parameter.  If there is no explicitly
   defined tag, then this is also used as the parameter's tag
   on-the-wire.  It has the format:

      name  =  ALPHA  *(  ALPHA / DIGIT  /  "-"  /  "_"  )

   The cardinality of a parameter specifies how many times a particular


Copyright Tech-Know-Ware Ltd, 2002.  All rights reserved.        Page 10
                  UMF - The United Message Format           August 2002


   parameter can appear in a message.  The format mirrors a C-like array
   specification, but uses UML style ranges rather than singular values
   as are required in C.  If the cardinality field is absent, then one
   and only one instance of the parameter must occur in a valid message.
   The format of the cardinality specification is:

      cardinality = "[" [ min-occurrences ".." ] max-occurrences "]"

      min-occurrences  =  ["-"] 1*DIGIT

      max-occurrences  =  ( ["-"] 1*DIGIT / "*" )

   Once again, the '*' in max-occurrences represents infinite or
   unbound.  Example cardinalities are as follows:

      [0..1]      ; Zero or one time

      [0..*]      ; Zero or more times

      [*]         ; Same as above, zero or more times

      [1..*]      ; One or more times

      [5]         ; Exactly five times

   An explicit tag can be any sequence of characters that do not have
   special significance to the parser.  If the tag definition begins
   with a "?", the "?" is discarded.  Thus to specify that ? be used as
   the tag on-the-wire, specify explicit-tag to be ??.

      explicit-tag = tag      ; tag defined in common definitions

   Marking an item as plugin indicates to the developer and the tools
   that this parameter is (probably) not part of the original message
   definition.  For example, it might be a proprietary extension.  It
   also indicates that the parameter may not be present in all received
   messages, and impacts on the way the binary encoding operates.

   The compound types are struct and union.  For a struct, subject to
   the various parameters cardinality specifications, any all or none of
   the parameters that a struct groups together may appear in a valid
   encoding of the construct.  In the case of a union, only one of the
   parameters may be encoded in a valid instance of the construct.

   The format of the compound types is similar to the simple types.
   They have the form:








Copyright Tech-Know-Ware Ltd, 2002.  All rights reserved.        Page 11
                  UMF - The United Message Format           August 2002


      compound-param  =  struct-param  /  union-param

      struct-param  =  "struct" WS name [ OWS cardinality ] 
                                        [ WS "as" WS explicit-tag ] 
                                        [ WS pluggable ]
                                        [ WS plugin ] 
                                OWS "{" struct-body "}" OWS ";"

      union-param = "union" name [ OWS cardinality ] 
                                        [ WS "as" WS explicit-tag ]
                                        [ WS pluggable ]
                                        [ WS plugin ]
                                OWS "{" union-body "}" OWS ";"

   In a struct and union the pluggable keyword indicates that the
   construct is a location that the message designers have formally
   declared as extendible using the 'plug' mechanism that is described
   further below.  UMF compilers are encouraged to emit warnings when
   extra material has is plugged into locations that are not marked as
   pluggable, but should not consider it an error.

   The format of the struct body is:

      struct-body = *( untagged-UMF-parameter )
                    *( UMF-parameter ) 
                    *( struct-extension )

   The struct body starts with all the untagged parameters.  Untagged
   parameters may have a cardinality other than one.  Note that, if the
   cardinality of an untagged parameter allows it to be absent, then
   when encoded on the wire, all parameters, including tagged parameters
   must also be absent.  Thus great care recommended when defining a
   message syntax that allows for an untagged parameter to be absent. 

   Following the untagged parameters, the tagged parameters are
   included.  When the message definition is subsequently extended,
   another instance of the extension parameters construct is added for
   each version in which the construct is extended.  (Note that all new
   parameters must always be added onto the end of an existing
   construct, and the order of parameters must never be rearranged from
   one version to the next.)

   All of these have a similar format to the types already defined,
   except that in some cases they may be untagged, or only allow a unary
   cardinality.  To make the ABNF definition accurate it is therefore
   necessary to repeat the above basic definitions with the appropriate
   tagging and cardinality specifications.

   As mentioned, the struct body may start with untagged-UMF-parameters.
   These are untagged, and must have a cardinality of 1.  There
   definition is:



Copyright Tech-Know-Ware Ltd, 2002.  All rights reserved.        Page 12
                  UMF - The United Message Format           August 2002


      untagged-UMF-parameter  =  untagged-simple-param  / 
                                      untagged-compound-param

      untagged-simple-type = simple-type WS name [ OWS cardinality ] WS 
                                                     "as" WS "?" OWS ";"

      untagged-compound-param = untagged-struct-param / 
                                     untagged-union-param

      untagged-struct-param = 
                           "struct" WS name [ OWS cardinality ] 
                                        WS "as" WS "?"  
                                        [ WS pluggable ]
                                        OWS "{" struct-body "}" OWS ";"

      untagged-union-param = "union" WS name [ OWS cardinality ] 
                                        WS "as" WS "?"
                                        [ WS pluggable ]
                                        OWS "{" union-body  "}" OWS ";"

   Note that the plugin keyword is not applicable to untagged items.

   The second part of a struct definition are the items that are tagged.
   These can have any desired cardinality.  These have the basic
   parameter definition that was initially presented, i.e.
   UMF-parameter.

   The third and final part of a struct body is the extension fields.
   These are parameters that are added in subsequent versions of the
   protocol specification.  They are marked out separately because a
   parser must always consider absence of these parameters to be a valid
   encoding so that it can receive messages from entities that are
   working with an earlier version of the protocol.  To do this would
   dictate that all extension parameters would have to have a
   cardinality specification that included zero.  This is tedious,
   potentially error prone, and loses some expressiveness.  Instead,
   extension parameters are wrapped inside square brackets to indicate
   that they are extensions.  It is then clear to any tools and
   developers that these parameters may be absent if a message is
   received from a host running an earlier version of the message
   definition.  The format of the struct extension is:

      struct-extension = "[" 1*( UMF-parameter ) "]"

   The definition of a union-body is as follows:

      union-body = [  integer-type WS name WS "as" WS "?" OWS ";" ]
                   *( singular-UMF-parameter ) 
                   *( union-extension )
      

   A union-body may have a single untagged integer parameter.  All other


Copyright Tech-Know-Ware Ltd, 2002.  All rights reserved.        Page 13
                  UMF - The United Message Format           August 2002


   parameters must be tagged and have a cardinality of one and only one.
   A union is extended in much the same way as a struct.

   The untagged integer parameter allows integers to be defined that
   have wild-carding options.  For example, a union might be defined as:

      union  select
      {
            int<0..65535>  numbered  as ?;
            void           any       as *;
      };
      

   Examples of the encoded form might be:

      select = 12

      select = *

   The parameters within a union are only allowed unary cardinality to
   avoid ambiguity in the line encoding.  If multiple instances of a
   parameter must be included as an option in a union, it is necessary
   to wrap the parameters within a struct, using something similar to:

      struct X { X      x[1..*] as ?; };

   As mentioned, most of the parameters within a union are tagged and
   have a cardinality of one.  There defininition is:

      singular-UMF-parameter  =  singular-simple-param  / 
                                 singular-compound-param

      singular-simple-param = simple-type WS name 
                                        [ WS "as" WS explicit-tag ] 
                                        [ WS  plugin  ] OWS ";"

      singular-compound-param = singular-struct-param / singular-union-param

      singular-struct-param = "struct" WS name [ WS "as" WS explicit-tag ]
                                               [ WS pluggable ]
                                               [ WS plugin ] 
                                OWS "{" struct-body "}" OWS  ";"

      singular-union-param = "union" WS name [ WS "as" WS explicit-tag ] 
                                             [ WS pluggable ]
                                             [ WS plugin ]
                                OWS "{" union-body "}" OWS ";"
      

   The union extension operates in a similar fashion to that of the
   struct, but references singular-UMF-parameters.  Its definition is:



Copyright Tech-Know-Ware Ltd, 2002.  All rights reserved.        Page 14
                  UMF - The United Message Format           August 2002


      union-extension = "[" 1*( singular-UMF-parameter ) "]"

   It was mentioned previously that unions and structs could reference
   types that are defined elsewhere.  The format of a referenced type
   can now be defined.  Referenced types have a cardinality of one, and
   are untagged.  This is because the cardinality and tagging of the
   type are defined in the item that does the referencing, rather than
   where the referenced type is defined.  (If a referenced type needs a
   cardinality other than one, it is recommended that the trick for
   giving a parameter within a union a non-unary cardinality be used.)  

   The definition of the referenced types are:

      referenced-UMF-parameter  =  referenced-simple-param  / 
                                   referenced-compound-param

      referenced-simple-param = simple-type  WS   name  ";"

      referenced-compound-param = referenced-struct-param / 
                                 referenced-union-param

      referenced-struct-param = "struct" WS name [ WS pluggable ]
                                OWS "{" struct-body "}" OWS ";"

      referenced-union-param = "union" WS name [ WS pluggable ]
                                OWS "{" union-body "}" OWS ";"

      

   A protocol may be extended by a third party without modifying the
   original definition.  This may be due to a proprietary extension, or
   an externally defined profile of the base protocol.  The
   specification for this type of extension is:

      third-party-extension = "plug" WS
                               tp-struct-extension / 
                                    tp-union-extension
                              "into" WS name *( "::" name )
                                    *( COMMA name *( "::" name ) ) OWS ";"

      tp-struct-extension = UMF-parameter
      tp-union-extension = singular-UMF-parameter
      

   This specifies a parameter that is to be plugged into an existing
   construct.  For example, if the following were defined:

      plug 
            ascii cookie as cookie.tkwumf.com 
      into my-example::my-addition;
      



Copyright Tech-Know-Ware Ltd, 2002.  All rights reserved.        Page 15
                  UMF - The United Message Format           August 2002


   The resulant definition would be treated as if it were:

      struct  my-example
      {
            int <0..255>  participant-id  as  ?;
            Action        action  as  ?;
            struct        my-addition[0..1] as tech-know-ware.com plugin;
            {
                  bool    tkw-app-capable  as  ?;
                  ascii   cookie as cookie.tkwumf.com plugin;
            };
      };
      

   The name field indicates that name of the construct that the item is
   to be plugged into.

   A single protocol may be defined in number of message definition
   file.  This might be for the purpose of accessing predefined
   libraries, or specifying the definition that the current definition
   extends.  A message definition therefore begins with a set of
   optional directives expressing this information.  They have the form:

      UMF-directive = OWS
                      [ "module" WS module-name WS ]
                      [ "extends" WS module-name OWS ";" OWS ]
                      *( "imports" WS module-name OWS ";" OWS )

      module-name = name *( "." name )
      

   Module specifies the name of the module.

   Extends is used for a definition that contains a third party
   extension.  The module-name in the extends specification indicates
   the message definition that is being extended.

   The imports statement indicates a library message definition that
   contains referenced types that are referenced within the message
   definition.

   The module-name follows the hierarchical format used in Java.  It is
   based on a domain name that is created from the name of the protocol,
   combined with the domain name of the entity that defined it.  For
   example, if a protocol called the Simple Conference Protocol (SCP)
   were defined by Tech-Know-Ware Ltd with a domain name of
   tech-know-ware.com, the module name might be:

      com.tech-know-ware.scp

   UMF defines a number of pseudo top level domains for its own
   purposes.  These are currently as follows:



Copyright Tech-Know-Ware Ltd, 2002.  All rights reserved.        Page 16
                  UMF - The United Message Format           August 2002


   +ietf A pseudo top level domain for the Internet Engineering Task
         Force.

   +iso  A pseudo top level domain for the International Standards
         Organisation.  The sub-domains of this domain follow the
         structure of ISO defined Object Identifiers.  (All spaces must
         be removed and numbers in brackets should be ignored when
         parsing this domain.  E.g. iso(1) member-body(2) us(840)
         rsadsi(113549) digestAlgorithm(2) 5 shall be represented as
         +iso(1).member-body(2).us(840).rsadsi(113549).digestAlgorithm(2).5
         and looked up as +iso.member-body.us.rsadsi.digestAlgorithm.5)

   +itu  A pseudo top level domain for the International
         Telecommunications Union.  The sub-domains of this domain
         follow the structure of ITU defined Object Identifiers.
         Processing of such identifiers follows that defined for
         processing ISO Object Identifiers.

   +umf  A pseudo top level domain for defining UMF extensions and
         libraries. 

   +uuid A pseudo top level domain that uses Universally Unique
         Identifiers for identification.  An example is: 

            +uuid.4d36e96c-e325-11ce-bfc1-08002be10318

   National standards bodies such as ANSI and BSI are defined under
   their national top-level domain.

   Finally, we are in a position to describe a complete UMF message
   definition.  This is:

      UMF-definition  =  UMF-directives
                         1* ( referenced-UMF-parameter /
                              third-party-extension )
      

   The first parameter defined within the message definition is the root
   of the message definition tree, and is thus the outer-most construct
   of an encoded message.

3.4 Complete ABNF

   This section presents the complete ABNF of a message definition
   without narrative.  Some definitions are common with the on-the-wire
   ABNF and a presented in a separate section. 

      UMF-definition  =  UMF-directives
                         1* ( referenced-UMF-parameter /
                              third-party-extension )

      UMF-directive = OWS


Copyright Tech-Know-Ware Ltd, 2002.  All rights reserved.        Page 17
                  UMF - The United Message Format           August 2002


                      [ "module" WS module-name WS ]
                      [ "extends" WS module-name OWS ";" OWS ]
                      *( "imports" WS module-name OWS ";" OWS )

      module-name = name *( "." name )
      referenced-UMF-parameter  =  referenced-simple-param  / 
                                   referenced-compound-param

      referenced-simple-param = simple-type  WS   name  ";"

      simple-type = "void" / "bool" / "ipv4addr" / "ipv6addr" / 
                    "date" / "time" / "oid" /
                    integer-type / string-type / bytes-type / 
                    embedded-type / const-type / reference

      integer-type = "int" [ OWS "<" range-constraint ">" ]

      string-type = ( "ascii" / "unquoted-ascii" / "unicode" ) 
                                [ OWS "<"  length-constraint ">" ]

      bytes-type = "bytes" [ OWS "<" length-constraint ">" ]

      const-type = "const" OWS "<" first-safe-char *( safe-char ) ">"

      embedded-type = "embedded" [ OWS "<" length-constraint ">" ]

      reference = [ module-name "::" ] name     ; Refers to a type 
                                                ; defined elsewhere

      range-constraint = constraint
      length-constraint = constraint
      constraint  =  [  min-constraint  ".."  ]  max-constraint
      min-constraint  =  ["-"] 1*DIGIT
      max-constraint  =  (  ["-"] 1*DIGIT  /  "*"  )

      name  =  ALPHA  *(  ALPHANUM  /  "-"  /  "_"  )

      referenced-compound-param = referenced-struct-param / 
                                 referenced-union-param

      referenced-struct-param = "struct" WS name [ WS pluggable ]
                                OWS "{" struct-body "}" OWS ";"

      struct-body = *( untagged-UMF-parameter )
                    *( UMF-parameter ) 
                    *( struct-extension )

      referenced-union-param = "union" WS name [ WS pluggable ]
                                OWS "{" union-body "}" OWS ";"

      union-body = [  integer-type WS name WS "as" WS "?" OWS ";" ]
                   *( singular-UMF-parameter ) 


Copyright Tech-Know-Ware Ltd, 2002.  All rights reserved.        Page 18
                  UMF - The United Message Format           August 2002


                   *( union-extension )

      untagged-UMF-parameter  =  untagged-simple-param  / 
                                      untagged-compound-param

      untagged-simple-type = simple-type WS name [ OWS cardinality ]
                                                WS "as" WS  "?"  ";"

      untagged-compound-param = untagged-struct-param / 
                                     untagged-union-param

      untagged-struct-param = 
                           "struct" WS name [ OWS cardinality ]
                                    WS "as" WS "?"  
                                    [ WS pluggable ]
                                OWS "{" struct-body "}" OWS ";"

      untagged-union-param = 
                           "union" WS name [ OWS cardinality ]
                                    WS "as" WS "?"
                                    [ WS pluggable ]
                                OWS "{" union-body "}" OWS ";"

      UMF-parameter  =  simple-param  /  compound-param

      simple-param = simple-type  WS  name [ OWS cardinality ]  
                                    [ WS "as" WS  explicit-tag  ]  
                                    [  WS  plugin  ]  ";"

      cardinality = "[" [ min-occurrences ".." ] max-occurrences "]"
      min-occurrences  =  ["-"] 1*DIGIT
      max-occurrences  =  (  ["-"] 1*DIGIT  /  "*"  )

      explicit-tag = tag      ; tag defined in common definitions

      compound-param  =  struct-param  /  union-param
      struct-param = "struct" WS name [ OWS cardinality ] 
                                      [ WS "as" WS explicit-tag ] 
                                      [ WS pluggable ]
                                      [ WS plugin ] 
                                OWS "{" struct-body "}" OWS ";"
      union-param = "union" WS name [ OWS cardinality ] 
                                    [ WS "as" WS explicit-tag ]
                                    [ WS pluggable ]
                                    [ WS plugin ]
                                OWS "{" union-body "}" OWS ";"

      struct-extension = "[" 1*( UMF-parameter ) "]"

      singular-UMF-parameter  =  singular-simple-param  / 
                                 singular-compound-param



Copyright Tech-Know-Ware Ltd, 2002.  All rights reserved.        Page 19
                  UMF - The United Message Format           August 2002


      singular-simple-param = type WS name [ WS "as" WS explicit-tag ] 
                                           [ WS plugin ]  ";"

      singular-compound-param = singular-struct-param /
      singular-union-param
      singular-struct-param = "struct" WS name 
                                             [ WS "as" WS explicit-tag ]
                                             [ WS pluggable ]
                                             [ WS plugin ] 
                                "{" struct-body "}"  ";"
      singular-union-param = "union" WS name [ WS "as" explicit-tag ] 
                                             [ WS pluggable ]
                                             [ WS plugin ]
                                "{" union-body "}" ";"

      third-party-extension = "plug" WS
                               tp-struct-extension / 
                                    tp-union-extension
                              "into" WS name *( "::" name )
                                        *( "," name *( "::" name ) ) ";"

      tp-struct-extension = UMF-parameter
      tp-union-extension = singular-UMF-parameter

   

4. On-the-Wire Representation



4.1 Principles of On-the-Wire Encoding

   The basic format of the text based on-the-wire encoding is to use the
   format:

      tag  =  value

   If there are multiple instances of a parameter, then they may either
   be conveyed as multiple instances of the above construct, and as a
   comma separated list, as in:

      tag  =  value, value, value

   If a tag is explicitly specified in the message definition, then this
   is used on the wire.  If no tag is explicitly specified, then the
   name of the parameter is used as the tag.  

   It is also possible to explicitly specify that no tag should be used
   on the wire by setting the explicit tag field to '?'.  All untagged
   items must appear in a struct in the same order that they are defined
   in the message definition, and must appear before any tagged items
   within a struct definition.  Untagged parameters that have greater


Copyright Tech-Know-Ware Ltd, 2002.  All rights reserved.        Page 20
                  UMF - The United Message Format           August 2002


   than one instance must be constructed as a comma separated list.  In
   these cases, the format on the wire becomes:

      value

   or:

      value, value, value

   If an untagged parameter has a cardinality that allows it to be
   absent from an encoded message, then all subsequent parameters in the
   enclosing struct, including tagged parameters, must also be absent.
   Consequently, great care should be taken when defining a message
   definition that allows untagged parameters to be absent.

   Thus, for the examples quoted earlier, that is:

      int                rfc-number ;
      int <1..30000>     referenced-rfcs [0..255] as refers;

   The format on the wire would be something like (depending on the
   actual values in question):

      rfc-number = 3024  refers = 822, 791, 2543

4.2 Example On-the-Wire Representation

   The following are example on-the-wire representations of the example
   message.

      1  
      join = { 'Alice' }  
      tech-know-ware.com  =  { True  }

      1  
      msg = { to = 2, 5, 8, 58  
            msg = 'Where are we going for dinner' }  

      1  
      leave  

   

4.3 Formal On-the-Wire Representation

   The principle representation of an UMF defined message on the wire is
   text based.  

   Parameters may be untagged as long as they appear before any other
   tagged parameters.  Untagged parameters that have non-singular
   cardinality must be comma separated. 



Copyright Tech-Know-Ware Ltd, 2002.  All rights reserved.        Page 21
                  UMF - The United Message Format           August 2002


   The top-level construct of an UMF definition is a referenced type,
   which essentially has no tag associated with it.  (Indeed, the
   presence of such a tag would not convey any information.)  The
   top-level construct is therefore either a struct body, a union body,
   or a simple value, as in:

      UMF-text-message  = ( struct-body  /
                          union-body )

   A struct body can contain untagged and tagged parameters.  All
   untagged parameters must appear before any tagged parameters.  The
   definition of a struct-body is therefore:

      struct-body = OWS
                    *( value *( COMMA value ) WS )
                    *( ( tag WS ) /              ; For a void parameter
                       ( tag  EQUAL  value *(  COMMA  value ) WS ) )
                    ; WS not required if it's the last item

   All items of a union body must be tagged, except for a single integer
   parameter that may be untagged.  Also, parameters must only have a
   cardinality of one in the encoding to avoid ambiguities in the
   encoded message.  Therefore a union body has the form:

      union-body =  OWS (integer-value WS /
                    tag WS /               ; For a void parameter
                    ( tag EQUAL value WS ) )

   where:

      value = simple-value / compound-value

      simple-value = bool-value / integer-value / oid-value /
                     ipv4addr-value / ipv6addr-value  /   
                     ascii-value / unquoted-ascii-value / unicode-value /
                     const-value / embedded-value / bytes-value / 
                     date-value / time-value

      bool-value = "True" / "False" / "T" / "F"

      int-value = [ "-" ] 1*DIGIT

      oid-value = 1*DIGIT *( "~" 1*DIGIT )    

                     ; Only the oid's numerical parts are represented

      ipv4addr-value = 1*3DIGIT "." 1*3DIGIT "." 1*3DIGIT "." 1*3DIGIT

      ipv6addr-value = ( 1*4HEX *( ":"  1*4HEX ) 
                                    [ ":" *( ":"  1*4HEX ] )

   Date and time parameters have fixed width to aid parsing.  As such


Copyright Tech-Know-Ware Ltd, 2002.  All rights reserved.        Page 22
                  UMF - The United Message Format           August 2002


   the various fields have leading zeros if required.  

   Dates are according to the Gregorian calendar.  Other calendar types
   may be constructed from primitive types if required.

   Typically the time should be converted to UTC prior to including in a
   message, unless the time can be guaranteed to have only local
   significance.

      date-value = date-year "-" date-month "-" date-day
      date-year = 4DIGIT                  ; e.g. 2002
      date-month = 2DIGIT                 ; With leading zeros, e.g. 02
      date-day = 2DIGIT                   ; With leading zeros, e.g. 02

      time-value = time-hours ":" time-minutes ":" time-seconds
      time-hours = 2DIGIT                 ; With leading zeros, e.g. 02
      time-minutes = 2DIGIT               ; With leading zeros, e.g. 02
      time-seconds = 2DIGIT               ; With leading zeros, e.g. 02
                                          ; Uses 24 hour clock notation
                                          ; All times presented in UTC

      

      ascii-value = 
           "'" *( %x00-26 / %x28-5B / %x2D-x7F / "\\" / "\'" ) "'"
      

      unquoted-ascii-value =  first-safe-char *( safe-char )

      unicode-value = DQUOTE
                 *( %x00-21 / %x23-5B / %x5D-xFF / "\\" / "\" DQUOTE ) 
                  DQUOTE
                             ; DQUOTE defined in RFC 2234

      bytes-value = "^" BASE64
      BASE64 = *( 4BASE64-CHAR ) 
                  ( 
                  ( 4BASE64-CHAR ) /
                  ( 3BASE64-CHAR "=" ) /
                  ( 2BASE64-CHAR "=" "=" )
                  )
      BASE64-CHAR = ALPHA / DIGIT / "+" / "/"

      const-value = first-safe-char *( safe-char )

      embedded-value = "(" *(%x00-28 / %x2A-5B / %x5D-FF / 
                          "\)" / "\\" ) ")"     ; "\" & ")" are escaped

   Illustrating the recursiveness of the message format, we have:

      compound-value = struct-value / union-value



Copyright Tech-Know-Ware Ltd, 2002.  All rights reserved.        Page 23
                  UMF - The United Message Format           August 2002


      struct-value = "{" struct-body "}" 

      union-value = union-body

      EQUAL = OWS "=" OWS
      COMMA = OWS "," OWS
      

4.4 Marking Message Boundaries

   Before a message is parsed it is necessary to know the boundaries of
   the message.  There are many ways in which this can be done, and the
   method adopted should be specified in the protocol specification.
   However, in the absence of any other way, UMF parsers should take the
   presence of an unmatched closing brace to be the end of message
   marker.  Hence, the definition of a message delimited in this way
   becomes:

      delimited-UMF-text-message = UMF-text-message "}"

4.5 Illustration of Encoded Types

   This section illustrates how the types look once they have been
   encoded according to the syntax above.  The tag of each item has the
   format 'my-XXXX'.  Except in the case of the 'void' example, the XXXX
   part indicates the type that is encoded to the right of the equals
   sign.

      my-void                // Tag only for a void parameter

      my-bool = True

      my-int = 5643

      my-ipv4addr = 10.0.0.1

      my-ipv6addr = 201:123::0

      my-date = 2002-02-28

      my-time = 12:00:00

      my-oid = 1~2~840~113549~2~5

      my-ascii = 'UMF'

      my-unquoted-ascii = UMF

      my-unicode = "UMF"

      my-const = UMF



Copyright Tech-Know-Ware Ltd, 2002.  All rights reserved.        Page 24
                  UMF - The United Message Format           August 2002


      my-bytes = ^01AF3C==

      my-embedded = ( my-other-int=5 single-closing-bracket-text='\)' )

      my-struct = { 5434 All time=98787654654 }

      my-union = 5434

      my-union1 = Switch

      my-union2 = Volume = 11

5. Common ABNF Definitions

   The following definitions are common to both the definition syntax
   and the on the wire representation.

      tag = [ "?" ] first-tag-safe-char *( safe-char )

      first-tag-safe-char = %x21 / 
                  ; Not "
                  %x23-26 / 
                  ; Not ' ( )
                  %28-2B
                  ; Not , -
                  %x2E-2F /
                  ; Not 0 1 2 3 4 5 6 7 8 9
                  %x3A-3C / 
                  ; Not =
                  %x3E-5D
                  ; Not ^
                  %x5F-7A /
                  ; Not {
                  %7C /
                  ; Not }
                  %7E-7F
                        ; Visible characters except = , " ' { } ( ) ^ -
                        ; and digits

      first-safe-char = first-tag-safe-char / DIGIT

      safe-char = first-safe-char / DQUOTE / "'" / "{" / "(" / "-" / "^"
                        ; Not = } ) ,

      OWS = [ WS ]                ; Optional white space
      WS = comment / " " / HTAB / CR / LF 
                                  ; HTAB, CR, LF defined in RFC-2234
                                  ; White space may appear between any
                                  ; token and is not limited to where
                                  ; it is explicitly specified




Copyright Tech-Know-Ware Ltd, 2002.  All rights reserved.        Page 25
                  UMF - The United Message Format           August 2002


      comment = c-comment / cpp-comment
      c-comment = "/*" <any except */> "*/"
      cpp-comment = "//" *( HTAB / %x20-%7f ) ( CR / LF )
                       ; A comment is treated as a single space for the 
                       ; purposes of parsing

6. Why UMF

   The name UMF is pronounced in the same way as 'oomph'.  The Collins
   Paperback English Dictionary (1986) defines oomph as:

      oomph - (umf) n. Inf. 1. enthusiasm, vigour, or energy.  2. sex
      appeal.

   So who wants their code to have UMF?



7. References

   [UMFHOME]http://www.tech-know-ware.com/umf

   [ABNF]D. Crocker, & P. Overell, "Augmented BNF for Syntax
         Specifications: ABNF, " Internet Engineering Task Force, RFC
         2234, November 1997.

   [XML] "Extensible Markup Language (XML) 1.0 (Second Edition)", W3C
         REC-xml, October 2000.

8. Author's Address

   Pete Cordell
   Tech-Know-Ware Ltd
   P.O. Box 30
   Ipswich, 
   IP5 2WY
   UK
   pete@tech-know-ware.com
   

9. ToDo: 



10. Changes: 

   ­ Byte array starts with ^.  May end with =

   ­ Struct may have many untagged parameters at the start that have
   higher than unary cardinality.

   ­ Added the pluggable keyword to formally mark locations that are


Copyright Tech-Know-Ware Ltd, 2002.  All rights reserved.        Page 26
                  UMF - The United Message Format           August 2002


   intended to be externally extendible.

   For Version 3

   ­ Changed byte-array to bytearray

   For Version 4

   ­ Changed bytearray to bytes

   ­ Added support for OIDs











































Copyright Tech-Know-Ware Ltd, 2002.  All rights reserved.        Page 27