XML Lite
Pete Cordell, Codalogic Ltd
Version 1.3

Recent discussion on the XML-DEV mailing list has identified the rise of JSON as a competitor to XML. Some discussion has been entered into on what can be done to XML to make it more appealing to developers that might otherwise use JSON.

This page captures my thoughts. I've attempted to outline the principles of what a new version of XML might look like, and then describe my high, medium and low level wishes.

In putting this together I have used Elliotte Rusty Harold's XML 2.0 piece at http://cafe.elharo.com/xml/xml-2-0/, Tim Bray's Extensible Markup Language - SW (XML-SW), and Andrew Welch's Hackable XML.

Contents

1 - Objective
2 - High Priority
   2.1 - Remove the Internal DTD and external DTD References
   2.2 - Discard CDATA Sections
   2.3 - Limit Character Encodings to UTF-8 and UTF-16
3 - Medium Priority
   3.1 - Improve XML namespaces
   3.2 - Allow Truncated End Tags
   3.3 - Allow ]]> in Element Content CharData
   3.4 - Prioritise Namespace Prefix Mappings
   3.5 - Recognise that a File May Contain Multiple XML Documents
4 - Low Priority
   4.1 - & not part of an entity string is treated as &
   4.2 - Allow -- in Comments
   4.3 - Allow Nested Comments
   4.4 - Preserve White Space in Attributes
5 - History

[Top] [Contents]

1 - Objective

XML simplified SGML, and by so doing became more popular than SGML. The hope is that by further simplifying XML it can increase its popularity still further.

My goals are:

I've divided the topics into different priorities.

I've also attempted to analyse the impact on existing XML parsers, existing applications and developers. For the analysis I assume that an XML application sits on top of an XML parser that exposes something like DOM, SAX or some other format.

Commenting on this document is probably best done on the XML-DEV mailing list. I will attempt to add links to the XML-DEV archives if it seems applicable. I haven't made this as a blog post because the comments of my blog seem to be a spam-fest! It's also a good idea to keep all comments relevant to XMLlite / XMLng / NextML etc. in one place. Some comments can be seen if you follow the Pete's blend for XMLlite thread.

[Top] [Contents]

2 - High Priority

[Top] [Contents]

2.1 - Remove the Internal DTD and external DTD References

The internal DTD and references to external DTDs along with the additional entity specification doubles or triples the complexity of writing an XML parser. While defining your own entities maybe helpful for some users, the cost is too high for the majority of users. Therefore removing the internal DTD and external DTD References significantly simplifies XML.

[Top] [Contents]

2.1.1 - Impact on existing XML parsers

No issue. The parser would just never find a DTD. A parser could be upgraded to report an error if a DTD is found in XMLlite mode.

[Top] [Contents]

2.1.2 - Benefit for new XML parsers

The DTD is a relatively complex part of an XML parser. It also implies entity handling and default attributes. Removing the internal DTD and external DTD references from XML could remove, say, 60% of a parsers complexity.

[Top] [Contents]

2.1.3 - Impact on applications

Assuming the XML parser only exposed to the application data that already had entities expanded, and default attributes inserted, then an application would be unaware of the change.

[Top] [Contents]

2.1.4 - Benefit for developers

XMLlite would be easier to learn and more readily understood.

[Top] [Contents]

2.1.5 - Notes

Following David Carlisle's post I've also suggested removing references to external DTDs.

[Top] [Contents]

2.2 - Discard CDATA Sections

CDATA sections give the illusion that you can enter data without escaping it. This is actually not true as you just end up having to look out for a different escape sequence. This lolls novice users into a false sense of security, which is bad. Discarding CDATA sections simplifies XML and removes surprises for the novice user.

[Top] [Contents]

2.2.1 - Impact on existing XML parsers

No issue. The parser would just never find a CDATA section. A parser could be upgraded to report an error if a CDATA section is found in XMLlite mode.

[Top] [Contents]

2.2.2 - Benefit for new XML parsers

CDATA Sections add complexity for little gain. Removing the CDATA Sections will remove an element of complexity in implementing a parser.

[Top] [Contents]

2.2.3 - Impact on applications

Assuming the parser presents text data to an application in a way independent of whether the text is in a CDATA section or not, then this change should have no impact on an application.

[Top] [Contents]

2.2.4 - Benefit for developers

CDATA sections give the illusion that you can add unescaped data to your applications. However, this isn't actually true. By removing this illusion, you remove the potential for random bugs occurring in code, perhaps when the code has long left production.

[Top] [Contents]

2.3 - Limit Character Encodings to UTF-8 and UTF-16

In the 10 years since XML's birth encoders such as UTF-8 and UTF-16 have risen in prominence. If someone needs to have string data presented to them in an alternative encoding, let them do the work rather than forcing the burden on everyone.

[Top] [Contents]

2.3.1 - Impact on existing XML parsers

No issue. Existing parsers should already support these encodings.

[Top] [Contents]

2.3.2 - Benefit for new XML parsers

Parsers will be able to have knowledge of UTF-x built-in requiring less dependence of external libraries making the code more self-contained and thus more portable.

[Top] [Contents]

2.3.3 - Impact on applications

No issue. The parsers 'normalize' the received data to a common character encoding already.

[Top] [Contents]

2.3.4 - Benefit for developers

Developers will not have to worry about encountering XML files encoded in unknown formats.

[Top] [Contents]

3 - Medium Priority

[Top] [Contents]

3.1 - Improve XML namespaces

This is medium priority mainly because it seems hard to do and the implications are quite large. If this were not the case it would likely be high priority.

Limit namespaces to the domain name form, such as com.mycompany.myschema form. The following forms would be equivalent:

    <ns:foo xmlns:ns='com.mycompany.myschema'>
    </ns:foo>
Or:
    <com.mycompany.myschema:foo>
    </com.mycompany.myschema:foo>
If a name's prefix matches a declared namespace prefix, then the prefix is replaced by the expand namespace name, otherwise the name prefix is considered to be the name's namespace.

Further, if the name's prefix does not match a namespace prefix, the name's prefix also becomes the default namespace. Thus:

    <com.mycompany.myschema:foo>
        <bar/>
    </com.mycompany.myschema:foo>
is more accurately equivalent to:
    <foo xmlns='com.mycompany.myschema'>
        <bar/>
    </foo>

Libraries such as DOM and SAX should break names such as 'com.mycompany.myschema:foo' into 'namespace='com.mycompany.myschema', name='foo'.

XPath based tools can use current approaches for accessing items within namespaces, or use the full name such as ./com.mycompany.myschema:foo/@bar. As above, names whose prefix does not match a known namespace prefix also set the default namespace, so element names without prefixes are in the same namespace as element names used earlier in the XPath expression. To explicitly specify an name without a prefix do ./:foo; i.e. there is a ':' with no leading namespace.

[Top] [Contents]

3.1.1 - Impact on existing XML parsers

Without modification, parsers seeing a ':' in a name will expect to be able to find a mapping for a namespace prefix. This will have to change. Depending on the structure of the parser code, it is hoped that such changes would be fairly localized and would be easy to make.

[Top] [Contents]

3.1.2 - Benefit for new XML parsers

Since the parser has to support a minor variation on existing parser behaviour it's not likely that this simplifies a new parser.

[Top] [Contents]

3.1.3 - Impact on applications

Assuming the parser presents names to the application as a 'namespace', 'name' pair, there should be no impact on an application.

[Top] [Contents]

3.1.4 - Benefit for developers

The main benefit of this is in simplifying things like XPath and out of context QNames. This change should greatly simplify this task for novice developers.

[Top] [Contents]

3.1.5 - Notes

This proposal follows various synaptic firings following Michael Kay's proposal.

[Top] [Contents]

3.2 - Allow Truncated End Tags

A common complaint about XML is its verbosity; particularly with regard to end tags. This is especially the case when you have small values such as integers in element bodies. Another issue is that it makes the XML harder to read because all the tags get in the way; leading some to suggest that, contrary to XML's goal, XML is not human readable.

Allowing an end tag to be improve </> would solve this. An XML document author could decide whether to use a normal end tag or a truncated end tag. When encountering a truncated end tag, libraries such as DOM and SAX with insert the omitted value.

This would allow XML such as:

<MyElement>
    <MyString>This is a string</>
    <MyInt>1234</>
</MyElement>

[Top] [Contents]

3.2.1 - Impact on existing XML parsers

Parsers would have to be modified to copy the start tag name to the end tag name when an end tag name is absent. Conceptually this might look something like:
	endTag = getToken();
	// Start of additional code
	if( endTag == "" )
		endTag = startTag;
	// End of additional code
	if( endTag != startTag )
		reportError();
	else
		sendEvent( END_TAG_EVENT, endTag );
	...

[Top] [Contents]

3.2.2 - Benefit for new XML parsers

Minimal impact on new parsers. The implications would be much the same for fixing existing parsers.

[Top] [Contents]

3.2.3 - Impact on applications

No issue. The parsers could fix-up the end tags before the applications are presented with the data.

[Top] [Contents]

3.2.4 - Benefit for developers

XML files are shorter, and data in short values is easier to see.

[Top] [Contents]

3.3 - Allow ]]> in Element Content CharData

Presumably because of CDATA Sections, XML does not allow the character sequence ]]> to appear in element contents. Even in XML 1.0 as it is today this is an unnecessary restriction as it does not render a file any more parsable. Therefore it is suggested to remove this restriction to avoid causing surprises for novice developers.

I have classified this as Medium Priority because it would break existing XML parsers. I'm also assuming that the character sequence ]]> is not that common, and so the restriction should not be an issue that often. It's also possible to work around the problem by suggesting that '>' should aways be represented as '&gt;'.

[Top] [Contents]

3.3.1 - Impact on existing XML parsers

This would break existing parsers, but it ought to be quite easy to fix that.

[Top] [Contents]

3.3.2 - Benefit for new XML parsers

Slightly simplifies a new parser.

[Top] [Contents]

3.3.3 - Impact on applications

No impact.

[Top] [Contents]

3.3.4 - Benefit for developers

A surprising feature of XML is removed, resulting in a cleaner, more intelligible design. Not fixing it provides ammunition to XML's detractors!

[Top] [Contents]

3.3.5 - Notes

Added following David Carlisle's post.

[Top] [Contents]

3.4 - Prioritise Namespace Prefix Mappings

Namespace prefix mappings are a special type of attribute. Currently they appear in no particular order in an element's start tag. This requires a parser to look through the entire set of attributes in a start tag before it can be sure it has learnt all namespace mappings.

This look-ahead complicates parsers and requires more temporary storage. Requiring that all namespace mappings occurred before any other attributes would simplify the parser, and make it more efficient.

IDEs could be used to ensure that namespace mapping attributes appeared in the right place.

[Top] [Contents]

3.4.1 - Impact on existing XML parsers

No issue. Existing parsers can already accept namespace declarations in any order and so having them all at the start of the start tag is no issue.

[Top] [Contents]

3.4.2 - Benefit for new XML parsers

Parsers will not have to look ahead for all the attributes in a start tag. This should make streaming parsers more efficient.

[Top] [Contents]

3.4.3 - Impact on applications

No issue.

[Top] [Contents]

3.4.4 - Benefit for developers

Developers will have to know to put the namespace declarations first. This is fairly intuitive for most developers who are used to 'thing's being declared before they can be referenced. An IDE can also ensure that this is correct.

[Top] [Contents]

3.5 - Recognise that a File May Contain Multiple XML Documents

If XML is to be everywhere, then it's attractive for XML to be used to write log files which are continually appended to. If it were recognised that a file didn't necessarily imply a single XML document, but could be a concatenation of multiple XML documents, this scenario would be easier.

[Top] [Contents]

3.5.1 - Impact on existing XML parsers

Parsers would have to be modified so that would not report an error after reading the closing tag of the first root element.

[Top] [Contents]

3.5.2 - Benefit for new XML parsers

More flexibility.

[Top] [Contents]

3.5.3 - Impact on applications

Only applications that required this functionality would be affected, and thus no impact on applications that do not require this functionality.

[Top] [Contents]

3.5.4 - Benefit for developers

More flexibility.

[Top] [Contents]

4 - Low Priority

[Top] [Contents]

4.1 - & not part of an entity string is treated as &

One of the attractions of CDATA sections is not having to escape & and < characters. But we've discarded CDATA sections!

Instead any '&' character that is not followed by 'amp;', 'apos;', 'quot;', 'lt;' or 'gt;' should be treated as a regular '&' character and not cause a well-formedness error.

It would be nice to have some magic to handle the '<' character as well. However, there is a broader set of characters that can appear after an '<' character, and so it is not so easy to do. As outside of programming '<' is likely to appear less than '&', this is not a big issue. After all, it would be churlish for users of '<' to deny users of '&' a shortcut just because they themselves don't have a shortcut!

[Top] [Contents]

4.1.1 - Impact on existing XML parsers

Currently parser will report an error when this occurs. However, it is thought to be a minor code change to insert the unmatched entity name into the output text when no match is found.

[Top] [Contents]

4.1.2 - Benefit for new XML parsers

For a new parser this boils down to inserting unknown entity name into the output text versus inserting the unknown entity name into an error message. Both are likely to be easy to do. Not having an error condition for this scenario may simplify the parser as there's no need to worry about unwinding the code if an error is encountered.

[Top] [Contents]

4.1.3 - Impact on applications

No impact. The parser will have addressed this before the application sees the data.

[Top] [Contents]

4.1.4 - Benefit for developers

It will be much easier for developers to create XML files, allowing them to work more efficiently and naturally, and the XML will be much easier to read.

[Top] [Contents]

4.2 - Allow -- in Comments

In code, it's common to block comments that include:
//--------------------------------------
But if a novice developer does the equivalent of this in XML they will get an error for no particularly good reason. This could lead someone new to XML to consider than XML is harder than they thought and then move onto something else.

The best thing to do is remove this surprise and allow -- (two or more dashes) in an XML comment.

[Top] [Contents]

4.2.1 - Impact on existing XML parsers

Current parsers will need to be modified to remove some error detection code. The code for detecting the end of a comment may also be affected, but hopefully in a trivial way.

[Top] [Contents]

4.2.2 - Benefit for new XML parsers

Slightly simpler coding and error handling within a comment.

[Top] [Contents]

4.2.3 - Impact on applications

It's not expected that an application would be too bothered whether a comment contained concatenated dashes, so in 99% of the case it would be no issue.

[Top] [Contents]

4.2.4 - Benefit for developers

Removal of surprises and the ability to format the XML as they want without having to worry about bizarre rules!

[Top] [Contents]

4.3 - Allow Nested Comments

There are two uses for comments; documentation and commenting out code. Currently if a section of XML already contains comments, then it is a non-trivial process to comment-out that section of XML. By allowing XML comments to be nested this process would be made much easier, making working with XML files much easier to do.

[Top] [Contents]

4.3.1 - Impact on existing XML parsers

Current XML parsers would have to be modified to allow this. Hopefully the changes are small and self-contained.

[Top] [Contents]

4.3.2 - Benefit for new XML parsers

It might be slightly more complex for a parser to handle nested comments versus not allowing nested comments. Hopefully this amounts to little more than maintaining a nestingDepth variable and shouldn't be a major difficulty.

[Top] [Contents]

4.3.3 - Impact on applications

It's not expected that an application would be too bothered about the contents of a comment, so in 99% of the case it would be no issue.

[Top] [Contents]

4.3.4 - Benefit for developers

It is much easier for a developer to comment out sections of XML, allowing them to work more efficiently.

[Top] [Contents]

4.4 - Preserve White Space in Attributes

The white space handling of attribute values is different to that of element values. This can add confusion. It should be left to the application to specify the type of white space handling it wants using methods such as getPreserve(), getReplace() and getCollapse(). It should not be for the XML layer to do collapse white space handling for attributes as a matter of course.

Such commonality between element and attribute handling allows code to evolve by changing attributes for elements and vice versa without causing surprising secondary effects.

[Top] [Contents]

4.4.1 - Impact on existing XML parsers

Existing XML parsers would have to be updated.

[Top] [Contents]

4.4.2 - Benefit for new XML parsers

New parsers would have to allow an application to access attributes with 3 different forms of white space handling rather than one. Hopefully this can be implemented via reuse of code used to access element text with different white space handling.

[Top] [Contents]

4.4.3 - Impact on applications

Applications would have to be aware that white space may not be normalized in attribute values and so would have to normalize attribute values themselves if that's what they needed.

[Top] [Contents]

4.4.4 - Benefit for developers

Reduction of surprises and the need to remember 'silly' rules. Data that didn't initially require white space preservation in version 1 could be changed to being able to interpret preserved white space in a later version.

[Top] [Contents]

5 - History

1.1
Added automatic setting of default namespace for 'explicit' namespace form.
1.3
Added 'Allow ]]> in Element Content CharData' section.
---END---