README - Markup, a validating XML parser for O'Caml

README - Markup, a validating XML parser for O'Caml

Abstract

Markup is a validating parser for XML-1.0 which has been written entirely and from scratch in Objective Caml.

Download

The sources of the most recent version are available.

User's Manual

The manual is included in the distribution both as Postscript document and bunch of HTML files. An online version can be found here.

Author, Copying

Markup has been written by Gerd Stolpmann, currently it is maintained by some developers of Structio. You may copy it as you like, you may use it even for commercial purposes as long as the license conditions are respected, see the file LICENSE coming with the distribution. It is a MIT-style license.

Description

Markup is an validating XML parser for O'Caml. It fullfills almost all the requirements that the w3c demands of an XML 1.0 parser. For a more complete implementation with an improved API (although bigger and with more dependencies) check PXP; written and maintained by the same author of Markup.

Beginning with the positive properties of this package, the current release can parse all valid XML documents and DTDs as long as they use only ISO-8859-1 characters, and it rejects most invalid or non-well-formed documents. This has been tested with lots of test documents, including all of James Clark's test records[1].

Once the document is parsed, it can be accessed using a class interface. This interface is still under development and subject to future changes. The interface allows arbitrary access including transformations. It has a so-called "extension", i.e. every element of the document has a main object and the extension. Although there is a default, the extension is thought as the changeable part of the element class, i.e. you can provide your own extension and add further properties to the elements.

Note that the class interface does not comply to the DOM standard. Gerd Stolpmann thinks that DOM is not applicable to O'Caml. Of course, the interface of Markup has a similar task.

Another design feature is that you can configure the parser such that it uses a different class for every element type. As classes are not first-order objects in O'Caml, this is done by an exemplar/instance scheme. A hashtable stores for every element type the exemplar which is simply an object of the class to be used. If the parser reads an element, it looks up the exemplar by the type of the element, and forms a clone of it. -- This way, it is possible to group the processing code for the elements as classes which seems to be very natural.

Code examples

This distribution contains these examples:

Restrictions

The following list of restrictions seems to be rather long, but it is actually not compared with the much longer list of XML features that have been implemented. I enumerate the restrictions here to make clear that Markup is currently under development, and although it is stable enough to be useable it should be taken with some care.

As the parser is validating, it should be able to report all violations against the XML rules. This is currently not the case. Mainly the following restrictions apply:

Furthermore, the following restrictions apply that are not violations of the standard:

Recent Changes


[1]
The parser is able to read all positive test records with the exception of "valid/sa/051.xml" and "valid/sa/063.xml" which contain non-ISO-8859-1 characters within names. The parser rejects all negative test records with the exception of "not-wf/sa/141.xml" which uses non-ISO-8859-1 characters in a way the parser cannot deal with.
[2]
This particular document is an example of this DTD!
[3]
In the DOM standard, comments are accessible. I think this is a bad feature, as comments should not be regarded as text to be processed.