Markup is a validating parser for XML-1.0 which has been written entirely and from scratch in Objective Caml.
The sources of the most recent version are available.
The manual is included in the distribution both as Postscript document and bunch of HTML files. An online version can be found here.
Markup has been written by Gerd Stolpmann, currently it is maintained by some developers of Structio. You may copy it as you like, you may use it even for commercial purposes as long as the license conditions are respected, see the file LICENSE coming with the distribution. It is a MIT-style license.
Markup is an validating XML parser for O'Caml. It fullfills almost all the requirements that the w3c demands of an XML 1.0 parser. For a more complete implementation with an improved API (although bigger and with more dependencies) check PXP; written and maintained by the same author of Markup.
Beginning with the positive properties of this package, the current release can parse all valid XML documents and DTDs as long as they use only ISO-8859-1 characters, and it rejects most invalid or non-well-formed documents. This has been tested with lots of test documents, including all of James Clark's test records[1].
Once the document is parsed, it can be accessed using a class interface. This interface is still under development and subject to future changes. The interface allows arbitrary access including transformations. It has a so-called "extension", i.e. every element of the document has a main object and the extension. Although there is a default, the extension is thought as the changeable part of the element class, i.e. you can provide your own extension and add further properties to the elements.
Note that the class interface does not comply to the DOM standard. Gerd Stolpmann thinks that DOM is not applicable to O'Caml. Of course, the interface of Markup has a similar task.
Another design feature is that you can configure the parser such that it uses a different class for every element type. As classes are not first-order objects in O'Caml, this is done by an exemplar/instance scheme. A hashtable stores for every element type the exemplar which is simply an object of the class to be used. If the parser reads an element, it looks up the exemplar by the type of the element, and forms a clone of it. -- This way, it is possible to group the processing code for the elements as classes which seems to be very natural.
This distribution contains these examples:
validate: simply parses a document and prints all error messages
readme: Defines a DTD for simple "README"-like documents, and offers conversion to HTML and text files[2].
xmlforms (broken): This is already a sophisticated application that uses XML as style sheet language and data storage format. It shows how a Tk user interface can be configured by an XML style, and how data records can be stored using XML.
The following list of restrictions seems to be rather long, but it is actually not compared with the much longer list of XML features that have been implemented. I enumerate the restrictions here to make clear that Markup is currently under development, and although it is stable enough to be useable it should be taken with some care.
As the parser is validating, it should be able to report all violations against the XML rules. This is currently not the case. Mainly the following restrictions apply:
The parser can only handle ISO-8859-1 characters. It can read files that are encoded as "UTF-8", "UTF-16", and "ISO-8859-1", but only the characters 0 to 255 are understood. --> This restriction will NOT be dropped in the near future, as Unicode support is a matter of the core O'Caml language.
The "standalone=yes" attribute is not checked.
That IDs are unique, and that IDREFs reference only existing IDs, are only checked in a deferred way, i.e. once these features are used by the application.
It is not checked if the content models are deterministic.
Furthermore, the following restrictions apply that are not violations of the standard:
Error messages are sometimes confusing (but they are getting better from version to version).
The exact locations of processing instructions are not stored in the document object. It is only known which element is the direct parent of the processing instruction. (NEW since 0.2.9: It is possible to configure the parser such that imaginary nodes are automatically added for PIs; as these nodes can be located exactly this is also true for the PIs in them - see option "processing_instructions_inline".)
XML comments are always dropped[3].
The attributes "xml:space", and "xml:lang" are not supported specially. (The application can do this.)
The built-in support for SYSTEM and PUBLIC identifiers is limited to local file access. There is no support for catalogs. The parser offers a hook to add missing features (by writing new or inheriting from existing resolver classes).
It is currently not possible to check for interoperatibility with SGML.
There is currently no "early parser hook", i.e. a possibility to throw away parts of the document that are not needed directly after they have been read. (Perhaps this is possible by deriving the document class.)
Changed in 0.2.12 (vtamara@informatik.uni-kl.de, changes released to the public domain):
Works with Ocaml 3.07 (also with previous versions)
Error messages now of the form 'file:line:column: message'
Fixed bug in cur_pos (introduced in 0.2.11) in the case of references to external entities.
Added option -D to validate. It allows to specify search path.
Added configuration script
Changed in 0.2.11 (vtamara@informatik.uni-kl.de, changes in the public domain):
New method quick_set_data
Information of source (file name and line) appended to each element and each data node, it can be seen with the method pos_file
The parser accepts an special commentary to change internally the source of an element: <!--# LINE FILE -->
Search for files (e.g DTDs) in several paths. The paths can be specified in the reference variable default_dirs (declared in markup_reader.mli).
New trivial functions in markup_misc.ml and open_gen in markup_yacc
More portability: Compilation doesn't depend of ocamlfind, makefiles more portable (work with gnumake and bsdmake), shell scripts in rtests more portable (work with bash and pdksh).
Changed in 0.2.10:
Bugfix: in the "allow_undeclared_attributes" feature.
Bugfix: in the methods write_compact_as_latin1.
Improvement: The code produced by the codewriter module can be faster compiled and with less memory usage.
Changed in 0.2.9:
New: The module Markup_codewriter generates for a given XML tree O'Caml code that creates the same XML tree. This is useful for applications which use large, constant XML trees.
New: Documents and DTDs have a method write_compact_as_latin1 that writes an XML tree to a buffer or to a channel. (But it is not a pretty printer...)
Enhancement: If a DTD contains the processing instruction
<?xml:allow_undeclared_attributes x?>where "x" is the name of an already declared element it is allowed that instances of this element type have attributes that have not been declared.
New function Markup_types.string_of_exn that converts an exception from Markup into a readable string.
Change: The module Markup_reader contains all resolvers. The resolver API is now stable.
New parser modes processing_instructions_inline and virtual_root that help locating processing instructions exactly (if needed).
Many bugs regarding CRLF handling have been fixed.
The distributed tarball contains now the regression test suite.
The manual has been extended (but it is still incomplete and still behind the code).
Changed in 0.2.8:
A bit more documentation (Markup_yacc).
Bugfix: In previous versions, the second trial to refer to an entity caused a Bad_character_stream exception. The reason was improper re-initialization of the resolver object.
Changed in 0.2.7:
Added some methods in Markup_document.
Bugfix: in method orphaned_clone
Changed in 0.2.6:
Enhancement: The config parameter has a new component "errors_with_line_numbers". If "true", error exceptions come with line numbers (the default; and the only option in the previous versions); if "false" the line numbers are left out (only character positions). The parser is 10 to 20 percent faster if the lines are not tracked.
Enhancement: If a DTD contains the processing instruction
<?xml:allow_undeclared_elements_and_notations?>it is allowed that elements and notations are undeclared. However, the elements for which declarations exist are still validated. The main effect is that the keyword ALL in element declarations means that also undeclared elements are permitted at this location.
Bugfix in method "set_nodes" of class Markup_document.node_impl.
Changed in 0.2.5:
If the XML source is a string (i.e. Latin1 some_string is passed to the parser functions as source), resolving did not work properly in previous releases. This is now fixed.
Changed in 0.2.4:
A problem with some kind of DTD that does not specify the name of the root element was fixed. As a result, the "xmlforms" application works again. Again thanks to Haruo.
Due to the XML specs it is forbidden that parameter entities are referenced within the internal subset if the referenced text is not a complete declaration itself. This is checked, but the check was too hard; even in external entities referenced from the internal subset this rule was enforced. This has been corrected; in external entities it is now possible to use parameter entities in an unrestricted way.
Changed in 0.2.3:
A fix for a problem when installing Markup on Solaris. Haruo detected the problem.
Changed in 0.2.2:
A single bugfix: The parser did not reject documents where the root element was not the element declared as root element. Again thanks to Claudio.
Changed in 0.2.1:
A single bugfix which reduces the number of warnings. Thanks to Claudio for detecting the bug.
Changed in 0.2:
Much more constraints are checked in the 0.2 release than in 0.1. Especially that entities are properly nested is now guaranteed; parsed entities now always match the corresponding production of the grammar.
Many weak checks have been turned into strong checks. For example, it is now detected if the "version", "encoding", and "standalone" attributes of an XML declaration are ordered in the right way.
The error messages have been improved.