1 Description

In working with XML or HTML, it is often the case that you need to work with multiple documents at the same time. And in most cases that's rather awkward. For instance, an XSLT transformation processes a single document. And yes, of course you can get others by calling the doc() function or produce others using <xsl:result-document>. But, especially when there are a lot of relations between the documents, this requires careful and sometimes heavy programming. The idea of XML containers tries to make this more manageable.

An XML container (as handled/used by this module) is an XML structure that holds other XML documents and references to binary files. Here is a short example:

<document-container xmlns="http://www.xtpxlib.nl/ns/container" 
  timestamp="2019-12-12T12:11:10"
  href-target-path="/my/website/location">

  <document href-target="index.html">
    <html> … </html>
  </document>

  <document href-target="page1.html">
    <html> … </html>
  </document>

  <external-document href-source="/my/resources/image.jpg" href-target="resources/image.jpg"/>
</document-container>

This example shows a container, probably generated by some pipeline or XSLT stylesheet, that contains the contents of a simple website. All two pages and some image are there. Running this container through xtlcon:container-to-disk will write it to the path indicated in /*/@href-target-path: /my/website/location. The documents index.html and page1.html come from the container, the binary image.jpg is copied from the indicated source location. Because everything, every separate file, is in (or referenced in) a single encompassing document, lots of things get easier: creating or checking internal referencing, making classes consistent, etc. An XSLT stylesheet that gets this as its main document has access to all information.

1.1 Applications

As it turned out, the whole idea of working with multiple documents in an XML container had several applications:

An important application of the zip format is its use as an overarching storage format for applications. For instance, most office suites do this: a Microsoft Word .docx or Excel .xlsx file is actually a zip file with many smaller files inside (most of them in XML format). There are many other examples.
Trying to interpret such a zip file and get something meaningful out of it can be a nightmarish experience, especially if you want to follow the standard (and not rely on some file naming convention some engineer cooked up and might change). It takes following links through several files to the place the actual interesting information is stored.
But if you run such a file through xtlcon:zip-to-container you get all files in a single encompassing one, making it much, much easier to follow internal links and find the right information. The xtpxlib-xoffice component does exactly this: it contains pipelines to get the contents of Word and Excel files in an easier to interpret XML format.
Going even further with this, it is now much easier to change or even create such a horribly complex Word, Excel or other kind of office zip file:
- Read a (template) office document in using xtlcon:zip-to-container.
- Change what you need to change (text, spreadsheet cell values, etc.). Leave the rest, with all this complex linking and other stuff you don't really need to understand, alone.
- Write it to a resulting zip file using xtlcon:container-to-zip.
A file structure that needs to end up on disk or in a zip file can be created easily using this XML container mechanism.

1.2 Working with containers

The container format is described here Working with XML container documents is done using XProc 1.0 or XProc 3.0 pipelines.

WARNING: The container formats and processing features differ between the 1.0 and the 3.0 version! More about this in the container format description.

There are some notable missing features in the current container handling. These are not impossible to implement, the need for them just hasn't arisen yet.

When writing a zip file you cannot control the compression (different ones, on or off). This means that this mechanism currently can't produce e-books (which require an uncompressed first file).
You can't work with binary contents inside the container, for instance when its base64 encoded.