Typical application scenarios for XProc

In the following, three scenarios will be introduced in which XProc could be used. In this process, the source codes are explained step by step. Since in all scenarios XML data are read in folder wise and they have thus this aspect in common, this procedure shall be explained in advance:

How to read in data folder wise and how to pass it on to a loop

With the help of the <p:directory-list> step, the entire content of a folder is read out. In order to ensure that only XML documents are read out, the step gets an appropriate filter instruction as an option. In this example, the folder containing the documents has the name “input“ and is indicated in the “path“ option.

<p:directory-list include-filter=".*xml" path="input"/>

The output document of this step encloses all elements in a <c:directory> element including the <c:file> child elements.

<c:directory name = string>
 (c:file |
  c:directory |
  c:other)*
</c:directory>

Since only the <c:file> elements are of interest because they contain the file names of the XML documents, they have to be extracted with <p:filter>.

<p:filter select="//c:file"/>

<p:filter> gets an appropriate XSLT expression which addresses <c:file>. So, all <c:file> elements are passed on to the next step.

<c:file name="document.xml" xmlns:c="http://www.w3.org/ns/xproc-step"/>
<c:file name="document2.xml" xmlns:c="http://www.w3.org/ns/xproc-step"/>
<c:file name="document3.xml" xmlns:c="http://www.w3.org/ns/xproc-step"/>
<c:file name="document4.xml" xmlns:c="http://www.w3.org/ns/xproc-step"/>
 

The “for-each“ step generates a loop which processes a <c:file> element for each run. Which means that as many loop iterations are performed as there are XML documents in the indicated folder.

At first, a variable with the name “filename“ is generated within the loop which saves the name of the current file as value. The name is situated in the “name“ attribute of <c:file> and is read out by a XPath expression.

<p:variable name="filename" select="c:file/@name"/>

This variable is relevant for the further processing. Since in this “name“ attribute only the file name but no file path is indicated, it has to be subsequently added by the <p:make-absolute-uris> step, because otherwise the documents in the next step cannot be further processed.

<p:make-absolute-uris match="c:file/@name">
  <p:with-option name="base-uri" select="'input/'"/>
</p:make-absolute-uris>

By adding the “base-uri“ option with the content “input/“, each document gets the correct path. The “base-uri“ option represents the path in the file system where the XProc stylesheet can be found. By adding “input/“, the folder of the XML documents, the path will be completed and could be as follows depending on the nature of the file system:

<c:file name="file:/Users/tg/Documents/Scenario/input/Document.xml" xmlns:c="http://www.w3.org/ns/xproc-step"/>

Without this correction, the next step cannot find <p:load> which will load the document in order to make it available to the following steps. The step is assigned in its “href“ attribute, which references the file to be loaded, the value of the “name“ attribute of <c:file> being previously extended by the path name in the form of an appropriate XPath expression.

<p:load dtd-validate="false">
   <p:with-option name="href" select="c:file/@name"/>
</p:load>

Then, the document loaded is passed on to a <p:try> step. This means that all following operations are executed in a protected condition. In the case of an error, it will be caught by <p:catch>, the processing will be finished accordingly and the next iteration of the loop, respectively the next file, will be started.

All scenarios have this basic functionality in common.

SCENARIO 1: cross media publishing

A company working in the field of publishing sells its publications in several formats, e.g. documents in PDF, XHTML and in the EPUB format.

Whereas the formats XHTML and PDF are commonly used, EPUB is not very popular. Therefore, the format will be briefly presented.

EPUB (electronic publication) is a standard for e-books published by the IDPF (International Digital Publishing Forum). It is a XML-based format which is composed of three open standards (Open Publication Structure, Open Packaging Format and OEBPS Container Format). The file extension of an EPUB file is called *.epub. Basically, it is a ZIP container with a certain folder hierarchy providing the real content of the document.

META-INF/
  container.xml
mimetype
opf/
ops/

In the subfolders you can find the relevant files which describe the content of the EPUB document. For additional information regarding EPUB, please see the IDPF homepage.

As the basis for the desired cross media publishing procedure, the raw data is provided in the form of XML documents. These XML documents are passing various quality assurance processes. So, they are validated against a XML Schema and a Schematron file. Only then they are transformed into the desired formats by corresponding XSLT transformations.

With XProc, these processes can be defined in a workflow. In that way, the XML documents shall initially be read in folder wise. Then, these documents are individually processed within the loop. Each document has to be validated against the preset XML Schema document. In the case of a successful validation, the next step is tackled.

If the validation is negative, a file with the name of the XML document is stored in folders designated for this purpose. This file contains the corresponding XProc error message which allows conclusions regarding the underlying problem. After that, the next document is processed. In the second stage, a Schematron validation is carried out. Here, the same procedure as in the first stage applies. If the result is error-free, the process is continued. If it is not error-free, a file with the name of the XML document and error information is stored in the appropriate folder and further processed in the next document.

When the validation works are successfully completed, the XML document can be transformed into the desired formats by a XSLT transformation. In the event that the transformations also cause errors (e.g. an error in the XSLT stylesheet), here also an error file of the same name is generated.

The following graphic shows the flow of the XProc script.

Figure: Publishing scenario

Figure: Publishing scenario

Since the folder wise reading in has been already discussed, this explanation starts directly with the <p:try> step. Since the behaviour in the <p:catch> area is identical in all scenarios, it will be discussed at the end of the text (see p:catch in the scenarios).

The first phase in the “Try Block“ is the validation against a XML Schema document. This is realised by the <p:validate-with-schema> step. Before that, a <p:group> element is opened which encloses all steps in <p:try>. This is a default of <p:try>. Next comes the XML Schema validation.

<p:validate-with-xml-schema assert-valid="true" mode="strict">
   <p:input port="schema">
      <p:document href="xml_schema.xsd"/>
   </p:input>
</p:validate-with-xml-schema>

In this phase the the XML Schema document to be used is indicated. If the validation is successful, the XML document originally read in is passed on to the next step which is responsible for the Schematron validation.

<p:validate-with-schematron assert-valid="true" name="schematron">
  <p:input port="parameters">
    <p:empty/>
  </p:input>
  <p:input port="schema">
    <p:document href="schematron.sch"/>
  </p:input>
</p:validate-with-schematron>

In this example the Schematron Schema document is indicated analogous to the previous step. If the validation is successful, the transformation into the corresponding formats is performed in the next stages.

All transformations are performed by the <p:xslt> step. Since three different formats shall be generated from the same XML document, three <p:xsltsteps must be performed accordingly. The output port of the previously performed Schematron validation step is assigned to all three steps as initial value on their respective input port.

Subsequently, a suitable step is performed depending on the format. In the case of HTML, a <p:store> step is performed which saves the document previously transformed. Here, the “filename“ variable generated at the beginning is used. It informes the <p:store> step how the document to be generated shall be named.

<p:xslt>
   <p:input port="source">
      <p:pipe port="result" step="schematron"/>
   </p:input>
   <p:input port="stylesheet">
      <p:document href="xslt_stylesheet.xsl"/>
   </p:input>
   <p:input port="parameters">
      <p:empty/>
   </p:input>
</p:xslt>
<p:store>
   <p:with-option name="href" select="concat('./output/success/html/',$filename,'.html')"/>
</p:store>

The output port of the <p:validate-with-schematron> step is assigned to the “source“ input port of <p:xlst>. At the “stylesheet“ input port the XSLT stylesheet to be used is indicated. After the execution of the step, on the output port the result is passed on to the following <p:store> step. This step saves the result in a file having the same name as the initial XML document in the “output/success/html“ folder. But with the extension “.html“. This is realised by the “concat“ function. With this function, several string elements can be united to one string. Which means in this example the path name and the content of the “filename“ variable.

This procedure is repeated twice for the other formats. But with the difference that different stylesheets are deposited for the XSLT transformations. Whereas the first stylesheet has performed a XHTML transformation, the other two realise a PDF and EPUB transformation. The stylesheet which is responsible for the EPUB realisation generates the folder structure typical for EPUB, so that it can be accordingly further processed in the next phase. In order to generate a PDF, a FO document has to be generated for a XSL-FO formatter. This is realised by the stylesheet.

For the generation of the EPUB version, the <p:exec> step is called up in the next phase. With the help of this step, external programmes can be used in XProc. For further processing, a "Perl" script is called up with this step.

<p:make-absolute-uris name="absolute" match="perl">
   <p:input port="source">
    <p:inline>
       <uris>
         <perl>epub.perl</perl>
       </uris>
    </p:inline>
   </p:input>
</p:make-absolute-uris>
<p:exec command="perl" result-is-xml="false" source-is-xml="false" name="Zip">
   <p:log port="errors" href="error.txt"/>
   <p:with-option name="args" select="substring-after(uris/perl,'file:')"/>
   <p:input port="source">
     <p:empty/>
   </p:input>
</p:exec>

In order to be able to operate the <p:exec> step, the indication of the Perl file must first be prepared. Since <p:exec> requires absolute path names, these have to be made absolute by the <p:make-absolute-uris> step. Otherwise the file cannot be found because the step always searches in the directory of the programme to be executed. Then, the Perl script is executed. The file is indicated in the “args“ attribute by the “substring-after“ command. It adapts the return value of <p:make-absolute-urisaccordingly by omitting the preceding “file:“ contained in the return value. So, <p:exec> can find and execute the Perl script.

This Perl script generates a ZIP document from the EPUB folders generated by the XSLT transformation and replaces the file extension with *.epub. Finally, the newly generated EPUB document is validated with the free programme “EpubCheck“. This ensures that the file can be read on all current EPUB readers. If the document is valid, it will be copied into the “output/sucess/epub“ folder.

Instead of a Perl script, also a “Batch“ file or a "Shell" script could have been used. It is also conceivable that by the use of <p:exec> a command line-based (or a supportive) Zip tool (e.g. with the "tar" programme) is addressed in order to generate the ZIP file. Two further <p:exec> steps have to follow in order to convert the file into a “*.epub“ file (e.g. by the Linux command mv) and to validate it afterwards with the “epubcheck“ programme. In this example these processes are summarised with a Perl script since the generation of EPUB documents is not the primary core of the scenarios.

For the generation of a PDF, the XSL-FO document generated by the XSLT transformation is passed on to the <p:xsl-formatter> step. This step transforms it into a PDF and creates a file in the “output/sucess/pdf/“ folder. As file name the original name of the XML document is also used in this example (but with the extension *.pdf).

<p:xslt name="XSL">
   <p:input port="source">
      <p:pipe port="result" step="schematron"/>
   </p:input>
   <p:input port="stylesheet">
      <p:document href="xslt_to_fo.xsl"/>
   </p:input>
   <p:input port="parameters">
      <p:empty/>
   </p:input>
</p:xslt>
<p:xsl-formatter name="XSL2" content-type="application/pdf">
   <p:with-option name="href" select="concat('./output/success/pdf/',$filename,'.pdf')"/>
   <p:input port="source">
      <p:pipe port="result" step="XSL"/>
   </p:input>
   <p:input port="parameters">
      <p:empty/>
   </p:input>
</p:xsl-formatter>

The PDF file name to be generated is realised by <p:with-option> and a further “concat“ command. A string is generated which at first receives the desired target folder and then the file name with the extension *.pdf.

At the end of the complete pipeline run the user is provided with a folder structure which contains all successfully performed transformations. But also all failed transformations including the appropriate diagnostic information will be provided in the “output/fail“ folder.

SCENARIO 2: Data migration

Two companies have the intention to merge. Both companies have accumulated enormous amounts of XML data in the respective company structures during the years of their existence. These data have to be combined.

This is realised with several XSLT transformations. Based on the newly defined, common set of rules (e.g. XML Schema), new, also XML-based data material is created from the exisiting XML documents by corresponding transformations.

Before the transformation, the initial XML documents are checked against a XML Schema and afterwards against a Schematron Schema. This ought to ensure that the transformation can be properly performed. Since the data of both companies are partially too different, firstly they have to be prepared by inserting regular expressions and find/replace operations with the help of a Perl script. Only afterwards, they can be transformed.

After the transformation via XSLT, the newly generated XML document is ckecked a second time against a Schematron Schema. If this validation is successful, the new document is saved in an appropriate folder.

This procedure, too, has to be defined in a XProc pipeline. If the document works successfully in the complete run, it will be saved. In the case of an error, a document with the same name will be generated containing the corresponding XProc error message. In this way, conclusions can be drawn regarding the cause of the problem.

The following graphic shows the flow of the XProc script.

Figure: Data migration scenario

Figure: Data migration scenario

Since the folder wise reading in of data has already been discussed, this explanation starts directly with the <p:try> step. Since the behaviour in the <p:catch> area is identical for all scenarios, it will be explained at the end of the text (see p:catch in the scenarios).

The first stage within the "Try Block" is the validation against a XML Schema document. This is realised by the <p:validate-with-schema> step. Before that, a <p:group> element is opened which encloses all steps in <p:try>. This is a default by <p:try>. Next, the XML Schema validation follows.

<p:validate-with-xml-schema assert-valid="true" mode="strict">
   <p:input port="schema">
     <p:document href="xml_schema.xsd"/>
   </p:input>
</p:validate-with-xml-schema>

In this phase, the XML Schema document to be used is indicated. If the validation is successful, the initially read in XML document is passed on to the next step being responsible for the Schematron validation.

<p:validate-with-schematron assert-valid="true" name="schematron">
  <p:input port="parameters">
    <p:empty/>
  </p:input>
  <p:input port="schema">
    <p:document href="schematron.sch"/>
  </p:input>
</p:validate-with-schematron>

At the “schema“ input port the Schematron document is indicated. The read in XML document is validated against this schema. If this validation is successful, the XML document is passed on to the next step (<p:store>). Here, a temporary copy of the XML document called “temp.xml“ is created. This is relevant for the later execution of the Perl script.

<p:store href="temp.xml">
   <p:input port="source">
     <p:pipe port="result" step="schematron"/>
   </p:input>
</p:store>

Since the Perl script is executed by the <p:exec> step, further preparations have to be made.

<p:make-absolute-uris name="absolute" match="perl">
   <p:input port="source">
   <p:inline>
      <uris>
        <perl>perl_script.pl</perl>
        <perl>temp.xml</perl>
      </uris>
   </p:inline>
   </p:input>
</p:make-absolute-uris>

The Perl script has the file name “perl_script.pl“. It shall process the intermediate copy of the initial XML document “temp.xml“ being currently situated in the loop. Later, the <p:exec> step will be informed that it shall execute the “perl“ programme. The corresponding command on the command line would be as follows:

perl perl_script.pl temp.xml

In XProc or rather in the current pipeline, this command is as follows:

<p:exec command="perl" result-is-xml="true" source-is-xml="false" name="exec">
  <p:log port="errors" href="error.txt"/>
  <p:with-option name="args" select="concat(substring-after(uris/perl[1],'file:'),' ',substring-after(uris/perl[2],'file:'))"/>
  <p:input port="source">
    <p:empty/>
  </p:input>
</p:exec>

XProc assumes that the current working directory in which the indicated data are located is the same as the directory of the programme to be executed. As this is seldom the case, an absolute path of the desired documents is realised by the <p:make-absolute-uris> step. Now, a string is assigned to the “args“ attribute of <p:exec>. This string contains the complete path to “perl_script.pl“, a space character as separator and “temp.xml“. The string is combined by the commands “concat“ and “substring-after“. The latter outputs its content only from a certain place, in this example after “file:“. Since <p:make-absolute-uris> puts a “file://“ in front of all entries, Perl cannot be carried out properly. It would not be able to find the file (does not correspond to the input syntax). By using “substring-after“, the string is manipulated accordingly, so that Perl can find the files.

In this way, the step can be executed successfully, provided that the concerned documents are in the same directory as the XProc stylesheet (because <p:make-absolute-urisrefers to it).

A further approach would be the indication of the current working directory (to be defined in the “cwd“ attribute). However, in this case, a dynamic indication by means of XProc is not possible because all methods to generate paths (<p:base-uri()>, <p:make-absolute-uris>, <p:resolve-uri>) refer to elements or attach the file name of the XProc stylesheet to the path as final indication.

After the execution of the <p:exec> step, the result will be enclosed with a <c:result> wrapper.

<c:result>string</c:result>

However, as this is not desired, the content of <c:result> is extracted by <p:filter> and passed on to the next step.

<p:filter name="filter" select="c:result/*"/>

By indicating the “c:result/*“ XPath expression in the “select“ attribute, it is determined that all contents within “c:result“ shall be outputted. These are passed on to the <p:xslt> step which performs the appropriate transformation.

<p:xslt name="transformer">
  <p:input port="source">
    <p:pipe port="result" step="filter"/>
  </p:input>
  <p:input port="stylesheet">
    <p:document href="xslt_stylesheet.xsl"/>
  </p:input>
  <p:input port="parameters">
    <p:empty/>
  </p:input>
</p:xslt>

If the transformation is successful, the result is passed on to a further Schematron check. This check shall ensure that the new data meet the preset qualitiy characteristics. Finally, the document is saved (by <p:store>). As file name the value of the above generated “filename“ variable is used.

<p:validate-with-schematron assert-valid="true" name="schematron2">
  <p:input port="parameters">
    <p:empty/>
  </p:input>
  <p:input port="schema">
    <p:document href="schematron_2.sch"/>
  </p:input>
</p:validate-with-schematron>
<p:store>
 <p:with-option name="href" select="concat('./output/success/',$filename)"/>
 <p:input port="source">
   <p:pipe port="result" step="schematron2"/>
 </p:input>
</p:store>

At the end of the complete run (which means all loop runs) the user is provided with a folder structure which contains all successfully completed and all failed processes.

SCENARIO 3: round-tripping

A publishing company produces documents in several formats (Cross Media Publishing). Since the authors do not want to work with XML editors, they use Microsoft Word.

The documents produced (the *.docx format of Word is XML-based) are transformed into Docbook documents (a special XML-related format which has been optimised for the print) by the appropriate XSLT transformation. On this basis, conversions into the desired formats (e.g. PDF) are performed by new transformations.

The generated Docbook documents have to be restored to their original condition (Word). For this purpose a further XSLT transformation with an appropriate stylesheet is necessary. This procedure is called round-tripping. For example, an editor who had made subsequent XML-related changes in the Docbook document (e.g. changes of the typesetting) could reconvert the new version to Word. This version would be sent to the author in order to re-use it for the next edition.

In the following workflow, a Word document is initially validated against XML Schema and Schematron within a pipeline. Then, it is transformed into DocBook. This newly created Docbook document is again retransformed into Word.

Finally, this newly created document is compared to the Word file being read in at the beginning. If they are identical, the preservation of the data structure is ensured. This last procedure serves to ensure the proper back transformation and is mainly relevant during the development phase.

The following graphic represents the flow of the XProc script.

Figure: Round-tripping scenario

Figure: Round-tripping scenario

Since the folder wise reading in has already been discussed, this explanation starts directly with the <p:try> step. Since the behaviour in the <p:catch> area is identical in all scenarios, it will be explained at the end of the text (see p:catch in the scenarios).

<p:validate-with-xml-schema assert-valid="true" mode="strict">
   <p:input port="schema">
     <p:document href="xml_schema.xsd"/>
   </p:input>
</p:validate-with-xml-schema>

At first, the document read in is validated against a XML Schema file in the <p:validate-with-xml-schema> step. In the “schema“ input port the Schema document to be used is defined. If this validation is successful, a validation against a Schematron document is performed in the next step.

<p:validate-with-schematron assert-valid="true" name="schematron">
  <p:input port="parameters">
    <p:empty/>
  </p:input>
  <p:input port="schema">
    <p:document href="schematron.sch"/>
  </p:input>
</p:validate-with-schematron>

In the <p:validate-with-schematron> step the document outputted by the previous step is read in and validated against a Schematron Schema. The Schematron file is indicated at the “schema“ input port. If the validation is successful, the document read in is passed on to the next step.

<p:xslt name="transformer">
 <p:input port="stylesheet">
   <p:document href="xslt_stylesheet_1.xsl"/>
 </p:input>
 <p:input port="parameters">
   <p:empty/>
 </p:input>
</p:xslt>

The transformation by <p:xslt> generates a DocBook document based on the input data. The XSLT stylesheet is indicated at the “stylesheet“ input port. If the process is successful, the newly created document is passed on to a further transformation step.

<p:xslt name="transformer_2">
<p:input port="source">
  <p:pipe port="result" step="transformer"/>
</p:input>
<p:input port="stylesheet">
  <p:document href="xslt_stylesheet_2.xsl"/>
</p:input>
<p:input port="parameters">
  <p:empty/>
</p:input>
</p:xslt>

In this second transformation, the previously generated DocBook document is converted back to Word. Then, the result is passed on.

<p:compare fail-if-not-equal="true">
  <p:input port="source">
    <p:pipe port="result" step="transformer"/>
  </p:input>
  <p:input port="alternate">
    <p:pipe port="result" step="transformer_2"/>
  </p:input>
</p:compare>

In the following <p:compare> step, the document converted back is compared with the document read in at the beginning. At the “source“ port the first document is indicated. At the “alternate“ port the second document to be compared is determined. If the “fail-if-not-equal“ attribute is set to “true“, it brings the step to terminate with an error where the documents are not identical.

Finally, the DocBook document is saved by the <p:store> step.

<p:store>
  <p:with-option name="href" select="concat('./output/success/',$filename)"/>
    <p:input port="source">
      <p:pipe port="result" step="transformer"/>
    </p:input>
</p:store>

The name of the created document results from the value of the “filename“ variable generated at the beginning.

At the end of the complete pipeline run, the user is provided with a folder containing all converted DocBook documents. All failed processes including the respective error messages are deposited in an appropriate folder.

<p:catch> in the scenarios

Since the behaviour of <p:catch> is identical in all scenarios, it is be described here once.

If a dynamic error occurs in the <p:try> area, for example by a validation error within a <p:validate-xml-schema> step, this area is directly abandoned. Normally, a dynamic error would lead to the abort of a pipeline. But since the process is performed loop wise in the scenarios, errors have to be caught and logged accordingly in order to be able to continue with the next file.

This is realised in the <p:catch> area. The errors are caught and stored by a <p:store> step. As input port the so-called “error“ port is used. It is provided by <p:catch>.

<p:store>
  <p:with-option name="href" select="concat('./output/fail/',$filename)"/>
    <p:input port="source">
      <p:pipe port="error" step="catch"></p:pipe>
    </p:input>
</p:store>

The file name is created by the “filename“ variable. In each scenario it is generated at the beginning and contains as value the file name of the current XML document. So, an occurred error is stored in an appropriate directory (fail) with the same file name as the one of the currently processed document.

After leaving the <p:catch> area, the loop is continued or, in the case of the last run, terminated.

<< back click further to the XProc reference >>