XML technologies / XSLT / XSLT and XPath functions / Alphabetical XSLT and XPath reference / unparsed-text

XSLT and XPath function reference in alphabetical order

(Excerpt from “XSLT 2.0 & XPath 2.0” by Frank Bongers, chapter 5, translated from German)

A | B | C | D | E | F | G | H | I | J | K | L | M | N | O | P | Q | R | S | T | U | V | W | X | Y | Z

unparsed-text

Category:

Association and localisation of nodes and resources at runtime

Origin:

XSLT 2.0

Return value:

A xs:string string corresponding to the text content of the external resource which is identified by the URI string passed on.

Call/Arguments:

unparsed-text($URI-string?, $encoding-indication?)

$URI-string:

Optional. A xs:string string which must lexically have the form of a xs:anyURI URI string, but which must not contain a fragment identifier. If it is a relative URI, it is resolved according to the base URI of the static context. The resolved URI has to point at a resource readable as text. If the empty sequence is passed on as first argument, the function returns the empty sequence.

$encoding-indication:

Optional. The second argument names the encoding of the external resource in order to enable its conversion into a string. According to RFC 2278, the encoding describes the binary coding of the character data (example: UTF-8, ISO-8859-1). If the desired encoding is not recognised or not supported by the processor, UTF-8 is used as standard encoding. Applications are not obligated to support further encodings besides UTF-8 and UTF-16.

Purpose of use:

The unparsed-text() function reads in an external resource by means of a URI string passed on to it and returns the content of the resource, without parsing it, as string. In contrast to the document() function, the unparsed-text() function does not demand that the external document is well-formed XML.

The string passed on as first argument is interpreted as URI string and must therefore be a lexically correct, absolute or relative URI. If the argument is an empty sequence, the function also returns an empty sequence.

However, in other cases (see below), the function may generate an error message which leads to the abort of the processing (non-recoverable dynamic error). In order to prevent this, the test function unparsed-text-available() can be used in advance.

The resolution of a relative URI passed on is carried out analogous to the fn:doc() function, which means the base URI of the static context of the function call is resolved against the base URI of the stylesheet.

If, however, the URI string originates from the source document, it is rather intended to use the base URI of the source text for the resolution: However, since the passing on of the desired base URI to unparsed-text() is not possible directly, the XPath function fn:resolve-uri() may be applied to the URI string in advance. This function may get as second argument a base URI which is resolved in this way to the desired, absolute URI related to the source document:

unparsed-text(fn:resolve-uri($URI-string, $base-URI))

Abort due to URI error
The processing is aborted as soon as the function receives as URI argument a string which contains a fragment identifier or a URI argument which does not target at a text resource (ERR XTDE1170).

Determination of the encoding of the resource:

For the implementation of the content of the text document into a string, the encoding of the resource is included. For this purpose, an encoding information available there is primarily used, otherwise a media-type indication. If neither of them is available, an encoding identifier provided in the form of a second argument $encoding-indication is optionally used. If no second argument was passed on, the processor automatically adopts UTF-8 as encoding.

Abort due to encoding errors
If the second argument was omitted and the processor can neither apply UTF-8 to the resource nor infer the encoding from the resource, the processing is aborted (ERR XTDE1200).

Restrictions with regard to usable characters:

The resource may only contain such characters which are also permitted as XML characters. Moreover, character octets must not be contained which are not convertible to valid XML characters by means of the indicated encoding.

Abort due to octet errors
If the external document contains character octets not resolvable by means of the indicated encoding (this also applies in case the processor does not support the stated encoding), the processing is aborted with an error message (ERR XTDE1190).

Behaviour towards ends of lines in the external resource:

The unparsed-text() function does not normalise ends of lines, which means it does not behave according to a XML parser. Depending on the system platform, this may lead to problems with the encoding of the ends of lines when reading in or during the output. An attempt would be to imitate the normalisation procedure of a XML parser by means of the fn:replace() function:

fn:replace( unparsed-text( 'example.txt', 'ISO-8859-1'),

'[
]+',

'
'

)

In this example, the resource is initially fetched, interpreted as ISO-8859-1 and passed along to the replacement function fn:replace() as first argument. This function replaces the character string CR+LF determined by the regular expression in the second argument by a simple LF. However, this is problematic for the usage in Mac OS, because this system in turn requires a simple CR (&xD;). Where applicable, the conversion has to be adapted in order to enable the portability of the stylesheet. (The problem becomes further complicated in mainframe environments, because these also know NEL and LS as end of line.)

A further possibility to unify line breaks can be implemented by means of XSLT instructions (based on a working example in the XSLT specification).

<xsl:for-each select="fn:tokenize(unparsed-text($in), '\r?\n')">
<!-- doing anything per line of the text resource -->
 ...
</xsl:for-each>

Here, the fn:tokenize() function splits the text resource into segments corresponding each to a text line. This is done with the help of the regular expression '\r?\n', whereby \r? corresponds to »one or none« CR character and \n to »exactly one« LF character. The segments obtained in this way can now be processed individually, whereby also breaks can be added in a controlled way.

Serialisation of the string:

In case the external document contains the characters < or >, they are transcribed during the serialisation by the XML entities < or >. With the help of the disable-output-escaping attribute of xsl:value-of, this can be prevented. In this way, external documents containing markup (e.g. in the form of HTML or XML code sections) can be embedded into result documents without prior parsing.

Example 1 – inserting a text with unparsed-text():

The external text 'example.txt':

This is an example text 
which shall be inserted into a document.

Stylesheet (excerpt):

...
<xsl:output method="xml" encoding="ISO-8859-1"/>
<xsl:template match="example">
  <result>
  <xsl:value-of select="unparsed-text('example.txt', 'ISO-8859-1')"/>
  </result>
</xsl:template>
...

Result document:

<?xml version="1.0" encoding="ISO-8859-1"?>
<result>
 This is an example text 
 which shall be inserted into a document.
</result>

Function definition:

XSLT 1.0:

The function is not available.

XSLT 2.0:

unparsed-text($href as xs:string?) as xs:string

unparsed-text($href as xs:string?, $encoding as xs:string) as xs:string

<< back

next >>

Copyright © Galileo Press, Bonn 2008
Printing of the online version is permitted exclusively for private use. Otherwise this chapter from the book "XSLT 2.0 & XPath 2.0" is subject to the same provisions as those applicable for the hardcover edition: The work including all its components is protected by copyright. All rights reserved, including reproduction, translation, microfilming as well as storage and processing in electronic systems.

Galileo Press, Rheinwerkallee 4, 53227 Bonn, Germany