XML technologies / XSLT / XSLT and XPath functions / Alphabetical XSLT and XPath reference / tokenize

XSLT and XPath function reference in alphabetical order

(Excerpt from “XSLT 2.0 & XPath 2.0” by Frank Bongers, chapter 5, translated from German)

A | B | C | D | E | F | G | H | I | J | K | L | M | N | O | P | Q | R | S | T | U | V | W | X | Y | Z

fn:tokenize

Category:

String functions – pattern matching

Origin:

XPath 2.0

Return value:

A sequence of xs:string strings; the result of the splitting of the input string into segments at the boundaries determined in the regular expression as separators.

Call/Arguments:

fn:tokenize($inputString?, $reg-ex, $flags?)

$inputString:

Optional. A xs:string string which shall be split into a sequence of substrings by means of the separators specified in the regular expression of the second argument – by omitting the substrings serving as separators. If the empty sequence is passed on to the function as first argument, it returns the empty sequence.

$reg-ex:

Obligatory. The argument consists of a regular expression which is used for the testing of the string. All instances of the separator, as described by the regular expression, are removed when split up.

$flags:

Optional. By means of the $flags argument the effect of the regular expression can be determined. If no $flags argument is passed on, firstly, the regular expression tests »case-sensitive« and secondly, it considers the input string as a closed (not line orientated) character string.

Purpose of use:

The fn:tokenize() function splits a string passed on to it into a sequence of substrings. For this purpose, a separator pattern is described by means of a regular expression. If the separator pattern occurs, the preceding substring is copied (up to a preceding separator match) and added as a string to the result sequence. The separator found by the pattern does not appear in the result sequence.

If a pattern match occurs immediately at the beginning of the input string, the first item of the result sequence is consequently the empty string.

If two separators (which means pattern matches) follow immediately one another, without overlapping, also an empty string is added as item to the result sequence.

However, if two pattern matches overlap and therefore also two potential separators (e.g. by groups contained in the regular expression), only the first match is used. The order of the groups is relevant:

fn:tokenize("abracadabra", "(ab)|(a)")

results in:

("", "r", "c", "d", "r", "")

Whereas,

fn:tokenize("abracadabra", "(a)|(ab)")

leads to the following result:

("", "br", "c", "d", "br", "")

Setting of flags with the $flags argument:

In order to modify the behaviour of the regular expression, it is allowed to pass on as third argument a string as so-called »flag« in addition to the input string and the regular expression. The allowed flags are based on the conventions of Perl.

Attention – deviation to flags in Perl syntax:
The Perl flag g (»global«) is not supported in XPath!

The argument may consist of the single letters m, i, s, x or (in undetermined order) of useful combinations thereof. Also the empty string (or the empty sequence) is permitted. An invalid flag argument (deviating from the permitted characters) is answered by the processor with the error message »Invalid regular expression. flags« (err:FORX0001).

Flag symbol	Description
i	»ignore« – ignores for the matches the case sensitivity in the examined string.
m	»multi-line« – respects line breaks in the examined string (allows a multiple match by the patterns ^ and $).
s	switches to the so-called »dot-all« mode. Explanation: It influences the behaviour of the meta character ».« (dot). Without set s flag, this meta character applies to all characters (also whitespace!) except for the line break character #x0A (NL).
x	deactivates the notice of the whitespace characters #x9, #xA, #xD and #x20 within the regular expression. However, if the flag is not set, the whitespace characters are considered as part of the expression and taken into account for the match. Explanation: With the set x flag, it is possible to make very long regular expressions clearer by using line breaks and tabulators.
''	The empty string – which is expressly permitted. It corresponds to the non-passing on of a flag argument.

Table: in XPath permitted symbols for flags in regular expressions

Note: Further explanations to regular expressions can be found under fn:matches() and fn:replace().

Example 1 – splitting a string into words:

fn:tokenize( "This is an example",

"\s+"

)

results in: ("This", "is", "an", "example").

The regular expression matches a series of one or several whitespace characters and defines these as separators. Therefore, in this example the words contained in the string become items in the result sequence. All space characters are removed.

Example 2 – splitting a comma-separated list available in string form:

fn:tokenize( "1, 15, 24,50",

",\s*"

)

results in: ("1", "15", "24", "50").

In this example, comma-separated number values available in string form are converted to a »real« sequence. The regular expression ",\s*" recognises commas with any number of trailing space characters as separators.

Example 3 – splitting a string into characters (version 1):

fn:tokenize( "Test",

".??"

)

results in: ("", "T", "e", "s", "t").

The regular expression ".??" effects the splitting of the string into a sequence of individual characters. The regular expression matches all characters and non-characters – and shall use the empty string as separator. In this respect, it is important to declare the quantifier ? by a further postpositioned '?' as not greedy (reluctant), since otherwise not only the empty string (which precedes each character) is used as separator, but also the trailing character (which would be removed then). Otherwise the result sequence would only consist of empty strings.

Unpleasantly, the result sequence starts with an empty string, because the first match occurs at the beginning of the string. (A consequence of the concept that each string is preceded by an empty string which is returned if the match occurs for the first character.)

Example 4 – splitting a string into characters (version 2):

fn:tokenize( "Test",

"B\.??"

)

results in: ("T", "e", "s", "t").

The previous example was modified insofar as the separator must not appear at the beginning of the string – more exactly, it must be found at a non-word boundary \B. This eliminates the occurrence of the disturbing empty string as first item in the result sequence. (In this example, space characters must not be contained in the input string, otherwise the regular expression would have to be more complex.)

Example 5 – splitting a string at HTML breaks:

fn:tokenize( "This is an example",

"\s* \s*",

"i"

)

results in: ("This is", "an", "example").

The set "i" flag effects in connection with the regular expression "\s* \s*" that a HTML tag is found with any number of surrounding space characters in upper case as well as in lower case letters and is recognised as separator.

Function definition:

XPath 1.0:

The function is not available.

XPath 2.0:

fn:tokenize($input as xs:string?,

$pattern as xs:string) as xs:string*

fn:tokenize($input as xs:string?,

$pattern as xs:string,

$flags as xs:string) as xs:string*

<< back

next >>

Copyright © Galileo Press, Bonn 2008
Printing of the online version is permitted exclusively for private use. Otherwise this chapter from the book "XSLT 2.0 & XPath 2.0" is subject to the same provisions as those applicable for the hardcover edition: The work including all its components is protected by copyright. All rights reserved, including reproduction, translation, microfilming as well as storage and processing in electronic systems.

Galileo Press, Rheinwerkallee 4, 53227 Bonn, Germany