XML and Unicode

(Excerpt from "The MathML Handbook" by Pavi Sandhu)

An XML document can contain any Unicode text. Unicode is an international standard for representing multilingual text. It defines a very large character set that includes characters from most of the world's languages as well as many mathematical and technical symbols.

A character set defines a mapping between a set of characters and a set of numbers, which are called code points. For example, in Unicode, the Greek letter is represented by the code point 945 (in decimal notation), or x3B1 (in hexadecimal notation).

Unicode is a superset of the American Standard Code for Information Interchange (ASCII), a widely used character set that includes all the letters and common punctuation marks used in English. ASCII consists of 128 characters with code points from 0 to 127. The first 128 characters of Unicode are identical to ASCII. For example, the letter A has the code point 65 in both ASCII and Unicode. However, Unicode goes well beyond ASCII by including many more characters. The current version of the standard, Unicode 3.2, defines code points for approximately 95,000 characters.

In an XML document, the names of elements and attributes, as well as the character data contained in an element, can all be written in Unicode. The advantage of using Unicode is that it allows you to use a single character set for text containing multiple languages and many different types of symbols. This avoids the problems caused by conflicting character sets, in which a single code point might be assigned to more than one character or a single character might have more than one code point, depending on the type of computer being used. Many software applications and operating systems now support Unicode. Unicode thus provides a standard way of encoding multilingual text so it can be exchanged and interpreted reliably across a wide variety of computer systems.

You can include a Unicode character in an XML document in the form of a character entity reference. For example, to include the Greek character α, you would type α. If the document includes a DTD declaration with entity names defined for specific characters, you can also insert the character using a named entity reference. For example, suppose you include a reference to the MathML DTD, as shown here:

<!DOCTYPE math SYSTEM "http://www.w3.org/TR/MathML2/dtd/mathml2.dtd">

You can then insert the character in this document by using the named entity reference &alpha; because the MathML DTD includes an entity declaration that associates the entity name alpha with the corresponding Unicode character code. We shall learn more about named characters in MathML under MathML characters.


<< back next >>





Copyright © CHARLES RIVER MEDIA, INC., Massachusetts (USA) 2003
Printing of the online version is permitted exclusively for private use. Otherwise this chapter from the book "The MathML Handbook" is subject to the same provisions as those applicable for the hardcover edition: The work including all its components is protected by copyright. All rights reserved, including reproduction, translation, microfilming as well as storage and processing in electronic systems.

CHARLES RIVER MEDIA, INC., 20 Downer Avenue, Suite 3, Hingham, Massachusetts 02043, United States of America