XML and HTML
(Excerpt from "The MathML Handbook" by Pavi Sandhu)
It is helpful to understand XML by contrasting it with HTML, the markup language that is most familiar to people. An HTML document consists of text marked up by tags that provide information about the text they contain. Each tag can also include attributes that provide additional information about the enclosed text. A Web browser can then use the tags and attributes to display the content in the document. HTML consists of about a hundred different tags with names like h1, h2, and p that provide formatting and layout information. Here is what a typical HTML fragment looks like:
<h1>This is a heading.</h1> <h2>This is a subheading.</h2> <p>Here is a paragraph of text.</p>
In contrast to HTML, XML, is a metamarkup language. This means that XML does not specify a fixed set of allowed tags or attributes. Instead, it allows users to define any tags they wish that make sense in the context of what they want to accomplish. For example, a stock transaction could be expressed in XML as:
<Stock> <Company>Acme Corporation</Company> <Ticker>ACME</Ticker> <Price>$43.57</Price> <Shares>10</Shares> </Stock>
Here, the user specifically defined the Stock, Company, Ticker, Price, and Shares tag names for this document. The great strength of XML is that it provides a standard framework for constructing new markup languages. A company, organization, professional body, or any other group can use the syntax of XML to define tag names that are meaningful for a specific purpose. This makes XML an extremely flexible and open-ended language that can easily be tailored to fit a wide range of applications.
As you saw in the stock example, XML tag names are usually chosen to describe the meaning of the data they enclose. This allows you to process applications to search and process the semantic content of the information. For example, you can design search engines to locate specific types of information instead of just looking for literal keywords or strings, greatly increasing the precision and relevance of the results they return. This is not possible with HTML since most HTML tags provide information about the formatting of their content, not its meaning.
Although XML is flexible in its vocabulary, its rules of syntax (which determine how tags and attributes can be combined) are very strictly enforced. An XML document that violates any of these syntax rules is automatically rejected by XML processing tools. In contrast, applications that process HTML are relatively tolerant of minor errors in syntax. For example, an HTML document can contain some overlapping elements as well as elements whose start tag is not followed by an end tag. A typical Web browser will try to guess the correct form for any content that has a nonstandard syntax. If the browser cannot guess the correct form, it will ignore the offending tags and continue processing the rest of the document. This tolerance, though convenient in some ways, is also problematic, since it can lead to different processors treating the same document differently.
The strictness of XML's syntax has a great advantage over HTML because it ensures that XML documents behave in a predictable and orderly fashion. This makes it easier to develop software tools for processing XML. XML's combination of flexibility in its vocabulary and inflexibility in its syntax makes it a much more powerful and versatile language than HTML.
XML and HTML are both derived from another metamarkup language called SGML, which became popular during the 1970s and 1980s. SGML is an extremely powerful language, well suited for creating and maintaining large collections of documents in a format independent of specific software and hardware systems. However, SGML is a very complex language, so it is difficult to understand and implement. As a result, the use of SGML was mainly confined to a few large corporations and government organizations that could afford the huge expense of developing and maintaining SGML-based systems. SGML's biggest success was HTML, which is an application of SGML; that is, a markup language defined according to the syntax rules specified by SGML.
In the early 1990s, Tim-Berners Lee and Anders Berglund at CERN (European Laboratory for Nuclear Research) in Geneva developed HTML. Their goal was to create a compact and efficient way of encoding hypertext documents for exchanging scientific information. However, HTML's impact was felt far beyond the scientific community. Since HTML is a simple language to learn and write software for, it facilitated the development of freely available browser software for viewing HTML documents. Thousands of people all over the world learned HTML and used it to create their own Web pages. HTML thus served as a key catalyst in the emergence of the Web as a revolutionary new medium for mass communication.
However, HTML's greatest strength — its compactness and simplicity — was also its biggest weakness. Despite new extensions to the language in its subsequent versions, HTML was just not powerful enough to support all the complex functionality that browser vendors wanted to implement. The various browser vendors created their own additions and modifications to the language, and the resulting proliferation of incompatible standards became a serious threat to the growth of the Web.
In 1994, the W3C was formed, with participation from both corporate and academic organizations. Its mission was to oversee the development of new standards and technologies for the Web. The W3C's initial efforts focused on adding new features to HTML and creating the specifications for new versions of the language, the last of them being HTML 4.0, released in 1997. However, as the limitations of HTML became more apparent, the W3C decided on an alternative approach.
In 1996, the W3C began work on creating a simplified version of SGML specifically for use on the Web. This "lite" version of SGML was intended to retain most of SGML's strengths while eliminating aspects of the language that were unduly complex or had not proved very useful in practice. The resulting language was XML, and its first official specification was released by the W3C in February 1998.
XML was readily adopted by developers all over the world who had long needed a convenient format for creating structured documents but who had been unwilling to adopt SGML on account of its complexity. In the years following its release, XML has become an extremely popular standard for creating structured documents and data, for Web applications and otherwise. Hundreds of books are devoted to XML, and dozens of free and commercial tools for creating, editing, and manipulating XML documents are available. The popularity of XML has also spawned a whole family of related standards, such as XSL (for formatting and transforming XML documents), XLink (for linking XML documents), and XPath (for specifying parts of an XML document). Taken together, they promise to define the future of the Web platform.
XML has several convenient features:
- It is extensible.
- It is a non-proprietary (or open) standard.
- Being a text-based format, it is platform independent.
- It is based on Unicode, which makes it well suited for internationalization.
- It is widely supported.
Although, users can create an XML document containing any tags they wish, in practice, most XML documents are written in a particular XML application; that is, a specific markup language defined according to the rules of XML, with its own fixed set of tags and attributes. MathML is an application of XML in the same way that HTML is an application of SGML. A large number of XML applications have been defined for specialized purposes in different industries and organizations.
Some prominent examples of XML applications are:
- DocBook: for describing documents such as books, articles, and manuals.
- Chemical Markup Language (CML): for describing the structure of molecules.
- Wireless Markup Language (WML): for describing wireless data. It is part of the Wireless Application Protocol (WAP) specification developed by a consortium of companies including Ericsson, Nokia, and Motorola.
- Scalable Vector Graphics (SVG): for describing two-dimensional graphics. It includes primitives for points, lines, curves, and so on.
- eXtensible Business Reporting Language (XBRL): for describing financial statements produced by companies.
- XSL: for formatting and transforming XML documents.
- XHTML: for aiding the transition from HTML to XML. XHTML is an XML version of HTML developed by the W3C. XHTML has the same tags and attributes as HTML, so that XHTML documents can be displayed in any Web browser.
|<< back||next >>|
Copyright © CHARLES RIVER MEDIA, INC., Massachusetts (USA) 2003
Printing of the online version is permitted exclusively for private use. Otherwise this chapter from the book "The MathML Handbook" is subject to the same provisions as those applicable for the hardcover edition: The work including all its components is protected by copyright. All rights reserved, including reproduction, translation, microfilming as well as storage and processing in electronic systems.
CHARLES RIVER MEDIA, INC., 20 Downer Avenue, Suite 3, Hingham, Massachusetts 02043, United States of America