Index

Web Logo

A Guide to XML


Contents


Index    Top of Page

Introduction

HTML vs XML

The main problem with HTML is that, although it is derived from SGML, an extensible language, HTML is a non-extensible language. This means that all the different tag types are fixed and the author cannot introduce new ones to suit his document.

Although there is a Document Type Definition (DTD) for parsing HTML documents in SGML viewers, there is no such device for web browsers. The browser, however, contains code to sort bad HTML coding (e.g. missing tags or non-symmetrical tagging). This has led to the perpetuation of bad habits in HTML coding.

Another problem has been the confusion by authors and the W3C alike, of the difference between document structure and document style. This was brought to a head when the W3C ratified the <FONT> tag together with its associated SIZE, COLOR and (worst of all) FACE attributes.

Extensible Markup Language (XML) allows the author to create and define tags which will be applicable to the context of his or her document. Tags are normally defined in a separate DTD file, although the DTD can be incorporated at the start of the XML document, if required.

Tags in HTML vary from opening and closing pairs, singular types, to hybrids. (The paragraph tag is an example of a hybrid, e.g. <P> </P> or <P> .) All XML tags must be used in open and closing pairs. A single tag can be used to represent an opening and closing pair, thus: <MYTAG/> . Therefore, the tag <BR/> could be used to represent the HTML tag <BR> .

A parsing engine within XML editors, browser or viewer discourages bad coding. An XML parser will accept an XML document without a DTD, as long as the document is "well-formed".



Index    Top of Page

XML Databases

XML tags can be used not only to define the structure of a document, but also to identify blocks of text as objects. For example:
   <PERSON>Guglielmo Marconi</PERSON> conducted many of his early experiments with 
   <INVENTION>Wireless</INVENTION> near <LOCATION>Chelmsford</LOCATION>.
Not only does this have applications for search engines but can also be used to hold data. For example, a CD catalogue might look like this:

   <ALBUM ARTISTE="The Corrs" TITLE="Talk on Corners">
       <TRACK>What Can I Do</TRACK>
       <TRACK>So Young</TRACK>
       <TRACK>Only When I Sleep</TRACK>
       <TRACK>Dreams</TRACK>


       <TRACK>Little Wing</TRACK>
   </ALBUM>

Font attributes and other formatting can be applied to an XML document by referring it to an Extensible Style Language (XSL) style sheet. For backwards compatibility with HTML, a Cascading Style Sheet (CCS) can be used instead.



Index    Top of Page

XML Applications

XML Applications include Channel Definition Format (CDF), Chemical Markup Language (CML) for rendering molecular structures on web pages, and NASA's Astronomical Instrument Markup Language (AIML) for remotely controlling instrumentation such as the SOFIA infrared telescope. The Resource Description Framework (RDF) provides meta data, i.e. information about the information a document contains. It is expected that Netscape Communicator 5 will use RDF to store bookmark and history information.

Vector Markup Language (VML) and Precision Graphics Markup Language (PGML) are both methods for describing vector graphics. Synchronised Multimedia Integration Language (SMIL) provides a way to integrate text, audio, graphics and animations.

Signed Document Markup Language (SDML) is used to sign, co-sign, endorse and witness on-line documents, while the Extensible Forms Description Language (XFDL) is designed for creating, viewing and completing business forms.

But the biggest applications for XML are considered to be Wireless Markup Language (WML), a standard for web access from mobile phones, and Commerce XML (cXML) for e-commerce.



Index    Top of Page

Browsers and Parsers

XML support was added to Microsoft Internet Explorer 5 and is expected to be included in Netscape Navigator 5. Internet Explorer 4 recognises some XML as it included support for Microsoft's CDF.

XML Parsers include XMetal and XML Pro.



Index    Top of Page

Resources

W3C
XML Industry Portal
XML.COM
IBM Developer's XML Zone
Microsoft Web Workshop

WAP Forum
Telecommunications Software and Multimedia Laboratory (Multimediaseminaari) (Finland)
Unwired Planet
Phone.Com



Index    Top of Page

The XML Prologue

XML Version

The XML Prologue (or the americanised Prolog) is the first line that appears in an XML file, and specifies some basic information about the XML file. The prologue is used to let parsers know they are dealing with an XML file, and can contain a version and encoding attribute:

<?xml version="1.0" encoding='UTF-8'?>

The version attribute allows you to specify the version of XML that is being used to author the document, and the encoding attribute specifies the document's character encoding. If you use an XML editor, it should create a prologue automatically.

You can incoporate your DTD in your document (not the most desirable option), in which case the XML Prologue would look like this:

<?xml version='1.0' standalone='yes'?>

No text may appear before the Prologue and the whole line should be lowercase.



Index    Top of Page

Document Encoding

Document encoding determines which character set the XML document is to use. The (document) encoding attribute is placed in the XML Prologue, thus:

<?xml version="1.0" encoding='UTF-8'?>

UTF-8 encodes Unicode into 8-bit characters, providing the ASCII character set. This value may also be set to UTF-16, EUC-JP, etc.



Index    Top of Page

Elements and Attributes

Elements

An element is the mechanism for tagging information in XML. It can define a heading level, an object or data. For example, when authoring a component catalogue, you might want to create an element for each component type, such resistor. In XML, that element would be represented as:

<RESISTOR></RESISTOR>

An empty element may also be represented as:

<RESISTOR/>

Note that an element describes an object or structure. It does not describe formatting. If you want to highlight some text in bold or italic, you have to define that text with a tag. Later you use a style sheet (later) to assign the bold or italic attribute to that tag.

Elements for Valid documents are defined in the Document Type Definition (DTD).



Index    Top of Page

Root Element

The Root Element is the main element for a XML document, and it contains all the other elements in the document:

<root_element>
      <element_one> </element_one>
            <element_two> </element_two>
      <root_element> </element_three>
</root_element> 

When creating a new document, you must specify the name for your root element. If you are using a DTD for validation, the name of the root element must also match the DOCTYPE Declaration.



Index    Top of Page

Element Attributes

An attribute adds descriptive information to an element. For example, if you had an element called <RESISTOR> and you wanted to specify its value, you would include a "VALUE" attribute for the element, thus:

<RESISTOR VALUE="47k">

Attributes for Valid documents are defined in the Document Type Definition (DTD).



Index    Top of Page

PCDATA (Parsed Character Data)

Sections of text within XML documents are represented as PCDATA sections, or parsed character data. PCDATA sections are read by the XML parser, and therefore, any markup contained in the PCDATA will be treated as XML, not as text.

For example, if you wanted to include a reference to a tag in your XML document, such as:

The paragraph tag in HTML is <P>

You would need to use the ">&LT and &GT entities to correctly display the less-than and greater-than symbols.

If you want an element to contain text, you must add a PCDATA section to the element.



Index    Top of Page

CDATA (Character Data)

CDATA Sections allow you incorporate text data into your document that is not parsed. For example, if you wanted to include an HTML tag in your XML document, you would need to escape the "< and > characters, so they would not be read as XML tags.

CDATA sections begin with <![CDATA[ and end with ]]> .

The contents of a CDATA section may contain any characters other than "]]>".



Index    Top of Page

Entities

An entity is a type of short-hand for representing data in an XML document. For example, if you wanted to include a copyright statement you would include the following in your DTD:

<&!ENTITY COPY "Copyright 1999 Philip Jefferson">

Then insert &COPY in your document where you want the copyright statement to appear:

Copyright 1999 Philip Jefferson

Because the ampersand (&) is used for the Entity statement, you must use &AMP in your document to represent the symbol. The same is true for the less-than (<) or greater-than (>) symbols which denote tags in XML.

Entities are represented with an ampersand, followed by the notation for the entity. Such as:

&AMP; &LT; &GT;

You can also use an Entity reference to insert the contents of another file into your document:

<&!ENTITY MORETEXT SYSTEM "more_txt.xml">

In this case, inserting &MORETEXT inserts the contents of more_txt.xml at that point in your document.



Index    Top of Page

Comments

Comments are represented in XML with the same syntax as HTML:

<!-- Comment text -->



Index    Top of Page

Well-formed XML

The term "Well-Formed" is applied to XML documents that adhere to certain conventions, allowing them to be parsed, or utilized by other applications.

In order to be considered "Well-Formed" a document must:

Any document meeting those requirements will be considered "Well-Formed". Your XML editor/parser should open any well-formed document.



Index    Top of Page

Validation

The term validation refers to checking the contents of an XML file against the rules specified in a Document Type Definition associated with the document.

In order for an XML document to be valid, the document must be well-formed, and all of the document content must match the rules specified in the DTD.



Index    Top of Page

Document Type Definition (DTD)

A Document Type Definition (DTD) is a set of rules defining the structure and grammar of an XML document. When you need to strictly enforce rules in a document, a DTD is the mechanism that you would use to do so.

For example, if you had a catalogue, and you wanted to state that every item must also have a price element, you could define this in a DTD.

A simple DTD might look something like this:

<!-- A Sample Product Catalogue DTD -->
<!ENTITY AUTHOR "John Doe">
<!ENTITY COMPANY "Kidz Clothes">
    <!ELEMENT CATALOGUE (PRODUCT+)>
    <!ELEMENT PRODUCT (DESCRIPTION+, PRICE+, NOTES?)>
       <!ATTLIST PRODUCT PARTNUM CDATA #IMPLIED SIZE (Small | Medium | Large | XtraLarge) "Large">
    <!ELEMENT PRICE (#PCDATA)>
    <!ELEMENT NOTES (#PCDATA)>
The DTD shown above includes element declarations, attribute declarations, and entity declarations. You may also notice that the DTD uses its own syntax, which is a holdover from SGML.

The * shows that the element may can contain any number of item elements.

The ? shows that the element may or may not be included in the item element.



Index    Top of Page

Enumerated Attributes

An enumerated attribute is an attribute that has a pre-defined list of values (i.e. "Small, Medium, or Large") that are defined in the document's Document Type Definition (DTD).



Index    Top of Page

DOCTYPE Declaration

DOCTYPE

The Document Type Declaration, or DOCTYPE declaration is used to identify the DTD which declares a set of rules for your XML document. The DOCTYPE declaration can be used to point to a set of rules within the XML document itself.

The DOCTYPE declaration looks something like this:

<!DOCTYPE ROOT_ELEMENT SYSTEM "mydtd.dtd">

The name specified in the DOCTYPE declaration must match that of your document's root element. Additionally, you must specify the SYSTEM identifier that points to your DTD, and optionally you can specify a PUBLIC identifier.



Index    Top of Page

SYSTEM Identifier

The SYSTEM Identifier is used to point to a Document Type Definition for resolving External Entities, and for defining the structure of your document.

The SYSTEM Identifier can contain one of the following:

  1. A File Name. If a file name is specified, the DTD must be in the same directory as the XML file.

  2. A "file:\" URL. In this case, you must specify a full path to the location of the DTD file.

  3. A WWW URL. This can be specified in the form of "http://www.myserver.com/my.dtd".

To use a DTD with your document, you must specify a SYSTEM Identifier.



Index    Top of Page

PUBLIC Identifier

The PUBLIC Identifier is used by XML Parsers to generate alternate URI's for resolving external entity references.

The PUBLIC Identifier is mostly used for backwards compatibility with SGML.



Index    Top of Page

Embedded XML

XML 'data islands' can be embedded in an HTML page, where they can be accessed via scripts or bound to HTML elements.

If the XML data is small or changes infrequently, the XML source can be included within the HTML page:

    <XML id="Sortdata">
        <LIBRARY>
            <BOOK>
                <TITLE>Goodwood - The Sussex Motor Racing Circuit</TITLE>
                <AUTHOR>Peter Garnier</AUTHOR>
                <PUBLISHER>Beaulieu Books</PUBLISHER>
                <CATEGORY>Motorsport</CATEGORY>
            </BOOK>

            <BOOK>
            </BOOK>
        </LIBRARY>
     </XML>
Refer to an external XML data source, using the SRC attribute of the <XML> tag:
     <XML ID="RemoteData" SRC="http://myData.xml">
     </XML>



Index    Top of Page

Cascading Style Sheets

For backwards-compatibilty with HTML4, Cascading Style Sheets can be used. The style sheet is referenced with the <?xml:stylesheet> tag and its assocaited href, type and charset attributes.
<?xml version="1.0" encoding="UTF-8"?>
<?XML:STYLESHEET HREF="glossary1.css" TYPE="text/css" CHARSET="UTF-8"?>
<XML>
0.1 Glossary of TermsThe file is created with a .css extension. An example is shown below:
    A.link {
	color: blue;
	text-decoration: underline;
    }
    A.visited {
	color: purple;
	text-decoration: underline;
    }
    A.active {
	color: red;
	text-decoration: underline;
    }
    P.Body {
	display: block;
	text-align: left;
	text-indent: 0.000000pt;
	margin-top: 0.000000pt;
	margin-bottom: 0.000000pt;
	margin-right: 0.000000pt;
	margin-left: 0.000000pt;
	font-size: 12.000000pt;
	font-weight: medium;
	font-style: Regular;
	color: #000000;
	text-decoration: none;
	vertical-align: baseline;
	text-transform: none;
	font-family: "Times";
    }



Index    Top of Page

The Channel Definition Format (CDF)

The Channel Definition Format is an XML application but has a .cdf extension. It is used for producing Microsoft-style Active or "Push" Channels. These are channels which push 'active' web content to a user's browser. Active content is that which is updated on a regular basis, such as news items or stock exchange prices. With CDF, content can be pushed to the user's browser, desktop or screensaver. The user also has the option to have news of updates emailed to him or herself.

To subscribe to a channel, a user clicks on a link to the CDF file. This starts the "Add Channel" wizard to add the channel to the Channel Bar on the user's browser, and subscription process itself. Note that subscription data is held on the client browser, not on the server.

A typical CDF file is shown below. Note the key tags <CHANNEL>, <TITLE>, <ABSTRACT> and <ITEM>.

<?xml version="1.0"?>
<!DOCTYPE Channel SYSTEM "http://www.w3c.org/Channel.dtd">

   <CHANNEL HREF="http://www.mychannel.co.uk/channel.html" BASE="http://www.mychannel.co.uk">
      <LOGIN DOMAIN="mychannel.co.uk" METHOD="BASIC" />
      <TITLE>Fred's Whacky Web Channel</TITLE>
      <ABSTRACT>This is Fred's Whacky Guide to the World Wide Web.</ABSTRACT>
      <ICON HREF="images/channel.gif" />
      <SCHEDULE>
         <INTERVALTIME DAY="1" />
      </SCHEDULE>
      <ITEM HREF="intro.html">
         <TITLE>Introduction</TITLE>
         <ABSTRACT>This is a summary of what you will find on this channel.</ABSTRACT>
         <USAGE VALUE="Channel" />
         <ICON HREF="images/intro.gif" />
      </ITEM>
         <CHANNEL HREF="more/stuff.html">
            <TITLE>More Stuff from Fred's Whacky Web Channel</TITLE>
            <ABSTRACT>Printed, Online and other resources about the World Wide Web.</ABSTRACT>
            <ICON HREF="images/more.gif" />
            <SCHEDULE>
               <INTERVALTIME HOUR="3" />
            </SCHEDULE>
            <ITEM HREF="books.html">
               <TITLE>Books</TITLE>
               <ABSTRACT>Books and journals all about the WWW.</ABSTRACT>
               <USAGE VALUE="Email" />
            </ITEM>
            <ITEM HREF="links.html">
               <TITLE>Links</TITLE>
               <ABSTRACT>Links to some great resources on the WWW.</ABSTRACT>
               <USAGE VALUE="ScreenSaver" />
            </ITEM>
         </CHANNEL>
   </CHANNEL>




Index    Top of Page

Wireless Markup Language (WML)

The new generation of mobile phones incorporate 'mini-browsers' which allow the user to 'surf' web-like content via Wireless Application (WAP) servers on the network provider's system. WAP is a 'layer' system of protocols using existing telecomms and internet infrastructure, including HTTP.

Navigation need not be restricted to the network provider's server. The network may also have a 'gateway' to the Internet, where WML content on other HTTP servers can be accessed.

The native language for WAP is the Wireless Markup Language (WML). Many tags in WML will be familiar to HTML authors, those such as <BIG> <SMALL> <B> and <U>.

There are no <H1> or <H2> tags. With the limited display area of current mobile phones, there is little need for these tags, and coupled with the relatively slow data rates, WML page content tends to be brief.

A WML 'page' is usually split into a number of frames or 'cards', with only one card being visible on the phone's display at any given time. Links similar to HTML '#' anchors enable the user to navigate between cards. A WML page containing a number of cards is known as a 'deck'.

Traditional web content is generally too large for WAP. As most WAP users are on the move, WAP content tends to be 'active', consisting of news, sport, stock prices, weather and travel information. This may seem similar fare to the 'Information Services' already provided on some GSM and paging services, but WAP now allows the user to browse further afield for information.

Data rates will improve when Universal Mobile Telecommunications Services (UMTS) come on stream, and with enhanced memory and displays, it will become viable to put mobile telephone user manuals on the manufacturer's or network provider's WAP site.


<?xml version="1.0"?>
<!DOCTYPE WML PUBLIC "-//WAPFORUM//DTD WML 1.0//EN" "http://www.wapforum.org/DTD/wml.xml">

<WML>
   <CARD TITLE="Weather" NAME="weather">
      <DO TYPE="PREV" LABEL="Back">
         <PREV/>
      </DO>

      <BR ALIGN="CENTER"/><BIG><B>Weather Forecast</B></BIG>
      <BR/>
         <BR/><B>AM:</B> <TAB/>Cloudy, sunny spells.
         <BR/><B>PM:</B> <TAB/>Possibility of showers.
         <BR/>
      <A>Tomorrow<GO URL="#tomorrow"/></A>
      <A>Outlook<GO URL="#outlook"/></A>
      <A>News<GO URL="http://www.news.com/news"/></A>
   </CARD>

   <CARD NAME="Outlook">
      <BR ALIGN="CENTER"/><BIG><B>Weather Outlook</B></BIG>
      <!--  Code goes here  -->
   </CARD>
</WML>

HTTP Servers must be register WML files as text/vnd.wap.wml mime-type, not text/xml. Compiled WML files should be registered as text/vnd.wap.wmlc mime-type. Although they cannot read text/html mime-type files, some WML browsers can pick out the HREF links within them. The Unwired Planet and Nokia 7110 browsers cannot handle files larger than 1442 bytes, though other WML browsers might.

Sample WAP Pages (WML)
HTML version



Index    Top of Page

Handheld Device Markup Language (HDML)

The Handheld Device Markup Language (HDML) was created by Unwired Planet and has been submitted to the W3C as a standard for communicating with Personal Digital Assistants (PDAs), organisers, and other portable data devices. Since these devices are increasingly becoming integrated with mobile phones, it is losing ground to WML. In fact, Version 1.2 of Unwired Planet's micro-browser supports WML 1.1.

HDML looks like XML but isn't, it breaks many rules. For instance, it doesn't declare a DTD, and some tags are neither paired nor end with a '/', e.g. <LINE> <WRAP> and <TAB>.

HDML was designed for small displays, typically a 12 character by 3 line display. Like WML, pages consist of 'cards' in a 'deck'. However, HDML files have to be restricted to 1400 bytes, otherwise the Unwired Planet micro-browser is swamped.


Created:   1st September 1999


Index