Home > Articles > Web Services > XML

  • Print
  • + Share This
This chapter is from the book

This chapter is from the book

XML and SVG

We've talked a lot about how SVG is written in XML, based on XML, etc., and you're probably wondering whether you need to know XML to write or edit SVG code. In fact, SVG is a type of XML, and knowing a bit of XML document structure and syntax will help you to understand the logic, conventions, and limitations you might encounter in SVG. In this chapter, we give you a little taste of XML. If you really get interested, we go further into XML in Chapter 10.

So, what is XML? XML is an extensible and flexible markup language. What does that mean? Let's find out.

What is Markup?

HTML (Hypertext Markup Language) is an example of a markup language that you may be familiar with. In an HTML document, if you run across something that looks like this:

<p>

you know that it is a markup tag that indicates a new paragraph. HTML has built-in markup tags, which are mostly for formatting, such as <p> for paragraph and <b> for bold. HTML markup tells browsers, processors, and editors how to format and organize content. It consists of start and end tags, processing instructions, etc. Markup describes content.

Notice the angle brackets around the <p>. A good way to think of markup is that anything between angle brackets is markup <This is markup>. Anything described by that markup is content.

<p>This is content<p>

What is Extensible?

So what about XML? Well, like HTML, XML has two types of building blocks, markup and content. Content is the character or image data that resides inside the markup. What makes XML so different and so powerful is that, unlike HTML, which comes with built-in markup tags, XML is extensible, which means that you can define your own markup. That means the potential for creating markup is relatively unlimited, or extensible. You can create markup that describes tags in your own document rather than being limited to the tags that HTML gives you. Or you can use preset definitions, called DTDs, or document type definitions. For now, let's take a look at some really basic XML code (Example 1–1).

Example 1–1 Script 1–2

1.  <?xml version="1.0"?>
2.  <document>
3.  <title>My first experiment</title>
4.  <sentence>This is fun</sentence>
5.  </document>

This is a very simple XML document. What do we see here?

  1. The processing statement, or XML declaration. We'll go into this a little later in the chapter.

  2. An opening tag named <document> that we created. That means we did not get it from any other source than ourselves.

  3. An opening tag named <title> that we created, content for that tag (My first experiment), and closing </title> tag.

  4. An opening tag named <sentence> that we created, content for that tag (This is fun), and closing </sentence> tag.

  5. Closing </document> tag.

Elements

So what are all these opening and closing tags about? Well in XML, we build documents with elements. An element has an opening tag, <tag>, and a closing tag, </tag>. The element includes the tags and what is in between them. In the above code snippet,

<title>My first experiment</title> 

<title> and </title> are the opening and closing tags, and <title>My first experiment</title> is the entire title element. Note that a closing tag differs from an opening tag in that it has a forward slash (/) inserted into it before the tag name.

In the above code snippet, the element <document> is the root element. Every XML document needs one root element. The root element contains the other elements inside of it. Another way of saying this is that the other elements are nested within the root element <document>. Some basic markup symbols and their uses are listed in Table 1–1.

TABLE 1–1 Some Basic XML Markup Symbols

Symbol

Use

?

begins and ends a processing instruction

<

delimiter that starts the beginning of a tag

>

delimiter that ends a tag

</

starts the beginning of an end tag


Processing Statement: Now to get to that mysterious first line of code:

<?xml version="1.0"?>

This line of code is the XML declaration. It is part of the prolog for the XML document. It must contain the version number but can also contain other instructions, such as targeting a specific character set (see Chapter 12 on Unicode) or stating whether the XML document is standalone (which means it "stands alone" or does not reference an external document) or whether it is not standalone (it references an external document). This is particularly important to note because most SVG documents reference an external document, usually the W3C standards body on SVG.

The above code statement is known as a processing instruction because it tells the computer to do something. In this case, it tells the browser and related software that this is an XML file and what version of XML it is. At this writing, there is only one version of XML, but in the future that is sure to change! Processing instructions use angle brackets and question marks <?processing instruction?> to begin and end the line of processing instruction code.

Well Formed or Valid?

An XML document can be well formed only or both well formed and valid. What does this mean? Let's look at well-formed documents first.

For an XML document to be well formed, it must follow some simple syntax rules:

  • Each opening tag must have a corresponding closing tag.

  • XML is case-sensitive; thus, <title>, <Title>, and <TITLE> are all three different tags.

  • There must be at least one element in a well-formed XML document.

  • There can be only one root element.

  • Elements and their tags must nest correctly.

  • Element names must conform to the following naming rules:

    • They must start with a letter or an underscore

    • They can contain letters, digits, periods (.), underscores (_), or hyphens (-).

    • Whitespace is not allowed.

    • They cannot begin with the sequence xml.

Unlike HTML, XML is completely and, unfortunately, totally unforgiving of syntax and space errors. Triple-check your XML code. XML is case-sensitive, so pick one style of naming, using either all or no caps, and stick with it. By convention, XML tags are usually written in lowercase. All XML parsers check to make sure that start and end tags exist and check delimiters and characters, as well.

Well formed means the document and syntax structure is correct, according to the above bulleted list.

Let's experiment with this now. Open a simple text editor, such as Notepad for Windows or BBEdit for Macs.

Type in the following code, exactly as you see it:

<?xml version="1.0"?>
<document>
<title>My first experiment</title>
<sentence>This is fun</sentence>
</document>

Now save the file to your hard drive with a name such as myfirst.xml, making sure you save it with the .xml extension. To do this in Windows, you will need to change the file type dropdown to All Files. It is important to save this as an .xml file; otherwise, you won't be able to see it as XML code. We saved our file as myfirst.xml.

Now open a browser window, either by connecting to the Internet or choosing Work Offline. Either click File, Open, Browse and find myfirst.xml in your own drive path specification or type in the file path in the location bar.

If you open the file in IE, you will see the code colored to denote syntax (Figure 1–5, left). Congratulations! You have a well-formed XML document!

NOTE

If you open the file in Netscape, you will see the text content alone, without the markup (Figure 1–5, right). This is correct; you are still on track. Netscape doesn't show the code tree the way IE does.

Figure xxxFigure xxxFIGURE 1–5 The left image is myfirst.xml in IE 5; the right image is myfirst.xml in Netscape 6.


Now let's see what happens when we deliberately make a mistake in coding. Go to your Notepad file and delete the closing </document> tag. Save the file as myFirstError.xml, again making sure that you save it with the .xml extension. Now open the file in a browser and see what happens.

Note that you get an error telling you exactly what you did wrong in both Netscape and IE! Pretty darned cooperative, aren't they? Sometimes fixing errors is that easy, sometimes not. Often, you have to hunt around a bit for what you did wrong in the code. Usually in IE 5 or Netscape 6, you will get at least a line number where the error is supposed to have occurred.

Now, put back the </document> tag to end the code properly. Save and view it. All is well.

Now let's get adventuresome here. Let's add another element. Open your file in your text editor of choice and, after the code line:

<title>My first experiment</title>

Press Enter and type in:

<author>
        <firstName>Jane</firstName>
        <lastName>Jones</lastName>
</author>

Notice that we've added a new element, but that something is different here. This new element, author, has two elements nested within it: firstName and lastName. Each nested element has an opening and closing tag, and the entire author element ends with the </author> closing tag. This is an example of two elements (firstName and lastName) nested within the element author. The element author contains the two elements firstName and lastName. In XML, we say that author is the parent element of firstName and lastName, and that firstName and lastName are child elements of the element author.

NOTE

With nested elements, each child element must begin and end completely within the parent element.

We're going to add one more thing—a comment. A comment is a piece of code that does nothing, it is there only for you to describe or notate the code. It is quite useful to comment your code, because often you will do something in the code and forget about why you did it! It is also very helpful when viewing another person's code to be able to see his or her comments.

A comment looks like this:

<!-- This is a comment-->

Open up your file again, and, below the line of code:

<title>My first experiment</title>

Type in:

<!-- firstName and lastName are child elements nested within the parent author element -->

Save the file, and view it as usual. You will see that, in Internet Explorer, you can see the comment as part of the code, but in Netscape, you still see only the content of the code, not the markup or comment.

If you want, play around with this code some more until you get comfortable with the "well-formed" concept. You can edit a Notepad file while it is open. Just remember to save it by going to File, Save. Play around! Try adding an element or two or some other content. To view the newly edited version in IE or Netscape, just click the Refresh button on the toolbar after saving your file. This will reload the updated file.

Now that you're pretty comfortable with well-formed documents, we're going to up the ante a little bit. Most XML documents need to be valid in addition to being well formed. What defines a valid XML document? Quite simply, a valid XML document must include a reference to a DTD.

DTD

An important aspect of XML structure is the XML DTD. To understand the DTD a bit, let's go back to our HTML example. HTML code contains a lot of markup that is already defined, such as our favorite, <p>. How does HTML know that <p> means to start a new paragraph? Simple, that information is defined in HTML's DTD. HTML's DTD tells the parser to start a new paragraph every time it runs across <p>.

As we said before, XML is extensible. That means that we define our own elements and markup, and it also means that we must define our own DTD for our XML documents. The DTD file holds all of the allowable parameters for the XML file that references it. All valid XML files must have a DTD.

Let's say that you want to build a house. The blueprint of the "house" would be the DTD. The DTD defines what is and what is not allowed in the XML document that references it. This is like saying you are going to build a house and allow the following: doors, walls, a roof, and a floor. (Of course, this is way too skimpy for a real house, but you get the idea.) So the door element would be one of the elements in the house, defined by us.

Once a DTD is named and the elements are specified, you can define specific characteristics of those elements, which are called attributes. Attributes can describe color, height, width, etc. An attribute of the door element in our house example might be wood, metal, glass, aluminum, or rubber. The attribute wood is, therefore, an attribute that we have just given the element door. So think of the DTD as the blueprint, an element as a door, and the element door's attribute as wood.

How does the valid XML document read the DTD? In one of two ways. The DTD can either be internal, which means that it is written into the XML document, or it can be external, in which case, you must include a reference to the external DTD.

What does a DTD look like? The following is a partial example of how a DTD is set up. We are using our "house" example.

<!DOCTYPE house [
<!ELEMENT house (doors,walls,roof,floor)>
<!ELEMENT doors (#PCDATA)>
<!ELEMENT walls (#PCDATA)>
<!ELEMENT roof (#PCDATA)>
<!ELEMENT floor (#PCDATA)>
]>

We used the idea of house as the DTD. Obviously, a rabbit is not an allowable item in a blueprint of a house. A rabbit is an animal. So if our DTD is about a house, and we don't want to include a rabbit in it, then in our valid XML file, we wouldn't be able to include a rabbit. In other words, everything that appears in a valid XML document must be declared in a referenced or inline DTD.

Note that the above code includes no attributes.

DTD Code

Take a look at the DTD code above.

First, DOCTYPE is the document type declaration, not to be confused with the DTD, or document type definition. This begins the DTD file.

Again, each element you are going to use for your XML document must be declared in the DTD that it references. The element declaration must start with an exclamation point (!), and the name of the element must start with a letter or underscore character. The ! defines the instruction to the browser or processor that this is an ELEMENT and not just a word. It is a good idea, though not necessary, to start the first ELEMENT statement with the same name as the DOCTYPE. ELEMENTS are written as !ELEMENT.

DTD Element Tags

<          Start delimiter
</         End delimiter
!ELEMENT   Element declaration (all caps necessary)
>          Close delimiter
/>         Empty Tag (mostly used as a placeholder)
!ATTLIST   Attribute declaration (all caps necessary)

(#PCDATA)

Parsed character data, (#PCDATA), tells the processor that characters are allowed in an ELEMENT, as opposed to elements or other instructions to the computer. (#PCDATA) is most often used for text content. You must tell the processor what type of data is allowed in each ELEMENT statement. Is an image allowed? Is just text allowed? More than one type of data is allowed in an element, but it all must be declared.

In the example above, in !ELEMENT house (which is a parent), we say that the elements doors, walls, roof, and floor (which are child elements of house) are allowed; then we break it down further by saying what can be included in each child element. We say (and this is only a partial list of the full statement) that in the !ELEMENTs doors, walls, roof, and floor, we can use text. Looks simple, right? Remember, we set up the parameters for our own documents, and we get to decide what we will include, but everything we use in our valid XML document must be declared in our DTD.

NOTE

(#PCDATA)

(#PCDATA) is always enclosed in parentheses.

A DTD can be saved as a separate file from the XML document, then referenced in the XML document with a line of code, like so:

<!DOCTYPE house SYSTEM "j://myHouse/house.dtd"> 

That line of code references the house.dtd. SYSTEM refers to the fact that this particular DTD resides on your home system. If the DTD resides on the Web, PUBLIC is the identifier you would use in place of SYSTEM.

The DTD can also be included in the XML document as an "internal" DTD.

Adding a DTD to myFirst

Open up Notepad for Windows or BBEdit for Macs, and reopen your myfirst.xml file.

Type or copy in the following after the XML declaration (remember what that was?). It was the line of code that looks like this: <?xml version="1.0"?>

<!DOCTYPE document [
<!ELEMENT document (title,author,sentence)>
<!ELEMENT title (#PCDATA)>
<!ELEMENT author (firstName,lastName)>
<!ELEMENT firstName (#PCDATA)>
<!ELEMENT lastName (#PCDATA)>
<!ELEMENT sentence (#PCDATA)>
]>

Save the document again as myfirst.xml. Again, when saving the document, make sure to use the .xml three-character extension and save it as type All Files, or it will be saved as a .txt document. In Windows 95/98, be careful to put quotes around it ("myfirst.xml"), or it will be saved as myfirst.xml.txt.

Now view the file in a browser. You should see something similar to Figure 1–6 or Figure 1–7.

Figure xxxFIGURE 1–6 myfirst.xml in Internet Explorer 5.

Figure xxxFIGURE 1–7 myfirst.xml in Netscape 6.

Once again, if the code is colored in IE or if just the text shows up in Netscape, the document is well formed, but we want to find out whether it is valid, as well. How do we do that? To validate the document, you have to have a validating XML parser. IE 5 has a parser but you may have to download parts of it, depending on your operating system. If you don't have a validating parser, there are some on the Web that let you paste in your code and tell you whether it is valid. One is located at www.stg.brown.edu/service/xmlvalid/. Simply paste in your code where you are given the "text" form, and press the Validate button. You will be returned to a page that tells you whether the document validates. If you have errors in the document, this program will also list them.

Parsers

An XML parser, including ones contained in some but not all browsers, check the framework and content of an XML statement for well-formedness; and, if they are validating parsers, they validate an XML file. Depending on the company, the programs also allow extensive planning and editing. This is a developing field, and there are new products out constantly.

  • + Share This
  • 🔖 Save To Your Account