- What Is Well-Formedness?
- Change Name to Lowercase
- Quote Attribute Value
- Fill In Omitted Attribute Value
- Replace Empty Tag with Empty-Element Tag
- Add End-tag
- Remove Overlap
- Convert Text to UTF-8
- Escape Less-Than Sign
- Escape Ampersand
- Escape Quotation Marks in Attribute Values
- Introduce an XHTML DOCTYPE Declaration
- Terminate Each Entity Reference
- Replace Imaginary Entity References
- Introduce a Root Element
- Introduce the XHTML Namespace
Introduce an XHTML DOCTYPE Declaration
Insert an XHTML DOCTYPE declaration at the start of each document.
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "DTD/xhtml1-transitional.dtd"> <html xmlns="http://www.w3.org/1999/xhtml">
The DOCTYPE declaration points to the DTD that is used to resolve entity references. Without it, the only entity references you can use are &, <, >, ', and ". Once you've added it, though, you can use the full set of HTML entity references: ©, , é, and so forth.
The DOCTYPE declaration will also be important in the next chapter when we begin to make documents valid, not merely well-formed.
Adding an XHTML DOCTYPE declaration has the side effect of turning off quirks mode in many browsers. This can affect how a browser renders a document. In general, this is a good thing, because nonquirks mode is much more interoperable. However, if you have old stylesheets that depend on quirks mode for proper appearance, adding a DOCTYPE may break them. You might have to update them to be standards conformant first. This is especially true for stylesheets that do very precise layout calculations.
You can use three possible DTDs for XHTML: frameset, transitional, and strict.
- The frameset DTD allows pages to contain frames.
- The transitional DTD retains deprecated presentational elements such as i, b, u, iframe, and applet.
- The strict DTD removes all deprecated presentational elements and attributes that should be replaced with CSS. It also tightens up the content model of many elements. For instance, in strict XHTML, blockquotes and bodies cannot contain plain text, only other block-level elements.
These are indicated by one of the following three DOCTYPE declarations:
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd"> <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"> <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Frameset//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-frameset.dtd">
In the short run, it doesn't matter which you pick. In the long run, you'll probably want to migrate your documents to the strict DTD, but for now you can use the frameset DTD on any pages that contain frames and the transitional DTD for other documents.
Browsers look at the public identifier to determine what flavor of HTML they're dealing with. However, they will not actually load the DTD from the specified URL. In essence, they already know what's there and don't need to load it every time.
Other, non-HTML-specific tools such as XSLT processors may indeed load the DTD. In this case, you may wish to replace the remote URLs with local copies. For example:
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "dtd/xhtml1-strict.dtd"> <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "dtd/xhtml1-transitional.dtd"> <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Frameset//EN" "dtd/xhtml1-frameset.dtd">
As long as the public identifiers are the same, the browsers will still recognize these.
Some documents on a site may already have DOCTYPE declarations, either XHTML or otherwise. Many tools have added these by default over the years, even though browsers never paid much attention to them. Thus, the first step is to find out what you've already got. Do a multifile search for <!DOCTYPE. Unless you're writing HTML or XML tutorials, any hits you get are almost certain to be preexisting DOCTYPE declarations. In most cases, though, they will not be the right one. Usually, there are only a few variants, so you can do a constant string multifile search and replace to upgrade to the newer XHTML DOCTYPE. Any that don't fit the pattern can be fixed by hand.
Documents that don't have a DOCTYPE are also easy to fix. The DOCTYPE always goes immediately before the <html> start-tag. Thus, all you have to do is search for <html\w and replace it with the following:
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "dtd/xhtml1-strict.dtd"> <html
You should also take this opportunity to configure your authoring tools to specify the XHTML DOCTYPE by default. Often it's a simple checkbox in a preference pane somewhere.
$ tidy -asxhtml --doctype strict file.htmlTagSoup does not add DOCTYPE declarations. You'll need to insert these by hand. Tidy adds a transitional DOCTYPE by default. However, you can request strict instead with the --doctype strict option: