Separating Style and Content: LaTeX and Typesetting
Periodically, I hear people discussing the relative merits of OpenOffice.org Writer, AbiWord, and Microsoft Word. I’ve even tried using them a few times and, while they seem to work, I still haven’t quite worked out the niche into which they’re meant to fit.
I write a lot. On a slow day, I’ll write a thousand words or so, but most days it’s more. Since I produce a lot of words, I would have expected to be in the target market for these tools, but I’m not. For the rest of this article, I’ll attempt to explain why.
Separation of Style and Content
The first mistake that most word processing programs make is that they don’t encourage the separation of style and content—some don’t even permit it. When I write, I structure my text in paragraphs. These are then assembled into sections, chapters, etc.
At the moment, I’m writing this article in vim. Some people have told me that I should use EMACS; I’ve tried, and while I appreciate the Lisp Machine heritage, and consider it a nice approach for an operating system, I wouldn’t want to use it until a decent text editor has been ported to it. Viper just doesn’t cut it. But I digress. vim is a text editor. A text editor is a program for editing text, which is what I’m doing.
While I’m writing this text, I occasionally want to create titles. I use a simple form of markup here; I underline top-level headings with equal (=) symbols, second-level headings with hyphens (-), and, if I use them, third-level headings have an underscore before and after (_like this_).
When I’ve finished writing, my text will be sent to an editor, who will try to salvage some of my incoherent ramblings and turn them into coherent prose. This is a very useful feature, and one I would consider a killer application for any word processor. Microsoft Word has a grammar checker feature that’s intended to do something similar, but only succeeds in translating my writing into something which is almost, but not quite, exactly unlike English.
The final stage is the layout. By the time you read this article, someone will have attached a load of HTML markup to it, which will instruct your browser how to display it.
So there are three distinct processes involved in producing this article:
Did you notice the snazzy HTML numbered list there? That was added by someone after I finished writing.
The obvious tool for the first step is a text editor. vim has some nice features that make this easier; the latest version checks spelling as I type, for example, and most recent versions can fold up paragraphs and sections to allow me to navigate the document easily.
The next phase is also often best done in a text editor. If I get an edited article back in text format, I can just run diff on the two versions and see very quickly what has been added, removed, or changed.
The final phase is where it starts to get interesting. Once you have a correctly edited document, you need to turn it into something that’s easy for people to read. With a word processor, you typically have to—or, at least, are encouraged to—do this before you begin writing the text.
When someone comes to typeset this text, it’s convenient for them to have some idea of how the text is structured. Some parts of this article are headings, for example; and I’ve explained how I mark those up. For a human, it’s quite easy to determine the structure. For a document this length—somewhere around 2,000–3000 words—it’s not a real problem for a human to lay out all of the text manually. Longer documents are a bit more difficult.
Of course, this article is being published on a web site, so a human doesn’t actually do the layout. Every time you visit this page, your browser will generate a newly laid-out copy for you. Doesn’t that make you feel special?
HTML originally contained predominantly syntactic markup. This included commands for things like boldfacing and italics, and for line breaks. Modern versions of HTML focus more on semantic markup, however, which includes things like paragraph breaks, headings (which were in the first version), and such.
Someone writing a modern HTML document will insert tags describing the structure of the document, and then later describe the layout in CSS. They may write things like this:
<p class=’Warning’> Don’t try this at home kiddies! </p>
The CSS for this paragraph might then specify that the entire thing should be displayed in large red letters.
SGML and DocBook
HTML itself is a derivative of SGML; Standard Generalized Markup Language. This is a meta-language for defining markup languages. A subset of SGML, known as XML, has become quite popular in recent years.
Another SGML-derivative, known as DocBook is also quite common. Some publishers use it extensively and it is also quite common in the Free Software community.
TeX and LaTeX
TeX was Donald Knuth’s typesetting language. It was created because Knuth was unhappy with the state of machine typesetting while writing his magnum opus, and so he took some time off to fix it. Being a computer scientist, Knuth created a Turing-complete language to use. This was a domain-specific language used for typesetting text. TeX contains commands for things like placing characters in positions on a page. Some other byproducts were METAFONT, which allows fully detailed typeface specification, and DVI, a device-independent output format.
DVI was popular for a while, particularly in academia, as a way of sharing typeset pages. A DVI document would look exactly the same on any viewer, or on paper. The format did have some drawbacks, however. It didn’t provide a good way of adding images, for example. The common way of producing images in DVIs was to leave a space and place a comment in the file containing the filename of an encapsulated PostScript image to be included there. Another program would then translate the DVI into PostScript (for viewing or printing) and insert the images in the correct places. These days, DVI is largely eclipsed by PDF, which fulfills a similar purpose and has the advantage that viewers are already installed on most computers.
Since TeX is a Turing-complete language, it’s possible to write complex programs in it (technically, any program). One thing that was quite common was adding macros to TeX. These would include some functionality that was commonly reused, and allow them to be invoked easily. A fairly complex package of macros, known as LaTeX, provides semantic markup capabilities.