\iffalse Article for Baskerville. Jonathan Fine, 16 March 1994 Revised, 18 March 1994 \fi \title{\protect\TeX\ and SGML --- Friend or Foe?} \author[Jonathan Fine]{Jonathan Fine\\\texttt{J.Fine@uk.ac.cam.pmms}} \begin{Article} At the last \ukt\ committee meeting there was an interesting discussion about holding a meeting (London, November this year perhaps) on \TeX\ and SGML. It became clear that for such a meeting to be successful, particularly for developing and promoting \TeX\ as a typesetting system, that the purpose, focus, agenda, speakers and audience were matters that required careful thought and further discussion. What follows are some personal observations and opinions on the subject, with which the rest of the committee may or may not agree. It is my intention to open communication and begin a debate that will continue through to the proposed meeting this winter and beyond. My primary sources are {\em The \TeX book}, and {\em The SGML Handbook\/} (Charles Goldfarb, OUP~1990), which will be cited as [T] and [S] respectively. First, some words about standards. An old joke has someone saying \lq\lq{}Yes, we believe in standards. That\rq{}s why we have so many of them.\rq\rq{} The joke, of course, is that standards should create or make manifest uniformity amongst similar objects. Eli Phalet gave an early demonstration of the effectiveness of standards to President Lincoln, early in the US Civil War. He dismantled several rifles, mixed the parts up in a heap, and the reassembled the rifles, thereby demonstrating the interchangability of the parts. (This won him a large Union munitions contract.) Because the parts had been manufactured to carefully specified tolerances, this could be done. Then it was surprising. Now, it is perhaps surprising that it was once surprising. We take it for granted. Another meaning for \lq{}standard\rq{} is as a flag which leads an army into battle. Such standards are economic realities in the commercial world. The word \lq{}document\rq{} is overworked. Instead, I will use the word \lq{}compuscript\rq{} (or script for short), to refer to a structured file containing text and tags or processing commands. It is convenient to think of a script as being an ASCII file meeting (formal or informal) syntax conditions. Thus presented, many files are scripts. \TeX\ and \LaTeX\ files satisfy an informal syntax. The same is true of macro files. Other examples are the content of a database, expressed in any one of a number of formats, program source files for any of the many programming languages, and document files for the various word processors and other typesetting systems. The ISO standard defines a document to be \lq\lq{}A collection of information that is processed as a unit. A document is classified as being of a particular document type.\rq\rq{}~[S,~p124,263] This may seem rather pedestrian and pedantic, but we are not yet able to repeat for scripts Remington\rq{}s rifle trick, which is of course based on boring and pedantic precise specifications for the parts. Incidentally, if you look up {\em Boring\/} in the Yellow Pages, it will say {\em See Civil Engineers}. The same compuscript may be processed in several ways. It may be edited, typeset, formatted for online display, compiled (if a program source file), or have its spelling and grammar checked. Portions may be extracted to form a secondary compuscript, such as an abstracts or citation journal. We can now see the complementary r\^oles of SGML and \TeX. The first is a standard for the specification of compuscripts. It is [S,~p7--8] \lq\lq{}based on two novel postulates \begin{itemize} \item[a)] Markup should describe a document\rq{}s structure and other attributes rather than specify processing to be performed~[\ldots] \item[b)] Markup should be rigorous so that the techniques available for processing rigorously-defined objects like programs and databases can be used for processing documents as well.\rq\rq{} \end{itemize} while \TeX\ is \lq\lq{}a new typesetting system intended for the creation of beautiful books---and especially for books that contain a lot of mathematics\rq\rq{}~[T,~page~v]. Thus, SGML is a specification language for compuscripts while \TeX\ is a typesetting system which will process suitable compuscripts. So far as I can tell, both \TeX\ and SGML are sound in their basic design. Given this---although some may disagree---one would expect them to work well together, like nuts and bolts. However, they do not, and it is worth understanding why and how. Here I must admit to having a trumpet to blow. It is my belief that a \TeX\ format can be written, that will parse and typeset suitable SGML compuscripts, and that such a format is the way to go. The following remarks are focussed on the existing \TeX\ and \LaTeX\ formats. \TeX\ has no inbuilt concept of markup or of parsing. This is probably as it should be, and I suggest that the reader reflect on why. My opinion is that such is---in terms of Knuth\rq{}s goal of creating beautiful books---a bell or whistle. A diversion. For similar reasons, I believe, Knuth saw no need to write a text file editor. He did however produce the WEB programming tools. He did supply \TeX\ and a couple of thousand lines of macros. Since then \TeX\ macro packages have mixed parsing in with processing in a manner which prohibits rigorous markup---a hallmark of SGML. One symptom of this is the recurrent problems of verbatim text within a macro argument, such as a section title. Because users can define new commands, the syntax of a \TeX\ compuscript is always subject to change. It may be harmless to write \begin{verbatim} \def\beq{\begin{equation}} \def\eeq{\end{equation}} \end{verbatim} in the preamble to a \LaTeX\ compuscript, but \begin{verbatim} \beq ax^2 + bxy + cy^2 \eeq \end{verbatim} will now cause a spell checker programmed to skip mathematics to trip up. Moreover, to set up such a checker to find the error in \begin{verbatim} \begin{equation} e = mc^2 \qquad\hbox{Eintsien} \end{equation} \end{verbatim} will not be easy. A more substantial problem is the special and contingent typesetting instructions, that are required to achieve quality typesetting. The simplest examples are the space adjustments \verb"\>" and so forth used with mathematics. The breaking and spacing of long equations and formulae, when setting to a narrow measure, presents more difficulties, if one is to typeset from a compuscript satisfying a rigorous syntax. The same applies to tables. Typically, one might expect a skilled compositor (either human or robotic) to \lq{}annotate\rq{} the author\rq{}s compuscript for, say, a scholarly journal with commands to control or adjust page breaks, the size and placement of floating items---in a word, page make up. SGML recognizes [S,~p139,277] that one sometimes needs \lq\lq{}processing instructions,\rq\rq{} which are \lq\lq{}markup consisting of system specific data that controls how a document is to be processed.\rq\rq{} Here, the system might be \TeX-based typesetting, or typesetting to a particular design, or some other application. \lq\lq{}As war is to diplomacy,\rq\rq{} writes Goldfarb [S,~p139], so this is \lq\lq{}the last resort of descriptive markup.\rq\rq{} The key to success for SGML is that it provides standards for compuscripts, or more exactly provides tools for the expression of such standards. This allows diverse programs to process the same compuscript in various ways, for different purposes. Yuri Rubinsky, in his preface [S,~page~x] wrote \begin{quote} Over the next five years, computer users will be invited to anbandon their worst habits: They will no longer have to work at every computer task as if it had no need to share data with all their other computer tasks; they will not have to act as if the computer is simply a complicated, slightly-more-lively replacement for paper; [\ldots]; not have to appease software programs that seem to be at war with one another. \end{quote} but perhaps he is too optimistic---he was writing in October~1990. There appear to be two main situations where \TeX\ can contribute to SGML based document processing. The first is the high quality typesetting of SGML compuscripts, such as the content of a database. The second is more subtle. The tagging process adds information to the compuscript, and thereby makes it more valuable. For example, in this document the names of our two author, Knuth and Goldfarb, are set in the main body font, and so require no additional markup. But for a hypertext retrieval engine, we will want these names linked to an index of persons. Mechanical processes may help, but because many people share the same family name, a certain amount of author assistance is required, particularly for the more common names, family names that are also place names, and so forth. This is only one example of how the author is uniquely qualified to provide data tagging, as we may call it. Employees can be told to tag data, but this strategy is unlikely to work for the authors of scholarly publications. Instead they must be equipped with tools and incentives. In particular, a document processing system which returns benefits (such as copious indices and cross-references) to the author as a consequence of data tagging will provide an incentive perhaps stronger than coercion. \TeX\ is freely and widely available. It deserves to be part of such a system. \end{Article} \endinput