\title{Portable Documents: Why Use SGML?} \author[David Barron]{David Barron\\ Department of Electronics and Computer Science\\ University of Southampton} \begin{Article} \section{Introduction} In this article we present a few ideas as a framework for the discussion of portable documents. We address a number of questions: \begin{itemize} \item What are portable documents? \item Who needs them, and why? \item How to produce them, now and in the future \end{itemize} \section{Documents} Traditionally, a document was a file (or a deck of cards), and consisted solely of text. Today, documents are typically {\em compound}, a mixture of text and graphics (bit-map or line art) that can be rendered on paper or screen. Additionally, they may include hypertext links (in which case they can only be viewed on screen). A recent development is the ability to incorporate video and sound in a compound document, either embedded within the document or linked by a pointer: such a document is a {\em multimedia} document. Hypertext-style links may also be included to form a {\em hypermedia} document: evidently, multimedia and hypermedia documents can only be `read' on a suitably equipped computer system. World Wide Web (WWW) documents are a special case of compound hypermedia documents where the links are to other documents elsewhere on the Internet They can be regarded as virtual documents, in the sense that the whole document never exists as a single identifiable object. More generally, we can define a {\em virtual document} as a structured collection of information from which instances of documents and other resources can be derived. Examples include: \begin{itemize} \item The Oxford English Dictionary which exists as a database from which are derived various printed editions (Shorter, Concise, Pocket etc.), as well as the CD-ROM version \item Critical editions of a literary text, where a single source `document' contains all the variations, and can be printed out using different variants as the base text \end{itemize} \section{Portability} The definition of portability that we shall use in this discussion is the ability to transmit the document digitally (over a network, or on a disk or CD-ROM) and re-create a faithful rendering of the document after transmission, if need be on a different hardware and/or software platform from that on which the document was originally created. It is important to observe that there are three different forms in which the text and graphics in a document might be re-created: \begin{itemize} \item with absolute visual fidelity \item with approximate visual fidelity \item retaining content only \end{itemize} \section{Who needs portable documents, and why?} Three different needs for portable documents can be adduced \begin{enumerate} \item Publishers need them in order to distribute electronic books and journals \item Communities with common interests who need to share information need them. An example is a scientific research community whose members use diverse hardware and software \item Librarians responsible for digital archives need portable documents, since they cannot assume that a particular hardware/software platform will exist in perpetuity \end{enumerate} \section{Examples of successful portability} \begin{itemize} \item Computer science researchers and software manufacturers distribute documents as PostScript files. This works well if the fonts employed are restricted to the basic 35, and the use of Adobe Acrobat (pdf files) increases portability when other fonts are used. \item The Physics pre-print library at Los Alamos National Laboratory is used by many physicists world-wide: over 10,000 retrievals per day are reported. The archive holds pre-prints in \LaTeX\ and PostScript formats (figures in PostScript only). This is successful because the Physics community has for some years used \TeX\ as its preferred means of exchanging information. \item WWW documents are highly portable, since their rendering is (almost entirely) determined by the browser software, and the use of a common mark-up language (HTML) ensures portability \end{itemize} \section{Achieving portability} At first sight it appears that portability might be achieved by agreeing standards (e.g. \LaTeX, PostScript, ODA, HTML). At present there is too much choice, and no obvious winner, especially in hypermedia documents. This is a sign of an immature technology. Another important fact to take into account is that it is difficult to impose standards in some environments e.g. acadaemia, where personal preferences lead to the equivalent of religious wars. Particular problems in achieving portability arise from varying fonts and character codes e.g. in handling European languages. Unicode will go a long way towards solving the character codes problem. \section{Why use SGML?} SGML provides a formal and portable definition of document structure. SGML syntax can define a hierarchical structure of embedded document parts, and can associate a type with each component in the hierarchy. By associating a rendering definition with each type of component, it is possible to achieve a portable document. In particular, SGML provides a uniform archive format for a library of portable documents. \subsection{An example} Suppose it is required to maintain a library of technical documents in an environment where some authors use \LaTeX, whilst others use Microsoft Word. We can define an SGML DTD for the document structure, together with \LaTeX and Word styles to define the rendering. This opens up three possibilities: \begin{enumerate} \item Author in SGML and use a tool to produce a \LaTeX\ or Word version from which the printed version can be produced. \item Author in \LaTeX\ and use a tool to translate to SGML to produce the archive copy \item Author in Word and use a tool to translate the RTF form to SGML to produce the archive copy \end{enumerate} In addition to the SGML version of the documents, the archive must contain the Word and \LaTeX\ style files and the translation tools. Once this is done, anyone can collect a document, the required style files and tools and produce a copy of the document. This will of course only work for text documents. For any document with graphics content, and for hypermedia documents, more is required. This is possible in principle, but much remains to be done \section{The future} A combination of SGML and OpenDoc is probably the best way forward. OpenDoc provides an architecture for portable documents: it treats a document as a container for a collection of `parts', each of which can have other parts embedded within it. Each type of part has associated programs to edit and render it, so that documents can be re-created with varying degrees of fidelity depending on the availability of rendering software for the particular varieties of parts that it includes. OpenDoc is a dynamic architecture, and assumes that a new type of part may occur at any time. In principle SGML can be used to describe the static structure of an OpenDoc document, providing the final link in the portability chain. \end{Article} Sir -- Philip Taylor is to be complimented on a fine display of pedantry in the best academic tradition, the kind of tradition that gives academics a bad name amongst normal folk. In computing we use lots of everyday words with specialised meanings, and most of us find no difficulty in using the context of an utterance to achieve any necessary disambiguation. With regard to his criticism of my use of the term "multimedia document", I agree that I don't plug my computer into a multiways socket. But then, I don't attend a performance of an operum at Covent Garden, either. Yours sincerely