% $Id: faq-fmt-conv.tex,v 1.10 2014/05/29 11:24:34 rf10 Exp rf10 $ \section{Format conversions} \Question[Q-toascii]{Conversion from \AllTeX{} to plain text} The aim here is to emulate the Unix \ProgName{nroff}, which formats text as best it can for the screen, from the same input as the Unix typesetting program \ProgName{troff}. Converting \acro{DVI} to plain text is the basis of many of these techniques; sometimes the simple conversion provides a good enough response. Options are: \begin{itemize} \item \ProgName{dvi2tty} (one of the earliest), \item \ProgName{crudetype} and \item \ProgName{catdvi}, which is capable of generating Latin-1 (ISO~8859-1) or \acro{UTF}-8 encoded output. \ProgName{Catdvi} was conceived as a replacement for \ProgName{dvi2tty}, but development seems to have stopped before the authors were willing to declare the work complete. \end{itemize} A common problem is the hyphenation that \TeX{} inserts when typesetting something: since the output is inevitably viewed using fonts that don't match the original, the hyphenation usually looks silly. Ralph Droms provides a \Package{txt} bundle of things in support of \acro{ASCII} generation, but it doesn't do a good job with tables and mathematics. Another possibility is to use the \LaTeX{}-to-\acro{ASCII} conversion program, \ProgName{l2a}, although this is really more of a de-\TeX{}ing program. The canonical de-\TeX{}ing program is \ProgName{detex}, which removes all comments and control sequences from its input before writing it to its output. Its original purpose was to prepare input for a dumb spelling checker, and it's only usable for preparing useful \acro{ASCII} versions of a document in highly restricted circumstances. \ProgName{Tex2mail} is slightly more than a de-TeX{}er~--- it's a \ProgName{Perl} script that converts \TeX{} files into plain text files, expanding various mathematical symbols (sums, products, integrals, sub/superscripts, fractions, square roots, \dots{}\@) into ``\acro{ASCII} art'' that spreads over multiple lines if necessary. The result is more readable to human beings than the flat-style \TeX{} code. Another significant possibility is to use one of the \Qref*{\acro{HTML}-generation solutions}{Q-LaTeX2HTML}, and then to use a browser such as \ProgName{lynx} to dump the resulting \acro{HTML} as plain text. \begin{ctanrefs} \item[catdvi]\CTANref{catdvi} \item[crudetype]\CTANref{crudetype} \item[detex]\CTANref{detex} \item[dvi2tty]\CTANref{dvi2tty} \item[l2a]\CTANref{l2a} \item[tex2mail]\CTANref{tex2mail} \item[txt]\CTANref{txtdist} \end{ctanrefs} \LastEdit{2011-07-21} \Question[Q-SGML2TeX]{Conversion from \acro{SGML} or \acro{HTML} to \protect\TeX{}} \acro{SGML} is a very important system for document storage and interchange, but it has no formatting features; its companion \acro{ISO} standard \acro{DSSSL} (see \URL{http://www.jclark.com/dsssl/}) is designed for writing transformations and formatting, but this has not yet been widely implemented. Some \acro{SGML} authoring systems (e.g., SoftQuad \ProgName{Author/Editor}) have formatting abilities, and there are high-end specialist \acro{SGML} typesetting systems (e.g., Miles33's \ProgName{Genera}). However, the majority of \acro{SGML} users probably transform the source to an existing typesetting system when they want to print. \TeX{} is a good candidate for this. There are three approaches to writing a translator: \begin{enumerate} \item Write a free-standing translator in the traditional way, with tools like \ProgName{yacc} and \ProgName{lex}; this is hard, in practice, because of the complexity of \acro{SGML}. \item Use a specialist language designed for \acro{SGML} transformations; the best known are probably \ProgName{Omnimark} and \ProgName{Balise}. They are expensive, but powerful, incorporating \acro{SGML} query and transformation abilities as well as simple translation. \item Build a translator on top of an existing \acro{SGML} parser. By far the best-known (and free!) parser is James Clark's \ProgName{nsgmls}, and this produces a much simpler output format, called \acro{ESIS}, which can be parsed quite straightforwardly (one also has the benefit of an \acro{SGML} parse against the \acro{DTD}). Two good public domain packages use this method: \begin{itemize} \item \begin{narrowversion} % really non-hyper David Megginson's \ProgName{sgmlspm}, written in \ProgName{Perl} 5, which is available from \URL{http://www.perl.com/CPAN/modules/by-module/SGMLS} \end{narrowversion} \begin{wideversion} David Megginson's \href{http://www.perl.com/CPAN/modules/by-module/SGMLS}{\ProgName{sgmlspm}}, written in \ProgName{Perl} 5. \end{wideversion} \item \begin{narrowversion} % really non-hyper Joachim Schrod and Christine Detig's \ProgName{STIL} (`\acro{SGML} Transformations in Lisp'), which is available from \URL{ftp://ftp.tu-darmstadt.de/pub/text/sgml/stil} \end{narrowversion} \begin{wideversion} Joachim Schrod and Christine Detig's \href{ftp://ftp.tu-darmstadt.de/pub/text/sgml/stil}{\ProgName{STIL}}, (`\acro{SGML} Transformations in Lisp'). \end{wideversion} \end{itemize} Both of these allow the user to write `handlers' for every \acro{SGML} element, with plenty of access to attributes, entities, and information about the context within the document tree. If these packages don't meet your needs for an average \acro{SGML} typesetting job, you need the big commercial stuff. \end{enumerate} Since \acro{HTML} is simply an example of \acro{SGML}, we do not need a specific system for \acro{HTML}. However, Nathan Torkington developed % (\Email{Nathan.Torkington@vuw.ac.nz}) \ProgName{html2latex} from the \acro{HTML} parser in \acro{NCSA}'s Xmosaic package. The program takes an \acro{HTML} file and generates a \LaTeX{} file from it. The conversion code is subject to \acro{NCSA} restrictions, but the whole source is available on \acro{CTAN}. Michel Goossens and Janne Saarela published a very useful summary of \acro{SGML}, and of public domain tools for writing and manipulating it, in \TUGboat{} 16(2). \begin{ctanrefs} \item[html2latex \nothtml{\rmfamily}source]\CTANref{html2latex} \end{ctanrefs} \Question[Q-LaTeX2HTML]{Conversion from \AllTeX{} to \acro{HTML}} \TeX{} and \LaTeX{} are well suited to producing electronically publishable documents. However, it is important to realize the difference between page layout and functional markup. \TeX{} is capable of extremely detailed page layout; \acro{HTML} is not, because \acro{HTML} is a functional markup language not a page layout language. \acro{HTML}'s exact rendering is not specified by the document that is published but is, to some degree, left to the discretion of the browser. If you require your readers to see an exact replication of what your document looks like to you, then you cannot use \acro{HTML} and you must use some other publishing format such as \acro{PDF}. That is true for any \acro{HTML} authoring tool. \TeX{}'s excellent mathematical capabilities remain a challenge in the business of conversion to \acro{HTML}. There are only two generally reliable techniques for generating mathematics on the web: creating bitmaps of bits of typesetting that can't be translated, and using symbols and table constructs. Neither technique is entirely satisfactory. Bitmaps lead to a profusion of tiny files, are slow to load, and are inaccessible to those with visual disabilities. The symbol fonts offer poor coverage of mathematics, and their use requires configuration of the browser. The future of mathematical browsing may be brighter~--- see % beware line break \Qref[question]{future Web technologies}{Q-mathml}. For today, possible packages are: \begin{description} \item[\ProgName{LaTeX2HTML}]a \ProgName{Perl} script package that supports \LaTeX{} only, and generates mathematics (and other ``difficult'' things) using bitmaps. The original version was written by Nikos Drakos for Unix systems, but the package now sports an illustrious list of co-authors and is also available for Windows systems. Michel Goossens and Janne Saarela published a detailed discussion of \ProgName{LaTeX2HTML}, and how to tailor it, in \TUGboat{} 16(2). A mailing list for users may be found via \URL{http://tug.org/mailman/listinfo/latex2html} \item[\ProgName{TtH}]a compiled program that supports either \LaTeX{} or \plaintex{}, and uses the font/table technique for representing mathematics. It is written by Ian Hutchinson, using \ProgName{flex}. The distribution consists of a single \acro{C} source (or a compiled executable), which is easy to install and very fast-running. \item[\ProgName{TeX4ht}]a compiled program that supports either \LaTeX{} or \plaintex{}, by processing a \acro{DVI} file; it uses bitmaps for mathematics, but can also use other technologies where appropriate. Written by Eitan Gurari, it parses the \acro{DVI} file generated when you run \AllTeX{} over your file with \ProgName{tex4ht}'s macros included. As a result, it's pretty robust against the macros you include in your document, and it's also pretty fast. \item[\ProgName{plasTeX}]a Python-based \LaTeX{} document processing framework. It gives DOM-like access to a \LaTeX{} document, as well as the ability to generate mulitple output formats (e.g. HTML, DocBook, tBook, etc.). \item[\ProgName{TeXpider}]a commercial program from \Qref*{Micropress}{Q-commercial}, which is described on \URL{http://www.micropress-inc.com/webb/wbstart.htm}; it uses bitmaps for equations. \item[\ProgName{Hevea}] a compiled program that supports \LaTeX{} only, and uses the font/table technique for equations (indeed its entire approach is very similar to \ProgName{TtH}). It is written in Objective \acro{CAML} by Luc Maranget. \ProgName{Hevea} isn't archived on \acro{CTAN}; details (including download points) are available via \URL{http://pauillac.inria.fr/~maranget/hevea/} \end{description} An interesting set of samples, including conversion of the same text by the four free programs listed above, is available at \URL{http://www.mayer.dial.pipex.com/samples/example.htm}; a linked page gives lists of pros and cons, by way of comparison. The World Wide Web Consortium maintains a list of ``filters'' to \acro{HTML}, with sections on \AllTeX{} and \BibTeX{}~--- see \URL{http://www.w3.org/Tools/Word_proc_filters.html} \begin{ctanrefs} \item[latex2html]Browse \CTANref{latex2html} \item[plasTeX]Browse \CTANref{plastex} \item[tex4ht]\CTANref{tex4ht} (but see \url{http://tug.org/tex4ht/}) \item[tth]\CTANref{tth} \end{ctanrefs} \Question[Q-fmtconv]{Other conversions to and from \AllTeX{}} \begin{description} \item[troff]\ProgName{Tr2latex}, assists in the translation of a \ProgName{troff} document into \LaTeXo{} format. It recognises most |-ms| and |-man| macros, plus most \ProgName{eqn} and some \ProgName{tbl} preprocessor commands. Anything fancier needs to be done by hand. Two style files are provided. There is also a man page (which converts very well to \LaTeX{}\dots{}). \ProgName{Tr2latex} is an enhanced version of the earlier \ProgName{troff-to-latex} (which is no longer available). % The \acro{DECUS} \TeX{} distribution (see % \Qref[question]{sources of software}{Q-archives}) % also contains a program which converts \ProgName{troff} to \TeX{}. %\item[Scribe] Mark James (\Email{jamesm@dialogic.com}) has a copy of % \ProgName{scribe2latex} he has been unable to test but which he will % let anyone interested have. The program was written by Van Jacobson % of Lawrence Berkeley Laboratory.% % \checked{RF}{1994/11/18} \item[WordPerfect] \ProgName{wp2latex} is actively maintained, and is available either for \MSDOS{} or for Unix systems. \item[\acro{RTF}] \ProgName{Rtf2tex}, by Robert Lupton, is for converting Microsoft's Rich Text Format to \TeX{}. There is also a converter to \LaTeX{} by Erwin Wechtl, called \ProgName{rtf2latex}. The latest converter, by Ujwal Sathyam and Scott Prahl, is \ProgName{rtf2latex2e} which seems rather good, though development of it seems to have stalled. Translation \emph{to} \acro{RTF} may be done (for a somewhat constrained set of \LaTeX{} documents) by \TeX{}2\acro{RTF}, which can produce ordinary \acro{RTF}, Windows Help \acro{RTF} (as well as \acro{HTML}, \Qref{conversion to HTML}{Q-LaTeX2HTML}). \TeX{}2\acro{RTF} is supported on various Unix platforms and under Windows~3.1 \item[Microsoft Word] A rudimentary (free) program for converting \acro{MS-W}ord to \LaTeX{} is \ProgName{wd2latex}, which runs on \MSDOS{}; it probably processes the output of an archaic version of \acro{MS-W}ord (the program itself was archived in 1991). For conversion in the other direction, the current preferred free-software method is a two-stage process: \begin{itemize} \item Convert \latex{} to \ProgName{OpenOffice} format, using the \ProgName{tex4ht} command \ProgName{oolatex}; \item open the result in \ProgName{OpenOffice} and `save as' a \acro{MS-W}ord document. \end{itemize} (Note that \ProgName{OpenOffice} itself is \emph{not} on \acro{CTAN}; see \url{http://www.openoffice.org/}, though most \ProgName{linux} systems offer it as a ready-to-install bundle.) \ProgName{tex4ht} can also generate OpenOffice \acro{ODT} format, which may be used as an intermediate to producing Word format files. \ProgName{Word2}\emph{\TeX{}} and \emph{\TeX{}}\ProgName{2Word} are shareware translators from % beware line break \href{http://www.chikrii.com/}{Chikrii Softlab}; positive users' reports have been noted (but not recently). If cost is a constraint, the best bet is probably to use an intermediate format such as \acro{RTF} or \acro{HTML}. \ProgName{Word} outputs and reads both, so in principle this route may be useful. You can also use \acro{PDF} as an intermediate format: Acrobat Reader for Windows (version 5.0 and later) will output rather feeble \acro{RTF} that \ProgName{Word} can read. \item[Excel] \ProgName{Excel2Latex} converts an \ProgName{Excel} file into a \LaTeX{} \environment{tabular} environment; it comes as a \extension{xls} file which defines some \ProgName{Excel} macros to produce output in a new format. \item[runoff] Peter Vanroose's \ProgName{rnototex} conversion program is written in \acro{VMS} Pascal. The sources are distributed with a \acro{VAX} executable. \item[refer/tib] There are a few programs for converting bibliographic data between \BibTeX{} and \ProgName{refer}/\ProgName{tib} formats. The collection includes a shell script converter from \BibTeX{} to \ProgName{refer} format as well. The collection is not maintained. \item[\acro{PC}-Write]\ProgName{pcwritex.arc} is a print driver for \acro{PC}-Write that ``prints'' a \acro{PC}-Write V2.71 document to a \TeX{}-compatible disk file. It was written by Peter Flynn at University College, Cork, Republic of Ireland. \end{description} % beware line breaks \href{http://www.tug.org/utilities/texconv/index.html}{Wilfried Hennings' \acro{FAQ}}, which deals specifically with conversions between \TeX{}-based formats and word processor formats, offers much detail as well as tables that allow quick comparison of features. A group at Ohio State University (\acro{USA}) is working on a common document format based on \acro{SGML}, with the ambition that any format could be translated to or from this one. \ProgName{FrameMaker} provides ``import filters'' to aid translation from alien formats (presumably including \TeX{}) to \ProgName{FrameMaker}'s own. \begin{ctanrefs} \item[excel2latex]\CTANref{excel2latex} \item[pcwritex.arc]\CTANref{pcwritex} \item[refer and tib tools]\CTANref{refer-tools} \item[rnototex]\CTANref{rnototex} \item[rtf2latex]\CTANref{rtf2latex} \item[rtf2latex2e]\CTANref{rtf2latex2e} \item[rtf2tex]\CTANref{rtf2tex} \item[tex2rtf]\CTANref{tex2rtf} \item[tex4ht]\CTANref{tex4ht} (but see \url{http://tug.org/tex4ht/}) \item[tr2latex]\CTANref{tr2latex} \item[wd2latex]\CTANref{wd2latex} \item[wp2latex]\CTANref{wp2latex} \item[\nothtml{\rmfamily}Word processor \acro{FAQ} (source)]% \CTANref{texcnvfaq} \end{ctanrefs} \Question[Q-readML]{Using \TeX{} to read \acro{SGML} or \acro{XML} directly} \Qref*{\context{} (mark \acro{IV})}{Q-context} can process some \acro{*ML}, to produce typeset output directly. Details of what can (and can not) be done, are discussed in % ! line break \href{http://wiki.contextgarden.net/XML}{The \context{} \acro{WIKI}}. \context{} is probably the system of choice for \alltex{} users who also need to work in \acro{XML} (and friends). (Note that \context{} mark~\acro{IV} requires \Qref*{\luatex{}}{Q-luatex}, and should therefore be regarded as experimental, though many people \emph{do} use it successfully). Older systems also manage, using no more than \alltex{} macro programming, to process \acro{XML} and the like. David Carlisle's \Package{xmltex} is the prime example; it offers a solution for typesetting \acro{XML} files, and is still in active (though not very widespread) use. One use of a \TeX{} that can typeset \acro{XML} files is as a backend processor for \acro{XSL} formatting objects, serialized as \acro{XML}. Sebastian Rahtz's Passive\TeX{} uses \Package{xmltex} to achieve this end. However, modern usage would proceed via \acro{XSL} or \acro{XSLT}2 to produce a formattable version. \begin{ctanrefs} \item[Context]\CTANref{context} \item[xmltex]\CTANref{xmltex} \item[passivetex]\CTANref{passivetex} \end{ctanrefs} \LastEdit{2013-04-11} \Question[Q-recovertex]{Retrieving \AllTeX{} from \acro{DVI}, etc.} The job just can't be done automatically: \acro{DVI}, \PS{} and \acro{PDF} are ``final'' formats, supposedly not susceptible to further editing~--- information about where things came from has been discarded. So if you've lost your \AllTeX{} source (or never had the source of a document you need to work on) you've a serious job on your hands. In many circumstances, the best strategy is to retype the whole document, but this strategy is to be tempered by consideration of the size of the document and the potential typists' skills. If automatic assistance is necessary, it's unlikely that any more than text retrieval is going to be possible; the \AllTeX{} markup that creates the typographic effects of the document needs to be recreated by editing. If the file you have is in \acro{DVI} format, many of the techniques for \Qref*{converting \AllTeX{} to \acro{ASCII}}{Q-toascii} are applicable. Consider \ProgName{dvi2tty}, \ProgName{crudetype} and \ProgName{catdvi}. Remember that there are likely to be problems finding included material (such as included \PS{} figures, that don't appear in the \acro{DVI} file itself), and mathematics is unlikely to convert easily. To retrieve text from \PS{} files, the \ProgName{ps2ascii} tool (part of the \href{http://www.ghostscript.com/}{\ProgName{ghostscript}} distribution) is available. One could try applying this tool to \PS{} derived from an \acro{PDF} file using \ProgName{pdf2ps} (also from the \href{http://www.ghostscript.com/}{\ProgName{ghostscript}} distribution), or \ProgName{Acrobat} \ProgName{Reader} itself; an alternative is \ProgName{pdftotext}, which is distributed with \ProgName{xpdf}. Another avenue available to those with a \acro{PDF} file they want to process is offered by Adobe \ProgName{Acrobat} (version 5 or later): you can tag the \acro{PDF} file into an estructured document, output thence to well-formed \acro{XHTML}, and import the results into Microsoft \ProgName{Word} (2000 or later). From there, one may convert to \AllTeX{} using one of the techniques discussed in % beware line break \wideonly{``}\Qref[question]{converting to and from \AllTeX{}}{Q-fmtconv}\wideonly{''}. The result will typically (at best) be poorly marked-up. Problems may also arise from the oddity of typical \TeX{} font encodings (notably those of the maths fonts), which \ProgName{Acrobat} doesn't know how to map to its standard Unicode representation. \begin{ctanrefs} \item[catdvi]\CTANref{catdvi} \item[crudetype]\CTANref{crudetype} \item[dvi2tty]\CTANref{dvi2tty} \item[xpdf]Browse \CTANref{xpdf} \end{ctanrefs} \LastEdit{2013-04-16} \Question[Q-LaTeXtoPlain]{Translating \LaTeX{} to \plaintex{}} Unfortunately, no ``general'', simple, automatic process is likely to succeed at this task. See % ! line break ``\Qref*{How does \LaTeX{} relate to \plaintex{}}{Q-LaTeXandPlain}'' for further details. Obviously, trivial documents will translate in a trivial way. Documents that use even relatively simple things, such as labels and references, are likely to cause trouble (\plaintex{} doesn't support labels). While graphics are in principle covered, by the \plaintex{} Translating a document designed to work with \LaTeX{} into one that will work with \plaintex{} is likely to amount to carefully including (or otherwise re-implementing) all those parts of \LaTeX{}, beyond the provisions of \plaintex{}, which the document uses. Some of this work has (in a sense) been done, in the port of the \LaTeX{} graphics package to \plaintex{}. However, while \Package{graphics} is available, other complicated packages (notably \Package{hyperref}) are not. The aspiring translator may find the \Qref*{\Eplain{}}{Q-eplain} system a useful source of code. (In fact, a light-weight system such as \Eplain{} might reasonably be adopted as an alternative target of translation, though it undoubtedly gives the user more than the ``bare minimum'' that \plaintex{} is designed to offer.) \begin{ctanrefs} \item[\nothtml{\begingroup\rmfamily}The\nothtml{\endgroup} eplain \nothtml{\rmfamily}system]% \CTANref{eplain} \item[\nothtml{\begingroup\rmfamily'}\plaintex{}\nothtml{'\endgroup} graphics]% \CTANref{graphics-plain} \end{ctanrefs} \LastEdit{2011-05-30}