.\" .\" tbl % | xroff -ms | lpr .\" .\" revision date - change whenever this file is edited .ds RD 5 April 1991 .nr PO 1.2i \" page offset 1.2 inches .nr PD .7v \" inter-paragraph distance .\" .EH 'RTF Processing Tool'- % -'Distribution 1.06a1' .OH 'Distribution 1.06a1'- % -'RTF Processing Tool' .OF 'Revision date:\0\0\*(RD''Printed:\0\0\n(dy \*(MO 19\n(yr' .EF 'Revision date:\0\0\*(RD''Printed:\0\0\n(dy \*(MO 19\n(yr' .\" .\" subscript strings .ds < \s-2\v'.4m' .ds > \v'-.4m'\s+2 .\" .\" I - italic font (taken from -ms and changed) .de I .nr PQ \\n(.f .if t \&\\$3\\f2\\$1\\fP\&\\$2 .if n .if \\n(.$=1 \&\\$1 .if n .if \\n(.$>1 \&\\$1\c .if n .if \\n(.$>1 \&\\$2 .. .de IS \" interface routine description header start .DS L .ta .8i .ft B .. .de IE \" interface routine description header end .DE .ft R .. .TL A Tool for RTF Processing .sp .5v Version 1.06a1 .AU Paul DuBois dubois@primate.wisc.edu .AI Wisconsin Regional Primate Research Center Revision date:\0\0\*(RD .NH Introduction .LP This document describes a general purpose tool for processing RTF files\*-an RTF reader which may be configured in a well-defined manner to allow it to be used with a variety of writers generating different output formats. This provides a method for generating RTF-to-\fIXXX\fR translators. In theory. .LP I assume that you have some familiarity with RTF syntax and semantics, and that you're willing to study the source code of the RTF distribution described here. If you don't have the RTF specification, you can get it from the FTP site listed under ``Distribution Availability.'' References to ``the specification'' refer to this document. .LP If you use this tool and find that you have an RTF file that won't pass through the sample translator .I rtf2null , or for which .I rtf2null announces unknown symbols, please contact me so the tool can be improved. It is best if you can supply the RTF file for which this behavior is observed. .NH Theory of Operation .NH 2 Translator Architecture .LP There are three components to an RTF translator (at least as conceived here): reader code, writer code, and setup code. These break down as follows. .IP reader\0\0 Responsible for peeling tokens out of the input stream, classifying them, and causing the writer to process them. .IP writer\0\0 Responsible for translating tokens from the input stream into the required output format. .IP setup\0\0 Responsible for making sure the reader and writer are initialized, and for calling the reader, to cause translation to occur. .LP This architecture allows the reader to remain constant, so that different translaters can be built by supplying different writer and setup code. .LP In practice, to build a new translator, you supply a .I main() function and the writer code, and link in the RTF reader. .I main() includes the setup code and is responsible to see that the following are done: .IP \(bu .nr PD 0v Process command-line arguments .IP \(bu Configure the reader, which may involve: .RS .IP \(bu Reset the input stream if necessary .IP \(bu Configure other reader behavior, such as whether to process the font and color tables internally. .IP \(bu Install writer callbacks into the reader so it knows what functions to call when various kinds of tokens occur .RE .IP \(bu Initialize the writer .IP \(bu Call the reader to process input stream .IP \(bu .nr PD .7v Terminate the writer .LP The minimal translator looks something like this: .LP .DS # include # include "rtf.h" int main () { RTFInit (); RTFRead (); exit (0); } .DE .LP This initializes the reader, and calls it to read .I stdin . The writer portion is null (i.e., there is no writer), so all that happens is that the reader tokenizes the input and discards it. That isn't very interesting; most of the sample translators are examples of more elaborate translators. .NH 2 Reader Operation .LP Tokens are classified using up to three numbers: token class, and major and minor numbers. The class number can be: .LP .DS .ta 1.5i rtfUnknown unrecognized token rtfGroup ``{'' or ``}'' rtfText plain text character rtfControl token beginning with ``\e'' rtfEOF fake class number; indicates end of input stream .DE .LP There are some exceptions. A few tokens beginning with ``\e'' actually belong to other classes, a tab character is treated like ``\etab'', and unrecognized tokens are put in class .I rtfUnknown no matter what they look like. .LP Within a class, tokens are assigned a major number, and perhaps a minor number. For the .I rtfText class, the major number is the value of the character (0..255), and there is no minor number. For the .I rtfControl class, most tokens have both a major and minor number. For instance, all paragraph attribute control symbols have major number .I rtfParAttr and a minor number indicating which property, such as .I rtfLeftIndent or .I rtfSpaceBefore . A few oddball control tokens have no minor number. .LP A ``plain text'' character can be a literal character, a character specified in hex notation (``\e\`\fIxx\fR'') or one of the special escaped characters (``\e{'', ``\e}'', ``\e\e''). The sequence ``\e:'' is treated as a plain text colon. This is arguably wrong; the rationale is given later under the description of the .I RTFGetToken() function. .LP Ideally, there should never be any tokens in the .I rtfUnknown class, but as the RTF standard continues to develop, unknown tokens are inevitable. .LP To write a translator, you'll need to familiarize yourself with the token classification scheme by reading .I rtf.h . A skeleton translator .I rtfskel.c is included with the distribution and may be used as a basis for new translators. .LP Each time a token is read, several global variables are set. .I rtfClass , .I rtfMajor , and .I rtfMinor indicate the token class, and major and minor numbers. (The major and minor numbers may be meaningless depending on the kind of token.) Control symbols may have a parameter value, e.g., ``\emargr720'' specifies a right margin (in units of 720 twentieths of a point). The parameter value is stored in .I rtfParam . The text of the token (including the parameter text) is placed in .I rtfTextBuf and its length in .I rtfTextLen . .LP If no parameter value is given, .I rtfParam is 0, which is indistinguishable from an explicitly specified parameter of ``0''. If you need to tell the difference, examine .I rtfTextBuf[rtfTextLen-1] to see if it's a digit or not. .LP The reader assumes a 7-bit character set. The specification indicates that character values \(>= 128 may be encoded with the ``\e'\fIxx\fR'' sequence. If the reader sees a character with the high bit set, it prints a message and exits. .LP Generally, a translator will configure the RTF reader to call particular writer functions when certain kinds of tokens are encountered in the input stream. These functions are known as .I "class callbacks" . Writer callbacks can be registered with the reader using .I RTFSetClassCallback() for each token class. .LP The reader reads each token, classifies it, and sends it to a token routing function .I RTFRouteToken() , tries to find a writer callback function to process the token. Tokens in a given class are ignored if no callback is registered for the class. .LP Class callbacks make it quite easy to receive notification when certain types of tokens occur in the input. For instance, a crude RTF text extractor could be written by installing a callback function for the .I rtfText class.\** .FS The reasons this is a crude translator are that: (i) some text characters occur in contexts where the characters are not intended to be output, e.g., font tables, stylesheets; (ii) character values greater than 127 probably should be translated into the normal ASCII range; (iii) some control symbols like ``\etab'' represent output text characters. The sample translator .I rtf2text addresses these problems in a (slightly) more sophisticated manner. .FE Whenever the function is invoked, .I rtfMajor will contain a value in the range 0..255 representing the character value. .LP .DS # include # include "rtf.h" void TextCallback () { putchar (rtfMajor); } int main () { RTFInit (); RTFSetClassCallback (rtfText, TextCallback); RTFRead (); exit (0); } .DE .LP Callbacks for the .I rtfControl and .I rtfGroup classes typically operate by selecting on the token major number to determine the action to take. A callback for the .I rtfGroup class usually will do something like this: .LP .DS void BraceCallback () { switch (rtfMajor) { case rtfBeginGroup: \fI...push state...\fR break; case rtfEndGroup: \fI...pop state...\fR break; } } .DE .NH 2 Destination Readers .LP Grouping in RTF documents occurs within braces ``{'' and ``}''. One kind of group is the .I destination . The token immediately following the opening brace is a destination control symbol. These indicate such things as headers, footers, footnotes, etc. .LP Three destinations which specify information for internal use (i.e., information which affects output but isn't itself written) are the font table, color table and stylesheet. Since these three destinations occur so commonly and have a special syntax, the RTF reader by default gobbles them up itself when it recognizes them. The functions which do this are called .I "destination readers" and are probably the nearest thing in the reader to what might be called parsers. They are installed by default so that translators can be written without the burden of understanding the syntax or digesting the contents of these destinations. Each of them constructs a list of the entries specified in the destination and the reader includes functions providing access to these lists. .LP Translators can turn off or override these defaults with .I RTFSetDestinationCallback() if necessary. To override one, pass the address of a different destination reader function. To turn one off, pass NULL. .LP Destination callbacks may be called for any destination, not just .I rtfFontTbl , .I rtfColorTbl and .I rtfStyleSheet . Destinations for which no callback is registered are not treated specially. .LP Other destinations for which there is a default reader are the information (``\einfo'') and picture (``\epict'') destinations; all they do is skip to the end of the group. .NH 3 Using the Built-in Destination Readers .LP The font table, color table and stylesheet information is maintained internally, and the reader either acts on that information itself, or allows itself to be queried by the writer about it, as described below. These descriptions do not apply if the translator shuts off or overrides the default destination readers, of course. .LP \fBStylesheet\*-\fRThe reader acts on this itself. When the stylesheet destination is encountered, the style contents are remembered. Thereafter, whenever the writer receives notification that a style number control symbol (``\es\fInnn\fR'') has occurred, it can call .I RTFExpandStyle(rtfParam) to cause the style to be expanded. The reader consults contents of the stylesheet and each token in the style definition is routed in turn back to the writer. This effects a sort of macro expansion. .LP If the writer doesn't care about style expansion, it simply refrains from calling .I RTFExpandStyle() . .LP If the writer wants information about a style, it can call .I RTFGetStyle() . .LP \fBFont table\*-\fRFor each entry in the font table, the font number, type and name are maintained by the reader. The writer finds out that a font number has been specified in the input when its control class callback is invoked and .I rtfMajor \(eq .I rtfCharAttr and .I rtfMinor \(eq .I rtfFontNum . To obtain a pointer to the appropriate .I RTFFont structure, the reader function .I RTFGetFont(rtfParam) may be called. .LP \fBColor table\*-\fRFor each entry in the color table, the color number is maintained along with the red, green and blue values. The writer finds out that a color number has been specified in the input when its control class callback is invoked and .I rtfMajor \(eq .I rtfCharAttr and .I rtfMinor \(eq .I rtfColorNum . To obtain a pointer to the appropriate .I RTFColor structure, the reader function .I RTFGetColor(rtfParam) may be called. .LP One subtle point about the built-in destination readers: destinations cannot be recognized until .I after the occurrence of the ``{'' symbol that begins the destination. This means the writer, if it maintains a state stack, will already have pushed a state. In order to allow the writer to properly pop that state in response to the ``}'', these destination readers feed the ``}'' back into the token router after they pull it from the input stream. What the writer actually sees is a ``{'' followed immediately by a ``}''. .LP Applications that maintain a state stack may find it necessary to do something similar if they supply their own destination readers. .NH Programming Interface .LP Source files using the RTF reader should #include .I rtf.h . .I reader.c should be compiled to produce .I reader.o , which should be part of the final application link. .LP The best way to learn how these source files work is to study the sample translators, which vary in complexity from very simple (e.g., .I rtf2text , .I rtfwc ), to wretchedly messy (e.g., .I rtf2troff ). You should be aware that one implication of the way the translators are built (callbacks and switch statements) is that it's quite easy to build them incrementally. You can start with a very bare-bones model, and start plugging in callbacks as you progress. Within the callbacks, your switch statements can progressively handle more cases. .LP An alternative approach is to start with a copy of .I rtfskel.c , which includes a full set of class callbacks and complete switch statements for all tokens. Each case is empty; you simply add code for those cases you want to handle. You can also rip out the code for the cases you know you'll never care about. .NH 2 Global variables .LP The global RTF reader variables are: .LP .DS .ta .6i 2i int rtfClass; token class int rtfMajor; token major number int rtfMinor; token minor number int rtfParam; parameter value for control symbols char rtfTextBuf[rtfBufSiz]; token text int rtfTextLen; length of token text .DE .LP These variables always apply to the token with which the writer should be concerned. This may be either the last token read or the current token within a style which is being reprocessed. .NH 2 Functions .IS void RTFInit () .IE Initialize the RTF reader. This is the first RTF routine that should be called. It performs some initialization such as computing hash values for the token lookup table and installation of the built-in destination readers. .LP .I RTFInit() may be called multiple times. Each invocation resets the reader's state completely, except that the input stream is not disturbed. .IS void RTFRead () .IE .I RTFRead() calls .I RTFGetToken() to tokenize the input stream and .I RTFRouteToken() to process each token, until input is exhausted. When .I RTFRead() returns, input has been completely read and the writer can perform any cleanup or termination needed. .LP If you want to read multiple files per invocation of your translator, you should do all your setup prior to each call to .I RTFRead() . That is, you should call .I RTFInit() , install callbacks, etc., then call .I RTFRead() . .IS void RTFRouteToken () .IE This routine decides what to do with the current token and routes it to the correct place for processing. Usually this is directly to the writer via a class callback. The token is .I not passed to the writer (i.e., the class callback is bypassed) when it is a destination token for which a reader callback is installed. .LP By default, built-in readers are installed for font table, color table, stylesheet and information and picture group destinations. The built-in readers can be disabled if the writer wants to see all tokens directly. .IS int RTFGetToken () .IE Reads one token from the input stream, classifies it, sets the global variables, and returns the class number. If the class is .I rtfEOF the end of the input stream has been reached. Newlines (``\en'') and carriage returns (``\er'') are silently discarded by .I RTFGetToken() , as they have no meaning. Both are passed to the token hook if one is installed, however. .LP The sequence ``\e:'' is treated as a plain text character, with .I rtfClass set to .I rtfText and .I rtfMajor set to the colon ASCII code. Strictly speaking, ``\e:'' is the control word for an index subentry, but some versions of Microsoft Word write out plain text colons with a preceding backslash, while others don't. This unfortunate ambiguity results in an ugly dilemma. It seems the lesser burden to require translators to recognize that plain text colons should ``really'' be treated as index subentry indicators while inside of an index entry destination, than to recognize that an index subentry control word should ``really'' be treated as a plain text colon everywhere else. .LP Writers probably should not need to use .I RTFGetToken() directly unless they install their own destination readers. One reason you might want to call it is to implement a ``peek at next token'' capability. Call .I RTFGetToken() and examine the global variables. Then call .I RTFRouteToken() to cause the symbol to be processed normally. This way you get to look at the token before it goes through the usual routing mechanism. .IS void RTFSetToken (class, major, minor, param, text) int class, major, minor, param; char *text; .IE It is sometimes useful to construct a fake token and run it through the token router to cause the effects of the token to be applied. .I RTFSetToken() allows you to do this, by setting the reader's global variables to the values supplied. If .I param is non-negative, the token text .I rtfTextBuf is constructed from .I text and .I param , otherwise .I rtfTextBuf is just copied from .I text . .IS void RTFSetReadHook (f) void (*f) (); .IE Install a function to be called by .I RTFGetToken() after each token is read from the input stream. The function takes no arguments and returns no value. Within the function, information about the current token can be obtained from the global variables. This function is for token examination purposes only, and should not modify those variables. .IS void (*RTFGetReadHook ()) () .IE Returns a pointer to the current read hook, or NULL if there isn't one. .IS void RTFSkipGroup () .IE This function can be called to skip to the end of the current group (including any subgroups). It's useful for explicitly ignoring ``\e*\e\fIdest\fR'' groups, where .I dest is an unrecognized destination, or for causing groups that you don't want to deal with to effectively ``disappear'' from the input stream. .LP Calling this function in the middle of expanding a style may cause problems. However, it is typically called when you have just seen a destination symbol, which won't happen during a style expansion\*-I think. .LP Be careful with this function if your writer maintains a state stack, because you will already have pushed a state when the opening group brace was seen. After .I RTFSkipGroup() returns, the group closing brace has been read, and you'll need to pop a state. All global token variables will still be set to the closing brace, so you may only need to call .I RTFRouteToken() to cause the state to be unstacked. .IS void RTFExpandStyle (num) int num; .IE Performs style expansion of the given style number, or does nothing if there is no such style. The writer should call this when it notices that the current token is a style number indicator. .IS void RTFSetStream (stream) FILE *stream; .IE Redirects the RTF reader to the given stream. This should be called before any reading is done. The default input stream is .I stdin . An alternative to .I RTFSetStream() is to simply .I freopen() the input file on .I stdin (that's what all the sample translators do). .LP The input stream is .I not modified by .I RTFInit() . .IS void RTFSetClassCallback (class, callback) int class; void (*callback) (); .IE Installs a writer callback function for the given token class. The first argument is a class number, the second is the function to call when tokens from that class are encountered in the input stream. This will cause .I RTFRouteToken() to invoke the callback when it encounters a token in the class. If .I callback is NULL (which is the default for all classes), tokens in the class are ignored, i.e., discarded. .LP The callback should take no arguments and return no value. Within the callback, information about the current token can be obtained from the global variables. .LP Installing a callback for the .I rtfEOF ``class'' is silly and has no effect. .IS void (*RTFGetClassCallback (class)) () int class; .IE Returns a pointer to the callback function for the given token class, or NULL if there isn't one. .IS void RTFSetDestinationCallback (dest, callback) int dest; void (*callback) (); .IE Installs a callback function for the given destination .I dest "" ( is a token minor number). When .I RTFRouteToken() sees a token with class .I rtfControl and major number .I rtfDestination , it checks whether there is a callback for the destination indicated by the minor number. If so, it invokes it. If .I callback is NULL, the given destination is not treated specially (the control class callback is invoked as usual). By default, destination callbacks are installed for the font table, color table, stylesheet, and information and picture group. .LP The callback should take no arguments and return no value. When the functon is invoked, the current token will be the destination token following the destination's initial opening brace ``{''. (For optional destinations, the destination token follows the ``\e*'' symbol.) .IS void (*RTFGetDestinationCallback (dest)) () int dest; .IE Returns a pointer to the callback function for the given token class, or NULL if there isn't one. .IS RTFStyle *RTFGetStyle (num) int num; .IE Returns a pointer to the .I RTFStyle structure for the given style number. The ``Normal'' style number is 0. Pass \-1 to get a pointer to the first style in the list. Styles are not stored in any particular order. .LP Be sure to check the result; it might be NULL. .LP This function is meaningless if the default stylesheet destination reader is overridden. .IS RTFFont *RTFGetFont (num) int num; .IE Returns a pointer to the .I RTFFont structure for the given font number. Pass \-1 to get a pointer to the first font in the list. Fonts are not stored in any particular order. .LP Be sure to check the result; it might be NULL. In particular, you might think that passing the number specified with the ``\edeff'' (default font) control symbol would always yield a valid font structure, but that's not true. The default font might not be listed in the font table. .LP This function is meaningless if the default font table destination reader is overridden. .IS RTFColor *RTFGetColor (num) int num; .IE Returns a pointer to the .I RTFColor structure for the given color number. (I think black is 0.) Pass \-1 to get a pointer to the first color in the list. Colors are not stored in any particular order. If the color values in the entry are \-1, the default color should be used. The default color is writer-dependent. .LP Be sure to check the result; it might be NULL. I think this means you should use the default color. .LP This function is meaningless if the default color table destination reader is overridden. .IS int RTFCheckCM (class, major) int class, major; .IE Returns non-zero if .I rtfClass and .I rtfMajor are equal to .I class and .I major , respectively, zero otherwise. .IS int RTFCheckCMM (class, major, minor) int class, major, minor; .IE Returns non-zero if .I rtfClass , .I rtfMajor and .I rtfMinor are equal to .I class , .I major and .I minor , respectively, zero otherwise. .IS int RTFCheckMM (major, minor) int major, minor; .IE Returns non-zero if .I rtfMajor and .I rtfMinor are equal to .I major and .I minor , respectively, zero otherwise. .IS char *RTFAlloc (size) int size; .IE Returns a pointer to a block of memory .I size bytes long, or NULL if insufficient memory was available. .IS char *RTFStrSave (s) char *s; .IE Allocates a block of memory big enough for a copy of the given string (including terminating null byte), copies the string into it, and returns a pointer to the copy. Returns NULL if insufficient memory was available. .IS void RTFFree (p) char *p; .IE Frees the block of memory pointed to by .I p , which should have been allocated by .I RTFAlloc() or .I RTFStrSave() . It is safe to pass NULL to this routine. .NH Distribution Availability .LP This software may be redistributed without restriction and used for any purpose whatsoever. .LP The RTF distribution is available for anonymous .I ftp access in the .I ~ftp/pub/RTF directory on host .I indri.primate.wisc.edu (Internet address 128.104.230.11). Updates appear there as they become available. .LP A version of the RTF specification is available in this directory, as either a binhex'ed Word for Macintosh document, or in RTF format. It is known to have a few errors, as it's a scanned version of a paper copy. Some of these errors have been fixed, but others remain (see, for instance, the example table text on page 17). The document is not quite as up to date as the one sent out by Microsoft, but it is much more complete than the one beginning ``Specification for RTF'' that may be found on some other archive sites. .LP If you do not have Internet access, send a request to one of the following: .DS .ta 1.2i Internet: software-request@primate.wisc.edu UUCP: rhesus!software-request .DE Bug reports and questions should be sent to one of these addresses as well. .LP If you use this software as the basis for a translater not included in the current collection, please consider contributing it for inclusion in a future distribution. In particular, an RTF-to-LaTeX translator seems to be an item of interest. I don't use LaTeX myself and am unlikely to write one, but it would probably be fairly popular.