.\"
.\" tbl % | xroff -ms | lpr
.\"
.\" revision date - change whenever this file is edited
.ds RD 5 April 1991
.nr PO 1.2i	\" page offset 1.2 inches
.nr PD .7v	\" inter-paragraph distance
.\"
.EH 'RTF Processing Tool'- % -'Distribution 1.06a1'
.OH 'Distribution 1.06a1'- % -'RTF Processing Tool'
.OF 'Revision date:\0\0\*(RD''Printed:\0\0\n(dy \*(MO 19\n(yr'
.EF 'Revision date:\0\0\*(RD''Printed:\0\0\n(dy \*(MO 19\n(yr'
.\"
.\" subscript strings
.ds < \s-2\v'.4m'
.ds > \v'-.4m'\s+2
.\"
.\" I - italic font (taken from -ms and changed)
.de I
.nr PQ \\n(.f
.if t \&\\$3\\f2\\$1\\fP\&\\$2
.if n .if \\n(.$=1 \&\\$1
.if n .if \\n(.$>1 \&\\$1\c
.if n .if \\n(.$>1 \&\\$2
..
.de IS	\" interface routine description header start
.DS L
.ta .8i
.ft B
..
.de IE	\" interface routine description header end
.DE
.ft R
..
.TL
A Tool for RTF Processing
.sp .5v
Version 1.06a1
.AU
Paul DuBois
dubois@primate.wisc.edu
.AI
Wisconsin Regional Primate Research Center
Revision date:\0\0\*(RD
.NH
Introduction
.LP
This document describes a general purpose tool for processing RTF
files\*-an RTF reader which may be configured in a
well-defined manner to allow it to be used with a variety of writers
generating different output formats.
This provides a method for generating RTF-to-\fIXXX\fR translators.
In theory.
.LP
I assume that you have some familiarity with
RTF syntax and semantics, and that you're willing to study the source
code of the RTF distribution described here.
If you don't have the RTF specification, you can get it from the FTP site
listed under ``Distribution Availability.''
References to ``the specification'' refer to this document.
.LP
If you use this tool and find that you have an RTF file that won't pass
through the sample translator
.I rtf2null ,
or for which
.I rtf2null
announces unknown symbols, please contact me so the tool can
be improved.
It is best if you can supply the RTF file for which this behavior is observed.
.NH
Theory of Operation
.NH 2
Translator Architecture
.LP
There are three components to an RTF translator (at least as conceived
here): reader code, writer code, and setup code.
These break down as follows.
.IP reader\0\0
Responsible for peeling tokens out of the input stream,
classifying them, and causing the writer to process them.
.IP writer\0\0
Responsible for translating tokens from the input
stream into the required output format.
.IP setup\0\0
Responsible for making sure the reader and writer are initialized, and
for calling the reader, to cause translation to occur.
.LP
This architecture allows the reader to remain constant, so that different
translaters can be built by supplying different writer and setup code.
.LP
In practice, to build a new translator, you supply a
.I main()
function and the writer code, and link in the RTF reader.
.I main()
includes the setup code and is responsible to see that the following are done:
.IP \(bu
.nr PD 0v
Process command-line arguments
.IP \(bu
Configure the reader, which may involve:
.RS
.IP \(bu
Reset the input stream if necessary
.IP \(bu
Configure other reader behavior, such as whether to
process the font and color tables internally.
.IP \(bu
Install writer callbacks into the reader so it knows what functions to
call when various kinds of tokens occur
.RE
.IP \(bu
Initialize the writer
.IP \(bu
Call the reader to process input stream
.IP \(bu
.nr PD .7v
Terminate the writer
.LP
The minimal translator looks something like this:
.LP
.DS
# include	<stdio.h>
# include	"rtf.h"

int main ()
{
	RTFInit ();
	RTFRead ();
	exit (0);
}
.DE
.LP
This initializes the reader, and calls it to read
.I stdin .
The writer portion is null (i.e., there is no writer), so all that happens
is that the reader tokenizes the input and discards it.
That isn't very interesting;
most of the sample translators are examples of more elaborate translators.
.NH 2
Reader Operation
.LP
Tokens are classified using up to three numbers: token class, and major and
minor numbers.
The class number can be:
.LP
.DS
.ta 1.5i
rtfUnknown	unrecognized token
rtfGroup	``{'' or ``}''
rtfText	plain text character
rtfControl	token beginning with ``\e''
rtfEOF	fake class number; indicates end of input stream
.DE
.LP
There are some exceptions.
A few tokens beginning with ``\e'' actually belong to other classes,
a tab character is treated like ``\etab'',
and unrecognized tokens are put in class
.I rtfUnknown
no matter what they look like.
.LP
Within a class, tokens are assigned a major number, and perhaps a
minor number.
For the
.I rtfText
class, the major number is the value of the character (0..255), and there
is no minor number.
For the
.I rtfControl
class, most tokens have both a major and minor number.
For instance, all paragraph attribute control symbols have major number
.I rtfParAttr
and a minor number indicating which property, such as
.I rtfLeftIndent
or
.I rtfSpaceBefore .
A few oddball control tokens have no minor number.
.LP
A ``plain text'' character can be a literal character, a character specified
in hex notation (``\e\`\fIxx\fR'') or one of the special escaped characters
(``\e{'', ``\e}'', ``\e\e'').
The sequence ``\e:'' is treated as a plain text colon.
This is arguably wrong; the rationale is given later under the
description of the
.I RTFGetToken()
function.
.LP
Ideally, there should never be any tokens in the
.I rtfUnknown
class, but as the RTF standard continues to develop, unknown tokens are
inevitable.
.LP
To write a translator, you'll need to familiarize yourself with
the token classification scheme by reading
.I rtf.h .
A skeleton translator
.I rtfskel.c
is included with the distribution and may be used as a basis for new
translators.
.LP
Each time a token is read, several global variables are set.
.I rtfClass ,
.I rtfMajor ,
and
.I rtfMinor
indicate the token class, and major and minor numbers.
(The major and minor numbers may be meaningless depending on the kind
of token.)
Control symbols may have a parameter value, e.g., ``\emargr720''
specifies a right margin (in units of 720 twentieths of a point).
The parameter value is stored in
.I rtfParam .
The text of the token (including the parameter text) is placed in
.I rtfTextBuf
and its length in
.I rtfTextLen .
.LP
If no parameter value is given,
.I rtfParam
is 0, which is indistinguishable from an explicitly specified
parameter of ``0''.
If you need to tell the difference, examine
.I rtfTextBuf[rtfTextLen-1]
to see if it's a digit or not.
.LP
The reader assumes a 7-bit character set.
The specification indicates that character values \(>= 128 may be encoded with
the ``\e'\fIxx\fR'' sequence.
If the reader sees a character with the high bit set, it prints a message
and exits.
.LP
Generally, a translator will configure the RTF reader
to call particular writer
functions when certain kinds of tokens are encountered in the input
stream.
These functions are known as
.I "class callbacks" .
Writer callbacks can be registered with the reader using
.I RTFSetClassCallback()
for each token class.
.LP
The reader reads each token, classifies it, and sends it to a token routing
function
.I RTFRouteToken() ,
tries to find a writer callback function to process the token.
Tokens in a given
class are ignored if no callback is registered for the class.
.LP
Class callbacks make it quite easy to receive notification when
certain types of tokens occur in the input.
For instance, a crude RTF text extractor could be written by
installing a callback function for the
.I rtfText
class.\**
.FS
The reasons this is a crude translator are that:
(i) some text characters occur in contexts where the characters
are not intended to be output, e.g., font tables, stylesheets; (ii)
character values greater than 127 probably should be translated into the normal
ASCII range; (iii) some control symbols like ``\etab'' represent
output text characters.
The sample translator
.I rtf2text
addresses these problems in a (slightly) more sophisticated manner.
.FE
Whenever the function is invoked,
.I rtfMajor
will contain a value in the range 0..255 representing the character value.
.LP
.DS
# include	<stdio.h>
# include	"rtf.h"

void TextCallback ()
{
	putchar (rtfMajor);
}

int main ()
{
	RTFInit ();
	RTFSetClassCallback (rtfText, TextCallback);
	RTFRead ();
	exit (0);
}
.DE
.LP
Callbacks for the
.I rtfControl
and
.I rtfGroup
classes
typically operate by selecting on the token major number to determine
the action to take.
A callback for the
.I rtfGroup
class usually will do something like this:
.LP
.DS
void BraceCallback ()
{
	switch (rtfMajor)
	{
	case rtfBeginGroup:
		\fI...push state...\fR
		break;
	case rtfEndGroup:
		\fI...pop state...\fR
		break;
	}
}
.DE
.NH 2
Destination Readers
.LP
Grouping in RTF documents occurs within braces ``{'' and ``}''.
One kind of group is the
.I destination .
The token immediately following the opening brace is a destination
control symbol.
These indicate such things as headers, footers, footnotes, etc.
.LP
Three destinations which specify information for internal use (i.e.,
information which affects
output but isn't itself written) are the font table, color table and
stylesheet.
Since these three destinations occur so commonly and have a special syntax,
the RTF reader by default gobbles them up itself when it recognizes them.
The functions which do this are called
.I "destination readers"
and are probably the nearest thing in the reader to what might be
called parsers.
They are installed by default so that translators can be written
without the burden of understanding the syntax or digesting the
contents of these destinations.
Each of them constructs a list of the entries specified in the
destination and the reader includes functions providing access to
these lists.
.LP
Translators can turn off or override these defaults with
.I RTFSetDestinationCallback()
if necessary.
To override one, pass the address of a different destination reader
function.
To turn one off, pass NULL.
.LP
Destination callbacks may be called for any destination, not just
.I rtfFontTbl ,
.I rtfColorTbl
and
.I rtfStyleSheet .
Destinations for which no callback is registered are not treated
specially.
.LP
Other destinations for which there is a default
reader are the information (``\einfo'') and picture (``\epict'')
destinations; all they do is skip to the end of the group.
.NH 3
Using the Built-in Destination Readers
.LP
The font table, color table and stylesheet information is maintained
internally, and the reader either acts on that information itself, or
allows itself to be queried by the writer about it, as described
below.
These descriptions do not apply if the translator shuts off or
overrides the default destination readers, of course.
.LP
\fBStylesheet\*-\fRThe reader acts on this itself.
When the stylesheet destination is encountered, the style contents are
remembered.
Thereafter, whenever the writer receives notification that a style number
control symbol (``\es\fInnn\fR'') has occurred, it can call
.I RTFExpandStyle(rtfParam)
to cause the style to be expanded.
The reader consults contents of the stylesheet and each
token in the style definition is routed in turn back to the writer.
This effects a sort of macro expansion.
.LP
If the writer doesn't care about style expansion, it simply refrains
from calling
.I RTFExpandStyle() .
.LP
If the writer wants information about a style, it can call
.I RTFGetStyle() .
.LP
\fBFont table\*-\fRFor each entry in the font table, the font number,
type and name are maintained by the reader.
The writer finds out that a font number has been specified in the input
when its control class callback is invoked and
.I rtfMajor
\(eq
.I rtfCharAttr
and
.I rtfMinor
\(eq
.I rtfFontNum .
To obtain a pointer to the appropriate
.I RTFFont
structure, the reader function
.I RTFGetFont(rtfParam)
may be called.
.LP
\fBColor table\*-\fRFor each entry in the color table, the color number
is maintained along with the red, green and blue values.
The writer finds out that a color number has been specified in the input
when its control class callback is invoked and
.I rtfMajor
\(eq
.I rtfCharAttr
and
.I rtfMinor
\(eq
.I rtfColorNum .
To obtain a pointer to the appropriate
.I RTFColor
structure, the reader function
.I RTFGetColor(rtfParam)
may be called.
.LP
One subtle point about the built-in destination readers:
destinations cannot be recognized until
.I after
the occurrence of the ``{'' symbol that begins the destination.
This means the writer, if it maintains a state stack, will already
have pushed a state.
In order to allow the writer to properly pop that state in response
to the ``}'', these
destination readers feed the ``}'' back into the token router after
they pull it from the input stream.
What the writer actually sees is a ``{'' followed immediately by a
``}''.
.LP
Applications that maintain a state stack may find it necessary to do
something similar if they supply their own destination readers.
.NH
Programming Interface
.LP
Source files using the RTF reader should #include
.I rtf.h .
.I reader.c
should be compiled to produce
.I reader.o ,
which should be part of the final application link.
.LP
The best way to learn how these source files work is to study the sample
translators, which vary in complexity from very simple (e.g.,
.I rtf2text ,
.I rtfwc ),
to wretchedly messy (e.g.,
.I rtf2troff ).
You should be aware that one implication of the way the translators
are built (callbacks and switch statements)
is that it's quite easy to build them incrementally.
You can start with a very bare-bones model, and start plugging in
callbacks as you progress.
Within the callbacks, your switch statements can progressively handle
more cases.
.LP
An alternative approach is to start with a copy of
.I rtfskel.c ,
which includes a full set of class callbacks and complete switch statements
for all tokens.
Each case is empty; you simply add code for those cases you want to handle.
You can also rip out the code for the cases you know you'll never care about.
.NH 2
Global variables
.LP
The global RTF reader variables are:
.LP
.DS
.ta .6i 2i
int	rtfClass;	token class
int	rtfMajor;	token major number
int	rtfMinor;	token minor number
int	rtfParam;	parameter value for control symbols
char	rtfTextBuf[rtfBufSiz];	token text
int	rtfTextLen;	length of token text
.DE
.LP
These variables always apply to the token with which the writer should
be concerned.
This may be either the last token read or the current token within a
style which is being reprocessed.
.NH 2
Functions
.IS
void RTFInit ()
.IE
Initialize the RTF reader.
This is the first RTF routine that should be called.
It performs some initialization such as computing hash values for the
token lookup table and installation of the built-in destination readers.
.LP
.I RTFInit()
may be called multiple times.
Each invocation resets the reader's state completely, except that the input
stream is not disturbed.
.IS
void RTFRead ()
.IE
.I RTFRead()
calls
.I RTFGetToken()
to tokenize the input stream
and
.I RTFRouteToken()
to process each token, until input is exhausted.
When
.I RTFRead()
returns, input has been completely read and the writer can perform any
cleanup or termination needed.
.LP
If you want to read multiple files per invocation of your translator,
you should do all your setup prior to each call to
.I RTFRead() .
That is, you should call
.I RTFInit() ,
install callbacks, etc., then call
.I RTFRead() .
.IS
void RTFRouteToken ()
.IE
This routine decides what to do with the current token and routes it
to the correct place for processing.
Usually this is directly to the writer via a class callback.
The token is
.I not
passed to the writer (i.e., the class callback is bypassed) when it
is a destination token for which a reader callback is installed.
.LP
By default, built-in readers are installed
for font table, color table, stylesheet and information and picture
group destinations.
The built-in readers can be disabled
if the writer wants to see all tokens directly.
.IS
int RTFGetToken ()
.IE
Reads one token from the input stream, classifies it, sets the global
variables, and returns the class number.
If the class is
.I rtfEOF
the end of the input stream has been reached.
Newlines (``\en'') and carriage returns (``\er'')
are silently discarded by
.I RTFGetToken() ,
as they have no meaning.
Both are passed to the token hook if one is installed, however.
.LP
The sequence ``\e:'' is treated as a plain text character, with
.I rtfClass
set to
.I rtfText
and
.I rtfMajor
set to the colon ASCII code.
Strictly speaking, ``\e:'' is the control word for an index
subentry, but some versions of Microsoft Word write out plain text
colons with a preceding backslash, while others don't.
This unfortunate ambiguity results in an ugly dilemma.
It seems the lesser burden to require translators to recognize
that plain text colons should ``really'' be treated as index subentry
indicators
while inside of an index entry destination, than to recognize that an
index subentry control word should ``really'' be treated as a plain
text colon everywhere else.
.LP
Writers probably should not need to use
.I RTFGetToken()
directly unless they install their own destination readers.
One reason you might want to call it is to implement a ``peek at next token''
capability.
Call
.I RTFGetToken()
and examine the global variables.
Then call
.I RTFRouteToken()
to cause the symbol to be processed normally.
This way you get to look at the token before it goes through the usual routing
mechanism.
.IS
void RTFSetToken (class, major, minor, param, text)
int	class, major, minor, param;
char	*text;
.IE
It is sometimes useful to construct a fake token and run it through the
token router to cause the effects of the token to be applied.
.I RTFSetToken()
allows you to do this, by setting the reader's global variables to the
values supplied.
If
.I param
is non-negative, the token text
.I rtfTextBuf
is constructed from
.I text
and
.I param ,
otherwise
.I rtfTextBuf
is just copied from
.I text .
.IS
void RTFSetReadHook (f)
void	(*f) ();
.IE
Install a function to be called by
.I RTFGetToken()
after each token is read from the input stream.
The function takes no arguments and returns no value.
Within the function,
information about the current token can be obtained from the global
variables.
This function is for token examination purposes only, and should not
modify those variables.
.IS
void (*RTFGetReadHook ()) ()
.IE
Returns a pointer to the current read hook, or NULL if there isn't one.
.IS
void RTFSkipGroup ()
.IE
This function can be called to skip to the end of the current group (including
any subgroups).
It's useful for explicitly ignoring ``\e*\e\fIdest\fR'' groups, where
.I dest
is an unrecognized destination, or for causing groups that you don't want
to deal with to effectively ``disappear'' from the input stream.
.LP
Calling this function in the middle of expanding a style may cause problems.
However, it is typically called when you have just seen a destination symbol,
which won't happen during a style expansion\*-I think.
.LP
Be careful with this function if your writer maintains a state stack,
because you will already have pushed a state when the opening group
brace was seen.
After
.I RTFSkipGroup()
returns, the group closing brace has been read, and you'll need to pop
a state.
All global token variables will still be set to the closing brace, so
you may only need to call
.I RTFRouteToken()
to cause the state to be unstacked.
.IS
void RTFExpandStyle (num)
int	num;
.IE
Performs style expansion of the given style number, or does nothing
if there is no such style.
The writer should call this when it notices that the current token
is a style number indicator.
.IS
void RTFSetStream (stream)
FILE	*stream;
.IE
Redirects the RTF reader to the given stream.
This should be called before any reading is done.
The default input stream is
.I stdin .
An alternative to
.I RTFSetStream()
is to simply
.I freopen()
the input file on
.I stdin
(that's what all the sample translators do).
.LP
The input stream is
.I not
modified by
.I RTFInit() .
.IS
void RTFSetClassCallback (class, callback)
int	class;
void	(*callback) ();
.IE
Installs a writer callback function for the given token class.
The first argument is a class number, the second is the function to
call when tokens from that class are encountered in the input stream.
This will cause
.I RTFRouteToken()
to invoke the callback when it encounters a token in the class.
If
.I callback
is NULL (which is the default for all classes),
tokens in the class are ignored, i.e., discarded.
.LP
The callback should take no arguments and return no value.
Within the callback,
information about the current token can be obtained from the global
variables.
.LP
Installing a callback for the
.I rtfEOF
``class'' is silly and has no effect.
.IS
void (*RTFGetClassCallback (class)) ()
int	class;
.IE
Returns a pointer to the callback function for the given token class,
or NULL if there isn't one.
.IS
void RTFSetDestinationCallback (dest, callback)
int	dest;
void	(*callback) ();
.IE
Installs a callback function for the given destination
.I dest "" (
is a token minor number).
When
.I RTFRouteToken()
sees a token with class
.I rtfControl
and major number
.I rtfDestination ,
it checks whether there is a callback for the destination indicated by
the minor number.
If so, it invokes it.
If
.I callback
is NULL, the given destination is not treated specially (the control
class callback is invoked as usual).
By default, destination callbacks are installed for the font table, color
table, stylesheet, and information and picture group.
.LP
The callback should take no arguments and return no value.
When the functon is invoked, the current token will be the destination
token following the destination's initial opening brace ``{''.
(For optional destinations, the destination token follows the ``\e*''
symbol.)
.IS
void (*RTFGetDestinationCallback (dest)) ()
int	dest;
.IE
Returns a pointer to the callback function for the given token class,
or NULL if there isn't one.
.IS
RTFStyle *RTFGetStyle (num)
int	num;
.IE
Returns a pointer to the
.I RTFStyle
structure for the given style number.
The ``Normal'' style number is 0.
Pass \-1 to get a pointer to the first style in the list.
Styles are not stored in any particular order.
.LP
Be sure to check the result; it might be NULL.
.LP
This function is meaningless if the default stylesheet destination
reader is overridden.
.IS
RTFFont *RTFGetFont (num)
int	num;
.IE
Returns a pointer to the
.I RTFFont
structure for the given font number.
Pass \-1 to get a pointer to the first font in the list.
Fonts are not stored in any particular order.
.LP
Be sure to check the result; it might be NULL.
In particular, you might think that passing the number specified with
the ``\edeff'' (default font) control symbol would always yield a
valid font structure, but that's not true.
The default font might not be listed in the font table.
.LP
This function is meaningless if the default font table destination
reader is overridden.
.IS
RTFColor *RTFGetColor (num)
int	num;
.IE
Returns a pointer to the
.I RTFColor
structure for the given color number.
(I think black is 0.)
Pass \-1 to get a pointer to the first color in the list.
Colors are not stored in any particular order.
If the color values in the entry are \-1, the default color should be used.
The default color is writer-dependent.
.LP
Be sure to check the result; it might be NULL.
I think this means you should use the default color.
.LP
This function is meaningless if the default color table destination
reader is overridden.
.IS
int RTFCheckCM (class, major)
int	class, major;
.IE
Returns non-zero if
.I rtfClass
and
.I rtfMajor
are equal to
.I class
and
.I major ,
respectively, zero otherwise.
.IS
int RTFCheckCMM (class, major, minor)
int	class, major, minor;
.IE
Returns non-zero if
.I rtfClass ,
.I rtfMajor
and
.I rtfMinor
are equal to
.I class ,
.I major
and
.I minor ,
respectively, zero otherwise.
.IS
int RTFCheckMM (major, minor)
int	major, minor;
.IE
Returns non-zero if
.I rtfMajor
and
.I rtfMinor
are equal to
.I major
and
.I minor ,
respectively, zero otherwise.
.IS
char *RTFAlloc (size)
int	size;
.IE
Returns a pointer to a block of memory
.I size
bytes long, or NULL if insufficient memory was available.
.IS
char *RTFStrSave (s)
char	*s;
.IE
Allocates a block of memory big enough for a copy of the given string
(including terminating null byte), copies the string into it, and returns
a pointer to the copy.
Returns NULL if insufficient memory was available.
.IS
void RTFFree (p)
char	*p;
.IE
Frees the block of memory pointed to by
.I p ,
which should have been allocated by
.I RTFAlloc()
or
.I RTFStrSave() .
It is safe to pass NULL to this routine.
.NH
Distribution Availability
.LP
This software may be redistributed without restriction and used for
any purpose whatsoever.
.LP
The RTF distribution is available for anonymous 
.I ftp
access in the
.I ~ftp/pub/RTF
directory on host
.I indri.primate.wisc.edu
(Internet address 128.104.230.11).
Updates appear there as they become available.
.LP
A version of the RTF specification is available in this directory,
as either a binhex'ed Word for Macintosh document, or in RTF format.
It is known to have a few errors, as it's a scanned version of a paper
copy.
Some of these errors have been fixed, but others remain (see, for instance,
the example table text on page 17).
The document is not quite as up to date as the one sent out by Microsoft,
but it is much more complete
than the one beginning ``Specification for RTF'' that may be found on
some other archive sites.
.LP
If you do not have Internet access, send a request to one of the following:
.DS
.ta 1.2i
Internet:	software-request@primate.wisc.edu
UUCP:		rhesus!software-request
.DE
Bug reports and questions should be sent to one of these addresses as
well.
.LP
If you use this software as the basis for a translater not included in
the current collection, please consider contributing it for inclusion
in a future distribution.
In particular, an RTF-to-LaTeX translator seems to be an item of interest.
I don't use LaTeX myself and am unlikely to write one, but it would probably
be fairly popular.