\title{Hyphenating British English}
\author[Philip Taylor]{Philip Taylor\\RHBNC\\\texttt{P.Taylor@Vax.Rhbnc.Ac.Uk}}
\begin{Article}

\noindent
Many members of \ukt\ will already be aware that an enormous debt of
gratitude is owed to Dominik Wujastyk, who undertook the initial
generation of a set of hyphenation patterns for \TeX\ which were based
on a British (as opposed to American) dictionary.  That debt of
gratitude is also owed to Oxford University Press, who donated their
internal word-list of some 160$\,$000 entries with primary, secondary and
tertiary breakpoints shewn as well as a `frequency-of-use' index for
each word.

Dominik struggled against seemingly insuperable odds to process this
vast word-list; the standard \emph{Patgen} simply wasn't up to the task, and
despite the best efforts of Peter Breitenlohner, Wayne Sullivan and
many others, an attempt to build a suitably large DOS/Pascal version
was doomed to failure.  In the end, Dominik discovered the \emph{web2c}
implementation of Karl Berry, and this, together with D J Delorie's
DJGPP C compiler, eventually enabled him to build a version of \emph{Patgen}
which could cope with a 160$\,$000-entry word-list.

But although he did not know it, his troubles were but starting: once
he could read the word-list, he had to supply values for three of the
most cryptic and arcane variables in the known \TeX\ world: +good_wt+,
+bad_wt+ and +threshold+.  These three variables control the entire pattern
generation process, yet even their inventor, Frank Liang, was forced to
confess in his Ph.D thesis (``Word Hy-phen-a-tion by Com-put-er'') that
he was unable to justify the values which he had used to generate the
American patterns other than by purely empirical means. And so Dominik,
too, used Frank's values, and produced patterns which, statistically
at least, were as valid as Frank Liang's.  Dominik recorded 
his experiences in a talk which he gave to the UK \TeX\ Users' Group 
Easter meeting which was held at RHBNC last year.

However, the generation of patterns is not a once-and-forever task:
those patterns which Dominik had produced were larger than the
American equivalent, requiring for some systems at least either a
specially `large' \TeX\ or at least a \TeX\ tuned to accommodate a
larger pattern set.  Furthermore it correctly hyphenated only 90\% of
the words in the 160$\,$000-entry word-list, missing about 10\%
completely.  There were also a few words which it was known would be
hyphenated incorrectly using Dominik's patterns, and which were
subsequently documented in the distributed \texttt{ukhyphen.tex}.

With a sabbatical year in India on the horizon, Dominik felt that it
was time to hand over the baton; he had created a viable set of
patterns, and if someone else wanted to improve on them, that was up
to them.  As Dominik knew that I had a considerable interest in
pattern generation, and that I had, in fact, offered to run
\emph{Patgen} on my VAX/VMS system if he had been unable to get a copy
working on any of the systems to which he had access, he asked if I
would like to become `custodian of the patterns', and I willingly
agreed.  After all, Dominik had done all the hard work --- acquired a
suitable machine-readable dictionary, created the initial pattern set,
ascertained suitable values for +good_wt+, +bad_wt+,
+threshold+{\ldots} So my task should be infinitely more
straightforward: just build on what Dominik had already done.

But of course, life is rarely that straightforward: as soon as I came
to build a large \emph{Patgen} for VAX/VMS, I discovered that the
Kellerman \& Smith changefile which I had no longer worked.
Furthermore, K\&S were unwilling to allow it to pass into the public
domain, so any development work on it would have been futile.  My
saviour turned out to be Christian Spieler, who had already ported the
remainder of the standard \TeX\ distribution to Alpha/VMS; only
\emph{Patgen} remained, and once I had explained to him the importance
of that little-known utility, he willingly and promptly undertook an
Alpha/VMS port, including as standard the additional workspace which
it was known would be required.  Within 24 hours a test version was
ready, and it worked beyond my wildest dreams: no second version was
needed, the very first version went straight into production, and that
same day I was able to produce a set of patterns which, statistically
at least, were as good as those produced by Dominik.

But just as Dominik had had to battle with +good_wt+, +bad_wt+ and
+threshold+, I too had my own windmills at which to tilt: in my case
the problem came about because Christian had, very reasonably, based
\emph{his} implementation on \emph{Patgen2} (Peter Breitenlohner's
8-bit modifications to DEK's standard 7-bit Patgen).  And Patgen2 has
four new variables with which to cope: +hyph_start+, +hyph_finish+,
+pat_start+ and +pat_finish+!

Fortunately for me, these are nowhere near as arcane as +good_wt+ and
its ilk: the two +hyph_+ parameters allow multiple passes through the
dictionary to be subsumed into a single run, whilst the two +pat_+
parameters allow the minimum and maximum length of pattern for each
pass to be separately specified.  I do not pretend for one instant
that I \emph{fully} understand these, and I certainly don't pretend to
have more than the vaguest comprehension of the full implications of
+good_wt+, etc., but at least I can now generate patterns to my
heart's content, and the Alpha is busy doing that at the very time
that I am writing this report\ldots

Between now and the time of publication of the next \BV, I hope to
have a much clearer understanding of the possible interactions between
the various parameters to \emph{Patgen}.  And I hope, too, to have
prepared a new set of patterns which the UK community will be able to
adopt as a standard, together with a minimal set of exceptions which I
am sure will still be necessary.  But work will not then stop: I have
already enlisted the help of a friend and sometime colleague, Chris
McManus, who I hope will be able to define some \emph{rules} for the
choice of values for the various parameters (Chris is a medic,
statistician and polymath \emph{extraordinaire}, and if anyone can
formulate rules for this problem, I am convinced that it is he); and
between us I hope that we will be able to publish some guidelines for
the use of \emph{Patgen2} --- guidelines which are sadly lacking at
the moment.

And finally I hope that you, too --- the UK \TeX\ Community --- will
contribute to this project: for someone has to identify the mistakes
which the patterns allow, and such a task is far beyond the ability of
any one individual to undertake.  Once a new definitive set of
patterns is announced, I will ask you all to look carefully at every
document that you typeset thereafter; and note whenever a hyphenation
looks strange; and to check it with a definitive list of valid
hyphenation points (I am using ``The Oxford Minidictionary of Spelling
and Word-Division'', but pointers to other definitive sources will be
most welcome); and if you find a genuine instance of a
wrongly-hyphenated word, then \emph{please} report it to me.  I will
probably set up an e-mail list solely for this purpose, since I lose
paper mail almost by definition whilst e-mail remains accessible in
perpetuity.

So, to summarise: building on previous work by Don Knuth, Frank Liang,
Peter Breitenlohner, The Oxford University Press, Dominik Wujastyk and
Christian Spieler (doubtless among many others), I am now in a
position to generate British English hyphenation patterns.  In
conjunction with Chris McManus, I hope that we will be able to
formalise much that has been heuristic, or at best stochastic, in the
past.  And with your help, I hope to be able to produce not only a
definitive set of British English patterns, but an equally definitive
(but, one hopes, very small!) set of exceptions.  I look forward to
this challenge very much indeed.

\end{Article}