\title{Hyphenating British English} \author[Philip Taylor]{Philip Taylor\\RHBNC\\\texttt{P.Taylor@Vax.Rhbnc.Ac.Uk}} \begin{Article} \noindent Many members of \ukt\ will already be aware that an enormous debt of gratitude is owed to Dominik Wujastyk, who undertook the initial generation of a set of hyphenation patterns for \TeX\ which were based on a British (as opposed to American) dictionary. That debt of gratitude is also owed to Oxford University Press, who donated their internal word-list of some 160$\,$000 entries with primary, secondary and tertiary breakpoints shewn as well as a `frequency-of-use' index for each word. Dominik struggled against seemingly insuperable odds to process this vast word-list; the standard \emph{Patgen} simply wasn't up to the task, and despite the best efforts of Peter Breitenlohner, Wayne Sullivan and many others, an attempt to build a suitably large DOS/Pascal version was doomed to failure. In the end, Dominik discovered the \emph{web2c} implementation of Karl Berry, and this, together with D J Delorie's DJGPP C compiler, eventually enabled him to build a version of \emph{Patgen} which could cope with a 160$\,$000-entry word-list. But although he did not know it, his troubles were but starting: once he could read the word-list, he had to supply values for three of the most cryptic and arcane variables in the known \TeX\ world: +good_wt+, +bad_wt+ and +threshold+. These three variables control the entire pattern generation process, yet even their inventor, Frank Liang, was forced to confess in his Ph.D thesis (``Word Hy-phen-a-tion by Com-put-er'') that he was unable to justify the values which he had used to generate the American patterns other than by purely empirical means. And so Dominik, too, used Frank's values, and produced patterns which, statistically at least, were as valid as Frank Liang's. Dominik recorded his experiences in a talk which he gave to the UK \TeX\ Users' Group Easter meeting which was held at RHBNC last year. However, the generation of patterns is not a once-and-forever task: those patterns which Dominik had produced were larger than the American equivalent, requiring for some systems at least either a specially `large' \TeX\ or at least a \TeX\ tuned to accommodate a larger pattern set. Furthermore it correctly hyphenated only 90\% of the words in the 160$\,$000-entry word-list, missing about 10\% completely. There were also a few words which it was known would be hyphenated incorrectly using Dominik's patterns, and which were subsequently documented in the distributed \texttt{ukhyphen.tex}. With a sabbatical year in India on the horizon, Dominik felt that it was time to hand over the baton; he had created a viable set of patterns, and if someone else wanted to improve on them, that was up to them. As Dominik knew that I had a considerable interest in pattern generation, and that I had, in fact, offered to run \emph{Patgen} on my VAX/VMS system if he had been unable to get a copy working on any of the systems to which he had access, he asked if I would like to become `custodian of the patterns', and I willingly agreed. After all, Dominik had done all the hard work --- acquired a suitable machine-readable dictionary, created the initial pattern set, ascertained suitable values for +good_wt+, +bad_wt+, +threshold+{\ldots} So my task should be infinitely more straightforward: just build on what Dominik had already done. But of course, life is rarely that straightforward: as soon as I came to build a large \emph{Patgen} for VAX/VMS, I discovered that the Kellerman \& Smith changefile which I had no longer worked. Furthermore, K\&S were unwilling to allow it to pass into the public domain, so any development work on it would have been futile. My saviour turned out to be Christian Spieler, who had already ported the remainder of the standard \TeX\ distribution to Alpha/VMS; only \emph{Patgen} remained, and once I had explained to him the importance of that little-known utility, he willingly and promptly undertook an Alpha/VMS port, including as standard the additional workspace which it was known would be required. Within 24 hours a test version was ready, and it worked beyond my wildest dreams: no second version was needed, the very first version went straight into production, and that same day I was able to produce a set of patterns which, statistically at least, were as good as those produced by Dominik. But just as Dominik had had to battle with +good_wt+, +bad_wt+ and +threshold+, I too had my own windmills at which to tilt: in my case the problem came about because Christian had, very reasonably, based \emph{his} implementation on \emph{Patgen2} (Peter Breitenlohner's 8-bit modifications to DEK's standard 7-bit Patgen). And Patgen2 has four new variables with which to cope: +hyph_start+, +hyph_finish+, +pat_start+ and +pat_finish+! Fortunately for me, these are nowhere near as arcane as +good_wt+ and its ilk: the two +hyph_+ parameters allow multiple passes through the dictionary to be subsumed into a single run, whilst the two +pat_+ parameters allow the minimum and maximum length of pattern for each pass to be separately specified. I do not pretend for one instant that I \emph{fully} understand these, and I certainly don't pretend to have more than the vaguest comprehension of the full implications of +good_wt+, etc., but at least I can now generate patterns to my heart's content, and the Alpha is busy doing that at the very time that I am writing this report\ldots Between now and the time of publication of the next \BV, I hope to have a much clearer understanding of the possible interactions between the various parameters to \emph{Patgen}. And I hope, too, to have prepared a new set of patterns which the UK community will be able to adopt as a standard, together with a minimal set of exceptions which I am sure will still be necessary. But work will not then stop: I have already enlisted the help of a friend and sometime colleague, Chris McManus, who I hope will be able to define some \emph{rules} for the choice of values for the various parameters (Chris is a medic, statistician and polymath \emph{extraordinaire}, and if anyone can formulate rules for this problem, I am convinced that it is he); and between us I hope that we will be able to publish some guidelines for the use of \emph{Patgen2} --- guidelines which are sadly lacking at the moment. And finally I hope that you, too --- the UK \TeX\ Community --- will contribute to this project: for someone has to identify the mistakes which the patterns allow, and such a task is far beyond the ability of any one individual to undertake. Once a new definitive set of patterns is announced, I will ask you all to look carefully at every document that you typeset thereafter; and note whenever a hyphenation looks strange; and to check it with a definitive list of valid hyphenation points (I am using ``The Oxford Minidictionary of Spelling and Word-Division'', but pointers to other definitive sources will be most welcome); and if you find a genuine instance of a wrongly-hyphenated word, then \emph{please} report it to me. I will probably set up an e-mail list solely for this purpose, since I lose paper mail almost by definition whilst e-mail remains accessible in perpetuity. So, to summarise: building on previous work by Don Knuth, Frank Liang, Peter Breitenlohner, The Oxford University Press, Dominik Wujastyk and Christian Spieler (doubtless among many others), I am now in a position to generate British English hyphenation patterns. In conjunction with Chris McManus, I hope that we will be able to formalise much that has been heuristic, or at best stochastic, in the past. And with your help, I hope to be able to produce not only a definitive set of British English patterns, but an equally definitive (but, one hopes, very small!) set of exceptions. I look forward to this challenge very much indeed. \end{Article}