[Corpora-List] American and British English spelling converter
Ben Hutchinson
ben.hutch at gmail.com
Fri Nov 3 00:12:03 UTC 2006
Stanford University's NLP group's POS tagger does some pre-processing
that converts British spellings to US spellings based on variations in
the spellings of certain common words and word endings.
As an example of how it modifies word endings, it tags
"sour flour our dour parlour rigour glamour colour Harbour"
as
"sour/JJ flour/NN our/PRP$ dour/NN parlor/NN rigor/NN glamor/NN
color/NN Harbor/NNP".
It even Americanizes unknown words ending in "-our", so, for example,
it tags "nonsensour" as "nonsensor". Sometimes it is a bit over
eager, as in "devour" -> "devor/NN".
The tagger is under the GNU license, so I think it should be possible
to adapt the Java code to suit your requirements as long as you
resdistribute your changes. I also think it should be fairly
straightforward to invert their algorithm, although it's a while since
I looked at the source. It is available from
http://nlp.stanford.edu/software/index.shtml
On 03/11/06, Martin Wynne <martin.wynne at oucs.ox.ac.uk> wrote:
> If you find such a program, let us know, and we can run it over the BNC
> and change the 5849 occurrences of 'realize' and inflected forms to
> 'realise' etc., and otherwise correct British English to your preferred
> spellings ;)
>
> Martin Krallinger wrote:
>
> > Dear all,
> >
> > I was looking for some simple tool (preferable in Python) which
> > is able to do automatic conversion of texts (or words) from
> > British English (UK) to American (US) English and vice versa.
> > (Example: realize <-> realise)
> >
> > This seems to be an easy task, but I could not find any ready to use
> > stand alone tool capable of performing this task.
> >
> > I want to integrate this application into an Information extraction
> > system
> > which handles scientific literature.
> >
> > I am also interested in references where aspects related to US/UK English
> > spelling has been analyzed in the context of information extraction, text
> > mining and terminology extraction.
> >
> > Best regards,
> >
> >
> > Martin
> >
> >
>
>
>
More information about the Corpora
mailing list