[Corpora-List] neologism finder tools

Antoinette Renouf ant at rdues.liv.ac.uk
Thu Jun 12 15:28:46 UTC 2003


Dear Eric, Sylvana
As Eric says, `new' has to be defined with respect to `old'. We define
this in three different ways. In each case, I use `corpus' to refer to
a chronological flow of electronic text (e.g. newspaper text) that is
marked for approximate authorial date.

One method is to bootstrap, by which I mean start with
day/week/month/quarter one text chunk and assume all its words are new,
store them, run them against the text time-chunk 2, find the
differences and call those `time-chunk 2 new words', run the cumulative
list against chunk 3, and so on. This method means that you will have a
graph of neologistic occurrence which registers 100% for time-chunk 1,
but which evens out after a few months, as the normal rate of coinage
is allowed to show through. At that point, you can disregard the data
for the first few months statistically; linguistically, you observe
those so-called new items with a pinch of salt and a lot of intuition.

The second method is related: according to this, you take the first 3
months or so of text flow and wordlist it, so that it becomes your
master wordlist, that which you deem for convenience purposes to
represent `the established lexicon' at start of play. You run the 4th
month of text against it, and so on.

The third method is to take a large external corpus or wordlist,
authored prior in time to your data flow, as your established lexicon.
The similarity or difference between this earlier text and your own
will affect your results - for instance if you run your newspaper
corpus against a masterlist of novels - or of a different newspaper.

All this raises many questions, as you can see - not least, whether the words
that newly appear are new or just first-time occurrences in that arbitrary data
sample (and so possibly not neologisms but revivals, highly-technical terms,
etc). But it is an approach that is automatable.

With static corpora, such as the 60's LOB/Brown; and the 90's
FLOB/Frown, you can compare the earlier and later comparative corpus
wordlists to find differences. Many linguists study static corpora but
talk about change in them by using their own native-speaking intuition
as the implicit source of authority as to established use.

We have made 1 month's of neologistic data available for April 1998 online; and
our neologism detection software can under some circumstances be
licenced and applied to any corpus.
see http://www.rdues.liv.ac.uk/aprdemo/

Antoinette Renouf

--------------------------------------------------
Research and Development Unit for English Studies
University of Liverpool
19 Abercromby Square
Liverpool
L69 7ZG

tel sec unit: +44 (0)151 794 2289
tel:          +44 (0)151 794 2286
fax:          +44 (0)151 794 2298
email:        ajrenouf at liv.ac.uk
url:          http://www.rdues.liv.ac.uk


> Date: Thu, 12 Jun 2003 15:07:59 +0100 (BST)
> From: Eric Atwell <eric at comp.leeds.ac.uk>
> X-X-Sender: eric at cslin148.csunix.comp.leeds.ac.uk
> To: krausse <krausse at fh-nordhausen.de>
> cc: corpora at hd.uib.no
> Subject: Re: [Corpora-List] neologism finder tools

> Sylvana,
> A problem with "retrieving new words in a corpus" is: "new" with respect
> to what?  You can easily find all words in a corpus with only one (or
> two..) occurrences, which makes them "rare"; but "new" implies
> your corpus builds on a larger monitor corpus tracking the language over
> time. As I understand it, AVIATOR/APRIL is not just software for a
> static corpus but infrastructure for processing a (large) monitor corpus.
> Is this what you have?
>
> Eric Atwell
>
>
> On Thu, 12 Jun 2003, krausse wrote:
>
> > Dear colleagues,
> >
> > In Lynne Bowker's and Jennifer Pearson's book "Working with Specialized
> > Corpora"  neologism finder tools like the ones used in the AVIATOR/APRIL
> > project are mentioned.
> >
> > I wonder whether there are any free or commercial programs available or
> > how other people go about retrieving new words in a corpus.
> >
> > Many thanks in advance,
> >
> > Sylvana Krausse
> >
>
> --
> Eric Atwell, CVL: Computer Vision and Language research group
> Distributed Multimedia Systems MSc Tutor & SOCRATES/JYA Tutor
> School of Computing, University of Leeds, LEEDS LS2 9JT
> TEL: 0113-3435761  MOBILE: 0775-1039104 FAX: 0113-3435468
> WWW: http://www.comp.leeds.ac.uk/eric  EMAIL: eric at comp.leeds.ac.uk
> Visit http://www.computingLEEDS.ac.uk - our newsletter for industry
>
>



More information about the Corpora mailing list