[Corpora-List] lemma list wanted

Hunter, Duncan D.I.Hunter at warwick.ac.uk
Fri Feb 23 23:48:58 UTC 2007


Wow! This could be what I need. I'm not QUITE techie enough to immediately work out how to access the data working with this technology, but I'll play around and see...
 
Many thanks,
 
Duncan Hunter

________________________________

From: owner-corpora at lists.uib.no on behalf of D.W.Hardcastle
Sent: Fri 23/02/2007 23:23
To: CORPORA at UIB.NO
Subject: RE: [Corpora-List] lemma list wanted



Sorry - I have lost the original thread, but I recall that someone
wanted lemma and inflection tables.
I also need to lemmatise and reinflect dictionary words for my PhD
project, so I have a lemmatiser that is based on CUVPlus
(http://ota.ahds.ac.uk/texts/2469.html).



If you are interested:
I have put a zip file on my website (http://mcs.open.ac.uk/dh5368/) it
contains a list of inflection-lemma mappings, lemma-inflection mappings
and a file called singles.txt which contains forms in the lexicon that
could not be reduced.

The data was extracted from the CUVPlus lexicon by running a lemmatising
algorithm to reduce every entry in the lexicon and checking the
resulting proposed lemmas against the lexicon.

The file lemmas.txt contains inflection-lemma mappings that were
corroborated by the lexicon and inflect.txt contains the inverse
mappings. These files include words that are already in base form.

The singles.txt file contains word forms that judging by the tag should
be reducible but for which no proposed lemma could be found in the
lexicon. Most are adverbs that have no adjective base form, many are
non-count plural forms. There are also some (BNC) tagging errors,
misspellings and rare word forms. I have included the BNC frequency for
each entry from the lexicon as most of the noise is of low frequency.

Please note that this means that words not covered by the CUVPlus
lexicon do not appear in the mappings.

All the entries in the files are tagged using the C7 tagset.

The data is work in progress, but it is pretty clean I believe.
If you decide to use the mapping tables please cite my PhD thesis - it
is at Birkbeck College, University of London and due for submission
later this year.


Thank you,

Dave


--
David Hardcastle
Research Programmer, Natural Language Generation Group
Faculty of Mathematics and Computing, room 121, North Spur
The Open University, Walton Hall, Milton Keynes, MK7 6AA
+44 (0) 1908 659947



-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20070223/169160e8/attachment.htm>


More information about the Corpora mailing list