<HTML dir=ltr><HEAD><TITLE>RE: [Corpora-List] lemma list wanted</TITLE>
<META http-equiv=Content-Type content="text/html; charset=unicode">
<META content="MSHTML 6.00.2900.3020" name=GENERATOR></HEAD>
<BODY>
<DIV id=idOWAReplyText93775 dir=ltr>
<DIV dir=ltr><FONT face=Arial color=#000000 size=2>Wow! This could be what I need. I'm not QUITE techie enough to immediately work out how to access the data working with this technology, but I'll play around and see...</FONT></DIV>
<DIV dir=ltr><FONT face=Arial size=2></FONT> </DIV>
<DIV dir=ltr><FONT face=Arial size=2>Many thanks,</FONT></DIV>
<DIV dir=ltr><FONT face=Arial size=2></FONT> </DIV>
<DIV dir=ltr><FONT face=Arial size=2>Duncan Hunter</FONT></DIV></DIV>
<DIV dir=ltr><BR>
<HR tabIndex=-1>
<FONT face=Tahoma size=2><B>From:</B> owner-corpora@lists.uib.no on behalf of D.W.Hardcastle<BR><B>Sent:</B> Fri 23/02/2007 23:23<BR><B>To:</B> CORPORA@UIB.NO<BR><B>Subject:</B> RE: [Corpora-List] lemma list wanted<BR></FONT><BR></DIV>
<DIV>
<P><FONT size=2>Sorry - I have lost the original thread, but I recall that someone<BR>wanted lemma and inflection tables.<BR>I also need to lemmatise and reinflect dictionary words for my PhD<BR>project, so I have a lemmatiser that is based on CUVPlus<BR>(<A href="http://ota.ahds.ac.uk/texts/2469.html">http://ota.ahds.ac.uk/texts/2469.html</A>).<BR><BR><BR><BR>If you are interested:<BR>I have put a zip file on my website (<A href="http://mcs.open.ac.uk/dh5368/">http://mcs.open.ac.uk/dh5368/</A>) it<BR>contains a list of inflection-lemma mappings, lemma-inflection mappings<BR>and a file called singles.txt which contains forms in the lexicon that<BR>could not be reduced.<BR><BR>The data was extracted from the CUVPlus lexicon by running a lemmatising<BR>algorithm to reduce every entry in the lexicon and checking the<BR>resulting proposed lemmas against the lexicon.<BR><BR>The file lemmas.txt contains inflection-lemma mappings that were<BR>corroborated by the lexicon and inflect.txt contains the inverse<BR>mappings. These files include words that are already in base form.<BR><BR>The singles.txt file contains word forms that judging by the tag should<BR>be reducible but for which no proposed lemma could be found in the<BR>lexicon. Most are adverbs that have no adjective base form, many are<BR>non-count plural forms. There are also some (BNC) tagging errors,<BR>misspellings and rare word forms. I have included the BNC frequency for<BR>each entry from the lexicon as most of the noise is of low frequency.<BR><BR>Please note that this means that words not covered by the CUVPlus<BR>lexicon do not appear in the mappings.<BR><BR>All the entries in the files are tagged using the C7 tagset.<BR><BR>The data is work in progress, but it is pretty clean I believe.<BR>If you decide to use the mapping tables please cite my PhD thesis - it<BR>is at Birkbeck College, University of London and due for submission<BR>later this year.<BR><BR><BR>Thank you,<BR><BR>Dave<BR><BR><BR>--<BR>David Hardcastle<BR>Research Programmer, Natural Language Generation Group<BR>Faculty of Mathematics and Computing, room 121, North Spur<BR>The Open University, Walton Hall, Milton Keynes, MK7 6AA<BR>+44 (0) 1908 659947<BR><BR></FONT></P></DIV></BODY></HTML>