<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">

<HTML><HEAD>

<META http-equiv=Content-Type content="text/html; charset=iso-8859-1">

<META content="MSHTML 6.00.6001.18639" name=GENERATOR>

<STYLE></STYLE>

</HEAD>

<BODY bgColor=#ffffff>

<DIV><FONT face=Arial size=2>Are there any lemmatized corpora of German, which 

can be used queried on-line or on Windows?  I'm trying 

to lemmatize some German text myself for lexical purposes, and I would like 

to see how others have handled the problems, and how well it works.</FONT></DIV>

<DIV><FONT face=Arial size=2></FONT> </DIV>

<DIV><FONT face=Arial size=2>Of the German corpora I have found, Negra is 

POS-tagged but not lemmatized, while Tiger is both POS-tagged and 

lemmatized.  Negra does not mention any query</FONT><FONT face=Arial 

size=2> facility; Tiger had one which is no longer supported and 

unfortunately doesn't work for me.  A problem for me with both these 

corpora is that the tagset they use (STTS) seems to be designed 

with syntax in mind.  Here are some examples where this may not suit my 

lexical purposes.</FONT></DIV>

<DIV><FONT face=Arial size=2></FONT> </DIV>

<DIV><FONT face=Arial size=2>1. The various finite forms of a verb (eg. 

aufsteigen) are lemmatized to the infinitive and tagged VVFIN, whereas the 

abstract noun (das Aufsteigen) is tagged NN.  I think I would like to be 

able to retrieve them all together, eg. in response to 

"aufsteigen".</FONT></DIV>

<DIV><FONT face=Arial size=2></FONT> </DIV>

<DIV><FONT face=Arial size=2>2. Present participles and past participles are 

tagged as adjectives (ADJA or ADJD). I think I would like to retrieve these too 

from the verbal infinitive.</FONT></DIV>

<DIV><FONT face=Arial size=2></FONT> </DIV>

<DIV><FONT face=Arial size=2>3. Substantivised adjectives are tagged as nouns 

(eg etwas Ähnliches).  I think I would like these retrieved along with the 

forms of the adjective (ähnlich).</FONT></DIV>

<DIV><FONT face=Arial size=2></FONT> </DIV>

<DIV><FONT face=Arial size=2>4. Separable verbs are tagged as  two words 

when separated and as one word when not separated.  I think I would like to 

retrieve separated and nonseparated examples together, though I have not decided 

whether this is best done by tagging them all as one word or as 

two.</FONT></DIV>

<DIV><FONT face=Arial size=2></FONT> </DIV>

<DIV><FONT face=Arial size=2>5. Compound forms are not decompounded.  I 

think I would like to decompound (most of) them.</FONT></DIV>

<DIV><FONT face=Arial size=2></FONT> </DIV>

<DIV><FONT face=Arial size=2>Although my interest is in lemmas, it is sometimes 

useful for me to have POS-tags also, eg. to distinguish arm-ADJ from 

Arm-NN.</FONT></DIV>

<DIV><FONT face=Arial size=2></FONT> </DIV>

<DIV><FONT face=Arial size=2>I have run my text through TreeTagger, using the 

training data for STTS, and expect to have to make the 

above changes manually.  Before committing myself further, I'd like to 

try out anything which already exists, or to receive any advice.</FONT></DIV>

<DIV><FONT face=Arial size=2></FONT> </DIV>

<DIV><FONT face=Arial size=2>Many thanks,<BR>Ciarán Ó 

Duibhín.</FONT></DIV></BODY></HTML>