<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">
<HTML><HEAD>
<META http-equiv=Content-Type content="text/html; charset=iso-8859-1">
<META content="MSHTML 6.00.6001.18639" name=GENERATOR>
<STYLE></STYLE>
</HEAD>
<BODY bgColor=#ffffff>
<DIV><FONT face=Arial size=2>Are there any lemmatized corpora of German, which
can be used queried on-line or on Windows? I'm trying
to lemmatize some German text myself for lexical purposes, and I would like
to see how others have handled the problems, and how well it works.</FONT></DIV>
<DIV><FONT face=Arial size=2></FONT> </DIV>
<DIV><FONT face=Arial size=2>Of the German corpora I have found, Negra is
POS-tagged but not lemmatized, while Tiger is both POS-tagged and
lemmatized. Negra does not mention any query</FONT><FONT face=Arial
size=2> facility; Tiger had one which is no longer supported and
unfortunately doesn't work for me. A problem for me with both these
corpora is that the tagset they use (STTS) seems to be designed
with syntax in mind. Here are some examples where this may not suit my
lexical purposes.</FONT></DIV>
<DIV><FONT face=Arial size=2></FONT> </DIV>
<DIV><FONT face=Arial size=2>1. The various finite forms of a verb (eg.
aufsteigen) are lemmatized to the infinitive and tagged VVFIN, whereas the
abstract noun (das Aufsteigen) is tagged NN. I think I would like to be
able to retrieve them all together, eg. in response to
"aufsteigen".</FONT></DIV>
<DIV><FONT face=Arial size=2></FONT> </DIV>
<DIV><FONT face=Arial size=2>2. Present participles and past participles are
tagged as adjectives (ADJA or ADJD). I think I would like to retrieve these too
from the verbal infinitive.</FONT></DIV>
<DIV><FONT face=Arial size=2></FONT> </DIV>
<DIV><FONT face=Arial size=2>3. Substantivised adjectives are tagged as nouns
(eg etwas Ähnliches). I think I would like these retrieved along with the
forms of the adjective (ähnlich).</FONT></DIV>
<DIV><FONT face=Arial size=2></FONT> </DIV>
<DIV><FONT face=Arial size=2>4. Separable verbs are tagged as two words
when separated and as one word when not separated. I think I would like to
retrieve separated and nonseparated examples together, though I have not decided
whether this is best done by tagging them all as one word or as
two.</FONT></DIV>
<DIV><FONT face=Arial size=2></FONT> </DIV>
<DIV><FONT face=Arial size=2>5. Compound forms are not decompounded. I
think I would like to decompound (most of) them.</FONT></DIV>
<DIV><FONT face=Arial size=2></FONT> </DIV>
<DIV><FONT face=Arial size=2>Although my interest is in lemmas, it is sometimes
useful for me to have POS-tags also, eg. to distinguish arm-ADJ from
Arm-NN.</FONT></DIV>
<DIV><FONT face=Arial size=2></FONT> </DIV>
<DIV><FONT face=Arial size=2>I have run my text through TreeTagger, using the
training data for STTS, and expect to have to make the
above changes manually. Before committing myself further, I'd like to
try out anything which already exists, or to receive any advice.</FONT></DIV>
<DIV><FONT face=Arial size=2></FONT> </DIV>
<DIV><FONT face=Arial size=2>Many thanks,<BR>Ciarán Ó
Duibhín.</FONT></DIV></BODY></HTML>