<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<html>
<head>
<meta content="text/html; charset=ISO-8859-1"
http-equiv="Content-Type">
<title></title>
</head>
<body bgcolor="#ffffff" text="#000000">
Dear Ciarán,<br>
<br>
as Heike already said, SMOR might be interesting for you.<br>
SMOR should be able to solve most of the problems you mentioned.
Here are some examples:<br>
<br>
> Aufsteigen<br>
auf<VPART>steigen<V><SUFF><+NN><Neut><Nom><Sg><br>
auf<VPART>steigen<V><SUFF><+NN><Neut><Dat><Sg><br>
auf<VPART>steigen<V><SUFF><+NN><Neut><Acc><Sg><br>
// nominalisation of a particle verb<br>
<br>
> verkleinertes<br>
verkleinern<V><PPast><SUFF><+ADJ><Pos><Neut><Nom><Sg><St><br>
verkleinern<V><PPast><SUFF><+ADJ><Pos><Neut><Acc><Sg><St><br>
// adjectivisation of a past participle<br>
<br>
> Ähnliches<br>
ähnlich<ADJ><SUFF><+NN><Neut><Nom><Sg><St><br>
ähnlich<ADJ><SUFF><+NN><Neut><Acc><Sg><St><br>
// nominalisation of an adjective<br>
<br>
> Morphologiesysteme<br>
Morphologie<NN>System<+NN><Neut><Dat><Sg><Old><br>
Morphologie<NN>System<+NN><Neut><Nom><Pl><br>
Morphologie<NN>System<+NN><Neut><Gen><Pl><br>
Morphologie<NN>System<+NN><Neut><Acc><Pl><br>
// compound<br>
<br>
You could even approach the separable verb prefix problem by
attaching the separated prefix to the verb and analysing it. Take
the sentence "Er schlägt das Buch auf". You extract "schlägt" and
"auf" and analyse the recombined wordform:<br>
> aufschlägt<br>
auf<VPART>schlagen<+V><3><Sg><Pres><Ind><br>
<br>
SMOR is not freely available yet, but you can obtain a free research
license.<br>
<br>
Best regards,<br>
Helmut Schmid<br>
<br>
<br>
Am 16.01.2012 22:07, schrieb Ciarán Ó Duibhín:
<blockquote
cite="mid:7AF24178F3D847EEAE6BD0656C5099F9@InneallChiarin"
type="cite">
<meta http-equiv="Content-Type" content="text/html;
charset=ISO-8859-1">
<meta content="MSHTML 6.00.6001.18639" name="GENERATOR">
<style></style>
<div><font face="Arial" size="2">Are there any lemmatized corpora
of German, which can be used queried on-line or on Windows?
I'm trying to lemmatize some German text myself for lexical
purposes, and I would like to see how others have handled the
problems, and how well it works.</font></div>
<div> </div>
<div><font face="Arial" size="2">Of the German corpora I have
found, Negra is POS-tagged but not lemmatized, while Tiger is
both POS-tagged and lemmatized. Negra does not mention
any query</font><font face="Arial" size="2"> facility; Tiger
had one which is no longer supported and unfortunately doesn't
work for me. A problem for me with both these corpora is
that the tagset they use (STTS) seems to be designed with
syntax in mind. Here are some examples where this may not
suit my lexical purposes.</font></div>
<div> </div>
<div><font face="Arial" size="2">1. The various finite forms of a
verb (eg. aufsteigen) are lemmatized to the infinitive and
tagged VVFIN, whereas the abstract noun (das Aufsteigen) is
tagged NN. I think I would like to be able to retrieve them
all together, eg. in response to "aufsteigen".</font></div>
<div> </div>
<div><font face="Arial" size="2">2. Present participles and past
participles are tagged as adjectives (ADJA or ADJD). I think I
would like to retrieve these too from the verbal infinitive.</font></div>
<div> </div>
<div><font face="Arial" size="2">3. Substantivised adjectives are
tagged as nouns (eg etwas Ähnliches). I think I would like
these retrieved along with the forms of the adjective
(ähnlich).</font></div>
<div> </div>
<div><font face="Arial" size="2">4. Separable verbs are tagged as
two words when separated and as one word when not separated.
I think I would like to retrieve separated and nonseparated
examples together, though I have not decided whether this is
best done by tagging them all as one word or as two.</font></div>
<div> </div>
<div><font face="Arial" size="2">5. Compound forms are not
decompounded. I think I would like to decompound (most of)
them.</font></div>
<div> </div>
<div><font face="Arial" size="2">Although my interest is in
lemmas, it is sometimes useful for me to have POS-tags also,
eg. to distinguish arm-ADJ from Arm-NN.</font></div>
<div> </div>
<div><font face="Arial" size="2">I have run my text through
TreeTagger, using the training data for STTS, and expect to
have to make the above changes manually. Before committing
myself further, I'd like to try out anything which already
exists, or to receive any advice.</font></div>
<div> </div>
<div><font face="Arial" size="2">Many thanks,<br>
Ciarán Ó Duibhín.</font></div>
<pre wrap="">
<fieldset class="mimeAttachmentHeader"></fieldset>
_______________________________________________
UNSUBSCRIBE from this page: <a class="moz-txt-link-freetext" href="http://mailman.uib.no/options/corpora">http://mailman.uib.no/options/corpora</a>
Corpora mailing list
<a class="moz-txt-link-abbreviated" href="mailto:Corpora@uib.no">Corpora@uib.no</a>
<a class="moz-txt-link-freetext" href="http://mailman.uib.no/listinfo/corpora">http://mailman.uib.no/listinfo/corpora</a>
</pre>
</blockquote>
<br>
</body>
</html>