[Corpora-List] Lemmatizing German text for lexical purposes

Eckhard Bick eckhard.bick at mail.dk
Tue Jan 17 10:31:38 UTC 2012


Hello Ciarán,

There are a number of lemmatized and morphosyntactically annotated 
German corpora in the CorpusEye collection, with an online search 
interface. The link for German is: http://corp.hum.sdu.dk/cqp.de.html, 
with about 85 million words. The corresponding live analysis is at 
http://beta.visl.sdu.dk/visl/de/parsing/automatic/parse.php.

Best regards,
Eckhard


On 2012-01-16 22:07, Ciarán Ó Duibhín wrote:
> Are there any lemmatized corpora of German, which can be used queried 
> on-line or on Windows?  I'm trying to lemmatize some German text 
> myself for lexical purposes, and I would like to see how others have 
> handled the problems, and how well it works.
> Of the German corpora I have found, Negra is POS-tagged but not 
> lemmatized, while Tiger is both POS-tagged and lemmatized.  Negra does 
> not mention any query facility; Tiger had one which is no longer 
> supported and unfortunately doesn't work for me.  A problem for me 
> with both these corpora is that the tagset they use (STTS) seems to 
> be designed with syntax in mind.  Here are some examples where this 
> may not suit my lexical purposes.
> 1. The various finite forms of a verb (eg. aufsteigen) are lemmatized 
> to the infinitive and tagged VVFIN, whereas the abstract noun (das 
> Aufsteigen) is tagged NN.  I think I would like to be able to retrieve 
> them all together, eg. in response to "aufsteigen".
> 2. Present participles and past participles are tagged as adjectives 
> (ADJA or ADJD). I think I would like to retrieve these too from the 
> verbal infinitive.
> 3. Substantivised adjectives are tagged as nouns (eg etwas 
> Ähnliches).  I think I would like these retrieved along with the forms 
> of the adjective (ähnlich).
> 4. Separable verbs are tagged as  two words when separated and as one 
> word when not separated.  I think I would like to retrieve separated 
> and nonseparated examples together, though I have not decided whether 
> this is best done by tagging them all as one word or as two.
> 5. Compound forms are not decompounded.  I think I would like to 
> decompound (most of) them.
> Although my interest is in lemmas, it is sometimes useful for me to 
> have POS-tags also, eg. to distinguish arm-ADJ from Arm-NN.
> I have run my text through TreeTagger, using the training data for 
> STTS, and expect to have to make the above changes manually.  Before 
> committing myself further, I'd like to try out anything which already 
> exists, or to receive any advice.
> Many thanks,
> Ciarán Ó Duibhín.
>
>
> _______________________________________________
> UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
> Corpora mailing list
> Corpora at uib.no
> http://mailman.uib.no/listinfo/corpora


-- 
Eckhard Bick,
cand.med., dr.phil.
University of Southern Denmark
e-mail: eckhard.bick at mail.dk
web: http://beta.visl.sdu.dk


_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora



More information about the Corpora mailing list