[Corpora-List] German morphological analysis

Jan Strunk strunk at linguistics.ruhr-uni-bochum.de
Mon Nov 19 16:04:52 UTC 2007


Hello,

a while ago I asked for tools to determine
the number feature of nouns in a German corpus.

 > We are searching for a robust and fast morphological
 > analyzer for German for a project on the use of bare nouns without
 > a determiner that we are currently doing at the University of Bochum.
 >
 > We would like to use this system to tell whether a given
 > noun in a given context is plural or singular.
 >
 > The tool should ideally have the following characteristics:
 > - able to cope with a lot of text in a reasonable time
 >    (we want to analyze whole year-volumes of newspaper text)
 > - able to cope with a lot of unknown words (often compounds)
 > - deterministic in the sense that it does indicate the most likely
 >    number and not only tells us that a noun could be both singular or
 >   plural.

I would like to thank all the people who responded for
their advice and the willingness to share their tools with us.
We have not yet tried out and compared the suggestions,
but I thought I should post a short preliminary summary.

Jason Eisner suggested using a part-of-speech tagger,
e.g. the TnT-Tagger.
Helmut Schmid and also proposed to use a tagger trained on
a corpus annotated with number information, e.g. the Tiger corpus.
He further suggested the Stuttgart IMSLex lexicon as a
morphological resource.

Philipp Koehn suggested the LoPar-Parser:
http://www.ims.uni-stuttgart.de/tcl/SOFTWARE/LoPar.html
However, it is not clear what grammar to use that includes
number information. (Thanks to Sabine Schulte im Walde
for information about available grammars!)

Garance Paris mentioned that she has used the Snowball stemmer
for similar purposes.

Franz Guenthner wrote that the CISLex system developed at the
Centrum für Informations- und Sprachverarbeitung in Munich
could possibly be used for this purpose.

I'll write a second summary once we have tested some possibilities.

Best regards,

Jan Strunk
strunk at linguistics.rub.de

Sprachwissenschaftliches Institut
Ruhr-Universität Bochum
Germany




_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora



More information about the Corpora mailing list