[Corpora-List] POS Tagger for German / Java

Yannick Versley versley at sfs.uni-tuebingen.de
Wed Jan 10 08:58:11 UTC 2007


Hi,

> I am currently working on a system for toponym recognition in natural
> german (web-based) text documents, as my master thesis.
> The system uses a POS tagger for extracting good NE candidates for a
> gazetteer.

based on my experience (also with a system for toponym resolution, but not in 
Java), I think it would be easiest to use tnt (or any other existing 
POS-tagger) by writing the input to a file, running tnt over it and reading 
back tnt's output.
If you want to train your own tagger, either with qtag or with another toolkit 
(e.g. the Stanford POS tagger, which is available under
http://nlp.stanford.edu/software/tagger.shtml ),
you will want to make sure that you
1. use a large corpus, e.g. Negra or TiGer (the qtag page says that it uses 
25k tokens of training data. Negra has 400k tokens and TiGer probably has 
around 1M).
2. use a large lexicon. This is especially important for the NE/NN 
distinction, as it is not easy to get this from only surface forms.
If you can, take a large full-form lexicon (you could try to use the lexicon 
data from the WCDG parser, freely available at 
http://nats-www.informatik.uni-hamburg.de/view/CDG/DownloadPage ,
or any other that you are able to get your hands on).
You should also try to get most of the information you have in your gazetteer 
into the tagger lexicon, but you need to be careful with ambiguous names
(e.g. Sonntag/NN and Sonntag/NE, Sommer/NN,NE or Bush/NN,NE in English).
Using a large lexicon is also good if you use a pre-trained tagger like tnt 
where you can add more lexical entries.

Cheers,
Yannick Versley
> Now, here my question arises
> 1. Do you know of any good POS tagger for German language, best Java-based?
> (I need only the NE-tagged tokens.)
> 2. I used tnt, but that one is based on perl/C, and it is not easy to
> integrate into my java framework.
> 3. I also used qtag. But it comes only with a, for my task too small data
> base (lexicon and matrix).
>
> So, is there any POS tagger out there that is easy to use and up for the
> task?
>
> Cheers & thx for listening in, yours
> Mike Sonntag
-- 
Yannick Versley
Seminar für Sprachwissenschaft, Abt. Computerlinguistik
Wilhelmstr. 19, 72074 Tübingen
Tel.: (07071) 29 77352



More information about the Corpora mailing list