[Corpora-List] Available: German Named Entity Recognition resources

Sebastian Padó pado at ims.uni-stuttgart.de
Thu Jun 24 09:29:16 UTC 2010


Dear all,

We are glad to announce two new resources for German Named Entity
Recognition that are freely available for research purposes.

The first resource is a German classifier for the CRF-based Stanford
NER system that has been trained on the German CoNLL 2003 dataset. It
distinguishes four classes of NEs: person, location, organization,
other. It includes features based on lexical clusters obtained from a
large (175M tokens) corpus of unlabelled German text, which improves
recall by up to 10%.

The second resource consists of two EUROPARL transcripts annotated
with Named Entities using the same scheme. The total size is about
110,000 tokens.

According to our evaluation, the classifier is currently among the
best NER systems for German.

Condition     | Test set          | Prec  | Rec   | F-1
---------------------------------------------------------
In-domain     | (CoNLL 2003 testb)| 86.6  | 71.2  | 78.2 
Out-of-domain | (EUROPARL)        | 78.0  | 56.7  | 65.6 


For more information and downloads, please visit
http://nlpado.de/~sebastian/ner_german.html

Sincerely,

Manaal Faruqui & Sebastian Pado
IMS, University of Stuttgart

_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora



More information about the Corpora mailing list