[Corpora-List] Towards an open-source French tagger

Agata Savary agata.savary at univ-tours.fr
Wed Dec 5 14:47:02 UTC 2012


CONCRAFT (http://hackage.haskell.org/package/concraft) is an open source tagger for Polish based on a novel idea of a Constrained Conditional Random 
Fields model (see [1] for details). It allows to harness the complexity of CRFs by constraining the set of labels for a given token by the output of a 
morphological analyzer. It outperforms existing taggers for Polish, notably with respect to unknown words.

We are planning to explore CONCRAFT's adaptability to an inflected language of a different family. Thus, we are looking for:
- a morphologically annotated corpus of French (preferably with both parts-of-speech and morphological features such as gender, number, tense, etc.),
- a large-coverage morphological analyser whose tagset would be equivalent to the corpus tagset,
- other freely-available taggers for French in view of a contrastive analysis.

A French version of CONCRAFT obtained in this experiment would be distributed under an open license (probably BSD).

[1] Jakub Waszczuk "Harnessing the CRF complexity with domain-specific constraints. The case of morphosyntactic tagging of a highly inflected 
language", in Proceedings of COLING 2012, Mumbai, India.



-- 
Agata Savary
Maître de conférences
IUT de Blois
Université François Rabelais de Tours
3 place Jean-Jaurès
41000 Blois
agata.savary at univ-tours.fr
tél. ++33 (0) 2 54 55 21 47
fax  ++33 (0) 2 54 55 21 32
http://www.info.univ-tours.fr/~savary/

_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora



More information about the Corpora mailing list