[Corpora-List] French corpora

Tue Dec 16 13:31:28 UTC 2014

Hi Antoinette,

here's a version of the Est Republicain corpus with state-of-the-art pre-processing (lemma, pos, mwe annotation).
 This data set was used for the SPMR 2014 Shared task (see below for notes)

https://www.dropbox.com/s/xsa2uhiryb5j9sf/FRENCH_SPMRL_UNL.tar.gz 

If you're interested by manually validated data sets, you can have a look to these:

The Cr#pBank is here: (1700 sentences, another 2,7k coming up)
http://pauillac.inria.fr/~seddah/FrenchSocialMediaBank-v0.9.1beta.tar.gz

Twitter, Facebook, Doctissimo (health forum), jeuxvideos.com (video games)
both noisy and less noisy text [1]

The  Sequoia treebank is here: (3200)
(both const and dep)
https://www.rocq.inria.fr/alpage-wiki/tiki-index.php?page=CorpusSequoia
Local neswpaper, wikipedia -history part-, Biomedical and Europarl.

it's described in French in [2] and in English in the first half
of  [3]

the Deep Sequoia (the same but with deep syntax information) is here
http://deep-sequoia.inria.fr   [4]

Also you can query some annotated corpora collected circa 2004 by Susanne Salmon-Alt and colleagues, it's the freebank base
(their newswire part is available via a password protected link but the raw text comes from the Ananas corpus [5])

http://corp.hum.sdu.dk/tgrepeye_fr.html

of course if you need the French Treebank (free for research, Le Monde text), please contact Anne Abeillé and Clément Planck (clement.plancq at linguist.univ-paris-diderot.fr), he's in charge of the distribution of the original XML sources, Marie Candito (Marie.Candito at gmail.com) for the current phrase-based & dependency  ready-to-parse versions)

if you really need huge annotated data set from newswire text such as parsed French AFP streams of the last 4 years, please contact  Eric de la Clergerie <Eric.De_La_Clergerie at inria.fr>

it's also my understanding that some of the texts from the Monde Diplomatique are subjected to a Creative Common license but I don't know if someone took the time to gather some of them into a corpus.
Many more ressources exist (Football corpus, transcripted broadcast news, Litterature's source and so on) so I'm sure you'll find what you want, ask again otherwise.

Best,
Djamé 

[1] The French Social Media Bank: a Treebank of Noisy User Generated Content, Djamé Seddah, Benoit Sagot, Marie Candito, Virginie Mouilleron, Vanessa Combet, COLING 2012, Mumbay, India
[2]  Le corpus Sequoia : annotation syntaxique et exploitation pour l’adaptation d’analyseur par pont lexical,Candito M.-H. and Djamé Seddah, 2012, Proceedings of TALN'2012, Grenoble, France
[3] A Word Clustering Approach to Domain Adaptation: Robust parsing of source and target domains Djamé Seddah, Marie Candito and Enrique Henestroza Anguiano (to appear in Grammars, parsers and recognisers special issue of the Journal of Logic and Computation 12-27
[4] Deep Syntax Annotation of the Sequoia French Treebank,Marie Candito, Guy Perrier, Bruno Guillaume, Corentin Ribeyre, Karën Fort, Djamé Seddah and Eric de la Clergerie, 2014 , Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC), Reykjavik, Iceland, 2014
[5] Le projet ANANAS : Annotation Anaphorique pour l’Analyse Sémantique de Corpus, Susanne Salmon-Alt, 2002, Proceedings of TALN'2002, Nancy

---------------------------------------------
 Notes on the Est-Republicain (spmrl version)

Unlabeled French Data Set:
This data set is derived from the release of the Est Republicain Corpus [1],
preprocessed following [2], with morphological predictions (lemma, pos, features)
generated by Morphette [3] trained on the SPMRL 2013 Shared Task French data set (train full/gold).
MWEs annotation have been added via  Lgtagger [4] trained on the same data.

Statistics:
# of sentences : > 8 millions
# of tokens : > 159 millions.

Annototation scheme:
Note that the morphological annotation schemes follow  exactly the one present in
the French "gold files". Besides the pred=y and mwehead=POS+ features (which
mark respectively a token part of compound/Mwe and the part-of-speach of the
whole compound  -- as taken  from the consituent file, see the French data set
documentation in FRENCH_SPMRL/doc/readme.spmrl -- ) we also included the
predicted dependencies for the internal structures of the compound in the
fields HEAD, DEPREL, PHEAD, PDEPREL.  Adding them as features instead of
 "pre-bracketed"dependencies  is trivial and left to the participants if
they so wish.

Quality of the annotations (on the dev set)

lemmas acc: :99.10
cpos acc: 97.98
fpos acc: 97.43
feat acc: 81.31
feat acc (no mwe features): 92.79

MWE recognition's performance is at 81.2 % of F-score on the Dev set. [5]

Full Mate (graph based) dependencies prediction will be made available soon
(at least for a significant subset of this data set).

Djamé Seddah, Marie Candito and Matthieu Constant

[1] Bertrand Gaiffe and Kamel Nehbi. 2009. Le corpus de l'Est Républicain.
Technical report, Atilf http://www.cnrtl.fr/corpus/estrepublicain/.
[2] Djamé Seddah, Marie Candito, Benoit Crabbé and Enrique
Henestroza Anguiano. 2012. Ubiquitous Usage of a Broad Coverage French Corpus: Processing the Est
Republicain corpus, , in Proceedings of LREC'2012
[3] Grzegorz Chrupała, Georgiana Dinu, and Josef van Genabith. 2008. Learning
morphology with morfette. In Proc. of LREC 2008, Marrakech, Morocco.
[4] Matthieu Constant, Anthony Sigogne, and Patrick Wa- trin. 2012.
Discriminative strategies to integrate multiword expression recognition and
parsing. In Proceedings of the 50th Annual Meeting of the Association for
Computational Linguistics: Long Papers - Volume 1, ACL ’12, pages 204–212,
Stroudsburg, PA, USA. Association for Computational Linguistics.
[5] Constant M., Candito M. and Seddah D., 2013. The LIGM-Alpage architecture for the SPMRL 2013 Shared Task: Multiword
Expression Analysis and Dependency Parsing, Proceedings of the Fourth SPMRL Workshop, Seattle, USA

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20141216/7ad4b444/attachment.htm>
-------------- next part --------------
_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora