Corpora: French corpora and software - Summary

NOELLE-VERONIQUE SERPOLLET n.serpollet at lancaster.ac.uk
Thu Jun 15 11:27:22 UTC 2000


Dear list members,

After having thanked the people who helped me with my query regarding
"Parallel corpora and French software", here is now a sunmmary of the
results I obtained:

	* software that I could use to tag/analyse my French data

Michael Barlow is currently developing ParaConc. 
<The new version will be based on
<the code from MonoConc Pro and will be similar in functionality (but
with
<more functions) to the one that you are using, [ParaConc, 1995], but
the <underlying code will be different.

http://jupiter.inalf.cnrs.fr/WinBrill/    
(Maria José Ribeiro <mj.ribeiro at NETC.PT>)

	* tagger/concordancer which would enable me to retrieve
occurrences
	of the French subjunctive

Cordial 6 Universités a a tagger/lemmatizer for French which does it:
1       Il      il      PPER3S
2       faut    falloir VINDP3S
3       que     que     SUB
4       je      je      PPER1S
5       vienne  venir   VSUBP1S
6       .       .       PCTFORTE
(Jean Veronis, http://www.up.univ-mrs.fr/~veronis)
For more information, contact SYNAPSE Développement 
www.synapse-fr.com


	* gather a French/English parallel corpus (with the texts being
aligned 	if possible).

 <ARCADE corpus of ca. 1.5M words of Fr/En texts aligned at sentence
level:
<http://www.up.univ-mrs.fr/~veronis/arcade

<The corpus is distributed by ELRA:
<http://www.icp.grenet.fr/ELRA/home.html
(Jean Veronis, veronis at up.univ-mrs.fr)

Tim Johns' website: http://web.bham.ac.uk/johnstf/timconc.htm

<He's been working on parallel concordancing within the Lingua
<project on multilingual parallel concordancing. I'm not
<quite sure whether you'll find actual corpora there, but
<there may be something, plus probably useful links.
(Antoine Consigny, anconsig at liverpool.ac.uk, anconsig at yahoo.fr)

Two corpora, primarily political and legislative in their content.  
available from the LDC:

<UN Parallel Text (English/Spanish/French)
<http://morph.ldc.upenn.edu/Catalog/LDC94T4A.html

<-- you can request just the English and French data, if you
<prefer; the full corpus is a 3-cdrom set, with one language per
<cdrom, one text document per data file, and alignment at the level
<of document/file only.

<Canadian Hansards (French/English)
<http://morph.ldc.upenn.edu/Catalog/LDC95T20.html

<-- a single cdrom containing
<two distinct sets of parallel text; one set is aligned at the
<sentence level, and the other (smaller) set is aligned at the
<paragraph level (with additional alignment data for individual
<word tokens within paragraphs).

Please write to ldc at ldc.upenn.edu if you would like further
information or are interested in purchasing either of these
collections.
(Shannon Sears, Linguistic Data Consortium, ssears at ldc.upenn.edu
www: http://www.ldc.upenn.edu)

I hope this will be of interest to a lot of members.
Noelle
---------------------
Noëlle SERPOLLET
Department of Linguistics and MEL
Lancaster University,
LANCASTER, LA1 4YT, UK
e-mail: n.serpollet at lancaster.ac.uk



More information about the Corpora mailing list