[Corpora-List] BILINGUAL PARALLEL CORPORA

Philipp Koehn pkoehn at inf.ed.ac.uk
Mon Nov 13 14:06:31 UTC 2006


Hi,

large available corpora for the languages in questions
are the Europarl http://www.statmt.org/europarl/ and
Acquis Communitair http://langtech.jrc.it/JRC-Acquis.html
corpora.

I am not sure what you mean by your second question.
What is the purpose of such a tool? There are tools out
there that do word alignment, build statistical machine
translation models, etc.

Also, the size of the corpus very much depends on
what you want to do with it. For statistical machine
translation, 1 million words goes a long way, although
recent systems are typically trained on more data.

Regards,
Philipp Koehn

On 11/12/06, JLDLME <jldlme at yahoo.com> wrote:
> Dear Corpora-List members,
>
> I have three questions...
>
> Does anyone know if there is any publicly available bilingual, sentence
> aligned, freely available corpus involving several languages, namely in
> Scandinavian (Finnish, Norwegian, etc.) or Latin languages (Spanish,
> Italian, etc.), for bilingual studies ?
>
> My second question is: Which would be the requirements to create an
> online/desktop software tool for the whole process of a parallel corpora?
>
> Finally, do you should consider one million of words (in both languages) a
> large or a little bilingual corpus?
>
> Any help will be appreciated.
>
>
> Regards,
>
>
> J. L. DeLucca (in some place of Spain)
>
>
>  ________________________________
> Access over 1 million songs - Yahoo! Music Unlimited.
>
>



More information about the Corpora mailing list