[Corpora-List] quantities of publicly available parallel text?

Alexandre Rafalovitch arafalov at gmail.com
Wed Feb 27 14:47:30 UTC 2008


Official documents of the United Nations are translated (by human
translators) into 6 languages (English, French, Spanish, Russian,
Chinese, Arabic). They are not unfortunately available in a research
ready bitexts, but the documents themselves are available from
http://documents.un.org . There is quite a lot of text there, if one
is ready to do some non-traditional parsing.

For the last 8-10 years, most of those documents have been available
in MSWord format. The rest are in PDFs (some with text and some with
scanned images).

LDC had a very old sample of UN documents; I think that was before
MSWord versions started to be published, so they had to scan and clean
their data.

I have more information available, if somebody takes an interest. I am
doing  research in Named Entity Recognition in that domain, but there
are enough challenges in the corpora to go around.

Regards,
    Alex.

-- 
Personal blog: http://blog.outerthoughts.com/
Research group: http://www.clt.mq.edu.au/Research/

On Tue, Feb 26, 2008 at 9:50 PM, Chris Dyer <redpony at umd.edu> wrote:
> Dear colleagues,
>
>  Is anyone aware of attempts to estimate how much machine-readable
>  parallel text is publicly available?  I'm trying to get a general
>  sense of the scale of parallel data we currently have (and are likely
>  to have in the future, assuming current growth trends).  Does anyone
>  have any statistics on this sort of thing?

_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora



More information about the Corpora mailing list