[Corpora-List] quantities of publicly available parallel text?
Alexandre Rafalovitch
arafalov at gmail.com
Wed Feb 27 14:47:30 UTC 2008
Official documents of the United Nations are translated (by human
translators) into 6 languages (English, French, Spanish, Russian,
Chinese, Arabic). They are not unfortunately available in a research
ready bitexts, but the documents themselves are available from
http://documents.un.org . There is quite a lot of text there, if one
is ready to do some non-traditional parsing.
For the last 8-10 years, most of those documents have been available
in MSWord format. The rest are in PDFs (some with text and some with
scanned images).
LDC had a very old sample of UN documents; I think that was before
MSWord versions started to be published, so they had to scan and clean
their data.
I have more information available, if somebody takes an interest. I am
doing research in Named Entity Recognition in that domain, but there
are enough challenges in the corpora to go around.
Regards,
Alex.
--
Personal blog: http://blog.outerthoughts.com/
Research group: http://www.clt.mq.edu.au/Research/
On Tue, Feb 26, 2008 at 9:50 PM, Chris Dyer <redpony at umd.edu> wrote:
> Dear colleagues,
>
> Is anyone aware of attempts to estimate how much machine-readable
> parallel text is publicly available? I'm trying to get a general
> sense of the scale of parallel data we currently have (and are likely
> to have in the future, assuming current growth trends). Does anyone
> have any statistics on this sort of thing?
_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora
More information about the Corpora
mailing list