[Corpora-List] quantities of publicly available parallel text?

Adam Kilgarriff adam at lexmasterclass.com
Wed Feb 27 15:46:39 UTC 2008


But aren't all these official, centralised corpora both of rather peculiar
genres, and rather small?  More interesting, to my mind, is Tiedemann and
Nygard's work, based on the neat observations that

1) films are often dubbed so exist in parallel languages
2) in the age of Web 2.0, people write transcrips and upload them
3) time stamps support alignment

see http://urd.let.rug.nl/tiedeman/OPUS/
- there's lots of (quasi-spoken) data for lots of language-pairs

adam

2008/2/27 Alexandre Rafalovitch <arafalov at gmail.com>:

> Official documents of the United Nations are translated (by human
> translators) into 6 languages (English, French, Spanish, Russian,
> Chinese, Arabic). They are not unfortunately available in a research
> ready bitexts, but the documents themselves are available from
> http://documents.un.org . There is quite a lot of text there, if one
> is ready to do some non-traditional parsing.
>
> For the last 8-10 years, most of those documents have been available
> in MSWord format. The rest are in PDFs (some with text and some with
> scanned images).
>
> LDC had a very old sample of UN documents; I think that was before
> MSWord versions started to be published, so they had to scan and clean
> their data.
>
> I have more information available, if somebody takes an interest. I am
> doing  research in Named Entity Recognition in that domain, but there
> are enough challenges in the corpora to go around.
>
> Regards,
>    Alex.
>
> --
> Personal blog: http://blog.outerthoughts.com/
> Research group: http://www.clt.mq.edu.au/Research/
>
> On Tue, Feb 26, 2008 at 9:50 PM, Chris Dyer <redpony at umd.edu> wrote:
> > Dear colleagues,
> >
> >  Is anyone aware of attempts to estimate how much machine-readable
> >  parallel text is publicly available?  I'm trying to get a general
> >  sense of the scale of parallel data we currently have (and are likely
> >  to have in the future, assuming current growth trends).  Does anyone
> >  have any statistics on this sort of thing?
>
>  _______________________________________________
> Corpora mailing list
> Corpora at uib.no
> http://mailman.uib.no/listinfo/corpora
>



-- 
================================================
Adam Kilgarriff
http://www.kilgarriff.co.uk
Lexical Computing Ltd                   http://www.sketchengine.co.uk
Lexicography MasterClass Ltd      http://www.lexmasterclass.com
Universities of Leeds and Sussex       adam at lexmasterclass.com
================================================
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20080227/ecea8c5a/attachment.htm>
-------------- next part --------------
_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora


More information about the Corpora mailing list