[Corpora-List] quantities of publicly available parallel text?
Adam Kilgarriff
adam at lexmasterclass.com
Wed Feb 27 15:46:39 UTC 2008
But aren't all these official, centralised corpora both of rather peculiar
genres, and rather small? More interesting, to my mind, is Tiedemann and
Nygard's work, based on the neat observations that
1) films are often dubbed so exist in parallel languages
2) in the age of Web 2.0, people write transcrips and upload them
3) time stamps support alignment
see http://urd.let.rug.nl/tiedeman/OPUS/
- there's lots of (quasi-spoken) data for lots of language-pairs
adam
2008/2/27 Alexandre Rafalovitch <arafalov at gmail.com>:
> Official documents of the United Nations are translated (by human
> translators) into 6 languages (English, French, Spanish, Russian,
> Chinese, Arabic). They are not unfortunately available in a research
> ready bitexts, but the documents themselves are available from
> http://documents.un.org . There is quite a lot of text there, if one
> is ready to do some non-traditional parsing.
>
> For the last 8-10 years, most of those documents have been available
> in MSWord format. The rest are in PDFs (some with text and some with
> scanned images).
>
> LDC had a very old sample of UN documents; I think that was before
> MSWord versions started to be published, so they had to scan and clean
> their data.
>
> I have more information available, if somebody takes an interest. I am
> doing research in Named Entity Recognition in that domain, but there
> are enough challenges in the corpora to go around.
>
> Regards,
> Alex.
>
> --
> Personal blog: http://blog.outerthoughts.com/
> Research group: http://www.clt.mq.edu.au/Research/
>
> On Tue, Feb 26, 2008 at 9:50 PM, Chris Dyer <redpony at umd.edu> wrote:
> > Dear colleagues,
> >
> > Is anyone aware of attempts to estimate how much machine-readable
> > parallel text is publicly available? I'm trying to get a general
> > sense of the scale of parallel data we currently have (and are likely
> > to have in the future, assuming current growth trends). Does anyone
> > have any statistics on this sort of thing?
>
> _______________________________________________
> Corpora mailing list
> Corpora at uib.no
> http://mailman.uib.no/listinfo/corpora
>
--
================================================
Adam Kilgarriff
http://www.kilgarriff.co.uk
Lexical Computing Ltd http://www.sketchengine.co.uk
Lexicography MasterClass Ltd http://www.lexmasterclass.com
Universities of Leeds and Sussex adam at lexmasterclass.com
================================================
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20080227/ecea8c5a/attachment.htm>
-------------- next part --------------
_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora
More information about the Corpora
mailing list