<div>But aren't all these official, centralised corpora both of rather peculiar genres, and rather small? More interesting, to my mind, is Tiedemann and Nygard's work, based on the neat observations that</div>
<div> </div>
<div>1) films are often dubbed so exist in parallel languages</div>
<div>2) in the age of Web 2.0, people write transcrips and upload them </div>
<div>3) time stamps support alignment</div>
<div> </div>
<div>see <a href="http://urd.let.rug.nl/tiedeman/OPUS/">http://urd.let.rug.nl/tiedeman/OPUS/</a></div>
<div>- there's lots of (quasi-spoken) data for lots of language-pairs </div>
<div> </div>
<div>adam<br><br></div>
<div class="gmail_quote">2008/2/27 Alexandre Rafalovitch <<a href="mailto:arafalov@gmail.com">arafalov@gmail.com</a>>:<br>
<blockquote class="gmail_quote" style="PADDING-LEFT: 1ex; MARGIN: 0px 0px 0px 0.8ex; BORDER-LEFT: #ccc 1px solid">Official documents of the United Nations are translated (by human<br>translators) into 6 languages (English, French, Spanish, Russian,<br>
Chinese, Arabic). They are not unfortunately available in a research<br>ready bitexts, but the documents themselves are available from<br><a href="http://documents.un.org/" target="_blank">http://documents.un.org</a> . There is quite a lot of text there, if one<br>
is ready to do some non-traditional parsing.<br><br>For the last 8-10 years, most of those documents have been available<br>in MSWord format. The rest are in PDFs (some with text and some with<br>scanned images).<br><br>LDC had a very old sample of UN documents; I think that was before<br>
MSWord versions started to be published, so they had to scan and clean<br>their data.<br><br>I have more information available, if somebody takes an interest. I am<br>doing research in Named Entity Recognition in that domain, but there<br>
are enough challenges in the corpora to go around.<br><br>Regards,<br> Alex.<br><font color="#888888"><br>--<br>Personal blog: <a href="http://blog.outerthoughts.com/" target="_blank">http://blog.outerthoughts.com/</a><br>
Research group: <a href="http://www.clt.mq.edu.au/Research/" target="_blank">http://www.clt.mq.edu.au/Research/</a><br></font>
<div class="Ih2E3d"><br>On Tue, Feb 26, 2008 at 9:50 PM, Chris Dyer <<a href="mailto:redpony@umd.edu">redpony@umd.edu</a>> wrote:<br>> Dear colleagues,<br>><br>> Is anyone aware of attempts to estimate how much machine-readable<br>
> parallel text is publicly available? I'm trying to get a general<br>> sense of the scale of parallel data we currently have (and are likely<br>> to have in the future, assuming current growth trends). Does anyone<br>
> have any statistics on this sort of thing?<br><br></div>
<div>
<div></div>
<div class="Wj3C7c">_______________________________________________<br>Corpora mailing list<br><a href="mailto:Corpora@uib.no">Corpora@uib.no</a><br><a href="http://mailman.uib.no/listinfo/corpora" target="_blank">http://mailman.uib.no/listinfo/corpora</a><br>
</div></div></blockquote></div><br><br clear="all"><br>-- <br>================================================<br>Adam Kilgarriff <a href="http://www.kilgarriff.co.uk">http://www.kilgarriff.co.uk</a> <br>
Lexical Computing Ltd <a href="http://www.sketchengine.co.uk">http://www.sketchengine.co.uk</a><br>Lexicography MasterClass Ltd <a href="http://www.lexmasterclass.com">http://www.lexmasterclass.com</a><br>
Universities of Leeds and Sussex <a href="mailto:adam@lexmasterclass.com">adam@lexmasterclass.com</a><br>================================================