<br><div><span class="gmail_quote"></span><blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;">Another thought: is there any place that actually tracks these sorts of
<br>pages? I know Phil Resnik was collecting some of this in the past<br>(<a href="http://umiacs.umd.edu/~resnik/strand/">http://umiacs.umd.edu/~resnik/strand/</a>), but I don't believe he is<br>actively doing so now.
</blockquote><div><br></div></div><br>That's correct, Mike -- unfortunately I didn't have the resources to
create an ongoing Web bitext mining operation, either via my own
crawling or using stored crawls on the Internet Archive. Our approach to classifying Web page pairs as translations (vs. not) actually works pretty well, but it's the
infrastructure that gets you -- getting your hands on the pages and managing an ongoing operation. There are some folks trying to get Web-scale computing off the ground for the language research community (e.g. <a href="http://wacky.sslmit.unibo.it/doku.php">
http://wacky.sslmit.unibo.it/doku.php</a>) but I'm not aware of anything yet that would allow you put together the sort of data you'd need in order to answer Yorick's question. I'd be happy to hear otherwise...
<br><br>Best,<br><br> Philip<br><br>