[Corpora-List] Query about the (dual) language of web pages

P Resnik psresnik at gmail.com
Tue Oct 9 19:23:33 UTC 2007


> Another thought: is there any place that actually tracks these sorts of
> pages?  I know Phil Resnik was collecting some of this in the past
> (http://umiacs.umd.edu/~resnik/strand/), but I don't believe he is
> actively doing so now.



That's correct, Mike -- unfortunately I didn't have the resources to create
an ongoing Web bitext mining operation, either via my own crawling or using
stored crawls on the Internet Archive.  Our approach to classifying Web page
pairs as translations (vs. not) actually works pretty well, but it's the
infrastructure that gets you -- getting your hands on the pages and managing
an ongoing operation.  There are some folks trying to get Web-scale
computing off the ground for the  language research community (e.g.
http://wacky.sslmit.unibo.it/doku.php) but I'm not aware of anything yet
that would allow you put together the sort of data you'd need in order to
answer Yorick's question.  I'd be happy to hear otherwise...

Best,

  Philip
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20071009/ac081554/attachment.htm>
-------------- next part --------------
_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora


More information about the Corpora mailing list