[Corpora-List] Parsed corpus file format

Linas Vepstas linasvepstas at gmail.com
Wed Jul 2 21:52:56 UTC 2008


2008/7/2 Serge Sharoff <S.Sharoff at leeds.ac.uk>:
>It takes a lot of time to setup a crawl

The current project is being done in conjunction with the
Apache Nutch/Lucene groups, and Wikia search, who already
have infrastructure in place, and have a vested interest in search.

>>From personal experience, cpu time is a major limiting factor:
it can take weeks or months of cpu time to parse even a modest
corpus; this is a limiting factor.  For example, the $100 million
acquistion of Powerset by Microsoft is being driven in part
by Powerset's need to get access to more compute power.

The hope is that by making this data publicly available, new
applications can be enabled, independent of proprietary
efforts such as Powerset.

> for an example see the ukWac paper at the last Web as Corpus workshop:
> http://webascorpus.sf.net/WAC4

I'll take a look.

--linas

_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora



More information about the Corpora mailing list