[Corpora-List] language-specific harvesting of texts from the Web

Marco Baroni baroni at sslmit.unibo.it
Mon Aug 30 22:33:01 UTC 2004


Maybe you could extract seeds to be used in new queries from the pages 
you found, as suggested in:

R. Ghani, R. Jones, and D. Mladenic. 2001. Mining the web to create 
minority language corpora. CIKM 2001, 279–286.
http://citeseer.ist.psu.edu/ghani01mining.html

We have a set of simple tools to automatize this kind of procedure 
somewhat (we use it mostly for terminology extraction, but they kind of 
work to create general-purpose monolingual corpora as well):

http://sslmit.unibo.it/~baroni/bootcat.html

Regards,

Marco



On Monday, Aug 30, 2004, at 22:51 Europe/Rome, Mark P. Line wrote:

> I've been playing with Google searches for extracting texts in a
> particular language from the Web without a lot of noise (i.e. few texts
> that aren't in the desired language). Any comments on the utility of 
> this
> approach for more serious corpus research? Any improvements to the best
> search criteria I've been able to come up with below? Any good search
> criteria for languages not listed?
>
> (If there's any interest at all, I'd be happy to collect searches like
> these on a webpage somewhere.)
>
>
---
Marco Baroni
University of Bologna
http://sslmit.unibo.it/~baroni



More information about the Corpora mailing list