[Corpora-List] language-specific harvesting of texts from the Web
Marco Baroni
baroni at sslmit.unibo.it
Mon Aug 30 22:33:01 UTC 2004
Maybe you could extract seeds to be used in new queries from the pages
you found, as suggested in:
R. Ghani, R. Jones, and D. Mladenic. 2001. Mining the web to create
minority language corpora. CIKM 2001, 279–286.
http://citeseer.ist.psu.edu/ghani01mining.html
We have a set of simple tools to automatize this kind of procedure
somewhat (we use it mostly for terminology extraction, but they kind of
work to create general-purpose monolingual corpora as well):
http://sslmit.unibo.it/~baroni/bootcat.html
Regards,
Marco
On Monday, Aug 30, 2004, at 22:51 Europe/Rome, Mark P. Line wrote:
> I've been playing with Google searches for extracting texts in a
> particular language from the Web without a lot of noise (i.e. few texts
> that aren't in the desired language). Any comments on the utility of
> this
> approach for more serious corpus research? Any improvements to the best
> search criteria I've been able to come up with below? Any good search
> criteria for languages not listed?
>
> (If there's any interest at all, I'd be happy to collect searches like
> these on a webpage somewhere.)
>
>
---
Marco Baroni
University of Bologna
http://sslmit.unibo.it/~baroni
More information about the Corpora
mailing list