[Corpora-List] Corpus Mining

Tue Dec 7 18:57:24 UTC 2004

Hi there.

The CorpusBuilder tool was the main inspiration for BootCaT:

http://www-2.cs.cmu.edu/~TextLearning/corpusbuilder/

It is intended for the collection of texts in a specific language, 
rather than about a specific topic, but I suppose it could be tweaked 
to look for specialized texts.

CorpusBuilder was (is?) part of a larger project about acquiring 
knowledge from the web:

http://www-2.cs.cmu.edu/~webkb/

An Crúbadán is another tool for language-specific web-corpus mining, 
that perhaps could be tweaked to sub-language mining:

http://borel.slu.edu/crubadan/

Somewhat relevant is also the notion of ``focused crawling'' in 
information retrieval, see e.g.

http://www8.org/w8-papers/5a-search-query/crawling/

Regards,

Marco
>
>

---
Marco Baroni
University of Bologna
http://sslmit.unibo.it/~baroni