[Corpora-List] Getting articles from newspapers to compile a corpus

Fri Nov 30 00:21:48 UTC 2012

On Thu, Nov 29, 2012 at 10:54:46PM +0100, Matías Guzmán wrote:
> 
> I already tried wget, it seems to work quite well, but I wasn't able to
> clean the html files it creates using BeautifulSoup for python. Maybe
> somebody know of other software capable of doing this?

For cleaning HTML files, JusText (http://code.google.com/p/justext/) 
might be what you are looking for. If you also want to remove
duplicate or near-duplicate documents, you need another tool, or
write your own.

Cagri

_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora