[Corpora-List] Getting articles from newspapers to compile a corpus

cagri coltekin c.coltekin at rug.nl
Fri Nov 30 00:21:48 UTC 2012


On Thu, Nov 29, 2012 at 10:54:46PM +0100, Matías Guzmán wrote:
> 
> I already tried wget, it seems to work quite well, but I wasn't able to
> clean the html files it creates using BeautifulSoup for python. Maybe
> somebody know of other software capable of doing this?

For cleaning HTML files, JusText (http://code.google.com/p/justext/) 
might be what you are looking for. If you also want to remove
duplicate or near-duplicate documents, you need another tool, or
write your own.

Cagri

_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora



More information about the Corpora mailing list