[Corpora-List] Getting articles from newspapers to compile a corpus
cagri coltekin
c.coltekin at rug.nl
Fri Nov 30 00:21:48 UTC 2012
On Thu, Nov 29, 2012 at 10:54:46PM +0100, Matías Guzmán wrote:
>
> I already tried wget, it seems to work quite well, but I wasn't able to
> clean the html files it creates using BeautifulSoup for python. Maybe
> somebody know of other software capable of doing this?
For cleaning HTML files, JusText (http://code.google.com/p/justext/)
might be what you are looking for. If you also want to remove
duplicate or near-duplicate documents, you need another tool, or
write your own.
Cagri
_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora
More information about the Corpora
mailing list