[Corpora-List] tool for extracting text from web forum and websites

Timothy Baldwin tb at ldwin.net
Thu Oct 15 03:35:05 UTC 2009


Hi Isabella,


> I need a tool for extracting all the text from pages and subpages of a Web
> Forum. I do not need a cleaning tool at the moment.
> 
> Can you suggest a tool to perform this operation?

We developed SiteScraper (http://sitescraper.googlecode.com) at Melbourne
University for exactly this purpose -- scraping threads from web user forums,
maintaining as much structure as possible (e.g. posts, titles, thread titles,
timestamps, post authors). You will need to provide a couple of training
instances (literally a handful), but otherwise, it should just work. Email me
off list if you are after more details.


Tim

_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora



More information about the Corpora mailing list