[Corpora-List] tool for extracting text from web forum and websites
Timothy Baldwin
tb at ldwin.net
Thu Oct 15 03:35:05 UTC 2009
Hi Isabella,
> I need a tool for extracting all the text from pages and subpages of a Web
> Forum. I do not need a cleaning tool at the moment.
>
> Can you suggest a tool to perform this operation?
We developed SiteScraper (http://sitescraper.googlecode.com) at Melbourne
University for exactly this purpose -- scraping threads from web user forums,
maintaining as much structure as possible (e.g. posts, titles, thread titles,
timestamps, post authors). You will need to provide a couple of training
instances (literally a handful), but otherwise, it should just work. Email me
off list if you are after more details.
Tim
_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora
More information about the Corpora
mailing list