[Corpora-List] tool for extracting text from web forum and websites
Alexandre Rafalovitch
arafalov at gmail.com
Wed Oct 14 20:30:12 UTC 2009
Dear Isabella,
You may need to give a bit more information.
Do you already have the files as html files on a local file system or
do you need to crawl. Are the pages html (well-formed or no). When you
say 'all the text', do you mean plain text without tags, descriptions,
alt, etc or something more complex.
Regards,
Alex.
Personal blog: http://blog.outerthoughts.com/
Research group: http://www.clt.mq.edu.au/Research/
- I think age is a very high price to pay for maturity (Tom Stoppard)
On Wed, Oct 14, 2009 at 5:22 AM, Isabella Chiari
<isabella.chiari at uniroma1.it> wrote:
> Dear Linguists,
>
> I need a tool for extracting all the text from pages and subpages of a Web
> Forum. I do not need a cleaning tool at the moment.
>
> Can you suggest a tool to perform this operation?
>
> Thanks,
>
> Isabella Chiari
>
>
>
> _______________________________________________
> Corpora mailing list
> Corpora at uib.no
> http://mailman.uib.no/listinfo/corpora
>
>
_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora
More information about the Corpora
mailing list