[Corpora-List] tool for extracting text from web forum and websites

Alexandre Rafalovitch arafalov at gmail.com
Wed Oct 14 20:30:12 UTC 2009


Dear Isabella,

You may need to give a bit more information.

Do you already have the files as html files on a local file system or
do you need to crawl. Are the pages html (well-formed or no). When you
say 'all the text', do you mean plain text without tags, descriptions,
alt, etc or something more complex.

Regards,
   Alex.
Personal blog: http://blog.outerthoughts.com/
Research group: http://www.clt.mq.edu.au/Research/
- I think age is a very high price to pay for maturity (Tom Stoppard)


On Wed, Oct 14, 2009 at 5:22 AM, Isabella Chiari
<isabella.chiari at uniroma1.it> wrote:
> Dear Linguists,
>
> I need a tool for extracting all the text from pages and subpages of a Web
> Forum. I do not need a cleaning tool at the moment.
>
> Can you suggest a tool to perform this operation?
>
> Thanks,
>
> Isabella Chiari
>
>
>
> _______________________________________________
> Corpora mailing list
> Corpora at uib.no
> http://mailman.uib.no/listinfo/corpora
>
>

_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora



More information about the Corpora mailing list