[Corpora-List] tool for extracting text from web forum and websites

Trevor Jenkins trevor.jenkins at suneidesis.com
Fri Oct 16 17:44:21 UTC 2009


On Fri, 16 Oct 2009, Stefan Th. Gries <stgries at gmail.com> wrote:

> 2009/10/16 Bjørn Arild Mæland <bjorn.maeland at gmail.com>:
>
> > This regexp is a good start, but its important to note that it isn't
> > enough for cleaning documents that use inline JavaScript and/or CSS.
>
> Of course not, but If I remember correctly, the query said something
> like 'cleaning is not an issue right now' so I focused on the download
> part ;-) There are of course *many* ugly issues to be dealt with.

My first idea would have been to kick of a subprocess and run lynx (the
text-only browser from ISC, the Internet Software Consortium) then have it
print the visible text to a temporary file that R can pick up when the
sub-process terminates. Blithely ignores JavaScript but still doesn't deal
with the any reorganisation that CSS might impose on the text. Again quick
and dirtry. And means one doesn't have to filter out that two things.

Regards, Trevor

<>< Re: deemed!


_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora



More information about the Corpora mailing list