[Corpora-List] Extracting text from Wikipedia articles
Angus B. Grieve-Smith
grvsmth at panix.com
Sat Aug 28 03:36:05 UTC 2010
Roman Klinger wrote:
> And why are you answering that to my answer? I am not the original
> poster.
Because you were the latest poster at the time. It was not written
in response to your email in particular.
> I do not agree. Extracting the main text from HTML is a hard task, as
> cleaning it from navigation bars, ads etc. is not trivial (see the
> CLEANEVAL competition).
Writing a script that will extract the main text from any HTML
document is a hard task. Writing a script that will extract the main
text from a specific HTML document (or one of a group that follow
certain well-defined conventions) is much easier.
> I agree :-). But in this concrete case, I would have answered: Just
> use this and that script, it's already in the world.
I've seen so many situations where it's a lot more work to find
off-the-shelf solutions and adapt them to an individual task than it
would be to just write a new script from scratch - if you already know
how to write Perl scripts. I can't say if this particular query is one
of them, but it's not out of the question.
--
-Angus B. Grieve-Smith
grvsmth at panix.com
_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora
More information about the Corpora
mailing list