[Corpora-List] Extracting text from Wikipedia articles

Angus B. Grieve-Smith grvsmth at panix.com
Sat Aug 28 03:36:05 UTC 2010


Roman Klinger wrote:
> And why are you answering that to my answer? I am not the original 
> poster.
    Because you were the latest poster at the time.  It was not written 
in response to your email in particular.
> I do not agree. Extracting the main text from HTML is a hard task, as 
> cleaning it from navigation bars, ads etc. is not trivial (see the 
> CLEANEVAL competition).
    Writing a script that will extract the main text from any HTML 
document is a hard task.  Writing a script that will extract the main 
text from a specific HTML document (or one of a group that follow 
certain well-defined conventions) is much easier.


> I agree :-). But in this concrete case, I would have answered: Just 
> use this and that script, it's already in the world.
    I've seen so many situations where it's a lot more work to find 
off-the-shelf solutions and adapt them to an individual task than it 
would be to just write a new script from scratch - if you already know 
how to write Perl scripts.  I can't say if this particular query is one 
of them, but it's not out of the question.

-- 
				-Angus B. Grieve-Smith
				grvsmth at panix.com


_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora



More information about the Corpora mailing list