[Corpora-List] Extracting text from Wikipedia articles

Fri Aug 27 18:55:13 UTC 2010

Hi Angus,

On 08/27/2010 08:35 PM, Angus B. Grieve-Smith wrote:
>      I wish I had a quick solution to give you, but at this point I can
> only say that most corpus linguists would get a TON of benefit out of a
> two-week course in Perl.  If you're a self-starter you can work with
> Beginning Perl, available for free here:
>
> http://www.perl.org/books/beginning-perl/

And why are you answering that to my answer? I am not the original poster.

>      After that, I'd suggest buying Programming Perl and the Perl Cookbook.
>
>      There have been a number of queries to the list recently about how
> to extract certain things (content, tokens, etc.) from HTML files.  If
> the people sending the queries knew Perl, they could probably have
> written a script and gotten their data in less time than it took to send
> the emails.

I do not agree. Extracting the main text from HTML is a hard task, as 
cleaning it from navigation bars, ads etc. is not trivial (see the 
CLEANEVAL competition).

>      I am NOT criticizing people for sending queries like this.  Everyone
> has their constraints and priorities.  But if you can find the time to
> learn Perl, it will help you to create these scripts yourself, and help
> others.

I do not agree with you. Googling for "extract plain text wikipedia" 
gives tons of results. It can be hard to find a working solution, and it 
can be hard to foresee all pitfalls in such a task.

That is why it makes sense to have a look on existing solutions.

>      I also welcome queries to the list along the lines of "I'm trying to
> write a Perl script to match all modals in English, but it's giving me
> this weird error.  What am I doing wrong?"  I'd be happy to try to
> answer questions like that. [...]

I agree :-). But in this concrete case, I would have answered: Just use 
this and that script, it's already in the world.

Best,
  Roman

-- 
Roman Klinger
Fraunhofer-Institute for Algorithms and Scientific Computing (SCAI)
Department of Bioinformatics
Schloss Birlinghoven
D-53754 Sankt Augustin
Tel.: +49-2241-14-2360
Fax.: +49-2241-14-4-2360
email: roman.klinger at scai.fhg.de
http://www.scai.fraunhofer.de/klinger.html

_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora