[Corpora-List] Extracting text from Wikipedia articles
Roman Klinger
roman.klinger at scai.fraunhofer.de
Fri Aug 27 18:55:13 UTC 2010
Hi Angus,
On 08/27/2010 08:35 PM, Angus B. Grieve-Smith wrote:
> I wish I had a quick solution to give you, but at this point I can
> only say that most corpus linguists would get a TON of benefit out of a
> two-week course in Perl. If you're a self-starter you can work with
> Beginning Perl, available for free here:
>
> http://www.perl.org/books/beginning-perl/
And why are you answering that to my answer? I am not the original poster.
> After that, I'd suggest buying Programming Perl and the Perl Cookbook.
>
> There have been a number of queries to the list recently about how
> to extract certain things (content, tokens, etc.) from HTML files. If
> the people sending the queries knew Perl, they could probably have
> written a script and gotten their data in less time than it took to send
> the emails.
I do not agree. Extracting the main text from HTML is a hard task, as
cleaning it from navigation bars, ads etc. is not trivial (see the
CLEANEVAL competition).
> I am NOT criticizing people for sending queries like this. Everyone
> has their constraints and priorities. But if you can find the time to
> learn Perl, it will help you to create these scripts yourself, and help
> others.
I do not agree with you. Googling for "extract plain text wikipedia"
gives tons of results. It can be hard to find a working solution, and it
can be hard to foresee all pitfalls in such a task.
That is why it makes sense to have a look on existing solutions.
> I also welcome queries to the list along the lines of "I'm trying to
> write a Perl script to match all modals in English, but it's giving me
> this weird error. What am I doing wrong?" I'd be happy to try to
> answer questions like that. [...]
I agree :-). But in this concrete case, I would have answered: Just use
this and that script, it's already in the world.
Best,
Roman
--
Roman Klinger
Fraunhofer-Institute for Algorithms and Scientific Computing (SCAI)
Department of Bioinformatics
Schloss Birlinghoven
D-53754 Sankt Augustin
Tel.: +49-2241-14-2360
Fax.: +49-2241-14-4-2360
email: roman.klinger at scai.fhg.de
http://www.scai.fraunhofer.de/klinger.html
_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora
More information about the Corpora
mailing list