[Corpora-List] Extracting text from Wikipedia articles

Angus B. Grieve-Smith grvsmth at panix.com
Fri Aug 27 18:35:53 UTC 2010


    I wish I had a quick solution to give you, but at this point I can 
only say that most corpus linguists would get a TON of benefit out of a 
two-week course in Perl.  If you're a self-starter you can work with 
Beginning Perl, available for free here:

http://www.perl.org/books/beginning-perl/

    After that, I'd suggest buying Programming Perl and the Perl Cookbook.

    There have been a number of queries to the list recently about how 
to extract certain things (content, tokens, etc.) from HTML files.  If 
the people sending the queries knew Perl, they could probably have 
written a script and gotten their data in less time than it took to send 
the emails.

    I am NOT criticizing people for sending queries like this.  Everyone 
has their constraints and priorities.  But if you can find the time to 
learn Perl, it will help you to create these scripts yourself, and help 
others.

    I also welcome queries to the list along the lines of "I'm trying to 
write a Perl script to match all modals in English, but it's giving me 
this weird error.  What am I doing wrong?"  I'd be happy to try to 
answer questions like that.  Or is there another list for that?

-- 
				-Angus B. Grieve-Smith
				Saint John's University
				grvsmth at panix.com


_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora



More information about the Corpora mailing list