[Corpora-List] Extracting text from Wikipedia articles
Angus B. Grieve-Smith
grvsmth at panix.com
Fri Aug 27 18:35:53 UTC 2010
I wish I had a quick solution to give you, but at this point I can
only say that most corpus linguists would get a TON of benefit out of a
two-week course in Perl. If you're a self-starter you can work with
Beginning Perl, available for free here:
http://www.perl.org/books/beginning-perl/
After that, I'd suggest buying Programming Perl and the Perl Cookbook.
There have been a number of queries to the list recently about how
to extract certain things (content, tokens, etc.) from HTML files. If
the people sending the queries knew Perl, they could probably have
written a script and gotten their data in less time than it took to send
the emails.
I am NOT criticizing people for sending queries like this. Everyone
has their constraints and priorities. But if you can find the time to
learn Perl, it will help you to create these scripts yourself, and help
others.
I also welcome queries to the list along the lines of "I'm trying to
write a Perl script to match all modals in English, but it's giving me
this weird error. What am I doing wrong?" I'd be happy to try to
answer questions like that. Or is there another list for that?
--
-Angus B. Grieve-Smith
Saint John's University
grvsmth at panix.com
_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora
More information about the Corpora
mailing list