[Corpora-List] Extracting text from Wikipedia articles

Nitin Madnani nmadnani at gmail.com
Fri Aug 27 18:43:42 UTC 2010


I recently did this. I downloaded the freebase wikipedia extraction (google that) and used BeautifulSoup to extract just the text part. It was a couple of days' work at the most. 

- Nitin

On Aug 27, 2010, at 2:35 PM, "Angus B. Grieve-Smith" <grvsmth at panix.com> wrote:

>   I wish I had a quick solution to give you, but at this point I can only say that most corpus linguists would get a TON of benefit out of a two-week course in Perl.  If you're a self-starter you can work with Beginning Perl, available for free here:
> 
> http://www.perl.org/books/beginning-perl/
> 
>   After that, I'd suggest buying Programming Perl and the Perl Cookbook.
> 
>   There have been a number of queries to the list recently about how to extract certain things (content, tokens, etc.) from HTML files.  If the people sending the queries knew Perl, they could probably have written a script and gotten their data in less time than it took to send the emails.
> 
>   I am NOT criticizing people for sending queries like this.  Everyone has their constraints and priorities.  But if you can find the time to learn Perl, it will help you to create these scripts yourself, and help others.
> 
>   I also welcome queries to the list along the lines of "I'm trying to write a Perl script to match all modals in English, but it's giving me this weird error.  What am I doing wrong?"  I'd be happy to try to answer questions like that.  Or is there another list for that?
> 
> -- 
> 				-Angus B. Grieve-Smith
> 				Saint John's University
> 				grvsmth at panix.com
> 
> 
> _______________________________________________
> Corpora mailing list
> Corpora at uib.no
> http://mailman.uib.no/listinfo/corpora

_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora



More information about the Corpora mailing list