[Corpora-List] Corpora for EAP: Architecture...?
    Marco Baroni 
    baroni at sslmit.unibo.it
       
    Mon Jan 16 13:18:40 UTC 2006
    
    
  
Hi Eric.
For smallish specialized corpora, I suppose the following Python-based 
solution would work, and it probably would not take more than one day to 
implement...
- Write a script to do random combinations of potentially relevant terms 
from a list
- Use a python module to retrieve web pages from google via the API, e.g.: 
http://pygoogle.sourceforge.net/, using each of the random combinations as 
a query string
- Use the python BTE module (http://www.smi.ucd.ie/hyppia/) to clean the 
pages you retrieve (it's slower than our perl implementation, but for small 
corpora that should not be a problem).
- Use the NLTK or other python/java tools to process the corpus constructed 
in this way
Regards,
Marco
    
    
More information about the Corpora
mailing list