[Corpora-List] Speeding up the constitution of corpora from LexisNexis

Mike Scott mike at lexically.net
Sat Jul 30 08:00:20 UTC 2011


Mahe, hi

We have been working on building corpora from this source at Aston 
University for research into the language of climate change. There are 
lots of problems with the newspaper database but most of these can be 
resolved fairly well:
* duplicated articles (often exact duplicates but different dates or 
publications but also slightly-varied duplicates
* imprecise/varied headers depending on the news-source
* many sources extremely well represented (eg. US newspapers) but other 
coverage patchy (eg. Brazilian)
* download restrictions (but these are generous so you can get lots of 
texts in one file)
* these large files need splitting up, not difficult to automate
Then you need to decide which publications or authors you do/don't wish 
to include in your corpus.
I am considering making the software I have prepared for this purpose 
available to the wider community; it would need some enhancing regarding 
a help system first. It attempts to parse the mulit-text download into 
separate articles, filters out duplicates, and then lets the user filter 
the set by publications & authors exporting cleaned-up texts to 
single-article or monthly-based text files.

Cheers -- Mike

On 28/07/2011 14:55, Mahé BEN HAMED wrote:
> Dear all,
>
> Is there a way to speed up the building of corpora from the Lexis 
> Nexis newspaper database (given a set of search parameters) ? To which 
> extent can the whole process be automated?
>
> Thanks,
>
> Mahe BEN HAMED
>
>
> _______________________________________________
> UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
> Corpora mailing list
> Corpora at uib.no
> http://mailman.uib.no/listinfo/corpora

-- 
Mike Scott

***
If you publish research which uses WordSmith, do let me know so I can include it at
http://www.lexically.net/wordsmith/corpus_linguistics_links/papers_using_wordsmith.htm
***
University of Aston and Lexical Analysis Software Ltd.
mike.scott at aston.ac.uk
www.lexically.net

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20110730/f54ac6e0/attachment.htm>
-------------- next part --------------
_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora


More information about the Corpora mailing list