<html>
<head>
<meta content="text/html; charset=ISO-8859-1"
http-equiv="Content-Type">
</head>
<body bgcolor="#FFFFFF" text="#000000">
Mahe, hi<br>
<br>
We have been working on building corpora from this source at Aston
University for research into the language of climate change. There
are lots of problems with the newspaper database but most of these
can be resolved fairly well:<br>
* duplicated articles (often exact duplicates but different dates or
publications but also slightly-varied duplicates<br>
* imprecise/varied headers depending on the news-source<br>
* many sources extremely well represented (eg. US newspapers) but
other coverage patchy (eg. Brazilian)<br>
* download restrictions (but these are generous so you can get lots
of texts in one file)<br>
* these large files need splitting up, not difficult to automate<br>
Then you need to decide which publications or authors you do/don't
wish to include in your corpus.<br>
I am considering making the software I have prepared for this
purpose available to the wider community; it would need some
enhancing regarding a help system first. It attempts to parse the
mulit-text download into separate articles, filters out duplicates,
and then lets the user filter the set by publications & authors
exporting cleaned-up texts to single-article or monthly-based text
files. <br>
<br>
Cheers -- Mike <br>
<br>
On 28/07/2011 14:55, Mahé BEN HAMED wrote:
<blockquote
cite="mid:CANm4eUXYu8Um9vwvOpKuLDoArJb7Wmae4=GdaL7eV9xEcKQ83Q@mail.gmail.com"
type="cite">Dear all,
<div><br>
<div>Is there a way to speed up the building of corpora from the
Lexis Nexis newspaper database (given a set of search
parameters) ? To which extent can the whole process be
automated?</div>
<div><br>
</div>
<div>Thanks,</div>
<div><br>
</div>
<div>Mahe BEN HAMED</div>
</div>
<br>
<fieldset class="mimeAttachmentHeader"></fieldset>
<br>
<pre wrap="">_______________________________________________
UNSUBSCRIBE from this page: <a class="moz-txt-link-freetext" href="http://mailman.uib.no/options/corpora">http://mailman.uib.no/options/corpora</a>
Corpora mailing list
<a class="moz-txt-link-abbreviated" href="mailto:Corpora@uib.no">Corpora@uib.no</a>
<a class="moz-txt-link-freetext" href="http://mailman.uib.no/listinfo/corpora">http://mailman.uib.no/listinfo/corpora</a>
</pre>
</blockquote>
<br>
<pre class="moz-signature" cols="72">--
Mike Scott
***
If you publish research which uses WordSmith, do let me know so I can include it at
<a class="moz-txt-link-freetext" href="http://www.lexically.net/wordsmith/corpus_linguistics_links/papers_using_wordsmith.htm">http://www.lexically.net/wordsmith/corpus_linguistics_links/papers_using_wordsmith.htm</a>
***
University of Aston and Lexical Analysis Software Ltd.
<a class="moz-txt-link-abbreviated" href="mailto:mike.scott@aston.ac.uk">mike.scott@aston.ac.uk</a>
<a class="moz-txt-link-abbreviated" href="http://www.lexically.net">www.lexically.net</a>
</pre>
</body>
</html>