<html>

  <head>

    <meta content="text/html; charset=ISO-8859-1"

      http-equiv="Content-Type">

  </head>

  <body bgcolor="#FFFFFF" text="#000000">

    Mahe, hi<br>

    <br>

    We have been working on building corpora from this source at Aston

    University for research into the language of climate change. There

    are lots of problems with the newspaper database but most of these

    can be resolved fairly well:<br>

    * duplicated articles (often exact duplicates but different dates or

    publications but also slightly-varied duplicates<br>

    * imprecise/varied headers depending on the news-source<br>

    * many sources extremely well represented (eg. US newspapers) but

    other coverage patchy (eg. Brazilian)<br>

    * download restrictions (but these are generous so you can get lots

    of texts in one file)<br>

    * these large files need splitting up, not difficult to automate<br>

    Then you need to decide which publications or authors you do/don't

    wish to include in your corpus.<br>

    I am considering making the software I have prepared for this

    purpose available to the wider community; it would need some

    enhancing regarding a help system first. It attempts to parse the

    mulit-text download into separate articles, filters out duplicates,

    and then lets the user filter the set by publications & authors

    exporting cleaned-up texts to single-article or monthly-based text

    files. <br>

    <br>

    Cheers -- Mike <br>

    <br>

    On 28/07/2011 14:55, Mahé BEN HAMED wrote:

    <blockquote

cite="mid:CANm4eUXYu8Um9vwvOpKuLDoArJb7Wmae4=GdaL7eV9xEcKQ83Q@mail.gmail.com"

      type="cite">Dear all,

      <div><br>

        <div>Is there a way to speed up the building of corpora from the

          Lexis Nexis newspaper database (given a set of search

          parameters) ? To which extent can the whole process be

          automated?</div>

        <div><br>

        </div>

        <div>Thanks,</div>

        <div><br>

        </div>

        <div>Mahe BEN HAMED</div>

      </div>

      <br>

      <fieldset class="mimeAttachmentHeader"></fieldset>

      <br>

      <pre wrap="">_______________________________________________

UNSUBSCRIBE from this page: <a class="moz-txt-link-freetext" href="http://mailman.uib.no/options/corpora">http://mailman.uib.no/options/corpora</a>

Corpora mailing list

<a class="moz-txt-link-abbreviated" href="mailto:Corpora@uib.no">Corpora@uib.no</a>

<a class="moz-txt-link-freetext" href="http://mailman.uib.no/listinfo/corpora">http://mailman.uib.no/listinfo/corpora</a>

</pre>

    </blockquote>

    <br>

    <pre class="moz-signature" cols="72">-- 

Mike Scott

***

If you publish research which uses WordSmith, do let me know so I can include it at

<a class="moz-txt-link-freetext" href="http://www.lexically.net/wordsmith/corpus_linguistics_links/papers_using_wordsmith.htm">http://www.lexically.net/wordsmith/corpus_linguistics_links/papers_using_wordsmith.htm</a>

***

University of Aston and Lexical Analysis Software Ltd.

<a class="moz-txt-link-abbreviated" href="mailto:mike.scott@aston.ac.uk">mike.scott@aston.ac.uk</a>

<a class="moz-txt-link-abbreviated" href="http://www.lexically.net">www.lexically.net</a>

</pre>

  </body>

</html>