[Corpora-List] Getting articles from newspapers to compile a corpus
Gisle Andersen
Gisle.Andersen at nhh.no
Fri Nov 30 11:29:02 UTC 2012
Dear Matías,
For Norwegian a 1 billion word Newspaper Corpus is compiled based on web crawler technology using wget and w3mir, followed by subsequent boilerplate/duplicate removal, text annotation, etc. It contains texts from 24 national/regional/local newspapers covering the period from 1998 to the present. For details, check this reference:
Andersen, Gisle and Hofland, Knut (2012), 'Building a large monitor corpus based on newspapers on the web', in Gisle Andersen (ed.), Exploring Newspaper Language - Using the web to create and investigate a large corpus of modern Norwegian (Amsterdam: John Benjamins), 1-30.
Kind regards,
Gisle Andersen, NHH
_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora
More information about the Corpora
mailing list