[Corpora-List] Getting articles from newspapers to compile a corpus

Gisle Andersen Gisle.Andersen at nhh.no
Fri Nov 30 11:29:02 UTC 2012


Dear Matías, 

For Norwegian a 1 billion word Newspaper Corpus is compiled based on web crawler technology using wget and w3mir, followed by subsequent boilerplate/duplicate removal, text annotation, etc. It contains texts from 24 national/regional/local newspapers covering the period from 1998 to the present. For details, check this reference: 

Andersen, Gisle and Hofland, Knut (2012), 'Building a large monitor corpus based on newspapers on the web', in Gisle Andersen (ed.), Exploring Newspaper Language - Using the web to create and investigate a large corpus of modern Norwegian (Amsterdam: John Benjamins), 1-30.

Kind regards, 
Gisle Andersen, NHH





_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora



More information about the Corpora mailing list