[Corpora-List] Web corpora vs. Gigaword

Serge Sharoff S.Sharoff at leeds.ac.uk
Thu Jun 2 12:12:53 UTC 2005


> > But then again, why not go simply to UPenn and purchase some
> > license for English Gigaword plus some additional tens of millions
> > words corpora from LDC?
> 
> For example because I'm also interested in 1 billion words of Italian,
> German and Japanese?  Or because I think that the web can give us a more
> varied picture of a language than a newswire corpus? But more in general

apart from the issue of their cost (LDC corpora are prohibitively expensive) and availability for particular languages, the language of newswire corpora is quite different from the language used in the BNC and Internet corpora.  I compared the frequency lists from several newswire corpora (Reuters and Gigaword, in particular) against corpora treated as representative (such as the BNC) and corpora compiled from the Internet.  It is interesting that both Internet and BNC-like corpora share similar features: newswire corpora report past events (frequently financial: 56% in Reuters) in a more or less formal language, so they use fewer first and second personal pronouns, question words, modals etc.  (these findings are reported in a paper currently under review; contact me, if you'd like to see the draft).  At least for the purposes of lexicographic research, it's much better to use corpora compiled from the Internet (unless you're interested specifically in the language of newswires).

Serge

--
Dr. Serge Sharoff
Centre for Translation Studies
School of Modern Languages and Cultures
University of Leeds
Leeds, LS2 9JT

tel: +44(0)113 343 7287
fax: +44(0)113 343 3287



More information about the Corpora mailing list