[Corpora-List] Brown Corpus

Eric Atwell eric at comp.leeds.ac.uk
Wed Jun 15 08:41:49 UTC 2005


steven,
I think the original design plan for Brown was to collect 500 text
samples, each of 2000 words (or up to end of sentence including the
2000th word). For some text-categories, eg newspapers (categories A,B,C)
the texts found were generally shorter than 2000 words, so several
newspaper-articles were included into a single 2000-word "text".
BUT most later 2000-word samples are from a single source.

Other corpora have followed this design principle of a standard
sample-size of about 2000 words (LOB, FLOB, FROWN,
ICE: International Corpus of English, CCA: Corpus of Contemporary
Arabic, ...), though not all have (eg BNC, ANC).    I dont suppose for
most applications it matters whether you combine small files into a big
file to simplify storage/processing, as long as there is a record
somewhere of the original sources (either in a Handbook, or in XML
header markup)

eric atwell, Leeds University


On Tue, 14 Jun 2005, Steven Bird wrote:

> Note that this version of the Brown Corpus contains 500 files, each
> consisting of around 200 lines of text on average.  Perhaps these were
> as big as they could handle back in 1961.  I think it would make matters
> simpler if the file structure was rationalized now, so that, e.g.:
>
> Brown Corpus file names
> Existing     -> Proposed
> ca01 .. ca44 -> a
> cb01 .. cb26 -> b
> etc
>
> (NB this is how things are being restructured in NLTK-Lite, a new,
> steamlined version of NLTK that will be released later this month.)
>
> -Steven Bird
>
>
> On Tue, 2005-06-14 at 17:27 +0100, Lou Burnard wrote:
>> By one of those uncanny coincidences, I am planning to include an
>> XMLified version of the Brown corpus on the next edition of the BNC Baby
>> corpus sampler. The version I have is derived from the GPLd version
>> distributed as part of the LTK tool set (http://nltk.sourceforge.net)
>> and includes POS tagging; there is also a version which has been
>> enhanced to include Wordnet semantic tagging but I am not clear as to
>> the rights in that.
>>
>> Lou Burnard
>>
>>
>> Xiao, Zhonghua wrote:
>>> The plain text version of Brown is available here:
>>> http://dingo.sbs.arizona.edu/~hammond/ling696f-sp03/browncorpus.txt
>>>
>>> Richard
>>> ________________________________
>>>
>>> From: owner-corpora at lists.uib.no on behalf of Jörg Schuster
>>> Sent: Tue 14/06/2005 14:39
>>> To: CORPORA at hd.uib.no
>>> Subject: [Corpora-List] Brown Corpus
>>>
>>>
>>>
>>> Hello,
>>>
>>> where can the Brown Corpus be downloaded or purchased?
>>>
>>> Jörg Schuster
>>>
>>>
>>>
>>>
>>>
>>>
>>
>
>
>
>

--
Eric Atwell, Senior Lecturer, Language research group, School of Computing,
Faculty of Engineering, University of Leeds, LEEDS LS2 9JT, England
TEL: +44-113-2335430  FAX: +44-113-2335468  http://www.comp.leeds.ac.uk/eric


More information about the Corpora mailing list