[Corpora-List] Summary: Corpus with restricted vocabulary

Klebanov Beata beata at cs.huji.ac.il
Tue Jan 27 09:19:26 UTC 2004


Dear Corpora members,

This is a summary of replies I received to my query from 18 Jan:

>
> For my research on textual manifestations of common knowledge, I am
> looking for a corpus of short English texts based on restricted vocabulary
> (up to ~500 different NP, VP heads), to be used for training machine
> learning tools sensitive to vocabulary size.


I would like to thank Brett Reynolds, Eric Atwell, Joel Walters and Andrew
Harley for providing pointers.

Here is the summary of replies:

(1) Andrew Harley <aharley at cambridge.org>
from Cambridge University Press suggested using
learner's dictionaries that have definitions based on restricted vocabulary;
for example, Cambridge learner dictionary that can be licensed. More info here:
http://dictionary.cambridge.org/researchers.htm

He also suggested using ELT readers at different levels that might meet
the restricted vocabulary requirement. The first level restricts the
vocabulary to 400 headwords; at his level, there are 6 books of about 30
pages including pictures. It is possible to view samples from the readers
here: http://publishing.cambridge.org/ge/elt/readers/26777/
Readers have not yet been licensed for use as a corpus, but
Andrew Harley thinks it might be possible if there is a demand and if the
authors agree.

In a similar spirit, Brett Reynolds <brett at forsyths.ca> suggested Oxford
Bookworms Series of Graded Readers; more information can be found here:
http://www.oup.com/elt/global/catalogue/readers/
Some short samples are available from the site.

(2) Joel Walters <waltej at netvision.net.il> has a small corpus of native
English texts collected for an experimental procedure involving writing
syntheses/summaries of two source texts. The corpus totals about 20,000
words and individual texts range from 50-600 words.

(3) Eric Atwell referred me to Dr Caroline Lyon of University of
Hertfordshire <C.M.Lyon at herts.ac.uk> who used a restricted English Corpus
for her PhD from 1994: http://homepages.feis.herts.ac.uk/~comrcml/Lyon-thesis.ps


Thanks to all who replied,

Beata.



More information about the Corpora mailing list