[Corpora-List] Request for advice on creating a learners' corpus
Eric Atwell
eric at comp.leeds.ac.uk
Wed Jan 26 16:50:24 UTC 2005
Victoria,
Latifa Al-Sulaiti was in a similar position about a year and a half ago,
she planned to collect a million-word Corpus of Contemporary Arabic
- native-speaker texts rather than learner texts, but even so she faced
similar technical issues, as her background was in linguistics and
language teaching rather than computing, and she didnt start with prior
knowledge about seeking permissions, corpus structure and management,
XML file format, markup info to add to file headers, etc.
Her initial version of the corpus is now complete and online;
see http://www.comp.leeds.ac.uk/latifa
Her methods and solutions to the problems along the way are documented
in her MSc Thesis, also online:
Latifa Al-sulaiti <a
href="http://www.comp.leeds.ac.uk/cgi-bin/sis/ext/rs_pub.cgi?cmd=displayabstract&sid=200081109">(Abstract)</a>
(MSc) <br /> <a href="/research/pubs/theses/Latifa_MSc.pdf">Designing
and Developing a Corpus of Contemporary Arabic</a>
We are also writing a paper for IJCL; we could let you have a draft if
you're interested...
I'm sure Latifa would be happy to discuss issues further - do get in
touch direct.
Good luck with your project!
Eric Atwell, School of Computing, Leeds University
On Thu, 27 Jan 2005, Victoria Muehleisen wrote:
> Hello Everyone,
>
> I teach English at a university in Japan, and we recently received some
> grant money to set up a learners' corpus, of students' essays written
> in English.
>
> Although we have some ideas of how we can begin doing research once we
> have the corpus, we don't know anything about actually setting it up.
> What are the best formats for storing the essays? For marking up the
> data? What kind of information will be most useful to add to the
> files? (For example, we know that we'll want to identify the level of
> the class the essay was written for--there are basic, intermediate, and
> advanced level writing courses--and we'll also want to code for the
> native language of the writer--not all the studehts are Japanese--but
> are there other kinds of variables we should keep track of?)
>
> We would appreciate references to books/articles/web sites on setting
> up a learners' corpus, especially ones that don't assume too much
> technical computer knowledge. We'll have people available to help up
> with the technical side, but we need to tell them what we want to do.
>
> In additional to references, if there is anyone who has created a
> learners' corpus and could warn us about any mistakes to avoid, that
> would also be very helpful. And at the next stage, we'll need to start
> thinking about issues of student privacy/permission, so any references
> on those issues (in particular, ways that other corpus-creators have
> done it) would be very useful.
>
> Thanking you in advance,
>
> *********************************
> Victoria Muehleisen
>
> School of International Liberal Studies Waseda University
> Nishi-Waseda 1-6-1
> Shinjuku-ku, Tokyo 169-8050
>
> E-mail: <vicky at waseda.jp>
> Home page: <www.f.waseda.jp/vicky>
>
>
>
--
Eric Atwell, Senior Lecturer, Computer Vision and Language research group,
School of Computing, University of Leeds, LEEDS LS2 9JT, England
TEL: +44-113-2335430 FAX: +44-113-2335468 http://www.comp.leeds.ac.uk/eric
More information about the Corpora
mailing list