[Corpora-List] Plea for help

geoffrey.williams geoffrey.williams at wanadoo.fr
Tue Nov 19 08:20:42 UTC 2002


Hi,

Although I have until now answered directly so as not to overfill inboxes, I
thought some general considerations might be useful. For me, small corpora
are in the half millions, but then I have texts that run to the 2000 tokens,
letters being shorter we have an entire new ball game in which the question
of small corpora and what they can show has to be considered. The Ghadessy
et al is a good intro to small corpora, but we still need to discuss the
pheneomenon in terms of its role and limitations.

For myself, I wouldn't term "business writing" "academic discourse", unless
of course you are in the financial side of academia rather than on the
poorly paid side as most of us are.You will need to decide what type of
business you are interested in as letters will vary enormously depending on
the type of business and the purose of the correspondance.

Corpora are about breadth and size.

Breadth gives the variety needed and means that you cannot just take the
letters of one writer as this would be studying author style and no
generalisation would be possible.Breadth however is constrained by the need
for homogeneity which means that you will have to be clear as to what sorts
of letters go into the corpus.

Size allows generalisation by being able to make statistically substantiated
observations. "Small corpora" are fine for studies of precise events studied
stylistically, but limit what you can say lexically. With such a corpus it
will be difficult to make comments as to collocation as your base will be
small. Do not forget Zipf's law on diminishing returns. About half of your
tokens will be hapax legomena, occurring only once, and out of the other
half the lion's share will be high frequency grammatical items. You will be
able to find some repeated sequences, try using the "clusters" in WordSmith,
and some "candidate" collocations from the speciality in whih you are
working.

For corpus building, you will need to negotiate access to mail. The easiest
is obviously email, but the genre is different from that of smail mail.
Snail mail will have to be scanned. You will also need to mark up your texts
as the sections, greetings etc, are of importance. For this you should use
the TEI recomendations, TEI lite is fine. Otherwise you just chuck the whole
lot in the concordancer and see what comes out. The big problem will be
finding a tame company that allows you  access to its letters. They will
certainly demand that the texts are rendered anonymous, and that there is
only limited access to the corpus.

There are a number of obstacles to overcome in such small corpus work, but
difficult corpora yield interesting results, they just take time.

Good luck

Geoffrey

***********************************************************

Dr Geoffrey C. Williams,
Département Langues Etrangères Appliquées
U.F.R. Lettres et Sciences Humaines
4, rue Jean Zay
B.P. 92116
56321 LORIENT Cedex
FRANCE

tél : 33 (0) 2 97 87 29 68
fax : 33 (0) 2 97 87 29 70

email : Geoffrey.Williams at univ-ubs.fr

http://www.univ-ubs.fr/crellic

***************************************************
----- Original Message -----
From: "Isa Abdul kaader" <I.Abdul-kaader at postgrad.umist.ac.uk>
To: "geoffrey.williams" <geoffrey.williams at wanadoo.fr>
Cc: <CORPORA at HD.UIB.NO>
Sent: Sunday, November 17, 2002 1:54 AM
Subject: Re: [Corpora-List] Plea for help


> Hi Geoffrey,
>
>     Many thanks for your suggestions, Geoffrey and to all the others who
gave
> such wonderful advise with books etc for me to get an understanding of
Corpus
> Linguistics.
>
>    Having understood the essentials, have decided to focus on attention on
> academic writing especially
> 1. Business Writing ( letters of compliant and adjustment) and /or
> 2. Technical Report writing ( could be hardware/software developemnt in
higher
> technical institutions).
>
>
>     Apart from the critical study in langauge teaching materials by
Kennedy (
> 1987a) Holmes ( 1988), Mindt ( 1992) and especially Conrad ( 1996b)who
dealt
> with academic text and corpus based techniques, ARE there any other study
that
> looks at specialized registers like business writing and technical
reports.
>
>      Have yet to get full access to BNC but would like to know if such
corpora
> ( registers) could be obtained from the "Official document and Academic
Prose"
> listing in the BNC index to compare the linguistic characteristics of the
> corpus that I wish to compile.
>
>      I want to compile a small corpus (20,000 to 30,000) with regard to
the
> register listed. I do NOT KNOW how to get started with this.  Need all
help
> with this.
>
>       In short, I am interested in compiling a corpus to study its
> characteristics and applications to exploit it to design best possible
> materials and activities to help my students understand and produce the
> registers listed above appropriately (helping students with language that
is
> actually used in these settings).
>
>      Keen to know research that states appropriateness and potential of
> corpus ( including collacations) in Computer Assisted Langauge Learning at
> higher insitutions especially to teach technical writing.
>
> Very many thanks in advance for your all your help.
>
> Rafiq
> Temsek Polytechnic
> Singapore
>



More information about the Corpora mailing list