[Lexicog] Corpora Planning

Tue May 9 10:41:17 UTC 2006

Dear Ali (and other friends)

No one seems to have answered your question posted on lexicographylist about
corpora, so I will attempt a brief summary, as this lies in my area of
expertise.  I'll start with the obvious -- so forgive me if I tell you what
you know already -- and then I'll move on to some particular issues.

1. What are corpora?

The word "Corpora" is the plural of "corpus" -- originally a Latin word
meaning "body". In modern usage, a corpus is simply a collection of texts in
electronic (machine-readable) form, for processing by computer. Corpora
provide evidence for how words are actually used. A famous (and freely
available) corpus for English is the British National Corpus (BNC) of 100
million words. It's ten years old now , but still a useful resource for
finding out how English words are used. See http://www.natcorp.ox.ac.uk/ .
Recently, Oxford University Press announced the "Oxford English Corpus" of 1
billion words (an order of magnitude bigger than BNC).  See
http://www.askoxford.com/oec/

2. Planning a Corpus

Many languages now have at least one -- but nevertheless new corpora are
still being planned and built in these languages. In English, for example,
special subject corpora  are now being planned and built -- so someone
interested in the language of medicine will build a corpus of medical texts.
Another example: a corpus of historical texts provides evidence for how
words were used in the past.

Other languages do not yet have a general corpus at all, so corpus planners
must start from scratch.  Typically, a general corpus will consist of lots
of different kinds of texts --- some journalism (electronic versions of
newspapers and journals are easy to obtain in many languages), some text
books, some academic writing, some fiction, some web pages, and some
transcripts of unscripted conversation (--- though the latter is difficult
to get hold of -- and can also be difficult to interpret).

Corpus planners generally avoid poetry and plays, as these are texts in
which language is often used in unusual ways.

A good introduction to to corpus planning, though somewhat out of date now,
is an article by Sue Atkins, Jeremy Clear, and Nicholas Ostler, 1992:
"Corpus Design Criteria" in the journal Literary and Linguistic Computing.
See http://llc.oxfordjournals.org/cgi/content/abstract/7/1/1

Nowadays, the Internet has made corpus buiilding much easier. Indeed, for
some purposes, the whole of the Internet is sometimes regarded as one vast
multilingual corpus. See a special issue of the journal "Computational
Linguistics" edited by Adam Kilgarriff and Gregory Grefenstette:
http://www.mitpressjournals.org/doi/abs/10.1162/089120103322711569  .

3. Getting Permissions

Someone building a corpus for general use must get permission from each
author (or copyright owner -- typically, the publisher)  before adding a
text to a corpus. This can raise difficult questions (e.g. "Who owns the
text?" and "What can I say that will persuade the text owner to give
permission?")

4. Building a Corpus

Once the texts have been obtained, along with permission to use them, some
basic computational work has to be carried out.  The texts must be
standardized, so that they are all in the same format for the computer, then
they must be 1) tokenized (finding word boundaries, deciding what to do
about punctuations marks), 2) lemmatized (if one wants the computer
to find "take, takes, taking, took, and/or taken" in response to a user's
inquiry about "take"; and 3) word-class tagged (so that, for example, the
computer can separate "report", noun, from "report", verb). Then each word
in each text must be indexed (a highly technical procedure, so that the
computer can instantly retrieve the information that users ask for about a
word or phrase or other linguistic item.

Fortunately, there are now some experts who specialize in building corpora
of any kind in any language.  The language does not matter, because the
procedures for processing words (letters, symbols) are
language-independent -- i.e. they are, in principle, the same for any
language. Among the best are Pavel Rychly and Adam Kilgarriff (see
http://www.sketchengine.co.uk/ )

5.. Why build a corpus anyway?

Linguists and lexicographers are divided between those who believe that one
can get all the evidence one needs by consulting the intuitions of a native
speaker (oneself, for example),
and those who believe that some source of evidence is necessary.  I worked
in lexicography in the 1970s, before there were corpora, and I can attest
from personal experience that the evidence of a large corpus provides
important insights into words and meanings which cannot be obtained by
introspection (however hard one tries). So I firmly believe that a corpus
(and tools for corpus analysis) are necessary for modern lexicography.

6. Oxford Dictionaries

One clarification re your question: OED is not a corpus-based dictionary.
The original 14-volume Oxford English Dictionary (1878-1928) (OED) was a
great historical investigation into
the origin and history every English word,  based on a 19th century
collection of citations, each of which written out on a slip of paper by
volunteers. (Phew!)  It was compiled long before computers were invented,
but in the 1980s OED was loaded onto a computer. It is now being very
thoroughly revised by a large team of lexicographers in Oxford. An on-line
version is available. The OED editors have to take account of many facts
(philological and historical) in addition to corpus evidence.

Oxford University Press is a vast publishing organization with several
divisions, which are run independently as separate businesses. The Oxford
Advanced Learners Dictionary of Current English (OALDCE) is published by the
English Language Teaching Division.  It was not originally -- but is now -- 
a corpus-based dictionary. It was completely revised and rewritten in the
1990s in the light of corpus evidence. It has nothing to do with the Oxford
English Dictionary (OED), other than the fact that it is published by the
same publisher.

The one-volume "Oxford Dictionary of English" is a corpus-based dictionary.
It is aimed at native speakers of English (but not at historical scholars).
So it lies somewhere between the OED and OALDCE. I was involved in creating
the first edition of the Oxford Dictionary of Enlgish book, so naturally I
think it is the best dictionary ever!

I hope these remarks are helpful, and that you will take the initative in
creating a corpus of Urdu. Let me know if I can help in any way.

Best wishes,

Patrick Hanks

----- Original Message ----- 
From: "ali72678" <ali72678 at yahoo.com>
To: <lexicographylist at yahoogroups.com>
Sent: Saturday, May 06, 2006 7:10 PM
Subject: [Lexicog] Corpora Planning

> Hi All
> I want to Know these things:
> 1--What is Corpora planning?
> 2-what is the corpora planning of OED and other learner dictionaries?
> Tell me and oblige.
> Ali
>
>
>
>
>
>
>
>
> Yahoo! Groups Links
>
>
>
>
>
>
>

------------------------ Yahoo! Groups Sponsor --------------------~--> 
Home is just a click away.  Make Yahoo! your home page now.
http://us.click.yahoo.com/DHchtC/3FxNAA/yQLSAA/HKE4lB/TM
--------------------------------------------------------------------~-> 

Yahoo! Groups Links

<*> To visit your group on the web, go to:
    http://groups.yahoo.com/group/lexicographylist/

<*> To unsubscribe from this group, send an email to:
    lexicographylist-unsubscribe at yahoogroups.com

<*> Your use of Yahoo! Groups is subject to:
    http://docs.yahoo.com/info/terms/