[Lexicog] Corpora Planning

Justice, Alexander ajustice at LMU.EDU
Fri May 19 17:54:27 UTC 2006


Dear Patrick,
 
Thank you for your detailed reply to Ali's question. I found it very
informative as well, as I'm just beginning to learn about corpus
linguistics.
 
Might I ask if there is a list compiled and published on the WWW of all
(known) corpora projects, or at least the main ones for each language?
Does Great Britain have a central coordinating body for corpora of
Britain's main Germanic and Celtic home languages? Or indeed of all
languages used by large communities in Britain.
 

Alexander Justice
Reference Librarian

Von der Ahe Library
Loyola Marymount University
One LMU Drive
Los Angeles, CA 90045

310.338.5947
ajustice at lmu.edu

http://www.lmu.edu/library 





 


________________________________

	From: lexicographylist at yahoogroups.com
[mailto:lexicographylist at yahoogroups.com] On Behalf Of Patrick Hanks
	Sent: Tuesday, May 09, 2006 3:41 AM
	To: lexicographylist at yahoogroups.com
	Cc: Driss El-Khattab; Pavel Rychly
	Subject: Re: [Lexicog] Corpora Planning
	
	

	Dear Ali (and other friends)
	
	No one seems to have answered your question posted on
lexicographylist about
	corpora, so I will attempt a brief summary, as this lies in my
area of
	expertise.  I'll start with the obvious -- so forgive me if I
tell you what
	you know already -- and then I'll move on to some particular
issues.
	
	1. What are corpora?
	
	The word "Corpora" is the plural of "corpus" -- originally a
Latin word
	meaning "body". In modern usage, a corpus is simply a collection
of texts in
	electronic (machine-readable) form, for processing by computer.
Corpora
	provide evidence for how words are actually used. A famous (and
freely
	available) corpus for English is the British National Corpus
(BNC) of 100
	million words. It's ten years old now , but still a useful
resource for
	finding out how English words are used. See
http://www.natcorp.ox.ac.uk/ .
	Recently, Oxford University Press announced the "Oxford English
Corpus" of 1
	billion words (an order of magnitude bigger than BNC).  See
	http://www.askoxford.com/oec/
	
	2. Planning a Corpus
	
	Many languages now have at least one -- but nevertheless new
corpora are
	still being planned and built in these languages. In English,
for example,
	special subject corpora  are now being planned and built -- so
someone
	interested in the language of medicine will build a corpus of
medical texts.
	Another example: a corpus of historical texts provides evidence
for how
	words were used in the past.
	
	Other languages do not yet have a general corpus at all, so
corpus planners
	must start from scratch.  Typically, a general corpus will
consist of lots
	of different kinds of texts --- some journalism (electronic
versions of
	newspapers and journals are easy to obtain in many languages),
some text
	books, some academic writing, some fiction, some web pages, and
some
	transcripts of unscripted conversation (--- though the latter is
difficult
	to get hold of -- and can also be difficult to interpret).
	
	Corpus planners generally avoid poetry and plays, as these are
texts in
	which language is often used in unusual ways.
	
	A good introduction to to corpus planning, though somewhat out
of date now,
	is an article by Sue Atkins, Jeremy Clear, and Nicholas Ostler,
1992:
	"Corpus Design Criteria" in the journal Literary and Linguistic
Computing.
	See http://llc.oxfordjournals.org/cgi/content/abstract/7/1/1
	
	Nowadays, the Internet has made corpus buiilding much easier.
Indeed, for
	some purposes, the whole of the Internet is sometimes regarded
as one vast
	multilingual corpus. See a special issue of the journal
"Computational
	Linguistics" edited by Adam Kilgarriff and Gregory Grefenstette:
	
http://www.mitpressjournals.org/doi/abs/10.1162/089120103322711569  .
	
	3. Getting Permissions
	
	Someone building a corpus for general use must get permission
from each
	author (or copyright owner -- typically, the publisher)  before
adding a
	text to a corpus. This can raise difficult questions (e.g. "Who
owns the
	text?" and "What can I say that will persuade the text owner to
give
	permission?")
	
	4. Building a Corpus
	
	Once the texts have been obtained, along with permission to use
them, some
	basic computational work has to be carried out.  The texts must
be
	standardized, so that they are all in the same format for the
computer, then
	they must be 1) tokenized (finding word boundaries, deciding
what to do
	about punctuations marks), 2) lemmatized (if one wants the
computer
	to find "take, takes, taking, took, and/or taken" in response to
a user's
	inquiry about "take"; and 3) word-class tagged (so that, for
example, the
	computer can separate "report", noun, from "report", verb). Then
each word
	in each text must be indexed (a highly technical procedure, so
that the
	computer can instantly retrieve the information that users ask
for about a
	word or phrase or other linguistic item.
	
	Fortunately, there are now some experts who specialize in
building corpora
	of any kind in any language.  The language does not matter,
because the
	procedures for processing words (letters, symbols) are
	language-independent -- i.e. they are, in principle, the same
for any
	language. Among the best are Pavel Rychly and Adam Kilgarriff
(see
	http://www.sketchengine.co.uk/ )
	
	5.. Why build a corpus anyway?
	
	Linguists and lexicographers are divided between those who
believe that one
	can get all the evidence one needs by consulting the intuitions
of a native
	speaker (oneself, for example),
	and those who believe that some source of evidence is necessary.
I worked
	in lexicography in the 1970s, before there were corpora, and I
can attest
	from personal experience that the evidence of a large corpus
provides
	important insights into words and meanings which cannot be
obtained by
	introspection (however hard one tries). So I firmly believe that
a corpus
	(and tools for corpus analysis) are necessary for modern
lexicography.
	
	6. Oxford Dictionaries
	
	One clarification re your question: OED is not a corpus-based
dictionary.
	The original 14-volume Oxford English Dictionary (1878-1928)
(OED) was a
	great historical investigation into
	the origin and history every English word,  based on a 19th
century
	collection of citations, each of which written out on a slip of
paper by
	volunteers. (Phew!)  It was compiled long before computers were
invented,
	but in the 1980s OED was loaded onto a computer. It is now being
very
	thoroughly revised by a large team of lexicographers in Oxford.
An on-line
	version is available. The OED editors have to take account of
many facts
	(philological and historical) in addition to corpus evidence.
	
	Oxford University Press is a vast publishing organization with
several
	divisions, which are run independently as separate businesses.
The Oxford
	Advanced Learners Dictionary of Current English (OALDCE) is
published by the
	English Language Teaching Division.  It was not originally --
but is now -- 
	a corpus-based dictionary. It was completely revised and
rewritten in the
	1990s in the light of corpus evidence. It has nothing to do with
the Oxford
	English Dictionary (OED), other than the fact that it is
published by the
	same publisher.
	
	The one-volume "Oxford Dictionary of English" is a corpus-based
dictionary.
	It is aimed at native speakers of English (but not at historical
scholars).
	So it lies somewhere between the OED and OALDCE. I was involved
in creating
	the first edition of the Oxford Dictionary of Enlgish book, so
naturally I
	think it is the best dictionary ever!
	
	I hope these remarks are helpful, and that you will take the
initative in
	creating a corpus of Urdu. Let me know if I can help in any way.
	
	Best wishes,
	
	
	Patrick Hanks
	
	
	----- Original Message ----- 
	From: "ali72678" <ali72678 at yahoo.com>
	To: <lexicographylist at yahoogroups.com>
	Sent: Saturday, May 06, 2006 7:10 PM
	Subject: [Lexicog] Corpora Planning
	
	
	> Hi All
	> I want to Know these things:
	> 1--What is Corpora planning?
	> 2-what is the corpora planning of OED and other learner
dictionaries?
	> Tell me and oblige.
	> Ali
	>
	>
	>
	>
	>
	>
	>
	>
	> Yahoo! Groups Links
	>
	>
	>
	>
	>
	>
	>
	
	
	
	
	SPONSORED LINKS 
Science kits
<http://groups.yahoo.com/gads?t=ms&k=Science+kits&w1=Science+kits&w2=Sci
ence+education&w3=Science+kit+for+kid&w4=Cognitive+science&w5=Science+ed
ucation+supply&w6=My+first+science+kit&c=6&s=145&.sig=0qmh_LxwxM-rDbioCJ
FS6Q>  	Science education
<http://groups.yahoo.com/gads?t=ms&k=Science+education&w1=Science+kits&w
2=Science+education&w3=Science+kit+for+kid&w4=Cognitive+science&w5=Scien
ce+education+supply&w6=My+first+science+kit&c=6&s=145&.sig=Excj8HDtQcq_A
ebnkeeGWA>  	Science kit for kid
<http://groups.yahoo.com/gads?t=ms&k=Science+kit+for+kid&w1=Science+kits
&w2=Science+education&w3=Science+kit+for+kid&w4=Cognitive+science&w5=Sci
ence+education+supply&w6=My+first+science+kit&c=6&s=145&.sig=TKv0Pos6NUk
T3-2vw9y_0g>  	
Cognitive science
<http://groups.yahoo.com/gads?t=ms&k=Cognitive+science&w1=Science+kits&w
2=Science+education&w3=Science+kit+for+kid&w4=Cognitive+science&w5=Scien
ce+education+supply&w6=My+first+science+kit&c=6&s=145&.sig=4LhN5e-euwMZL
UgPOsoG9Q>  	Science education supply
<http://groups.yahoo.com/gads?t=ms&k=Science+education+supply&w1=Science
+kits&w2=Science+education&w3=Science+kit+for+kid&w4=Cognitive+science&w
5=Science+education+supply&w6=My+first+science+kit&c=6&s=145&.sig=I_uaOJ
i7S2ku24m3Ad-zTQ>  	My first science kit
<http://groups.yahoo.com/gads?t=ms&k=My+first+science+kit&w1=Science+kit
s&w2=Science+education&w3=Science+kit+for+kid&w4=Cognitive+science&w5=Sc
ience+education+supply&w6=My+first+science+kit&c=6&s=145&.sig=XpTpH51aY4
wda-bBFInovQ>  	

________________________________

	YAHOO! GROUPS LINKS 


		
	*	 Visit your group "lexicographylist
<http://groups.yahoo.com/group/lexicographylist> " on the web.
		  
	*	 To unsubscribe from this group, send an email to:
		 lexicographylist-unsubscribe at yahoogroups.com
<mailto:lexicographylist-unsubscribe at yahoogroups.com?subject=Unsubscribe
> 
		  
	*	 Your use of Yahoo! Groups is subject to the Yahoo!
Terms of Service <http://docs.yahoo.com/info/terms/> . 


________________________________


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/lexicography/attachments/20060519/37a378f5/attachment.htm>


More information about the Lexicography mailing list