<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">
<HTML><HEAD>
<META http-equiv=Content-Type content="text/html; charset=us-ascii">
<META content="MSHTML 6.00.2900.2873" name=GENERATOR></HEAD>
<BODY>
<DIV dir=ltr align=left><SPAN class=172314717-19052006><FONT face=Arial
color=#0000ff size=2>Dear Patrick,</FONT></SPAN></DIV>
<DIV dir=ltr align=left><SPAN class=172314717-19052006><FONT face=Arial
color=#0000ff size=2></FONT></SPAN> </DIV>
<DIV dir=ltr align=left><SPAN class=172314717-19052006><FONT face=Arial
color=#0000ff size=2>Thank you for your detailed reply to Ali's question. I
found it very informative as well, as I'm just beginning to learn about corpus
linguistics.</FONT></SPAN></DIV>
<DIV dir=ltr align=left><SPAN class=172314717-19052006><FONT face=Arial
color=#0000ff size=2></FONT></SPAN> </DIV>
<DIV dir=ltr align=left><SPAN class=172314717-19052006><FONT face=Arial
color=#0000ff size=2>Might I ask if there is a list compiled and
published on the WWW of all (known) corpora projects, or at least the main ones
for each language? Does Great Britain have a central coordinating body for
corpora of Britain's main Germanic and Celtic home languages? Or indeed of all
languages used by large communities in Britain.</FONT></SPAN></DIV>
<DIV> </DIV><!-- Converted from text/plain format -->
<P><FONT size=2><!-- Converted from text/plain format --></P>
<P><FONT size=2>Alexander Justice<BR>Reference Librarian<BR><BR>Von der Ahe
Library<BR>Loyola Marymount University<BR>One LMU Drive<BR>Los Angeles, CA
90045<BR><BR>310.338.5947<BR>ajustice@lmu.edu<BR><BR><A
href="http://www.lmu.edu/library">http://www.lmu.edu/library</A> </FONT></P>
<P><FONT face=Arial color=#0000ff></FONT><BR><BR></FONT></P>
<DIV> </DIV><BR>
<BLOCKQUOTE style="MARGIN-RIGHT: 0px">
<DIV class=OutlookMessageHeader lang=en-us dir=ltr align=left>
<HR tabIndex=-1>
<FONT face=Tahoma size=2><B>From:</B> lexicographylist@yahoogroups.com
[mailto:lexicographylist@yahoogroups.com] <B>On Behalf Of </B>Patrick
Hanks<BR><B>Sent:</B> Tuesday, May 09, 2006 3:41 AM<BR><B>To:</B>
lexicographylist@yahoogroups.com<BR><B>Cc:</B> Driss El-Khattab; Pavel
Rychly<BR><B>Subject:</B> Re: [Lexicog] Corpora Planning<BR></FONT><BR></DIV>
<DIV></DIV><BR>Dear Ali (and other friends)<BR><BR>No one seems to have
answered your question posted on lexicographylist about<BR>corpora, so I will
attempt a brief summary, as this lies in my area of<BR>expertise. I'll
start with the obvious -- so forgive me if I tell you what<BR>you know already
-- and then I'll move on to some particular issues.<BR><BR>1. What are
corpora?<BR><BR>The word "Corpora" is the plural of "corpus" -- originally a
Latin word<BR>meaning "body". In modern usage, a corpus is simply a collection
of texts in<BR>electronic (machine-readable) form, for processing by computer.
Corpora<BR>provide evidence for how words are actually used. A famous (and
freely<BR>available) corpus for English is the British National Corpus (BNC)
of 100<BR>million words. It's ten years old now , but still a useful resource
for<BR>finding out how English words are used. See <A
href="http://www.natcorp.ox.ac.uk/">http://www.natcorp.ox.ac.uk/</A>
.<BR>Recently, Oxford University Press announced the "Oxford English Corpus"
of 1<BR>billion words (an order of magnitude bigger than BNC). See<BR><A
href="http://www.askoxford.com/oec/">http://www.askoxford.com/oec/</A><BR><BR>2.
Planning a Corpus<BR><BR>Many languages now have at least one -- but
nevertheless new corpora are<BR>still being planned and built in these
languages. In English, for example,<BR>special subject corpora are now
being planned and built -- so someone<BR>interested in the language of
medicine will build a corpus of medical texts.<BR>Another example: a corpus of
historical texts provides evidence for how<BR>words were used in the
past.<BR><BR>Other languages do not yet have a general corpus at all, so
corpus planners<BR>must start from scratch. Typically, a general corpus
will consist of lots<BR>of different kinds of texts --- some journalism
(electronic versions of<BR>newspapers and journals are easy to obtain in many
languages), some text<BR>books, some academic writing, some fiction, some web
pages, and some<BR>transcripts of unscripted conversation (--- though the
latter is difficult<BR>to get hold of -- and can also be difficult to
interpret).<BR><BR>Corpus planners generally avoid poetry and plays, as these
are texts in<BR>which language is often used in unusual ways.<BR><BR>A good
introduction to to corpus planning, though somewhat out of date now,<BR>is an
article by Sue Atkins, Jeremy Clear, and Nicholas Ostler, 1992:<BR>"Corpus
Design Criteria" in the journal Literary and Linguistic Computing.<BR>See <A
href="http://llc.oxfordjournals.org/cgi/content/abstract/7/1/1">http://llc.oxfordjournals.org/cgi/content/abstract/7/1/1</A><BR><BR>Nowadays,
the Internet has made corpus buiilding much easier. Indeed, for<BR>some
purposes, the whole of the Internet is sometimes regarded as one
vast<BR>multilingual corpus. See a special issue of the journal
"Computational<BR>Linguistics" edited by Adam Kilgarriff and Gregory
Grefenstette:<BR><A
href="http://www.mitpressjournals.org/doi/abs/10.1162/089120103322711569">http://www.mitpressjournals.org/doi/abs/10.1162/089120103322711569</A>
.<BR><BR>3. Getting Permissions<BR><BR>Someone building a corpus for general
use must get permission from each<BR>author (or copyright owner -- typically,
the publisher) before adding a<BR>text to a corpus. This can raise
difficult questions (e.g. "Who owns the<BR>text?" and "What can I say that
will persuade the text owner to give<BR>permission?")<BR><BR>4. Building a
Corpus<BR><BR>Once the texts have been obtained, along with permission to use
them, some<BR>basic computational work has to be carried out. The texts
must be<BR>standardized, so that they are all in the same format for the
computer, then<BR>they must be 1) tokenized (finding word boundaries, deciding
what to do<BR>about punctuations marks), 2) lemmatized (if one wants the
computer<BR>to find "take, takes, taking, took, and/or taken" in response to a
user's<BR>inquiry about "take"; and 3) word-class tagged (so that, for
example, the<BR>computer can separate "report", noun, from "report", verb).
Then each word<BR>in each text must be indexed (a highly technical procedure,
so that the<BR>computer can instantly retrieve the information that users ask
for about a<BR>word or phrase or other linguistic item.<BR><BR>Fortunately,
there are now some experts who specialize in building corpora<BR>of any kind
in any language. The language does not matter, because the<BR>procedures
for processing words (letters, symbols) are<BR>language-independent -- i.e.
they are, in principle, the same for any<BR>language. Among the best are Pavel
Rychly and Adam Kilgarriff (see<BR><A
href="http://www.sketchengine.co.uk/">http://www.sketchengine.co.uk/</A>
)<BR><BR>5.. Why build a corpus anyway?<BR><BR>Linguists and lexicographers
are divided between those who believe that one<BR>can get all the evidence one
needs by consulting the intuitions of a native<BR>speaker (oneself, for
example),<BR>and those who believe that some source of evidence is
necessary. I worked<BR>in lexicography in the 1970s, before there were
corpora, and I can attest<BR>from personal experience that the evidence of a
large corpus provides<BR>important insights into words and meanings which
cannot be obtained by<BR>introspection (however hard one tries). So I firmly
believe that a corpus<BR>(and tools for corpus analysis) are necessary for
modern lexicography.<BR><BR>6. Oxford Dictionaries<BR><BR>One clarification re
your question: OED is not a corpus-based dictionary.<BR>The original 14-volume
Oxford English Dictionary (1878-1928) (OED) was a<BR>great historical
investigation into<BR>the origin and history every English word, based
on a 19th century<BR>collection of citations, each of which written out on a
slip of paper by<BR>volunteers. (Phew!) It was compiled long before
computers were invented,<BR>but in the 1980s OED was loaded onto a computer.
It is now being very<BR>thoroughly revised by a large team of lexicographers
in Oxford. An on-line<BR>version is available. The OED editors have to take
account of many facts<BR>(philological and historical) in addition to corpus
evidence.<BR><BR>Oxford University Press is a vast publishing organization
with several<BR>divisions, which are run independently as separate businesses.
The Oxford<BR>Advanced Learners Dictionary of Current English (OALDCE) is
published by the<BR>English Language Teaching Division. It was not
originally -- but is now -- <BR>a corpus-based dictionary. It was completely
revised and rewritten in the<BR>1990s in the light of corpus evidence. It has
nothing to do with the Oxford<BR>English Dictionary (OED), other than the fact
that it is published by the<BR>same publisher.<BR><BR>The one-volume "Oxford
Dictionary of English" is a corpus-based dictionary.<BR>It is aimed at native
speakers of English (but not at historical scholars).<BR>So it lies somewhere
between the OED and OALDCE. I was involved in creating<BR>the first edition of
the Oxford Dictionary of Enlgish book, so naturally I<BR>think it is the best
dictionary ever!<BR><BR>I hope these remarks are helpful, and that you will
take the initative in<BR>creating a corpus of Urdu. Let me know if I can help
in any way.<BR><BR>Best wishes,<BR><BR><BR>Patrick Hanks<BR><BR><BR>-----
Original Message ----- <BR>From: "ali72678" <ali72678@yahoo.com><BR>To:
<lexicographylist@yahoogroups.com><BR>Sent: Saturday, May 06, 2006 7:10
PM<BR>Subject: [Lexicog] Corpora Planning<BR><BR><BR>> Hi All<BR>> I
want to Know these things:<BR>> 1--What is Corpora planning?<BR>> 2-what
is the corpora planning of OED and other learner dictionaries?<BR>> Tell me
and oblige.<BR>>
Ali<BR>><BR>><BR>><BR>><BR>><BR>><BR>><BR>><BR>>
Yahoo! Groups
Links<BR>><BR>><BR>><BR>><BR>><BR>><BR>><BR><BR>
<!-- |**|begin egp html banner|**| -->
<br>
<div style="text-align:center; color:#909090; width:500px;">
<hr style="border-bottom:1px; width:500px; text-align:left;">
<tt>YAHOO! GROUPS LINKS</tt>
</div>
<br>
<ul>
<tt><li type=square> Visit your group "<a href="http://groups.yahoo.com/group/lexicographylist">lexicographylist</a>" on the web.<br> </tt>
<tt><li type=square> To unsubscribe from this group, send an email to:<br> <a href="mailto:lexicographylist-unsubscribe@yahoogroups.com?subject=Unsubscribe">lexicographylist-unsubscribe@yahoogroups.com</a><br> </tt>
<tt><li type=square> Your use of Yahoo! Groups is subject to the <a href="http://docs.yahoo.com/info/terms/">Yahoo! Terms of Service</a>.</tt>
</ul>
<br>
<div style="text-align:center; color:#909090; width:500px;">
<hr style="border-bottom:1px; width:500px; text-align:left;">
</div>
</br>
<!-- |**|end egp html banner|**| -->
</BODY></HTML>