<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">

<HTML><HEAD>

<META http-equiv=Content-Type content="text/html; charset=us-ascii">

<META content="MSHTML 6.00.2900.2873" name=GENERATOR></HEAD>

<BODY>


<DIV dir=ltr align=left><SPAN class=172314717-19052006><FONT face=Arial 

color=#0000ff size=2>Dear Patrick,</FONT></SPAN></DIV>

<DIV dir=ltr align=left><SPAN class=172314717-19052006><FONT face=Arial 

color=#0000ff size=2></FONT></SPAN> </DIV>

<DIV dir=ltr align=left><SPAN class=172314717-19052006><FONT face=Arial 

color=#0000ff size=2>Thank you for your detailed reply to Ali's question. I 

found it very informative as well, as I'm just beginning to learn about corpus 

linguistics.</FONT></SPAN></DIV>

<DIV dir=ltr align=left><SPAN class=172314717-19052006><FONT face=Arial 

color=#0000ff size=2></FONT></SPAN> </DIV>

<DIV dir=ltr align=left><SPAN class=172314717-19052006><FONT face=Arial 

color=#0000ff size=2>Might I ask if there is a list compiled and 

published on the WWW of all (known) corpora projects, or at least the main ones 

for each language? Does Great Britain have a central coordinating body for 

corpora of Britain's main Germanic and Celtic home languages? Or indeed of all 

languages used by large communities in Britain.</FONT></SPAN></DIV>

<DIV> </DIV><!-- Converted from text/plain format -->

<P><FONT size=2><!-- Converted from text/plain format --></P>

<P><FONT size=2>Alexander Justice<BR>Reference Librarian<BR><BR>Von der Ahe 

Library<BR>Loyola Marymount University<BR>One LMU Drive<BR>Los Angeles, CA 

90045<BR><BR>310.338.5947<BR>ajustice@lmu.edu<BR><BR><A 

href="http://www.lmu.edu/library">http://www.lmu.edu/library</A> </FONT></P>

<P><FONT face=Arial color=#0000ff></FONT><BR><BR></FONT></P>

<DIV> </DIV><BR>

<BLOCKQUOTE style="MARGIN-RIGHT: 0px">

  <DIV class=OutlookMessageHeader lang=en-us dir=ltr align=left>

  <HR tabIndex=-1>

  <FONT face=Tahoma size=2><B>From:</B> lexicographylist@yahoogroups.com 

  [mailto:lexicographylist@yahoogroups.com] <B>On Behalf Of </B>Patrick 

  Hanks<BR><B>Sent:</B> Tuesday, May 09, 2006 3:41 AM<BR><B>To:</B> 

  lexicographylist@yahoogroups.com<BR><B>Cc:</B> Driss El-Khattab; Pavel 

  Rychly<BR><B>Subject:</B> Re: [Lexicog] Corpora Planning<BR></FONT><BR></DIV>

  <DIV></DIV><BR>Dear Ali (and other friends)<BR><BR>No one seems to have 

  answered your question posted on lexicographylist about<BR>corpora, so I will 

  attempt a brief summary, as this lies in my area of<BR>expertise.  I'll 

  start with the obvious -- so forgive me if I tell you what<BR>you know already 

  -- and then I'll move on to some particular issues.<BR><BR>1. What are 

  corpora?<BR><BR>The word "Corpora" is the plural of "corpus" -- originally a 

  Latin word<BR>meaning "body". In modern usage, a corpus is simply a collection 

  of texts in<BR>electronic (machine-readable) form, for processing by computer. 

  Corpora<BR>provide evidence for how words are actually used. A famous (and 

  freely<BR>available) corpus for English is the British National Corpus (BNC) 

  of 100<BR>million words. It's ten years old now , but still a useful resource 

  for<BR>finding out how English words are used. See <A 

  href="http://www.natcorp.ox.ac.uk/">http://www.natcorp.ox.ac.uk/</A> 

  .<BR>Recently, Oxford University Press announced the "Oxford English Corpus" 

  of 1<BR>billion words (an order of magnitude bigger than BNC).  See<BR><A 

  href="http://www.askoxford.com/oec/">http://www.askoxford.com/oec/</A><BR><BR>2. 

  Planning a Corpus<BR><BR>Many languages now have at least one -- but 

  nevertheless new corpora are<BR>still being planned and built in these 

  languages. In English, for example,<BR>special subject corpora  are now 

  being planned and built -- so someone<BR>interested in the language of 

  medicine will build a corpus of medical texts.<BR>Another example: a corpus of 

  historical texts provides evidence for how<BR>words were used in the 

  past.<BR><BR>Other languages do not yet have a general corpus at all, so 

  corpus planners<BR>must start from scratch.  Typically, a general corpus 

  will consist of lots<BR>of different kinds of texts --- some journalism 

  (electronic versions of<BR>newspapers and journals are easy to obtain in many 

  languages), some text<BR>books, some academic writing, some fiction, some web 

  pages, and some<BR>transcripts of unscripted conversation (--- though the 

  latter is difficult<BR>to get hold of -- and can also be difficult to 

  interpret).<BR><BR>Corpus planners generally avoid poetry and plays, as these 

  are texts in<BR>which language is often used in unusual ways.<BR><BR>A good 

  introduction to to corpus planning, though somewhat out of date now,<BR>is an 

  article by Sue Atkins, Jeremy Clear, and Nicholas Ostler, 1992:<BR>"Corpus 

  Design Criteria" in the journal Literary and Linguistic Computing.<BR>See <A 

  href="http://llc.oxfordjournals.org/cgi/content/abstract/7/1/1">http://llc.oxfordjournals.org/cgi/content/abstract/7/1/1</A><BR><BR>Nowadays, 

  the Internet has made corpus buiilding much easier. Indeed, for<BR>some 

  purposes, the whole of the Internet is sometimes regarded as one 

  vast<BR>multilingual corpus. See a special issue of the journal 

  "Computational<BR>Linguistics" edited by Adam Kilgarriff and Gregory 

  Grefenstette:<BR><A 

  href="http://www.mitpressjournals.org/doi/abs/10.1162/089120103322711569">http://www.mitpressjournals.org/doi/abs/10.1162/089120103322711569</A>  

  .<BR><BR>3. Getting Permissions<BR><BR>Someone building a corpus for general 

  use must get permission from each<BR>author (or copyright owner -- typically, 

  the publisher)  before adding a<BR>text to a corpus. This can raise 

  difficult questions (e.g. "Who owns the<BR>text?" and "What can I say that 

  will persuade the text owner to give<BR>permission?")<BR><BR>4. Building a 

  Corpus<BR><BR>Once the texts have been obtained, along with permission to use 

  them, some<BR>basic computational work has to be carried out.  The texts 

  must be<BR>standardized, so that they are all in the same format for the 

  computer, then<BR>they must be 1) tokenized (finding word boundaries, deciding 

  what to do<BR>about punctuations marks), 2) lemmatized (if one wants the 

  computer<BR>to find "take, takes, taking, took, and/or taken" in response to a 

  user's<BR>inquiry about "take"; and 3) word-class tagged (so that, for 

  example, the<BR>computer can separate "report", noun, from "report", verb). 

  Then each word<BR>in each text must be indexed (a highly technical procedure, 

  so that the<BR>computer can instantly retrieve the information that users ask 

  for about a<BR>word or phrase or other linguistic item.<BR><BR>Fortunately, 

  there are now some experts who specialize in building corpora<BR>of any kind 

  in any language.  The language does not matter, because the<BR>procedures 

  for processing words (letters, symbols) are<BR>language-independent -- i.e. 

  they are, in principle, the same for any<BR>language. Among the best are Pavel 

  Rychly and Adam Kilgarriff (see<BR><A 

  href="http://www.sketchengine.co.uk/">http://www.sketchengine.co.uk/</A> 

  )<BR><BR>5.. Why build a corpus anyway?<BR><BR>Linguists and lexicographers 

  are divided between those who believe that one<BR>can get all the evidence one 

  needs by consulting the intuitions of a native<BR>speaker (oneself, for 

  example),<BR>and those who believe that some source of evidence is 

  necessary.  I worked<BR>in lexicography in the 1970s, before there were 

  corpora, and I can attest<BR>from personal experience that the evidence of a 

  large corpus provides<BR>important insights into words and meanings which 

  cannot be obtained by<BR>introspection (however hard one tries). So I firmly 

  believe that a corpus<BR>(and tools for corpus analysis) are necessary for 

  modern lexicography.<BR><BR>6. Oxford Dictionaries<BR><BR>One clarification re 

  your question: OED is not a corpus-based dictionary.<BR>The original 14-volume 

  Oxford English Dictionary (1878-1928) (OED) was a<BR>great historical 

  investigation into<BR>the origin and history every English word,  based 

  on a 19th century<BR>collection of citations, each of which written out on a 

  slip of paper by<BR>volunteers. (Phew!)  It was compiled long before 

  computers were invented,<BR>but in the 1980s OED was loaded onto a computer. 

  It is now being very<BR>thoroughly revised by a large team of lexicographers 

  in Oxford. An on-line<BR>version is available. The OED editors have to take 

  account of many facts<BR>(philological and historical) in addition to corpus 

  evidence.<BR><BR>Oxford University Press is a vast publishing organization 

  with several<BR>divisions, which are run independently as separate businesses. 

  The Oxford<BR>Advanced Learners Dictionary of Current English (OALDCE) is 

  published by the<BR>English Language Teaching Division.  It was not 

  originally -- but is now -- <BR>a corpus-based dictionary. It was completely 

  revised and rewritten in the<BR>1990s in the light of corpus evidence. It has 

  nothing to do with the Oxford<BR>English Dictionary (OED), other than the fact 

  that it is published by the<BR>same publisher.<BR><BR>The one-volume "Oxford 

  Dictionary of English" is a corpus-based dictionary.<BR>It is aimed at native 

  speakers of English (but not at historical scholars).<BR>So it lies somewhere 

  between the OED and OALDCE. I was involved in creating<BR>the first edition of 

  the Oxford Dictionary of Enlgish book, so naturally I<BR>think it is the best 

  dictionary ever!<BR><BR>I hope these remarks are helpful, and that you will 

  take the initative in<BR>creating a corpus of Urdu. Let me know if I can help 

  in any way.<BR><BR>Best wishes,<BR><BR><BR>Patrick Hanks<BR><BR><BR>----- 

  Original Message ----- <BR>From: "ali72678" <ali72678@yahoo.com><BR>To: 

  <lexicographylist@yahoogroups.com><BR>Sent: Saturday, May 06, 2006 7:10 

  PM<BR>Subject: [Lexicog] Corpora Planning<BR><BR><BR>> Hi All<BR>> I 

  want to Know these things:<BR>> 1--What is Corpora planning?<BR>> 2-what 

  is the corpora planning of OED and other learner dictionaries?<BR>> Tell me 

  and oblige.<BR>> 

  Ali<BR>><BR>><BR>><BR>><BR>><BR>><BR>><BR>><BR>> 

  Yahoo! Groups 

  Links<BR>><BR>><BR>><BR>><BR>><BR>><BR>><BR><BR>


<!-- |**|begin egp html banner|**| -->


<br>

  <div style="text-align:center; color:#909090; width:500px;">

  <hr style="border-bottom:1px; width:500px; text-align:left;">

  <tt>YAHOO! GROUPS LINKS</tt>

</div>

<br>

<ul>

  <tt><li type=square> Visit your group "<a href="http://groups.yahoo.com/group/lexicographylist">lexicographylist</a>" on the web.<br> </tt>

  <tt><li type=square> To unsubscribe from this group, send an email to:<br> <a href="mailto:lexicographylist-unsubscribe@yahoogroups.com?subject=Unsubscribe">lexicographylist-unsubscribe@yahoogroups.com</a><br> </tt>

  <tt><li type=square> Your use of Yahoo! Groups is subject to the <a href="http://docs.yahoo.com/info/terms/">Yahoo! Terms of Service</a>.</tt>

</ul>

<br>

<div style="text-align:center; color:#909090; width:500px;">

  <hr style="border-bottom:1px; width:500px; text-align:left;">

</div>

</br>


<!-- |**|end egp html banner|**| -->


</BODY></HTML>