[Corpora-List] LEXICOGRAPHIC SOFTWARE: courses

Adam Kilgarriff adam.kilgarriff at itri.brighton.ac.uk
Wed Oct 2 05:33:30 UTC 2002


		     ===========================
			Lexicographic software
			    Short Courses
			      Responses
		     ===========================

J. L. DeLucca writes:
> We would like to hearing from you WHAT computational tools do you use at the
> present time for developping your LEXICOGRAPHIC projects.

This email provoked a number of responses on the lists above.  It's a
topic we have deep (and sometimes bitter) experience of and will be
addressing in depth in an upcoming short course

 LCM04 Computers and Lexicography
 11-14 November 2002
 ITRI, Brighton, England
 Tutors:
    Adam Kilgarriff
    David Tugwell
 Guest lectures from:
    Steve Crowdy, Longman Dictionaries / Pearson Education
    Laura Elliot, Oxford University Press

For details and bookings see

    http://www.itri.brighton.ac.uk/lexicom

In response to some of the earlier responses to the mailout:

(1) Ramesh Krishnamurthy presents desiderata for both corpus resources
and corpus querying, and a "Dictionary Writing System", (attached
below).  While Ramesh's list is useful as a starting point, it is
clearly not a full specification and does not present requirements in
relation to, eg, critical database issues such as 'sort' and
'cross-reference' functionality.  Steve Crowdy has worked extensively
on two full specifications, both implemented and used in large-scale
dictionary production environments, which he will be talking about.
(One of these is now commercially available.)

As John Wiliams notes (also attached below), it is useful to distinguish the
Dictionary Writing System and the Corpus Query System (a read-only
package, from the lexicographer's viewpoint, in which language corpora
are loaded and can be viewed flexibly.)   The course mentioned above
covers the former.  Another Brighton course (website, bookings as
above) covers the latter:

 LCM07 Corpus Design and Use
 2-5 December 2002
 Tutors:
    Adam Kilgarriff
    Michael Rundell

(2) Baden Hughes and others listed software they used, as here

> >Languages:
> >Perl, C++, NLP++, VB, Java, Tcl
> >
> >Applications & Utilities:
> >TeX, sed, awk, grep, FileMaker Pro, MySQL, Excel, Word
>

While these are all salient for various aspects of dictionary-making,
they fall far short of being an environment designed to help
lexicographers efficiently produce a large, coherent and consistent
dictionary. Most lexicographers are not programmers, and want a single
tool for writing a dictionary which takes care of the growing lexical
database in such a way that they need not think about it, but can get
on with the job of analysing meaning and writing entries.


                         ?? WORKSHOP ??

If there is sufficient interest in the topic, we could append a
workshop, for pooling ideas and experiences of Dictionary Writing
Systems, to the LCM04 short course.  if you would be interested and in a
position to come to Brighton for it (most likely dates: around 15/16
Nov), do let me know, with the dates you could make: if there are
enough responses, I'll organise it.



   Adam Kilgarriff

   also for Sue Atkins, Michael Rundell (Lexicography Masterclass Ltd)
   and Lexicom group, University of Brighton


Ramesh Krishnamurthy writes:
 > Dear Dr De Lucca
 >
 > I have drawn up a checklist from my 15 years experience in corpus-based computational lexicography.
 > I hope this helps.
 >
 > If you are going to create software for the whole process from raw data to publishing
 > of a dictionary/reference book, I think these would be my requirements.
 > Every process should be automated to the maximum, with allowance for human intervention
 > or input of preferences.
 >
 > 1. for monolingual dictionaries, a large corpus of L1
 > 2. for bilingual dictionaries, a large corpus of L1 and L2, with pointers in both directions to find
 > suggested equivalent words and phrases
 > 3. lemmatized frequency lists, to decide which words are important enough to include in the dictionary,
 > and which forms are significant, etc
 > 4. based on the frequency lists, a spelling checker, giving variant spellings
 > 5. pronunciation, with regional variations; concordanced tone units to hear word pronunciation in context
 > 6. statistics for regional variations
 > 7. statistics for genre distribution: is the wordform used in all types of text, or mainly in speech,
 > mainly in newspapers, mainly in novels, etc
 > 8. grammar - wordclass identification, colligation, grammar patterns (valency, complementation, etc);
 > with frequencies, regional variations, and genre-distribution
 > 9. collocation: individual collocates, lexical phrases, etc; with frequencies,  regional variations, and genre-distribution
 > 10. semantics - hypernyms, hyponyms, synonyms (i.e. thesaurus), antonyms
 > 11. pragmatics - any relevant information
 > 12. selected examples for each point from 3 onwards; large corpora yield hundreds or thousands of examples, so
 > 13. spoken data: typical speaker, context, interlocutor, etc
 > 14. concordancer to allow access to raw data and ability to check the information given from point 3 onwards
 > 15. automatic cut-and-paste to dictionary or reference book database
 > 16. customizable database templates for reference books
 > 17. validation routines to ensure database entry fields contain correct information and are in correct sequence
 > 18. ability to interrogate database on any field or subfield, to count entries, check that editorial policies have been followed,
 > check cross-references, check that examples contain the headword, etc
 > 19. automatic conversion from database to typesetting formats - columnation, page numbering, headers and footers, widows and orphans, typefaces, etc
 > 20. progress monitoring - which processes have been completed (e.g. compilation, editing, proofreading), which words have been done, who did them, when, etc
 >
 > All the tools should be flexible, to allow users to cater for local variations in any feature, from orthographic form (capitalization, punctuation, contractions, etc)
 > to size of field in the databases, etc.
 >
 > Best wishes
 > Ramesh
 >
 > Ramesh Krishnamurthy
 > Consultant, Collins Cobuild and Bank of English Corpus;
 > Honorary Research Fellow, Centre for Corpus Linguistics, University of Birmingham;
 > Honorary Research Fellow, Computational Linguistics Research Group, University of Wolverhampton.
 >
 >

John Williams writes:
 >
 > Dear Dr De Lucca,
 >
 > As a former colleague of Ramesh, I haven't got much to add to his very
 > comprehensive checklist. But you may wish to consider to what extent the
 > data requirements (approx. points 1-14 of Ramesh's list) need to be
 > integrated with the compilation package proper (approx. points 15-20).
 > For maximum reusability, you may want to separate the two components,
 > and maybe your brief only covers the latter.
 >
 > I would add a couple of things:
 > - since many big dictionary projects today are compiled by dispersed
 > teams working on their own computers, the software ideally needs to be
 > platform-independent, and include some kind of networking facility for
 > ease of file transfer;
 > - again, for maximum reusability and flexibility, the software should
 > allow the project manager to define his/her own tagset (though a basic
 > tagset should be included initially). There's a Croatian package called
 > Softlex that allows precisely this.
 >
 > Best wishes,
 >
 > John
 >
 >
 >
 > --
 >
 > John Williams
 >
 > Freelance Lexicographer
 >
 > Tel/Fax: (+44) (0)151 733 5459
 > Mobile: (+44) (0)7968 027829
 >
 > Web: http://www.eflex-mcmail.com
 >
 > E-mail: johnw at whoever.com
 >

 > ----- Original Message -----
 > From: delucca at nilc.icmc.usp.br
 > To: corpora at hd.uib.no
 > Cc: delucca at usp.br
 > Subject: [Corpora-List] Dictionary Creation Software
 >
 > Dear Colleagues,
 >
 > We are a team of researchers in Computational Linguistics and, at the
 > present time, we are working on construction software tools for making
 > Dictionaries.
 >
 > We would like to hearing from those who have experiences with the compiling
 > dictionaries
 > and vocabularies the following:  WHAT you would like, would need, and would
 > hope of a Dictionary Creation Software. What type of tools would be essential
 > for making dictionaries, vocabularies and other any type of reference work. A
 > concordancer? A Spelling Checker? Pronouncing ?
 >
 > We look forward to hearing from you with great interest.
 >
 > Thank you very much in advance for your advice.
 >
 > Sincerely
 >
 >
 >
 >
 > J.L. DeLucca, PhD
 >
 > Interinstitutional Center for Research and Development in Computational
 > Linguistics (NILC)
 > Sao Paulo University
 >
 >

--
NEW!! MSc and Short Courses in Lexical Computing and Lexicography
Info at

http://www.itri.brighton.ac.uk/lexicom

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
Adam Kilgarriff
Senior Research Fellow                         tel: (44) 1273 642919
Information Technology Research Institute           (44) 1273 642900
University of Brighton                         fax: (44) 1273 642908
Lewes Road
Brighton BN2 4GJ         email:      Adam.Kilgarriff at itri.bton.ac.uk
UK                       http://www.itri.bton.ac.uk/~Adam.Kilgarriff
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%



More information about the Corpora mailing list