[Corpora-List] Re: Dictionary Creation Software
Ramesh Krishnamurthy
ramesh at easynet.co.uk
Wed Sep 18 15:03:14 UTC 2002
Dear Dr De Lucca
I have drawn up a checklist from my 15 years experience in corpus-based computational lexicography.
I hope this helps.
If you are going to create software for the whole process from raw data to publishing
of a dictionary/reference book, I think these would be my requirements.
Every process should be automated to the maximum, with allowance for human intervention
or input of preferences.
1. for monolingual dictionaries, a large corpus of L1
2. for bilingual dictionaries, a large corpus of L1 and L2, with pointers in both directions to find
suggested equivalent words and phrases
3. lemmatized frequency lists, to decide which words are important enough to include in the dictionary,
and which forms are significant, etc
4. based on the frequency lists, a spelling checker, giving variant spellings
5. pronunciation, with regional variations; concordanced tone units to hear word pronunciation in context
6. statistics for regional variations
7. statistics for genre distribution: is the wordform used in all types of text, or mainly in speech,
mainly in newspapers, mainly in novels, etc
8. grammar - wordclass identification, colligation, grammar patterns (valency, complementation, etc);
with frequencies, regional variations, and genre-distribution
9. collocation: individual collocates, lexical phrases, etc; with frequencies, regional variations, and genre-distribution
10. semantics - hypernyms, hyponyms, synonyms (i.e. thesaurus), antonyms
11. pragmatics - any relevant information
12. selected examples for each point from 3 onwards; large corpora yield hundreds or thousands of examples, so
13. spoken data: typical speaker, context, interlocutor, etc
14. concordancer to allow access to raw data and ability to check the information given from point 3 onwards
15. automatic cut-and-paste to dictionary or reference book database
16. customizable database templates for reference books
17. validation routines to ensure database entry fields contain correct information and are in correct sequence
18. ability to interrogate database on any field or subfield, to count entries, check that editorial policies have been followed,
check cross-references, check that examples contain the headword, etc
19. automatic conversion from database to typesetting formats - columnation, page numbering, headers and footers, widows and orphans, typefaces, etc
20. progress monitoring - which processes have been completed (e.g. compilation, editing, proofreading), which words have been done, who did them, when, etc
All the tools should be flexible, to allow users to cater for local variations in any feature, from orthographic form (capitalization, punctuation, contractions, etc)
to size of field in the databases, etc.
Best wishes
Ramesh
Ramesh Krishnamurthy
Consultant, Collins Cobuild and Bank of English Corpus;
Honorary Research Fellow, Centre for Corpus Linguistics, University of Birmingham;
Honorary Research Fellow, Computational Linguistics Research Group, University of Wolverhampton.
----- Original Message -----
From: delucca at nilc.icmc.usp.br
To: corpora at hd.uib.no
Cc: delucca at usp.br
Subject: [Corpora-List] Dictionary Creation Software
Dear Colleagues,
We are a team of researchers in Computational Linguistics and, at the
present time, we are working on construction software tools for making
Dictionaries.
We would like to hearing from those who have experiences with the compiling
dictionaries
and vocabularies the following: WHAT you would like, would need, and would
hope of a Dictionary Creation Software. What type of tools would be essential
for making dictionaries, vocabularies and other any type of reference work. A
concordancer? A Spelling Checker? Pronouncing ?
We look forward to hearing from you with great interest.
Thank you very much in advance for your advice.
Sincerely
J.L. DeLucca, PhD
Interinstitutional Center for Research and Development in Computational
Linguistics (NILC)
Sao Paulo University
More information about the Corpora
mailing list