[Lexicog] data format

billposer at ALUM.MIT.EDU billposer at ALUM.MIT.EDU
Tue Jun 30 04:05:53 UTC 2009


Exactly what you should do depends on what you want to do with the information.
However, if what you want initially is to have the information on-line in
a format that allows you easily to manipulate it and in particular easily
to reproduce, with corrections or additions, or in a modified format, something
like the original print dictionary, or a web equivalent thereof, then
I would recommend that you:

(a) decide how many distinct pieces of information are present in the text
	(e.g. English word, language X gloss, English example sentence,
	language X translation of example sentence);

(b) choose tags for these pieces of information, other things being equal
	probably guided by the Standard Dictionary Format tags;

(c) enter the data in a Shoebox-style format;

When I say Shoebox-style, I mean with entries separated by blank lines
and with fields identified by tags, but with your choice of field-initiator
and tag-value separator. Shoebox/Toolbox use backslash for the field-initator,
which is fine unless you work with Unix-type tools, in which case it is a
huge pain-in-the-neck. I use %. I know of someone who uses "==". It doesn't
much matter what it is so long as it isn't likely to occur in your data.
Shoebox/Toolbox uses whitespace for the tag-value separator. You may prefer
a single visible character which reduces the likelihood of some kinds of
errors and is slightly easier to parse. In other words, you could have
something like this:

\w able 
\e a small child is able to walk
\t silo ala meqora ta o rovea talio

or something like this:

%W:able
%E:a small child is able to walk
%T:silo ala meqora ta o rovea talio

These are easily interconvertible, so it doesn't make a lot of difference.
However, if it makes no difference to you, you are probably best off
using Shoebox/Toolbox format since you can then use those tools and other
tools that can read that format without any further work.

Strictly speaking, Shoebox/Toolbox does not require blank lines between
records but instead requires that each record begin with a designated
tag. If you want to be sure of being able to use tools designed for this
format, you should make sure to place that field at the beginning
of the record. 
 
My recommendation is, at this stage, not to use ANY of the dictionary programs.
For bulk data entry of this type, with simple record structure, the best
data entry method is your favorite text editor (or word processor, if you make
sure to export as plain text, without word processor cruft). I have a fair
amount of experience with this. When I first started work on Carrier, I entered
the entire existing dictionary this way so as to be able to search it.
Later, when I decided how I wanted to handle verbs, I entered masses of
paradigms, about 500 entries per day. For tasks such as these I speeded things
up a bit by creating an editor macro that inserted a record template for me
to fill in.

Your mileage may vary, but in my experience for this kind of material
using a dictionary program for bulk data entry slows you down
considerably. Their data entry tools are designed to lead you through
the process step-by-step and to keep things consistent when adding entries.
By all means use them if you want the hand-holding, but consider whether
you really need it.

Once you have your data in a format like this, you can read it into
any of a number of tools that understand it and/or write/have written
your own tools.

Bill

 


------------------------------------

Yahoo! Groups Links

<*> To visit your group on the web, go to:
    http://groups.yahoo.com/group/lexicographylist/

<*> Your email settings:
    Individual Email | Traditional

<*> To change settings online go to:
    http://groups.yahoo.com/group/lexicographylist/join
    (Yahoo! ID required)

<*> To change settings via email:
    mailto:lexicographylist-digest at yahoogroups.com 
    mailto:lexicographylist-fullfeatured at yahoogroups.com

<*> To unsubscribe from this group, send an email to:
    lexicographylist-unsubscribe at yahoogroups.com

<*> Your use of Yahoo! Groups is subject to:
    http://docs.yahoo.com/info/terms/



More information about the Lexicography mailing list