[Lexicog] data format

lengosi pcunger at MSN.COM
Tue Jun 30 04:21:50 UTC 2009


Thanks, Bill. That was very helpful. Simple text format it is. For now . . . ;-)

Paul
--- In lexicographylist at yahoogroups.com, billposer at ... wrote:
>
> Exactly what you should do depends on what you want to do with the information.
> However, if what you want initially is to have the information on-line in
> a format that allows you easily to manipulate it and in particular easily
> to reproduce, with corrections or additions, or in a modified format, something
> like the original print dictionary, or a web equivalent thereof, then
> I would recommend that you:
> 
> (a) decide how many distinct pieces of information are present in the text
> 	(e.g. English word, language X gloss, English example sentence,
> 	language X translation of example sentence);
> 
> (b) choose tags for these pieces of information, other things being equal
> 	probably guided by the Standard Dictionary Format tags;
> 
> (c) enter the data in a Shoebox-style format;
> 
> When I say Shoebox-style, I mean with entries separated by blank lines
> and with fields identified by tags, but with your choice of field-initiator
> and tag-value separator. Shoebox/Toolbox use backslash for the field-initator,
> which is fine unless you work with Unix-type tools, in which case it is a
> huge pain-in-the-neck. I use %. I know of someone who uses "==". It doesn't
> much matter what it is so long as it isn't likely to occur in your data.
> Shoebox/Toolbox uses whitespace for the tag-value separator. You may prefer
> a single visible character which reduces the likelihood of some kinds of
> errors and is slightly easier to parse. In other words, you could have
> something like this:
> 
> \w able 
> \e a small child is able to walk
> \t silo ala meqora ta o rovea talio
> 
> or something like this:
> 
> %W:able
> %E:a small child is able to walk
> %T:silo ala meqora ta o rovea talio
> 
> These are easily interconvertible, so it doesn't make a lot of difference.
> However, if it makes no difference to you, you are probably best off
> using Shoebox/Toolbox format since you can then use those tools and other
> tools that can read that format without any further work.
> 
> Strictly speaking, Shoebox/Toolbox does not require blank lines between
> records but instead requires that each record begin with a designated
> tag. If you want to be sure of being able to use tools designed for this
> format, you should make sure to place that field at the beginning
> of the record. 
>  
> My recommendation is, at this stage, not to use ANY of the dictionary programs.
> For bulk data entry of this type, with simple record structure, the best
> data entry method is your favorite text editor (or word processor, if you make
> sure to export as plain text, without word processor cruft). I have a fair
> amount of experience with this. When I first started work on Carrier, I entered
> the entire existing dictionary this way so as to be able to search it.
> Later, when I decided how I wanted to handle verbs, I entered masses of
> paradigms, about 500 entries per day. For tasks such as these I speeded things
> up a bit by creating an editor macro that inserted a record template for me
> to fill in.
> 
> Your mileage may vary, but in my experience for this kind of material
> using a dictionary program for bulk data entry slows you down
> considerably. Their data entry tools are designed to lead you through
> the process step-by-step and to keep things consistent when adding entries.
> By all means use them if you want the hand-holding, but consider whether
> you really need it.
> 
> Once you have your data in a format like this, you can read it into
> any of a number of tools that understand it and/or write/have written
> your own tools.
> 
> Bill
>




------------------------------------

Yahoo! Groups Links

<*> To visit your group on the web, go to:
    http://groups.yahoo.com/group/lexicographylist/

<*> Your email settings:
    Individual Email | Traditional

<*> To change settings online go to:
    http://groups.yahoo.com/group/lexicographylist/join
    (Yahoo! ID required)

<*> To change settings via email:
    mailto:lexicographylist-digest at yahoogroups.com 
    mailto:lexicographylist-fullfeatured at yahoogroups.com

<*> To unsubscribe from this group, send an email to:
    lexicographylist-unsubscribe at yahoogroups.com

<*> Your use of Yahoo! Groups is subject to:
    http://docs.yahoo.com/info/terms/



More information about the Lexicography mailing list