[Corpora-List] Format for context info

John F. Sowa sowa at bestweb.net
Thu May 12 12:58:15 UTC 2011


On 5/12/2011 7:48 AM, Kilian Evang wrote:
>> I was thinking just to give them as tab separated txt files as that
>> seems the most universal, e.g. something like:
>>
>> how[tab]are[tab]54
>
> I think that's a good idea. Google's huge n-gram corpus is also released
> in this format (though I'm not sure if they use tabs or spaces):

CSV (Comma Separated Values) is a format for txt files since prehistoric
times (i.e., before the Internet).

But context info might need more complex structures than just CSV.
A widely used format is JSON, which is the next step up beyond CSV.
For example a list of items would be represented:

    [a, b, c, d, e, f, g]

A list of tagged items would be represented:

    {tag1: a, tag2: b, tag3, c}

And it's possible to nest these two formats arbitrarily deep.
That might be useful for contexts that contain other contexts.

The syntax for JSON is expressed on the first page of the
JSON web site:

    http://www.json.org

John



_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora



More information about the Corpora mailing list