[Corpora-List] Format for context info

hans christensen hc.corpus at gmail.com
Thu May 12 20:39:48 UTC 2011


John and Kilian,
Thanks a lot for your replies. I think for now I'll just go for the simple
txt but I'll definitely look into the json for future updates (for now I'm
just working on the basic n-grams, but I'm researching more advanced
models).
Thanks,
Hans


> On 5/12/2011 7:48 AM, Kilian Evang wrote:
> >>* I was thinking just to give them as tab separated txt files as that
> *>>* seems the most universal, e.g. something like:
> *>>*
> *>>* how[tab]are[tab]54
> *>*
> *>* I think that's a good idea. Google's huge n-gram corpus is also
> released
> *>* in this format (though I'm not sure if they use tabs or spaces):
> *CSV (Comma Separated Values) is a format for txt files since prehistoric
> times (i.e., before the Internet).
> But context info might need more complex structures than just CSV. A widely
> used format is JSON, which is the next step up beyond CSV. For example a
> list of items would be represented:
> [a, b, c, d, e, f, g]
> A list of tagged items would be represented:
> {tag1: a, tag2: b, tag3, c}
> And it's possible to nest these two formats arbitrarily deep. That might be
> useful for contexts that contain other contexts.
> The syntax for JSON is expressed on the first page of the JSON web site:
> http://www.json.org
> John
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20110512/f0e12b74/attachment.htm>
-------------- next part --------------
_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora


More information about the Corpora mailing list