[Corpora-List] Format for context info

Kilian Evang poststelle at texttheater.net
Thu May 12 11:48:13 UTC 2011


Hi Hans,

On 05/11/2011 10:18 PM, hans christensen wrote:
> I'm kinda new to the "scene" so I'm not really familiar with what
> standards are commonly used (if such exist). So, my question is: I want
> to make context information for the HC Corpora
> <http://corpora.heliohost.org/> available for download (for now I'm
> looking at 2-gram and 3-grams). I was wondering if there are any
> standard way of doing this?
> I was thinking just to give them as tab separated txt files as that
> seems the most universal, e.g. something like:
>
> how[tab]are[tab]54

I think that's a good idea. Google's huge n-gram corpus is also released 
in this format (though I'm not sure if they use tabs or spaces):

http://googleresearch.blogspot.com/2006/08/all-our-n-gram-are-belong-to-you.html

Best,
Kilian

_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora



More information about the Corpora mailing list