[Corpora-List] question about storage of corpora

Adam Radziszewski kocikikut at gmail.com
Mon May 30 17:55:07 UTC 2011


>
>
>
> For archiving and data interchange, XML (the file format) is unsurpassed
> for any size of corpus, as it is software-independent, based on plaintext,
> and the tags are human-readable and to some degree self-describing. A CWB
> index or a database would *not* be a good format for this purpose, by
> contrast, because they are binary formats based on non-self-describing
> column-and-row input.
>

I beg to differ with the suggestion that using XML entails
software-independence and interoperability. XML is just a ‘shell’ around any
arbitrary format, providing a slight abstraction over very low-level
encoding details and imposing only soft constraints on the shape of the
actual format (there are alternative formats, such as JSON or YAML, which
are basically as interoperable as XML).

The same unfortunately applies for TEI to some degree. TEI seems rather a
meta-format than any particular format itself. This renders the
interoperability only virtual, since given two fully TEI-compliant corpora
one is not guaranteed to be able to use the same software to read both.

By the way, my calculations were probably oversimplified, since I counted
only the ann_morphosyntax file, which in turn references ann_segmentationand
text_structure. If they are to be included, the average bytes/token ratio
for TEI/NKJP reaches 1485.58 (meaning that a 1-million corpus would take 1.4
GB). This does not include any metadata (and none of the mentioned
alternative format does).

Best,
Adam
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20110530/72bd19e7/attachment.htm>
-------------- next part --------------
_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora


More information about the Corpora mailing list