[Corpora-List] question about storage of corpora

chris brew cbrew at acm.org
Mon May 30 18:41:59 UTC 2011


XML is a well-established international standard, with strong corporate and
academic backing well beyond the world of software development. TEI is
carefully designed by deep thinkers about librarianship, archives
and document preservation. This is why the facilities for encoding meta-data
in TEI are so elaborate. These concerns may not be immediately relevant to
people like me who see corpora as resources for machine learning, but they
are crucial to
the long future of the documents, since they will determine whether
researchers in 200 years' time can keep track of the documents and their
significance.

If you are building a corpus for the sake of its lasting archival value, it
is natural to create a version of your corpus that uses the TEI.
Essentially, you are betting that a format designed by current librarians
and document experts will still make sense to similar people far in the
future, and that there will be such people.


JSON and YAML (which I think are great) do not have the deep institutional
support that makes them suitable as an archival format. If you use them for
that purpose, you are essentially betting that Javascript or something like
it will make sense to programmers far in the future. I find it a stretch to
predict how people will be thinking about programming in 10 years, never
mind 200. However, as a data interchange format for _now_, JSON and YAML
have much more going for them.

If I were building a corpus intended for long-term use, I would use TEI to
define the archival format, then cater for other needs by mechanically
translating from the central TEI format into what is easy to process or to
search. Like the ANC,
I would offer multiple download formats.

Chris





On Mon, May 30, 2011 at 1:55 PM, Adam Radziszewski <kocikikut at gmail.com>wrote:

>
>>
>> For archiving and data interchange, XML (the file format) is unsurpassed
>> for any size of corpus, as it is software-independent, based on plaintext,
>> and the tags are human-readable and to some degree self-describing. A CWB
>> index or a database would *not* be a good format for this purpose, by
>> contrast, because they are binary formats based on non-self-describing
>> column-and-row input.
>>
>
> I beg to differ with the suggestion that using XML entails
> software-independence and interoperability. XML is just a ‘shell’ around any
> arbitrary format, providing a slight abstraction over very low-level
> encoding details and imposing only soft constraints on the shape of the
> actual format (there are alternative formats, such as JSON or YAML, which
> are basically as interoperable as XML).
>
> The same unfortunately applies for TEI to some degree. TEI seems rather a
> meta-format than any particular format itself. This renders the
> interoperability only virtual, since given two fully TEI-compliant corpora
> one is not guaranteed to be able to use the same software to read both.
>
> By the way, my calculations were probably oversimplified, since I counted
> only the ann_morphosyntax file, which in turn references ann_segmentationand
> text_structure. If they are to be included, the average bytes/token ratio
> for TEI/NKJP reaches 1485.58 (meaning that a 1-million corpus would take 1.4
> GB). This does not include any metadata (and none of the mentioned
> alternative format does).
>
> Best,
> Adam
>
>
> _______________________________________________
> UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
> Corpora mailing list
> Corpora at uib.no
> http://mailman.uib.no/listinfo/corpora
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20110530/7658039d/attachment.htm>
-------------- next part --------------
_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora


More information about the Corpora mailing list