XML is a well-established international standard, with strong corporate and academic backing well beyond the world of software development. TEI is carefully designed by deep thinkers about librarianship, archives<div>and document preservation. This is why the facilities for encoding meta-data in TEI are so elaborate. These concerns may not be immediately relevant to people like me who see corpora as resources for machine learning, but they are crucial to</div>
<div>the long future of the documents, since they will determine whether researchers in 200 years' time can keep track of the documents and their significance.</div><div><br></div><div>If you are building a corpus for the sake of its lasting archival value, it is natural to create a version of your corpus that uses the TEI. Essentially, you are betting that a format designed by current librarians and document experts will still make sense to similar people far in the future, and that there will be such people.</div>
<div><br><div><div><br></div><div>JSON and YAML (which I think are great) do not have the deep institutional support that makes them suitable as an archival format. If you use them for that purpose, you are essentially betting that Javascript or something like it will make sense to programmers far in the future. I find it a stretch to predict how people will be thinking about programming in 10 years, never mind 200. However, as a data interchange format for _now_, JSON and YAML have much more going for them.</div>
<div><br></div><div>If I were building a corpus intended for long-term use, I would use TEI to define the archival format, then cater for other needs by mechanically translating from the central TEI format into what is easy to process or to search. Like the ANC,</div>
<div>I would offer multiple download formats.</div><div><br></div><div>Chris</div><div><br></div><div><br><div><br></div><div><br></div><div><br><div class="gmail_quote">On Mon, May 30, 2011 at 1:55 PM, Adam Radziszewski <span dir="ltr"><<a href="mailto:kocikikut@gmail.com">kocikikut@gmail.com</a>></span> wrote:<br>
<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex;"><div class="gmail_quote"><div class="im"><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
<br></blockquote><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
<br>
For archiving and data interchange, XML (the file format) is unsurpassed for any size of corpus, as it is software-independent, based on plaintext, and the tags are human-readable and to some degree self-describing. A CWB index or a database would *not* be a good format for this purpose, by contrast, because they are binary formats based on non-self-describing column-and-row input.<br>
</blockquote><div><br></div></div><div>I beg to differ with the suggestion that using XML entails software-independence and interoperability. XML is just a ‘shell’ around any arbitrary format, providing a slight abstraction over very low-level encoding details and imposing only soft constraints on the shape of the actual format (there are alternative formats, such as JSON or YAML, which are basically as interoperable as XML).</div>
<div><br></div><div>The same unfortunately applies for TEI to some degree. TEI seems rather a meta-format than any particular format itself. This renders the interoperability only virtual, since given two fully TEI-compliant corpora one is not guaranteed to be able to use the same software to read both.</div>
<div><br></div><div>By the way, my calculations were probably oversimplified, since I counted only the <font face="'courier new', monospace">ann_morphosyntax</font> file, which in turn references <font face="'courier new', monospace">ann_segmentation</font> and <font face="'courier new', monospace">text_structure</font>. If they are to be included, the average bytes/token ratio for TEI/NKJP reaches 1485.58 (meaning that a 1-million corpus would take 1.4 GB). This does not include any metadata (and none of the mentioned alternative format does).</div>
<div><br></div><div>Best,</div><div>Adam</div><div><br></div></div>
<br>_______________________________________________<br>
UNSUBSCRIBE from this page: <a href="http://mailman.uib.no/options/corpora" target="_blank">http://mailman.uib.no/options/corpora</a><br>
Corpora mailing list<br>
<a href="mailto:Corpora@uib.no">Corpora@uib.no</a><br>
<a href="http://mailman.uib.no/listinfo/corpora" target="_blank">http://mailman.uib.no/listinfo/corpora</a><br>
<br></blockquote></div><br></div></div></div></div>