<div class="gmail_quote"><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex;"><div style="word-wrap:break-word"><div><div class="im"><div><br></div></div><div>TEI XML, using the oXygen XML editor, and storing the XML-files in for example in BaseX is the solution. At least the editing and annotation we do so far for the Croatian Language Corpus (<a href="http://riznica.ihjj.hr/" target="_blank">http://riznica.ihjj.hr/</a>) this way. I use BaseX for my own purposes, but do plan to provide a new front-end search with it as a backend. The current online search front-end of the CLC is a manipulated PhiloLogic, that takes raw TEI XML files (see the link above for the interface).</div>
<div><br></div></div></div></blockquote><div><br></div><div>I made a rough calculations using <a href="http://nlp.ipipan.waw.pl/TEI4NKJP/">the current proposal for TEI encoding</a> of the National Corpus of Polish (<a href="http://nkjp.pl/">NKJP</a>). I consider only morpho-syntax, no upper annotation levels. Here are the results*:</div>
<div><br></div><div><meta http-equiv="content-type" content="text/html; charset=utf-8"><b>TEI</b>: <font class="Apple-style-span" color="#cc0000">1355,75</font> bytes/token</div><div><b>XCES XML</b> (<a href="http://korpus.pl/index.php?lang=en&page=welcome">IPI PAN Corpus</a> dialect): <font class="Apple-style-span" color="#cc0000">277,10</font> bytes/token</div>
<meta http-equiv="content-type" content="text/html; charset=utf-8"><div><meta http-equiv="content-type" content="text/html; charset=utf-8"><b>simple tab-separated text format</b>: <font class="Apple-style-span" color="#cc0000">110,08</font> bytes/token</div>
<div>simple tab-separated with no ambiguity info: <font class="Apple-style-span" color="#cc0000">38.67</font> bytes/token (this format is lossy in that only one contextually-appropriate tag–lemma pair is selected per token)</div>
<div><br></div><div>This means that a one-million corpus would take 1.3 GB in TEI, while only 105 MB in simple txt (37 MB in the no-ambiguity txt format).</div><div><br></div><div><br></div><div>*<i>How I made this</i>? I downloaded the ann_morphosyntax example from the ‘file in NKJP’ column on the TEI4NKJP site. I used two tools for the conversion:</div>
<div>• wypluwka2morph.py script bundled <a href="http://code.google.com/p/pantera-tagger/">the Pantera tagger</a> to convert from TEI/NKJP to XCES XML</div><div>• <a href="http://nlp.pwr.wroc.pl/redmine/projects/libpltagger/wiki">maca-convert</a> to convert from XCES XML to the other formats</div>
<div><br></div><meta http-equiv="content-type" content="text/html; charset=utf-8"><meta http-equiv="content-type" content="text/html; charset=utf-8"><div><meta http-equiv="content-type" content="text/html; charset=utf-8">Note that NKJP's annotation includes information about ambiguity in the corpus — that is each token is annotated with:</div>
<div>• one tag–lemma pair being the interpretation marked as contextually appropriate (as chosen by a ‘human MSD tagger’) and</div><div>• a set of tag–lemma pairs, which could be theoretically appropriate in another contexts (e.g. morphological analyser output).</div>
<div>This makes the file larger. If only contextually-appropriate interpretations are important, then the file may be way smaller (this is the ‘no-ambiguity’ variant of the txt file).</div><div><br></div><div>Best,</div>
<div>
Adam Radziszewski</div><div><br></div><meta http-equiv="content-type" content="text/html; charset=utf-8"><meta http-equiv="content-type" content="text/html; charset=utf-8"><meta http-equiv="content-type" content="text/html; charset=utf-8"><meta http-equiv="content-type" content="text/html; charset=utf-8"></div>