[Corpora-List] question about storage of corpora

Mon May 30 16:54:20 UTC 2011

>
>
> TEI XML, using the oXygen XML editor, and storing the XML-files in for
> example in BaseX is the solution. At least the editing and annotation we do
> so far for the Croatian Language Corpus (http://riznica.ihjj.hr/) this
> way. I use BaseX for my own purposes, but do plan to provide a new front-end
> search with it as a backend. The current online search front-end of the CLC
> is a manipulated PhiloLogic, that takes raw TEI XML files (see the link
> above for the interface).
>
>
I made a rough calculations using the current proposal for TEI
encoding<http://nlp.ipipan.waw.pl/TEI4NKJP/> of
the National Corpus of Polish (NKJP <http://nkjp.pl/>). I consider only
morpho-syntax, no upper annotation levels. Here are the results*:

*TEI*: 1355,75 bytes/token
*XCES XML* (IPI PAN Corpus
<http://korpus.pl/index.php?lang=en&page=welcome> dialect):
277,10 bytes/token
*simple tab-separated text format*: 110,08 bytes/token
simple tab-separated with no ambiguity info: 38.67 bytes/token (this format
is lossy in that only one contextually-appropriate tag–lemma pair is
selected per token)

This means that a one-million corpus would take 1.3 GB in TEI, while only
105 MB in simple txt (37 MB in the no-ambiguity txt format).

**How I made this*? I downloaded the ann_morphosyntax example from the
 ‘file in NKJP’ column on the TEI4NKJP site. I used two tools for the
conversion:
• wypluwka2morph.py script bundled the Pantera
tagger<http://code.google.com/p/pantera-tagger/> to
convert from TEI/NKJP to XCES XML
• maca-convert <http://nlp.pwr.wroc.pl/redmine/projects/libpltagger/wiki> to
convert from XCES XML to the other formats

Note that NKJP's annotation includes information about ambiguity in the
corpus — that is each token is annotated with:
• one tag–lemma pair being the interpretation marked as contextually
appropriate (as chosen by a ‘human MSD tagger’) and
• a set of tag–lemma pairs, which could be theoretically appropriate in
another contexts (e.g. morphological analyser output).
This makes the file larger. If only contextually-appropriate interpretations
are important, then the file may be way smaller (this is the ‘no-ambiguity’
variant of the txt file).

Best,
Adam Radziszewski
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20110530/ea46c77b/attachment.htm>
-------------- next part --------------
_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora