[Corpora-List] question about storage of corpora

Adam Przepiorkowski adamp at ipipan.waw.pl
Wed Jun 1 14:55:21 UTC 2011


Dear Tine,

As already advocated by various people, and also mentioned by Adam
Radziszewski, we are using TEI P5 for the National Corpus of Polish
(http://nkjp.pl/).  Of course, that's a huge collection of very specific
recommendations from which you pick up only those that you need, so each
TEI P5 schema will be different.  Ours is documented here:

http://nlp.ipipan.waw.pl/TEI4NKJP/

and it takes care of metadata, text structure, segmentation into
sentences and word-like tokens, morphosyntactic and syntactic levels,
Named Entities and some Word Sense Disambiguation.  In terms of
gigabytes, this format takes a lot of space[*], but disk space is cheap
these days and XML files compress very well, so it hasn't been too much
of a problem by now.

Now, we haven't tried any native XML databases because of our previous
experiences, but Damir Ćavar mentioned in this thread that things have
changed recently.  Still, I would be interested whether anybody is using
native XML databases for corpora which are 1) in the range of *billions*
of tokens (ours currently has about 1 450 000 000 tokens) and 2)
linguistically annotated at least at the morphosyntactic level.

What we do instead in the National Corpus of Polish, is we compile XML
files into a purpose-designed binary format used by our search engine,
Poliqarp (http://poliqarp.sourceforge.net/), and – independently – we
convert them to a relational database.[**]  Admittedly, this compilation and
conversion takes time (in the range of days), which is a nuisance, but
not a major obstacle, as this is something that needs to be done only
occasionally.

All best,

Adam


*  Specifically, around 240 GB for almost 1.5 billion words
   morphosyntactically annotated (with info about all possible
   interpretations for all tokens, about which one is selected in the
   context and about the tool performing the morphosyntactic annotation,
   and with additional marking of text structure, segmentation at
   various levels and rich metadata).

** The two search engines, with quite different functionalities, are
   employed here: http://nkjp.pl/index.php?page=6&lang=1.


Tine Lassen <tine.lassen at tdcadsl.dk>:

> Hi,
>
> I am in the process of compiling a series of domain corpora, and once the present text gathering phase is completed, of course i need to store the texts somehow. The texts need to be annotated with e.g. parts of spech and posssibly
> phrase boundaries for term extraction purposes. 
>
> My questions are: Would it be wiser to store the texts as XML or in a relational database format?
> Does a generally accepted corpus annotation XML-schema exist? And do tools for annotation of and search in such files exists?
> How do you store your corpora?
>
> Any thoughts or ideas regarding the questions are very welcome :)
>
> Best,
> Tine Lassen
> Copenhagen Business School
>
>
> _______________________________________________
> UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
> Corpora mailing list
> Corpora at uib.no
> http://mailman.uib.no/listinfo/corpora

-- 
Adam Przepiórkowski                          ˈadam ˌpʃɛpjurˈkɔfskʲi
http://clip.ipipan.waw.pl/ ____ Computational Linguistics in Poland
http://nlp.ipipan.waw.pl/ ____________ Linguistic Engineering Group
http://nkjp.pl/ _________________________ National Corpus of Polish

_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora


More information about the Corpora mailing list