[Corpora-List] question about storage of corpora

Thu Jun 2 14:23:05 UTC 2011

Hi All,

> Still, I would be interested whether anybody is using
> native XML databases for corpora which are 1) in the range of *billions*
> of tokens (ours currently has about 1 450 000 000 tokens) and 2)
> linguistically annotated at least at the morphosyntactic level.

We are storing a somewhat large blog corpus in XML.

We needed to slightly modify the XML though, to allow easier querying.

The corpus consists of over 350 mln. sentences. At this moment it is  
annotated with different types of information for different kinds of  
research: 1. pos, 2. dependency structure and 3. emotion tags.

However, though XML is good for storing the files, raw files are very big  
(100MB each), and querying a raw XMLs was a horror. So lately we indexed  
it with HyperEstraier. Works like Google now :)

Details about the corpus are in here:
Jacek Maciejewski, Michal Ptaszynski, Pawel Dybala, "Developing a  
Large-Scale Corpus for Natural Language Processing and Emotion Processing  
Research in Japanese", In Proceedings of the International Workshop on  
Modern Science and Technology (IWMST), Kitami, Japan/September 2010, pp.  
192-195.
link to th paper:
http://tnij.org/corpus_paper

Best,

Michal

-----------------------------
Od: Adam Przepiorkowski <adamp at ipipan.waw.pl>
Kopia dla: corpora at hd.uib.no
Do: Tine Lassen <tine.lassen at tdcadsl.dk>
Data: Wed, 01 Jun 2011 16:55:21 +0200
Temat: Re: [Corpora-List] question about storage of corpora

Dear Tine,

As already advocated by various people, and also mentioned by Adam
Radziszewski, we are using TEI P5 for the National Corpus of Polish
(http://nkjp.pl/).  Of course, that's a huge collection of very specific
recommendations from which you pick up only those that you need, so each
TEI P5 schema will be different.  Ours is documented here:

http://nlp.ipipan.waw.pl/TEI4NKJP/

and it takes care of metadata, text structure, segmentation into
sentences and word-like tokens, morphosyntactic and syntactic levels,
Named Entities and some Word Sense Disambiguation.  In terms of
gigabytes, this format takes a lot of space[*], but disk space is cheap
these days and XML files compress very well, so it hasn't been too much
of a problem by now.

Now, we haven't tried any native XML databases because of our previous
experiences, but Damir Ćavar mentioned in this thread that things have
changed recently.  Still, I would be interested whether anybody is using
native XML databases for corpora which are 1) in the range of *billions*
of tokens (ours currently has about 1 450 000 000 tokens) and 2)
linguistically annotated at least at the morphosyntactic level.

What we do instead in the National Corpus of Polish, is we compile XML
files into a purpose-designed binary format used by our search engine,
Poliqarp (http://poliqarp.sourceforge.net/), and – independently – we
convert them to a relational database.[**]  Admittedly, this compilation  
and
conversion takes time (in the range of days), which is a nuisance, but
not a major obstacle, as this is something that needs to be done only
occasionally.

All best,

Adam

*  Specifically, around 240 GB for almost 1.5 billion words
    morphosyntactically annotated (with info about all possible
    interpretations for all tokens, about which one is selected in the
    context and about the tool performing the morphosyntactic annotation,
    and with additional marking of text structure, segmentation at
    various levels and rich metadata).

** The two search engines, with quite different functionalities, are
    employed here: http://nkjp.pl/index.php?page=6&lang=1.

Tine Lassen <tine.lassen at tdcadsl.dk>:

> Hi,
>  I am in the process of compiling a series of domain corpora, and once  
> the present text gathering phase is completed, of course i need to store  
> the texts somehow. The texts need to be annotated with e.g. parts of  
> spech and posssibly
> phrase boundaries for term extraction purposes. My questions are: Would  
> it be wiser to store the texts as XML or in a relational database format?
> Does a generally accepted corpus annotation XML-schema exist? And do  
> tools for annotation of and search in such files exists?
> How do you store your corpora?
>  Any thoughts or ideas regarding the questions are very welcome Best,
> Tine Lassen
> Copenhagen Business School
>   _______________________________________________
> UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
> Corpora mailing list
> Corpora at uib.no
> http://mailman.uib.no/listinfo/corpora

-- 
Adam Przepiórkowski                          ˈadam ˌpʃɛpjurˈkɔfskʲi
http://clip.ipipan.waw.pl/ ____ Computational Linguistics in Poland
http://nlp.ipipan.waw.pl/ ____________ Linguistic Engineering Group
http://nkjp.pl/ _________________________ National Corpus of Polish

_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora