[Corpora-List] question about storage of corpora

Fco. Mario Barcala Rodríguez mario.barcala at mundo-r.com
Mon May 30 09:39:52 UTC 2011


Hi:

We store texts of our corpora as XML and works fine for us for more
than a decade.  We, then, build relational databases from them to make
different search applications (http://corpus.cirp.es/corga and
http://corpus.cirp.es/corgaetq)

TEI (http://www.tei-c.org) or XCES (http://www.xces.org) can give you
a start point.

We made an stylesheet adaptation of an XML editor to do part of
speech. It's not the best solution, but works for us for years. For
searching, we build ad hoc relational database from the XML files.

You can see all details and other related questions in my PhD
work. The full pdf file (Galician language) and an extended summary of
it (in English) can be downloaded from my home page:

http://www.xente.mundo-r.com/barcala/publicacions_english.html

Ask me any doubts you want

Regards,

  Mario Barcala

On Fri, May 27, 2011 at 03:14:25PM +0200, Tine Lassen wrote:
> Hi,
> I am in the process of compiling a series of domain corpora, and once the
> present text gathering phase is completed, of course i need to store the
> texts somehow. The texts need to be annotated with e.g. parts of spech
> and posssibly phrase boundaries for term extraction purposes.
> My questions are: Would it be wiser to store the texts as XML or in a
> relational database format?Does a generally accepted corpus annotation
> XML-schema exist? And do tools for annotation of and search in such files
> exists?How do you store your corpora?
> Any thoughts or ideas regarding the questions are very welcome :)
> Best,Tine LassenCopenhagen Business School
> 

> _______________________________________________
> UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
> Corpora mailing list
> Corpora at uib.no
> http://mailman.uib.no/listinfo/corpora


_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora



More information about the Corpora mailing list