[Corpora-List] question about storage of corpora

Damir Cavar dcavar at indiana.edu
Mon May 30 09:55:33 UTC 2011


Hi Tine,

On May 27, 2011, at 3:14 PM, Tine Lassen wrote:

> I am in the process of compiling a series of domain corpora, and once the present text gathering phase is completed, of course i need to store the texts somehow. The texts need to be annotated with e.g. parts of spech and posssibly phrase boundaries for term extraction purposes. 
> 
> My questions are: Would it be wiser to store the texts as XML or in a relational database format?
> Does a generally accepted corpus annotation XML-schema exist? And do tools for annotation of and search in such files exists?
> How do you store your corpora?

TEI XML, using the oXygen XML editor, and storing the XML-files in for example in BaseX is the solution. At least the editing and annotation we do so far for the Croatian Language Corpus (http://riznica.ihjj.hr/) this way. I use BaseX for my own purposes, but do plan to provide a new front-end search with it as a backend. The current online search front-end of the CLC is a manipulated PhiloLogic, that takes raw TEI XML files (see the link above for the interface).

So, why bother and store all that in relational DBs? The current XML-DBs are quite efficient and fast:

TEI
http://www.tei-c.org/

Philologic
http://sites.google.com/site/philologic3/home

BaseX
http://basex.org/


and the only commercial in this list is:

oXygen
http://www.oxygenxml.com/


best wishes
DC



--
Dr. Damir Cavar
http://web.me.com/dcavar/
mobile +49 176 60928748
office +49 7531 885357
private (US): +1 (734) 330-2902
FaceTime: dcavar at me.com

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20110530/5008ed00/attachment.htm>
-------------- next part --------------
_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora


More information about the Corpora mailing list