[Corpora-List] question about storage of corpora
Damir Cavar
dcavar at indiana.edu
Mon May 30 09:55:33 UTC 2011
Hi Tine,
On May 27, 2011, at 3:14 PM, Tine Lassen wrote:
> I am in the process of compiling a series of domain corpora, and once the present text gathering phase is completed, of course i need to store the texts somehow. The texts need to be annotated with e.g. parts of spech and posssibly phrase boundaries for term extraction purposes.
>
> My questions are: Would it be wiser to store the texts as XML or in a relational database format?
> Does a generally accepted corpus annotation XML-schema exist? And do tools for annotation of and search in such files exists?
> How do you store your corpora?
TEI XML, using the oXygen XML editor, and storing the XML-files in for example in BaseX is the solution. At least the editing and annotation we do so far for the Croatian Language Corpus (http://riznica.ihjj.hr/) this way. I use BaseX for my own purposes, but do plan to provide a new front-end search with it as a backend. The current online search front-end of the CLC is a manipulated PhiloLogic, that takes raw TEI XML files (see the link above for the interface).
So, why bother and store all that in relational DBs? The current XML-DBs are quite efficient and fast:
TEI
http://www.tei-c.org/
Philologic
http://sites.google.com/site/philologic3/home
BaseX
http://basex.org/
and the only commercial in this list is:
oXygen
http://www.oxygenxml.com/
best wishes
DC
--
Dr. Damir Cavar
http://web.me.com/dcavar/
mobile +49 176 60928748
office +49 7531 885357
private (US): +1 (734) 330-2902
FaceTime: dcavar at me.com
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20110530/5008ed00/attachment.htm>
-------------- next part --------------
_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora
More information about the Corpora
mailing list