[Corpora-List] corpus software

Christian-Emil Ore c.e.s.ore at iln.uio.no
Sun Apr 18 12:20:36 UTC 2010


Dear all,

I am leading a project for building a text corpus for medieval 
Norwegian. The project is under the Menota umbrella (www.menota.org) and 
the texts are encoded in TEI P5 Menota extension (Medieval Nordic Text 
Archive) (see www.menota.org for the Menota handbook).

The corpus will consist of 1.5 million running words (which is a lot 
when transcribed from manuscripts and not from editions) out of which 
1.0 will be given a morphosyntactic encoding out of which 0.5 will be 
encoded as syntactic trees (treebank). The treebank xml-format will be 
according to the Univ of Stuttgart's TIGER format.

In Menota (as in all corpora I have been involved in the development of 
or,) the Corpus Linguist Workbench (CLW/CQP) from Univ. of Stuttgart is 
the standard choice of corpus search system.  However, CLW/CQP is old 
and has only been maintained and not developed the last 10 years( I know 
ab out the open corpus workbench initative)  For example the unicode 
support is meager.

Do you have any suggestion for a more up to date system e.g. with full 
unicode support. Could lucene be a candiate?

Chr-Emil


_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora



More information about the Corpora mailing list