[Corpora-List] corpus software
Christian-Emil Ore
c.e.s.ore at iln.uio.no
Sun Apr 18 12:20:36 UTC 2010
Dear all,
I am leading a project for building a text corpus for medieval
Norwegian. The project is under the Menota umbrella (www.menota.org) and
the texts are encoded in TEI P5 Menota extension (Medieval Nordic Text
Archive) (see www.menota.org for the Menota handbook).
The corpus will consist of 1.5 million running words (which is a lot
when transcribed from manuscripts and not from editions) out of which
1.0 will be given a morphosyntactic encoding out of which 0.5 will be
encoded as syntactic trees (treebank). The treebank xml-format will be
according to the Univ of Stuttgart's TIGER format.
In Menota (as in all corpora I have been involved in the development of
or,) the Corpus Linguist Workbench (CLW/CQP) from Univ. of Stuttgart is
the standard choice of corpus search system. However, CLW/CQP is old
and has only been maintained and not developed the last 10 years( I know
ab out the open corpus workbench initative) For example the unicode
support is meager.
Do you have any suggestion for a more up to date system e.g. with full
unicode support. Could lucene be a candiate?
Chr-Emil
_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora
More information about the Corpora
mailing list