Corpora: XML programmes and tagging
Tom Emerson
tree at cymru.basistech.com
Fri May 19 14:34:57 UTC 2000
Gabriella Rundblad writes:
> 1) As far as I understand, it is today recommended to use
> XML for tagging purposes. For this I'll need user-friendly
> programme(s), the question is which. I know there are both
> free ware, share ware and commercial products out there,
> though I've never tried (yet) either of them and don't
> know how user-friendly they are. I know HTML and use
> Hotmetal Pro for this (great!) and there is obviously an
> XML equivalent (XMetal). Could you advice what programme(s)
> to use?! Is XMetal good for a never-before-tagger?!
I've been building large monolingual and parallel corpora for
Simplified and Traditional Chinese texts (SC<>TC parallel) and have
ended up using GNU Emacs for all of my editing. I have not been able
to find another tool that allows me to create and edit documents using
Unicode (the only way to handle SC and TC within a single document).
There are SGML and XML modes for Emacs that are useful, though for my
purposes (and with my DTD, see below) I just insert the markup
manually or with the help of various Python scripts I put together to
massage the various source texts.
I ruled out XMetaL when it was first released because of their refusal
to fully support Unicode, which is essential for my purposes. For your
needs it probably is not a problem: eth and thorn (upper- and
lowercase) are both in ISO 8859-1 (Latin-1). If you need Yogh then
you're out of luck.
> 2) The tagging I would like to do (I'm reading up on TEI
> etc) is a tagging of phrases and clauses, not parts of
> speech. What's been done on this earlier? Any lists of tags
> etc?
Take a look at the Corpus Encoding Standard,
http://www.cs.vassar.edu/CES/
and its XML counterpart, XCES,
http://www.cs.vassar.edu/XCES
For my purposes I couldn't use these because they lack support for
Eastern languages, and right now I don't need the complexity for my
internal work. So I rolled my own DTD which works fine for me. In the
long term I would like to move to XCES. Unfortunately attempts to
become involved in that have gone unanswered.
-tree
--
Tom Emerson Basis Technology Corp.
Language Hacker http://www.basistech.com
"Beware the lollipop of mediocrity: lick it once and you suck forever"
More information about the Corpora
mailing list