Corpora: XML programmes and tagging

Tom Emerson tree at cymru.basistech.com
Fri May 19 14:34:57 UTC 2000


Gabriella Rundblad writes:
 > 1) As far as I understand, it is today recommended to use
 > XML for tagging purposes. For this I'll need user-friendly
 > programme(s), the question is which. I know there are both
 > free ware, share ware and commercial products out there,
 > though I've never tried (yet) either of them and don't
 > know how user-friendly they are. I know HTML and use
 > Hotmetal Pro for this (great!) and there is obviously an
 > XML equivalent (XMetal). Could you advice what programme(s)
 > to use?! Is XMetal good for a never-before-tagger?!

I've been building large monolingual and parallel corpora for
Simplified and Traditional Chinese texts (SC<>TC parallel) and have
ended up using GNU Emacs for all of my editing. I have not been able
to find another tool that allows me to create and edit documents using
Unicode (the only way to handle SC and TC within a single document).

There are SGML and XML modes for Emacs that are useful, though for my
purposes (and with my DTD, see below) I just insert the markup
manually or with the help of various Python scripts I put together to
massage the various source texts.

I ruled out XMetaL when it was first released because of their refusal
to fully support Unicode, which is essential for my purposes. For your
needs it probably is not a problem: eth and thorn (upper- and
lowercase) are both in ISO 8859-1 (Latin-1). If you need Yogh then
you're out of luck.

 > 2) The tagging I would like to do (I'm reading up on TEI
 > etc) is a tagging of phrases and clauses, not parts of
 > speech. What's been done on this earlier? Any lists of tags
 > etc?

Take a look at the Corpus Encoding Standard,

    http://www.cs.vassar.edu/CES/

and its XML counterpart, XCES,

    http://www.cs.vassar.edu/XCES

For my purposes I couldn't use these because they lack support for
Eastern languages, and right now I don't need the complexity for my
internal work. So I rolled my own DTD which works fine for me. In the
long term I would like to move to XCES. Unfortunately attempts to
become involved in that have gone unanswered.

       -tree

--
Tom Emerson                                          Basis Technology Corp.
Language Hacker                                    http://www.basistech.com
  "Beware the lollipop of mediocrity: lick it once and you suck forever"



More information about the Corpora mailing list