[Corpora-List] annotation of aligned texts

Nancy Ide ide at cs.vassar.edu
Sun Jul 21 23:41:41 UTC 2002


On Thursday, July 18, 2002, at 09:59 AM, pamela forner wrote:
> We are working with parallel texts aligned at word level and we are now 
> facing the problem of encoding the alignment information. We’d like the 
> annotation to be as conformant as possible to XCES standards for 
> parallel texts alignment, but we only found examples at sentence level. 
> Could anybody provide further information about XCES standards or is 
> anybody aware of other accepted conventions for annotation of texts 
> aligned at word level?


it is true that there are examples only for the sentence level in the 
current (CES) documentation. However, we now have on-line (although as 
yet unannounced) a suite of XCES schemas to replace the DTDs. Using 
these, you can link to anything you want to--whether it is tagged (for 
words, this would be with <w> tags as per the XCES doc conventions) or 
not (in which case you use offset information in the xlink). Please have 
a look at the new XCES schemas at http://www.xml-ces.org.

The schemas have not yet been made fully public for two reasons: (1) the 
new schemas for spoken data are not as yet finalized; and (2) there are 
some problems with various XML schema parsers, which are unfortunately 
inconsistent in their ability to handle data encoded according to the W3 
specs. This means that our use of various features is not always 
accepted by a given parser, and we want to be able to make concrete 
receommendations about parsers etc. before going public. However, the 
XCES schemas as they exist now on the web site are reasonably robust, 
and there should be no problem with "upward compatibility" once we 
announce the official versions.

Please contact me or suderman at cs.vassar.edu (the schema developer) if 
you have any problems with or questions about the schemas--we are 
anxious to help out anyone who is using them!

Nancy Ide

=======================================================

Nancy Ide

Professor and Chair
Department of Computer Science, Vassar College
Poughkeepsie, NY 12604-0520 USA
Tel: +1 845 437-5988 Fax: +1 845 437-7498
ide at cs.vassar.edu

Chercheur Associe
Equipe Langue et Dialogue, LORIA/CNRS
Campus Scientifique - BP 239
54506 Vandoeuvre-les-Nancy FRANCE
Tel: +33 (0)3 83 59 20 47 Fax: +33 (0)3 83 41 30 79
ide at loria.fr

=======================================================
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: text/enriched
Size: 2638 bytes
Desc: not available
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20020721/be636343/attachment-0001.bin>


More information about the Corpora mailing list