[Corpora-List] XML annotation guidelines

Chris Brew cbrew at ling.ohio-state.edu
Fri Jun 6 15:52:20 UTC 2003


On Fri, Jun 06, 2003 at 09:35:15AM -0400, Simpson, Rita wrote:
>
>    Dear Corporist Colleagues,
>
>    We are in the process of converting our corpus of transcribed
>    academic speech from SGML to XML, and adding additional annotation.
>    Can anyone point us to some standards or (preferably) precedents
>    for XML-ized annotation of:
>
>    1) POS tagging
>    and
>    2) pragmatic markup (e.g., text segments manually identified as
>    'narrative',
>    'disagreement', 'request', etc.)
>
>    Within the TEI guidelines (P4), we've found some suggestions for the
>    POS
>    tagging, (but nothing yet for something like our pragmatic
>    categories), e.g.
>
>    <s type="sentence">
>       <w ana="at">The</w>
>       <w ana="nn1">victim</w>
>       <m ana="gen">'s</m>
>       <w ana="nn2">friends</w>
>    ...
>    </s>
>
>    But somehow this seems a bit more verbose than it needs to be.
>    Is this format standard, or are there other XML-style annotation
>    formats in use?


1) Yes. It is standard. Why is verbosity a problem? If you want a
compact format, you might choose to define your own. But if
you do that, it is a good idea to also define a systematic
and information preserving
automatic mapping from your compact format to a specific XML format
and back. That way you get the benefit of XML's tools for transformation
and validation, as well as whatever other benefits you obtain from
working with your compact format.

2) You may want to look at the choices made in part-of-speech tagging
the British National Corpus. One thing I noticed in your format is
that applying a tool like Edinburgh's textonly to it would yield
either

Thevictim'sfriends...

or

The
victim
's
friends
...

with the difference arising from the choice of whether to intepret the
newlines after the </w> tags as part of the document or not.

If you care about document layout, you may need to do something rather more
complex. The BNC has a plausible solution to this problem (they include
whitespace in the <W> </W> elements, but this complicates the problem of
counting words. Whether that matters to you depends, I suppose, on what
kinds of document you want to represent and why.

Chris



More information about the Corpora mailing list