Corpora: Style guides for hand tagging POS

David A. Campbell campbed at flux.cpmc.columbia.edu
Wed Apr 12 07:10:57 UTC 2000


Hi,
    I've been using the Penn Treebank Project Guidelines for POS tagging
of English text (Beatrice Santorini).  I'm tagging raw (unedited and
uncorrected) text and I've had some problems assigning tags in some
cases:
    1.  Misspellings. Especially when they are mispelled into other
English word:  "He was (d)one eating."
    2.  Compound nouns that should be hyphenated, but aren't.  "I had a
follow up yesterday" vs. "I had a follow-up yesterday"
    3.  Tokenization of dates.  Should 3/5/00 be tokenized into 3 / 5 /
00 and each marked up individually or should it be kept as is?

Can someone point me to a guide for tagging when there are errors in the
text?

Thank you,

David Campbell
Department of Medical Informatics
Columbia University



More information about the Corpora mailing list