Corpora: Style guides for hand tagging POS
David A. Campbell
campbed at flux.cpmc.columbia.edu
Wed Apr 12 07:10:57 UTC 2000
Hi,
I've been using the Penn Treebank Project Guidelines for POS tagging
of English text (Beatrice Santorini). I'm tagging raw (unedited and
uncorrected) text and I've had some problems assigning tags in some
cases:
1. Misspellings. Especially when they are mispelled into other
English word: "He was (d)one eating."
2. Compound nouns that should be hyphenated, but aren't. "I had a
follow up yesterday" vs. "I had a follow-up yesterday"
3. Tokenization of dates. Should 3/5/00 be tokenized into 3 / 5 /
00 and each marked up individually or should it be kept as is?
Can someone point me to a guide for tagging when there are errors in the
text?
Thank you,
David Campbell
Department of Medical Informatics
Columbia University
More information about the Corpora
mailing list