[Corpora-List] Tag-set conversion

Detmar Meurers dm at ling.ohio-state.edu
Fri Jan 31 05:51:01 UTC 2003


> > Does anybody know of an existing tool to translate between the BNC C5
> > tag-set and the Penn Tree Bank tag-set?
>
> [...]
> You could alternatively just retag the BNC using a Penn-style tagger, of
> course, given that the BNC data was for the most part automatically tagged.

I'd be very careful there. The 2 million word BNC core corpus is
hand-corrected, which according to Leech (1997) reduced the error
rate to less than 0.3%. And for the 100 million word BNC that paper
mentions an error rate of 1.7% (of all words, excluding punctuation
marks). For the BNC2, the "BNC2 POS-tagging Manual" that comes with
the corpus estimates the overall error rate at 1.15% (cf. also the
BNC Tagging Enhancement Project). So "simple automatic retagging
with a Penn-style tagger" is likely to double or triple your error
rate.

Lieben Gruss,
Detmar


@Manual{leech:97,
  title = 	 {A Brief Users' Guide to the Grammatical Tagging of the British
                 National Corpus},
  author =	 {Geoffrey Leech},
  organization = {UCREL, Lancaster University},
  year =	 1997,
  note =         {\url{http://www.hcu.ox.ac.uk/BNC/what/gramtag.html}}}
			
			
--
Detmar Meurers                              Fax: Int + 614 292-8833
The Ohio State University                   Tel: Int + 614 292-0461
Department of Linguistics                   E-Mail: dm at ling.osu.edu
1712 Neil Avenue, Oxley Hall     Homepage: http://ling.osu.edu/~dm/
Columbus OH 43210-1298, USA    PGP key on web page (use encouraged)

"It is a capital mistake to theorize before one has data. Insensibly
one begins to twist facts to suit theories, instead of theories to
suit facts." Sherlock Holmes in "A Scandal in Bohemia" (A. C. Doyle)



More information about the Corpora mailing list