Corpora: POS tagger
Martin Wynne
martin.wynne at ota.ahds.ac.uk
Tue Apr 30 15:35:57 UTC 2002
I think that, as you suspected, the specification you require is Utopia! I
can see at least four problems with it:
(i) Complete accuracy in automatic procedures is not possible because there
are (in most types of text) a significant number of cases where semantic
ambiguity (mapping onto POS ambiguity) can only be resolved by contextual
knowledge;
(ii) POS-tagging is not a straightforward task with clearly defined
procedures and rules which are accepted and agreed on in the community. Part
of speech categories are based on a bundle of different levels of
categorisation (semantic, syntactic, morphological), and there are many
different theories underlying these systems as well as different ways of
applying the theories. Even if you decide on the theoretical underpinning,
there will be conflicts between differents levels of classifying words (e.g.
"it looks like a verb, but it functions like an adjective, yet it's meaning
is like a noun"). So tagging can't help being in some senses arbitrary and
inconsistent, so even if you claim >99% accuracy, no-one else will agree
with you. And if you were thinking of having the same or compatible tagsets
for all the languages, then you are multiplying these problems. (This isn't
an argument against POS-tagging per se, rather a plea for clear guidelines
and good documentation);
(iii) You would need not one but four programs for tagging four languages.
While you might get some success with one tagging algorithm, there will be
important resources such as lexicons, morphological rules, transition
probabilities, algorithms for identifying word boundaries, sentence
boundaries, foreign words, proper nouns etc. and these will be different for
each language. And if you want to achieve an extremely high level of
accuracy, it is actually unlikely that the same approach would work for the
four languages you mention. I think you'd get better results by using the
best tagger for each individual language.
(iv) You'll probably find that some of the important resources in this field
don't work under Windows, as they will have been developed under Unix.
Sorry if this is all rather negative - good luck!
Best,
Martin
__
Martin Wynne
martin.wynne at ota.ahds.ac.uk
Linguistics Officer
Oxford Text Archive
Oxford University Computing Services
13 Banbury Road
Oxford
UK - OX2 6NN
Tel: +44 1865 283299
Fax: +44 1865 273275
-----
From: Nicole Baumgarten [mailto:se9d030 at rrz.uni-hamburg.de]
Sent: 30 April 2002 15:30
To: corpora at hd.uib.no
Subject: Corpora: POS tagger
Dear all,
does anybody know of an automatic, plus-99 per cent accuracy (utopia?),
unidiosyncratic, easy-to-apply POS tagger that can handle German, English
(French, Spanish) and works in an ordinary Windows environment?
ANY ideas are greatly appreciated!
All the best
Nicole.
------------------------------------------
Nicole Baumgarten
SFB 538 Mehrsprachigkeit
Covert Translation
Max-Brauer-Allee 60
22765 Hamburg
nicole.baumgarten at uni-hamburg.de
++49-40-42838 6453
More information about the Corpora
mailing list