Corpora: Question about a Brown Corpus tag

Dirk Ludtke i60x0378 at ip.media.kyoto-u.ac.jp
Thu Sep 14 07:22:41 UTC 2000


This thread is already a month old, but for me there are still some
questions left.

-----------------------

on 16 Aug 2000 David Campbell wrote:

> 'Who' and 'That' are tagged by Brown as 'Wh'
> pronouns (WPS) when introducing relative
> clauses, but 'which' retains it's determiner
> tag WDT. I am at a loss as to why.

on 17 Aug 2000 Eric S Atwell wrote:

> Some tag definitions in Brown were clearly
> decided by what TAGGIT found computable;
> I *guess* linguistic inconsistencies in tagging
> some words may be down to drawing boundaries on
> grounds of computational tractability rather than
> purely linguistic reasons

on 17 Aug 2000 Andrew Harley wrote:

> This explains how so many taggers can claim 95% or higher success rates!

> I also know taggers that tagged IN as "preposition
> or conjunction" on the same grounds.

------------------------

So what could be the linguistic reasons that Eric was mentioning? For me
(with a rather limited linguistic background) the "traditional" criteria
for POS determination look quite arbitrary or let's say heuristic.

I cannot, for instance, see any advantage of separating "until" in:
* until tomorrow (preposition)
* until the morning comes (subordinating conjunction)

while not separating "and" in:
* you and me (coordinating conjunction)
* I go and see (coordinating conjunction)

or "with" in:
* to see with a telescope    (preposition)
* the man with the telescope (preposition).

Or why should I call the German "entlang" (along) a PREposition,
even if it is behind the noun phrase:
* den Fluss entlang (along the river)

--------------------------

But, I am sure that there is theoretic linguistic work about POS
categorization without these kinds of inconsistencies. And I am almost
sure that people who tag corpora not only think about the accuracy of
their results, but also about the needs of future users or at least
about linguistic credibility.

And therefore I don't understand why connective Parts of Speech (like
relative pronouns, conjunctions, conjunctive adverbs... ) are modelled
in such a neglectful way in all the corpora I have seen so far.

Or are there maybe approaches I am not aware of?
Or is it maybe too difficult or even impossible to make it "good"?

--------------------------

Dirk Ludtke

Language Media Lab
Kyoto University



More information about the Corpora mailing list