Corpora: Question about a Brown Corpus tag

E S Atwell eric at comp.leeds.ac.uk
Thu Sep 14 10:37:44 UTC 2000


Dirk,
I can see fairly simple "linguistic common sense criteria" to explain the
distinction you query:

- a preposition introduces a noun phrase
- a subordinating conjunction introduces a subordinate clause
- SOME words can belong to more than one class, eg "until" can intro both
- but since there are many other words which only introduce NPs (eg
"with") or only introduce clauses (eg "unless") we need seaprate classes
for these 2 cases

- coordinating conjunctions "and", "or" (and arguably "but") can connect 2
words/phrases of virtually any class (and even 2 different classes, eg
"until tomorrow and the morning comes"), you might suggest there should be
separate PoS for NP_coord and Phrase_coord
- but there aren't lots of words (at least in English) which are ONLY
NP_coord or ONLY Phrase_coord, so there's no point in creating separate
classes and saying "and", "or", "but" are all ambiguous between the two.

I'm not a German linguist, but my guess about "entlang" is that if you
accept the more general definition "a preposition introduces a noun
phrase" then it's covered; it just happens that PoS-tagging seems to have
got off the ground first for English, and consequently PoS-tagsets for
other languages have adapted English PoS categories and nomenclature.

I'm not really a theoretical linguist at all - I hope there's a
theoretical linguist out there who can give a better explanation than
mine!
      Eric

Eric Atwell, Distributed Multimedia Systems MSc Tutor & SOCRATES Tutor
School of Computing, University of Leeds, LEEDS LS2 9JT
TEL: (44)113-2335430  FAX: (44)113-2335468
WWW: http://www.comp.leeds.ac.uk/eric  EMAIL: eric at comp.leeds.ac.uk

cf:

>
> So what could be the linguistic reasons that Eric was mentioning? For me
> (with a rather limited linguistic background) the "traditional" criteria
> for POS determination look quite arbitrary or let's say heuristic.
>
> I cannot, for instance, see any advantage of separating "until" in:
> * until tomorrow (preposition)
> * until the morning comes (subordinating conjunction)
>
> while not separating "and" in:
> * you and me (coordinating conjunction)
> * I go and see (coordinating conjunction)
>
> or "with" in:
> * to see with a telescope    (preposition)
> * the man with the telescope (preposition).
>
> Or why should I call the German "entlang" (along) a PREposition,
> even if it is behind the noun phrase:
> * den Fluss entlang (along the river)
>
> --------------------------
>
> But, I am sure that there is theoretic linguistic work about POS
> categorization without these kinds of inconsistencies. And I am almost
> sure that people who tag corpora not only think about the accuracy of
> their results, but also about the needs of future users or at least
> about linguistic credibility.
>
> And therefore I don't understand why connective Parts of Speech (like
> relative pronouns, conjunctions, conjunctive adverbs... ) are modelled
> in such a neglectful way in all the corpora I have seen so far.
>
> Or are there maybe approaches I am not aware of?
> Or is it maybe too difficult or even impossible to make it "good"?
>
> --------------------------
>
> Dirk Ludtke
>
> Language Media Lab
> Kyoto University
>
>
>



More information about the Corpora mailing list