[Corpora-List] Keyness across Texts

marcor at cce.ufsc.br marcor at cce.ufsc.br
Tue Jul 10 17:15:40 UTC 2007


Citando Mike Scott <mike at lexically.net>:

Elaborating on Mike's remarks, I supervised one completed MA-level research in
which the Portuguese preposition *de*, commonly translated as *of* when out of
context, was selected as key by the Keywords tool, when comparing a study
corpus of mechanical engineering trainee reports to (what is deemed to be) a
general and quite large corpus of the Portuguese language. The word *de*
appeared in both wordlists previously generated for each corpus as the most
frequent word in the corpora. It was thus unusual to have a corpus in which it
was also selected as key in the comparison.

The fact that *de* is the most frequent word of the Portuguese language may be
explained on the basis of

1. dates, which in Portuguese are usually written as 14 de outubro de 2006 (14
of October of 2007);

2. noun phrases such as *analista de sistemas* (literally, *analyst of systems*,
corresponding to *systems engineer* in English; differently from English, nouns
are not routinely used as adjectives in Portuguese and thus the natural way of
saying *glass door* is *porta de vidro*, literally *door of glass*;

3. a variety of common phrases such as *de carro* (*of car*, meaning *by car*)
and *de verdade* (*of truth*, meaning *real* or *true* in phrases such as *uma
cientista de verdade*, *a-FEM cientist of truth*, meaning *a-FEM true cientist*

The whole set adds up to make *de* the most frequent word in Portuguese corpora
in general.

Further investigation of corpus material revealed that the engineering reports
presented a substantially higher relative frequency of *de* because there was a
higher incidence of technical terms which were noun phrases with a *N de N* or
*N de N de N* structure. This allowed us to characterise the typical technical
term in Portuguese as having this structure and propose, although this has not
been tested, that this information might be used to separate automatically
candidates for technical terms in Portuguese texts and candidates for technical
texts as distinguished from other text types.

It also yielded results regarding the semantics of *de* in these noun phrases,
summarised in a list which included:

cano de ferro (*pipe of iron*, meaning *iron pipe*, material which some object
is made of)
pino de sustentação (*pin of sustainment*, meaning *supporting pin*, purpose of
an object)
máquina de qualidade(*machine of quality*, meaning (good)-quality machine,
evaluation of an object)

and a variety of other semantic relations.

None of this would have been easily detected without the selection of *de* as
key.

Best,

Marco Rocha

> I agree with Jin-Dong Kim's points 99% -- with one little proviso,
> namely that "verbs like 'be' or 'observed' as keywords which will be
> hardly accepted as keywords" depends on what one wants to accept, so I
> am less confident of Kim's "hardly".
> In Siena recently at a conference on keyness, those present considered
> suggested possibilities that a) a key word (or phrase) must definitely
> be a noun, and b) that a key word definitely could not be a function
> word (like "the" or "do"). My own position was that a machine-generated
> key word can be a word like "do" or "it", that when it is such a word
> (and I agree a human would never consider them as potentially key) it is
> likely to be extremely interesting and to merit further investigation as
> to why it has stood out. In that way, "be" could be key of a certain
> text or set of texts and could actually point not directly but
> indirectly to aboutness.
> BE is not as "about-y" a word as ELEPHANT, because I cannot picture BE
> but I can imagine an elephant -- but in any case to decide that ELEPHANT
> reflects aboutness surely is to assume a dodgy kind of naive semantics,
> rather like BACHELOR being +MALE -MARRIED etc.
> I am happy to agree that BE cannot point straight to "be-ness", whatever
> that might be, but it could point to some other pattern involving "be"
> which might well tell us what the focus texts were about, as my
> Shakespeare examples involving DO in Othello can.
>
> Cheers -- Mike
>
>




----------------------------------------------------------------
This message was sent using IMP, the Internet Messaging Program.



More information about the Corpora mailing list