[Corpora-List] foreign words in German

chris brew cbrew at acm.org
Thu Sep 29 13:47:03 UTC 2011


>
> Well, that's true, but I don't think any theoretical representations
> are as rich as speakers' internalised ones. We have yet to
> parsimoniously model the reason why 'strong' goes with 'coffee' and
> 'powerful' with 'engine'. A scalar metric of 'foreignness' would be
> more accurate (and probably more useful) than a categorical division,
> even if it failed to capture all the nuances, and probably more useful
> (at least for some tasks) than sticking with a fully multidimensional
> representation.

I agree with all this, except the word "accurate". I fully agree with "useful".
To me, accuracy is about measuring something. An accurate measurement
of an underlying scalar should be a one-dimensional, an accurate
measurement of a two-dimensional
vector (for example, the velocity of an object moving in a plane)
should be two-dimensional, and so on.
All I am saying is that mental representations are multi-dimensional,
and complicated. So I don't want
to say that simplified models of them are accurate or inaccurate. That
would be to compare apples to not
just oranges but to the whole fruit department of the supermarket.


We have great freedom in how we choose to build data-driven models of
the lexicon. But every time that
we use a simplified model, we are focusing attention on some aspect
that we want to model. For some
reason (possibly a deep childhood trauma to do with sounding educated
when reading) discussions of
etymology, for me, evoke the task of using guesses about language
origin to choose how to pronounce a
string of letters like "hearth", "passe", "lexicon", "sandhi" or
"Volkswagen". Matters to me, maybe not so much to others.
For this task,I want to map from words to pronunciation models, which
will, at least to some extent, be indexed by
language origin. Neither a dichotomous decision nor a scalar
representing generic foreignness meets my need.

For another task, I might choose a different representation.The
decision would be
task specific. For purposes of measuring the reading level of a
document, a mere count of the number of words that pass some
threshold of blatant foreignness might be adequate. Or perhaps a
scalar representation would be much better than a dichotomous
one. [and yes, I do know that it is a simplification to pretend that a
document always HAS a reading level]


For a shared lexical resource it is trickier to work out what to put
in, because the task is vague. In that case I would probably just fall
back on
established practice and provide the same kind of language origin
labels that traditional dictionaries do. Not because these are
perfect, but because
a diverse group of readers have previously found them useful.

_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora



More information about the Corpora mailing list