Excluding Basque data

Wed Mar 8 10:22:32 UTC 2000

Roz Frank writes:

>  Since the topic of criteria for inclusion and exclusion of Basque data has
>  surfaced again on the list, perhaps the following information will be of
>  interest.

[snip my discussion of cutoff dates]

>  I've prepared the following background summary in order to aid those
>  members of the list who might be somewhat unfamiliar with the source
>  materials for Basque. Hopefully, the summary also will bring into focus
>  some of the problems that inevitably arise when one attempts to choose a
>  "more restrictive early date" for the cut-off, whether that be set at 1600,
>  1700 or even 1800.

>  1) There is essentially no Medieval Basque 'literature' of any kind.

Indeed, apart from a few songs transmitted orally before being finally
written down.

[snip summary of medieval Basque materials]

These medieval materials are more substantial than is sometimes realized.
Michelena's 1964 Textos Arcaicos Vascos is a dense 200-page book summarizing
them.  These materials are of immense linguistic value.  And Michelena
himself was fond of reminding us that the linguistic history of Basque
begins in the 10th century, and not, as is often supposed, with the
literary works of the 16th century.

I can't stress this too strongly.  A great deal of Basque vocabulary,
plus some phonology and morphology, and even a little syntax, is recorded
in those medieval materials.

[snip summary of 16th-century materials, which by the way omits one very
important work: Landucci's Basque dictionary of 1562]

>  Criteria.
>
>  Larry Trask has stated that he would prefer 1600 as the cut-off date for
>  "early attestation".  As one can see from the information given above, if
>  the cut off date is set at 1600, it's pretty slim pickins.

Oh, no -- not at all.

In spite of the modest body of material before 1600, the proportion of
the basic Basque vocabulary recorded in it is very high.

Take a look at Sarasola's 1996 dictionary, which reports dates of first
attestation.  Every page I glance at lists between two and eight words
first recorded before 1600.  The average seems to be about four or five
such words.  So, since the dictionary has 800 pages, that means that the
total of such words must be somewhere around 3000-4000.  Not bad.

Of course, there are further words first recorded between 1600 and 1700.
But these are less numerous, and, more importantly, *practically all*
of them are transparent compounds or derivatives, or sometimes obvious loan
words, and hence will be excluded from my list in any case.  In fact, on
browsing through the dictionary just now, I couldn't find *a single word*
first recorded between 1600 and 1700 which was neither obviously
polymorphemic nor an obvious loan word.  I think this is a telling point:
the kind of vocabulary I'm interested in is almost invariably recorded
before 1600, if it's recorded at all.

>  And furthermore, if one applies Larry Trask's other criterion to the same
>  corpus, namely, that for an item to be included in his list it must be
>  recorded early in all dialects or most of the dialects, we're fried
>  (although this might not be the way that Larry intends the dialectal
>  criterion to be applied to the data).

No; it is not.  My criteria are (1) that the word is recorded *somewhere*
before 1600, (2) that it is recorded *at some time* in all or nearly all
dialects, (3) that it is not obviously polymorphemic, and (4) that it does
not appear to be shared with any neighboring languages.

>  The database in question simply
>  doesn't provide a wide sampling. In other words, serious difficulties would
>  arise if one were to apply the second criterion of widespread dialectal use
>  to items attested prior to 1600: the two books mentioned above and three
>  collections of proverbs do not cover all the dialects. Thus, a strict
>  application of Larry's second rule would actually eliminate all the words
>  found in these works.

No.  This is a misunderstanding.

>  Again, he may assume that if the word is attested
>  early (prior to 1600) in one or more of the northern dialects that would be
>  sufficient.

It is sufficient for me if the word is recorded *anywhere* before 1600.
There's nothing special about the northern dialects.  And, for historical
work, it would be difficult to name a single more important text than
the Refranes y Sentencias of 1596 -- written in the Bizkaian dialect.

>  Earlier Larry Trask has stated that in his opinion, when finished his list
>  would end up containing only 200 "native" Basque words. He may have stated
>  'a couple hundred' (sorry I don't have the exact citation).

No; I have never named any such precise number, nor any number this small.
What I suggested was "a few hundred" or "several hundred" words.  In fact,
even this estimate is perhaps too cautious, but I'm trying to err on the
safe side until I've done the work.

>  By my
>  calculations, there might be even fewer, unless he means that he would
>  include a if the word can be attested prior to 1600 in one dialect and then
>  rediscovered in the nineteenth and tntieth centuries in four of the five
>  dialects.

Yes, that's roughly what I mean.

>  Thus, there is the question of how the sample is skewed because of the
>  following facts:
>
>  1) that many works are translations from Latin by clergymen;
>  2) that when the works are not translations they are nonetheless books or
>  treatises written by priests about religious themes; and
>  3) during the 16th and 17th centuries the works represent primarily one
>  dialectal zone, Lapurdi.

But none of these is a problem for me.  All may well be issues in other
kinds of historical work.  For example, they are *certainly* issues in
working on early Basque syntax.  But, for basic vocabulary, they are
simply not a problem.

Even if he's merely translating a religious text, a Basque writer is not
going to translate things like 'hand', 'eye', 'cow', 'two', 'sister' or
'go' with anything other than the ordinary Basque word.  The *additional*
presence of a mass of religious vocabulary is no obstacle.  Nor is the
likely absence of words like 'somersault' or 'otter' a problem, since
these less usual words are almost invariably polymorphemic or borrowed.

Moreover, it is not quite true that our 16th-century texts are
overwhelmingly Lapurdian: the R&S is Bizkaian, and Landucci's
dictionary is Alavese.

Finally, the sheer size of a work is also not necessarily a big issue.
Axular's huge 1643 book Gero contains a total of only about 4300 words,
and this total includes proper names and derivatives: for example, it
includes <jakin> 'know' and six derivatives of this word, all counted
separately.  And this total is not so much greater than the perhaps
2000 words recorded in the tiny 16th-century collections of proverbs.

There just *isn't* that much Basque vocabulary which is native,
ancient and monomorphemic.  And what there is keeps turning up over
and over again, in text after text, regardless of how big the text
is or of how many other words it contains.

>  With respect to the first point, we note that both of the major books from
>  the 16th century were written by members of the clergy.

But Landucci was not a clergyman, and his 143-page dictionary of 1562
is not religious in nature.  Among other delights, this book contains
the first known occurrences of the words <alu> and <zakil>, which I
shall delicately gloss here as 'vulva' and 'virile member' -- admittedly
words which most Basque clergymen were reluctant to commit to print.

[snip summary of subsequent Basque literature]

>  So if my understanding is correct, Larry Trask's "more restrictive early
>  date" would admit one translation of the New Testament with a catechism and
>  tables for calculating moveable feasts, 1 short book of poems, a letter
>  from Mexico, and three brief collections of proverbs as his data base to
>  which he would add the miscellaneous citations, epigraphs, songs, place
>  names, proper names, a couple of very short word lists compiled by
>  non-Basques and some random words and phrases found in works written in
>  Romance prior to 1700.

Plus Landucci.  But Roz makes it sound as though this were a feeble and
inadequate body of materials.  It is not.  Not only does it contain
thousands of words, it contains practically all of the words which have
any chance of meeting my other criteria.

>  Should such a database be considered a representative sample?

For my purposes, absolutely.

If you doubt this, then accept a little challenge: name me three Basque
words which are first recorded only after 1600, which are attested more
or less throughout the country, which are not transparently polymorphemic,
and which do not appear to be shared with any neighboring languages.
Please report back to me when you've found them. ;-)

>  Keep reading, there are more surprises!!

For who?

>  List members unfamiliar with the highly oral nature of Basque culture might
>  be surprised to know that the date set for the beginning of Basque
>  literature is 1879. And to give people a better idea of just how few texts
>  there really are I would like to reproduce (actually summarize) information
>  in the form of three charts. In their original form the charts also
>  indicate which dialects the works were written in, but I've left that
>  information out. The cut off date for the statistical tabulation is 1879.
>  To qualify as a work, the text had to be a non-periodical and at least 48
>  pages long, a standard definition taken from UNESCO.

[snip tables]

All irrelevant, I'm afraid.

First, whether some text does or does not qualify as "literature" by
somebody's definition is irrelevant.  All that matters is that it is
written in Basque by somebody who knows Basque.  For this purpose, a
laundry list is as good as a novel, except that the laundry list is
likely to be shorter.  In fact, in some respects, a laundry list is
preferable to a novel, since it's more likely to represent ordinary
everyday usage than is a work of self-conscious literature.

Second, while later texts obviously add greatly to the record of
Basque words, they do *not* add significantly to the body of words
I'm interested in: the best candidates for native, ancient and
monomorphemic status.  Finding <ikusmen> '(sense of) vision', 'eyesight'
only in 1785 is of no interest to me: the word is plainly polymorphemic,
and the suffix is borrowed.

>  I believe that these three charts help explain some of the reason that Jon
>  Patrick and I have repeatedly argued in favor of including Azkue's
>  dictionary as a legitimate and necessary addition to any database for
>  Euskera.

No.  Not to *any* database.  There is practically nothing in Azkue which
is relevant to me but which is not more readily available elsewhere.
Remember what I'm doing.

>  In addition, Azkue was meticulous in noting the dialect, even
>  indicating the name of the village, in which he collected the item.
>  Moreover, he utilized some 150 Basque texts as part of his database and he
>  indicates precisely which text each item comes from, citing the entire
>  sentence in which it occurred so that the reader has the contextualization
>  of the entry.

I've already discussed Azkue's virtues and vices on this list.  His
meticulous citing of sources is a big plus, and I make heavy -- but
cautious -- use of him for this purpose.  However, as I've explained
before, Azkue contains many errors, and the dictionary cannot be taken
at face value for historical work.

>  In conclusion, keeping in mind the question of whether religious texts,
>  primarily translations, are appropriate (or the best) data sources for our
>  purposes,

For my purposes, it is the *earliest* texts which are crucial.  I don't
care whether those texts are about religion or about alien abductions.
As long as they are written in Basque, that's all that matters.

[snip passage on the religious nature of early Basque literature]

>  In short, from my point of view, given the nature of the facts set out
>  above, to assign the cut-off point for the database at 1600 is not
>  particularly logical;

Sorry; I can't agree.  This is a fine choice, for my purposes.

>  and it would be only slightly more logical to assign
>  the cut-off to 1700. This is particularly so if we keep in mind that our
>  aim is to reconstruct a stage of the language roughly 2000 years earlier,
>  i.e., prior to its first contacts with the Roman invaders who entered the
>  Peninsula in 218 BC.

Careful.  It makes a big difference just *what* we are trying to reconstruct.

Thanks to Michelena, we already have an excelent reconstruction of the
phoneme system of the Pre-Basque of some 2000 years ago.

Now I want to move on to a reconstruction of the morpheme-structure
constraints applying to monomorphemic lexical items in Pre-Basque.
To do that, I need to identify the best attesed candidates for such
lexical items.  And those best candidates, unsurprisingly, are largely
to be found in the basic everyday vocabulary, the words which recur
constantly in Basque texts at all periods.

>  Considering the intended purpose of the database, it
>  is unlikely that any changes that the language would have undergone in the
>  hundred-year period, i.e., from 1599-1699, would affect the outcome of the
>  study in any significant way.

I agree, Roz.  But this is an argument for taking 1600 as the cutoff
date -- now isn't it?

However, we don't have to choose the cutoff date arbitrarily.  As I've
pointed out above, a quick scan of the evidence suggests that preferring
a later date to 1600 is not going to increase the number of relevant
words to any significant extent.

And, once again, let me remind you wearily that I am *not* trying to
identify *all possible* candidates for native and ancient status --
at least, not at this stage.  I am only trying to identify the *best*
candidates.

>  By the way is there a chronological cut-off point for words in Romance
>  languages? Or in Slavic? Just curious.

Depends on the purpose.

Look.  I'm not suggesting that my criteria are the best possible
criteria for doing *any work* on Basque.  I'm only suggesting that
they're the most suitable criteria for *my particular purposes*.
Other tasks will very likely call for different criteria.  But so what?

Larry Trask
COGS
University of Sussex
Brighton BN1 9QH
UK

larryt at cogs.susx.ac.uk