new Welsh corpus
Brian MacWhinney
macw at
Fri Nov 24 22:03:26 UTC 2000
Dear Info-CHILDES,
I am happy to announce the addition to CHILDES of a new corpus on the
acquisition of Welsh contributed by Bob Jones of Aberystwyth. These data
represent a large (500 children) sample of children between the ages of 3
and 7 that was collected in the 1970s, but which was just recently
transcribed into CHAT by Dr. Jones and colleagues. The corpus can be found
in the /celtic folder on There are now two Welsh
corpora from Bob Jones. This one is labeled "Welsh2". The readme file
for this corpus is as follows:
This database in Childes format was produced by a project which was funded
by the Economic and
Social Research Council (ESRC) of the UK with an award of £60,611
(R000237978). The project
ran from the 1st of July 1999 until the 30th of June 2000. It was directed
by Bob Morris Jones
and staffed by two researchers, Merris Griffiths and Mared Roberts, in the
Department of
Education, University of Wales, Aberystwyth, Ceredigion SY23 2AX, Wales, UK.
The data is based on the spontaneous recordings of children between the
of three and
seven years of age, speaking Welsh. They were recorded in schools
Wales in
undirected play situations, mainly playing in pairs with various toys in a
box of sand. The
children are from different school, socio-economic, regional, and
The original recordings were collected during the period 1974-1977 by a
project which was
located in the same department, funded by the Welsh Office, directed by
Professor C.J. Dodson,
run by Bob Morris Jones, and staffed at various times by Brec'hed Piette,
Hefin Jones,
John Jones, Wyn James, Christine James, and Nesta Dodson.
There are two cohorts: children from three to five, and children from five
to seven. The first
digit in the names of the files which make up the database gives the age of
the children. The
file names of the five year olds of the older cohort are distinguished by
the letter 'a'
after the first digit. The remaining digits complete the file name in all
The scale of the database can be indicated by the following summary:
three year olds: 25 files (c3001 - c3025), 418kb, 42 children
four year olds: 31 files (c4001 - c4031), 498kb, 62 children
five year olds: 39 files (c5001 - c5039), 859kb, 77 children
five 'a' year olds: 44 files (c5a001 - c5a044), 855kb, 87 children
six year olds: 48 files (c6001 - c6048), 1.00mb, 96 children
seven year olds: 52 files (c7001 - c7052), 1.14mb, 104 children
Personal names, local place-names, and local places-of-work have been made
anonymous by
using random nonsense-strings of letters: all begin with an initial
and the place
names have a final 0. The names of public figures, fictional characters,
more distant
places have been retained. Making names anonymous loses some information
about word-forms,
especially about mutations - where they occur - and word-play.
The children produced many noises while playing, and some attempt has been
made to transcribe
these, although they are not intended to capture the phonetic details. They
have the suffix
@sn. Nonsense forms, in word-play for instance, have the suffix @gl. Both
are declared in the
00depadd.cut file.
English is also spoken by various children to different degrees in the
database. Single English
words - either by themselves or within a Welsh utterance - are not marked.
But phrases or
sentences of English words are enclosed in scope symbols < ... >, and are
followed by the
comment [% Saesneg] - 'Saesneg' being the Welsh word for 'English'.
Similarly, phrases and sentences which are from songs, nursery rhymes, and
similar material
are enclosed within < ... > and are followed by the comment [% ca:n] -
'ca:n' (or 'c?n', to
use the circumflex - see below) is the Welsh for 'song'.
Unfinished words (that is, fragments and not shortened words) are indicated
by an initial &.
There are many homonyms, many of which come about through phonological
processes of elision
and assimilation in spontaneous speech. Digits and the apostrophe are used
to distinguish
different word-forms which otherwise have the same spelling. The lexicon
gives the lexeme to
which they belong. The apostrophe is declared in the 00depadd.cut file to
cater for
word-initial occurrences.
In spontaneous speech, patterns of a Welsh copula followed by a personal
subject pronoun
occur as a pronoun only. Such pronouns are indicated by a final apostrophe.
There are
instances, mainly of directive-like utterances within the context of a
were it is
not entirely clear what the pattern is. But these instances have likewise
been give a final
Welsh orthography contains circumflexed letters: '?ÍÓÙ' and also 'w' and
'y', for which there
is no ASCII provision. Circumflexed letters are not stable over different
applications, as is
well-known. Consequently, they are represented as 'a: e: i: o:', which
convention can then be
conveniently extended to 'w: and y:'. This convention is mainly used where
ambiguity would
otherwise occur. Welsh also makes limited use of the diaeresis and the
diacritics, but
it has not been necessary to cater for these separately.
The data files contain utterances by children and adults. The former are
identified as
Target_child or Child on the @Participant header line in the data files;
latter are
identified as Investigators and Teachers. The utterances of the adults have
been transcribed
in full, but not as painstakingly as those of the children; in particular,
homonyms have not
all been disambiguated through transcription.
The lexicon contains the word-forms produced by the children. It does not
contain word-forms
produced by adult participants. The lexicon contains all the Welsh words
English-words which occur within a Welsh utterance or by themselves. It
not contain
English words which are in English phrases or sentences. It does not
proper names,
the spellings of noises or nonsense words - they can be identified in the
data by an initial
capital, the suffix @sn, and the suffix @gl, respectively. Neither does it
contain xxx (for
indecipherable material), and unfinished fragments which begin with &.
The categories and their codes in the lexicon are as follows:
?? = multi-category form which is ambiguous in context
a1 = pro-form place adjuncts like FANNA 'there', FAMA 'here', FANCW 'yonder'
ab = conjuncts and disjuncts like HEFYD 'also', FELLY 'therefore'
ad = other adjuncts
ag = apsect markers YN 'progressive', WEDI 'perfective'
an = adjectives
ar = prepositions
as = adverbs ALLAN 'out', YMLAEN 'onwards'. I-FFWRDD 'away', I-LAWR 'down',
at = adverbs beginning with TU - TU-ALLAN 'outside', TU-OL 'behind', etc.
b4 = Welsh finite verb with English inflection
bd = English verbs in "-ed", "-en" or equivalent e.g. 'crashed', 'drunk'
be = verbnoun forms (compare English plain infinitive) including
but not BOD 'be'
bf = finite-verb forms (including the imparative forms) except BOD 'be'
bg = English verbs in "-ing"
bp = English plain infinitive forms
cd = co-ordinating conjunctions
ce = verbnoun (compare English plain infinitive) of BOD 'be'
cf = finite forms of BOD 'be'
cm = MWY 'more' as a comparative particle before adjectives
cn = greetings and farewells
cy = subordinating conjunctions like ACHOS 'because'
eb = standard exclamations like AA 'ah', OO 'oh'
en = nouns
er = the post-modifying words ARALL 'other' and ERAILL 'others'
es = EISIAU 'wants, needs' - a nominal form
g1 = nominal wh- words - BETH 'what', PWY 'who'
g2 = adverbial wh- words - PRYD 'when', PAM 'why', SUT 'how'
g3 = the wh- word PA 'which'
g4 = compounds involving wh- words like BETH+BYNNAG 'whatever', PRYD+BYNNAG
g5 = the wh- word FAINT 'how much/many'
ga = grammatcically invariant answer words IE 'yes', NAGE 'no', DO 'yes' a
NADDO 'no'.
gc = the comparative particle NA 'than'
gd = demonstrative words DYNA 'there/that is', DYMA 'here/this is', DACW
'yonder is'
gg = intensifiers like RHY 'too', GO 'gairly', MOR 'so'.
gm = quantifiers like DIGON 'enough', LLAWER 'much/many, MWY 'more'
gr = preverbal particles like MI, FE, NI and focussing particles like MAI,
gt = the predicatival particle YN
ll = pro-form adjuncts YNA 'there', YMA 'here' and ACW 'yonder'
ly = letters of the alphabet
mo = words indicating epistemic modality EFALLAI 'perhaps', HWYRACH
ne = the negator DIM 'no/not' both as quantifier and adverb
on = onomatopoeic-type forms
pa = politeness expressions
pe = determiners
pi = forms of PIAU, used to indicate ownership
qq = for obscure forms
r1 = personal pronouns
r2 = demonstrative pronouns
r3 = indefinite pronouns like RHYWUN 'someone'
r4 = negative pronouns
r5 = reflexive pronouns
r6 = reciprocal pronouns
r7 = conjunctive pronouns like FINNAU 'me too'
r8 = prefixed (possessive) pronouns
r9 = the 'alternative' pronoun LLALL 'other', LLEILL 'others'
rd = RHAID 'must, necessity'
ri = numbers
rp = universal pronouns like PAWB 'everyone'
rq = indefinite phrases like BETH+'NA 'thingie', LLE+'NA, BE+TI'+'N+GALW
'what do you call it'
sg = standard verbal pauses like YMM 'uhm'
sy = standard paralinguistic forms like HY-HY 'uh-uh', MM-MM 'uhm-uhm'
ya = manner-adverbial particle YN e.g. YN GYFLYM 'quickly'
Multi-membership, if found in the corpus, is indicated by the Childes
convention for this, that
is, a backward slash after the first entry, followed on the succeeding
line(s) by another entry.
These categories serve only to identify data which can be recovered for
analysis. They are not
intended to represent probing analyses.
This latter point applies to all transcriptional conventions in this
database - they serve
as ways of recovering data for analysis.
The files supplied for this database are as follows:
data files: c3001 - c3025
c4001 - c4031
c5001 - c5039
c5a001 - ca5044
c6001 - c6048
c7001 - c7052
lexicon files: welsh3_7.lex (the main lexicon)
gl.lex (nonsense words)
sn.lex (noises)
others: 00depadd.cut
00readme.cdc (this file)
Bob Morris Jones (19/8/2000)
e-mail: bmj at
personal homepage:
project homepage:
More information about the Info-childes
mailing list