Corpora: A Welsh lexical database and frequency count

Nick Ellis n.ellis at bangor.ac.uk
Wed Jan 16 10:52:14 UTC 2002


Cronfa Electroneg o Gymraeg (CEG)

A 1 million word lexical database and frequency count for Welsh

Please circulate to those interested

This is a word frequency analysis of 1,079,032 words of written Welsh
prose, based on 500 samples of approximately 2000 words each,
selected from a representative range of text types to illustrate
modern (mainly post 1970) Welsh prose writing. It was conceived as
providing a Welsh parallel to the Kucera and Francis analysis for
American English, and the LOB corpus for British English, in the
expectation that such an analysed corpus would provide research tools
for a number of academic disciplines: psychology and
psycholinguistics, child and second language acquisition, general
linguistics, and the linguistics of Modern Welsh, including literary
analysis.

     The sample included materials from the fields of novels and short
stories, religious writing, childrenís literature both factual and
fiction, non-fiction materials in the fields of education, science,
business, leisure activities, etc.,  public lectures, newspapers and
magazines, both national and local, reminiscences, academic writing,
and general administrative materials (letters, reports, minutes of
meetings).

     The resultant corpus was analysed to produce frequency counts of
words both in their raw form and as counts of lemmas where each token
is demutated and tagged to its root. This analysis also derives basic
information concerning the frequencies of different word classes,
inflections, mutations, and other grammatical features.

     Available on-line:

     Ellis, N. C., O'Dochartaigh, C., Hicks, W., Morgan, M., &
Laporte, N.  (2001). Cronfa Electroneg o Gymraeg (CEG): A 1 million
word lexical database and frequency count for Welsh. [On-line],
Available: http://www.bangor.ac.uk/ar/cb/ceg/ceg_eng.html

-------------------------------------------------------------------

Cronfa Electroneg o Gymraeg (CEG)


Cronfa ddata eirfaol o filiwn o eiriau sy'n cyfrif amlder defnydd
geiriau yn y Gymraeg

A wnewch chi gylchredeg hwn i bawb sydd â diddordeb ynddo.




  Mae hwn yn ddadansoddiad amlder geiriau o 1,079,032 o eiriau o
ryddiaith Gymraeg ysgrifenedig, a seiliwyd ar 500 o samplau o tua
2000 o eiriau yr un. Fe'u detholwyd o ystod gynrychioliadol o
destunau rhyddiaith Gymraeg gyfoes (o 1970 ymlaen yn bennaf). Y nod
oedd cynnig rhywbeth cyffelyb i ddadansoddiad Kucera a Francis o
Saesneg Americanaidd, a'r corpws LOB o Saesneg Prydeinig. Y disgwyl
oedd y byddai corpws a ddadansoddwyd fel hyn yn cynnig offer ymchwil
ar gyfer nifer o ddisgyblaethau academaidd:

*	seicoleg a seicoieithyddiaeth
*	plant yn caffael ail iaith
*	ieitheg gyffredinol
*	ieitheg y Gymraeg Cyfoes, gan gynnwys dadansoddi llenyddol.

     Roedd y sampl yn cynnwys:

*	deunyddiau o feysydd nofelau a straeon byrion
*	ysgrifennu crefyddol
*	llenyddiaeth plant (ffeithiol a dychmygol)
*	deunyddiau ym meysydd addysg, gwyddoniaeth, busnes,
gweithgareddau hamdden, etc.
*	darlithoedd cyhoeddus
*	papurau newydd a chylchgronau - cenedlaethol a lleol
*	atgofion
*	ysgrifennu academaidd
*	deunyddiau gweinyddu cyffredinol (yn llythyrau, adroddiadau,

     Dadansoddwyd y corpws i gynhyrchu cyfrifon amlder geiriau yn eu
ffurf grai yn ogystal â chyfrifon o lemata lle mae pob arwydd wedi ei
ddad-dreiglo a'i dagio yn ôl ei wreiddyn. Rhydd y dadansoddiad yma
hefyd wybodaeth sylfaenol am amlder y gwahanol ddosbarthiadau
geiriol, ffurfdroadau, treigliadau a nodweddion gramadegol eraill.

     Ar gael ar-lein:
     Ellis, N. C., O'Dochartaigh, C., Hicks, W., Morgan, M., &
Laporte, N.  (2001). Cronfa Electroneg o Gymraeg (CEG): A 1 million
word lexical database and frequency count for Welsh. [On-line],
Available: http://www.bangor.ac.uk/ar/cb/ceg/ceg_cym.html
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20020116/0cb57d8f/attachment.htm>


More information about the Corpora mailing list