[Corpora-List] syllable contact frequency - CELEX

Wed Oct 22 12:51:54 UTC 2008

Dear Katharina,

if you take the CELEX file gpl.cd ("German phonology lemma"), you find 
the transcription in DISC format in the fourth column. So you can use 
the following UN*X pipeline to extract e.g. the number of all lemmas 
containing "p" followed by "t" which are separated by a syllable 
boundary (and possibly an accent marker):

	cut -d"\\" -f4 gpl.cd | grep "p-'*t" | wc -l

(The result should be 181.)

The first part of the pipeline (cut -d"\\" -f4 gpl.cd) extracts the 
fourth column of the file gpl.cd. The second part (grep "p-'*t") 
searches for a certain pattern in the extracted column using a regular 
expression. The last part (wc -l) counts the number of lines (i.e. 
lemmas in this case) that match the regular expression. Simply change 
the second part to suit your search.

Keep in mind though that CELEX is not a corpus but a lexicon. So the 
numbers you get are type frequencies, i.e. it tells you how many 
_lemmas_ that are listed in CELEX contain your search pattern.

Hope this helps
Caren.

_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora