[Corpora-List] syllable contact frequency - CELEX
Caren Brinckmann
caren at brinckmann.de
Wed Oct 22 12:51:54 UTC 2008
Dear Katharina,
if you take the CELEX file gpl.cd ("German phonology lemma"), you find
the transcription in DISC format in the fourth column. So you can use
the following UN*X pipeline to extract e.g. the number of all lemmas
containing "p" followed by "t" which are separated by a syllable
boundary (and possibly an accent marker):
cut -d"\\" -f4 gpl.cd | grep "p-'*t" | wc -l
(The result should be 181.)
The first part of the pipeline (cut -d"\\" -f4 gpl.cd) extracts the
fourth column of the file gpl.cd. The second part (grep "p-'*t")
searches for a certain pattern in the extracted column using a regular
expression. The last part (wc -l) counts the number of lines (i.e.
lemmas in this case) that match the regular expression. Simply change
the second part to suit your search.
Keep in mind though that CELEX is not a corpus but a lexicon. So the
numbers you get are type frequencies, i.e. it tells you how many
_lemmas_ that are listed in CELEX contain your search pattern.
Hope this helps
Caren.
_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora
More information about the Corpora
mailing list