[Lexicog] help with N-grams

Sun Oct 26 17:55:06 UTC 2008

Hi J.L.,
Thank you for your kind interest.
Please find attached a .txt sample. Words are presented in sequences of 
three lines, in each case:
- Line 1 provides the full graphemic string (English surnames in this 
corpus), followed by its phonemic transcription, in which phonemes are 
separated by a space.
- Lines 2 and 3 provide the graphemic and phonemic alignments. Note that 
conjoined letters are indicated with the "+" sign in graphemic clusters:

---- aagaard  = "eI g A: d ----
    a+a    g    a+a+r    d
    "eI    g    A:    d

In the example above (the English surname Aagard), the clusters are: 
<a+a> and <a+a+r>).
Note that primary stress is marked with a double quote < " >, and 
secondary stress with the
percent sign < % >, placed directly before the stressed vowels.
Do note that the phonemic symbols are case-sensitive.
Just ask if there is anything else you need to know.
With kind regards,
Marc

J.L. DeLucca wrote:
>
> Hi Mark,
>  
> I have a software tool for doing ngrams (bi,tri,tetra y penta), but I 
> know I you are looking for something more precise. Could you send me a 
> short piece of your database or your text?
>  
> Best for now,
>
> J. L. De Lucca
> Universidad Politécnica de Valencia
> Departamento de Linguistica Aplicada
>
> --- On *Sat, 10/25/08, Marc FRYD /<marc.fryd at univ-poitiers.fr>/* wrote:
>
>     From: Marc FRYD <marc.fryd at univ-poitiers.fr>
>     Subject: [Lexicog] help with N-grams
>     To: lexicographylist at yahoogroups.com
>     Date: Saturday, October 25, 2008, 12:49 AM
>
>     Hi all,
>     I wonder if anyone could help a linguist with moderate programming
>     abilities with the following task.
>     I am currently working on a corpus of aligned grapheme-to- phoneme
>     isolated words.
>     I would like to produce an N-gram parsing of both levels of data (the
>     graphemic and the phonemic) with a view to extracting trends
>     favouring
>     realisations (i.e. this grapheme will realise as that phoneme with
>     an x
>     rate of occurrence if preceded/followed by such and such
>     graphemes). The
>     db is currently c3000 words, but it will keep growing.
>     Cheers,
>     Marc
>
>     -- 
>     Dr. Marc FRYD
>     Senior Lecturer in English Linguistics
>
>     Faculté des Lettres et des Langues
>     Université de Poitiers
>     95 avenue du Recteur Pineau
>     86022, Poitiers, France
>
>     Office: 05 49 45 48 11
>     Cell: 06 76 28 18 50
>
>
>  

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/lexicography/attachments/20081026/1882cfab/attachment.htm>
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: surnames.alignment_sample.txt
URL: <http://listserv.linguistlist.org/pipermail/lexicography/attachments/20081026/1882cfab/attachment.txt>