LL-L "Resources" 2003.10.04 (10) [E]
Lowlands-L
lowlands-l at lowlands-l.net
Sun Oct 5 21:55:35 UTC 2003
======================================================================
L O W L A N D S - L * 05.OCT.2003 (10) * ISSN 189-5582 * LCSN 96-4226
http://www.lowlands-l.net * lowlands-l at lowlands-l.net
Rules & Guidelines: http://www.lowlands-l.net/index.php?page=rules
Posting Address: lowlands-l at listserv.linguistlist.org
Server Manual: http://www.lsoft.com/manuals/1.8c/userindex.html
Archives: http://listserv.linguistlist.org/archives/lowlands-l.html
Encoding: Unicode (UTF-8) [Please switch your view mode to it.]
=======================================================================
You have received this because you have been subscribed upon request.
To unsubscribe, please send the command "signoff lowlands-l" as message
text from the same account to listserv at listserv.linguistlist.org or
sign off at http://linguistlist.org/subscribing/sub-lowlands-l.html.
=======================================================================
A=Afrikaans Ap=Appalachian B=Brabantish D=Dutch E=English F=Frisian
L=Limburgish LS=Lowlands Saxon (Low German) N=Northumbrian
S=Scots Sh=Shetlandic V=(West)Flemish Z=Zeelandic (Zeêuws)
=======================================================================
From: Kenneth Rohde Christiansen <kenneth at gnu.org>
Subject: LL-L "Resources" 2003.10.04 (08) [E]
You will find some links here:
http://www.pbk.dk/~pbk1807/dialecten.htm
http://www.pbk.dk/~pbk1807/Oostfreesk.html
These pages are only located here temporary and might disappear anytime
- so copy them if you want to keep them.
Please don't comment on the correctness of the information available on
the first site - it is not finished.
Cheers, Kenneth
----------
From: Sandy Fleming [sandy at scotstext.org]
Subject: "Resources"
> From: R. F. Hahn <sassisch at yahoo.com>
> Subject: Resources
>
> Jan,
>
> I'd be happy to do some test driving and might be able to contribute a few
> URLs.
>
> How about coming up with an automatic generator of alternative
> spelling, for
> instance written in Perl, PHP or such? Such an "engine" would be very
> useful for many of us, for instance if it were available online for
> automated transformation into different orthographic systems (which sounds
> like a projet right there). Maybe Sandy, Mathieu and others could act as
> advisers on that.
Below is a Perl subroutine I wrote a few months ago to enable me to match
variant spellings in collections of Scots proverbs.
The basic idea (which works well in Scots but may need to be re-hashed for
Lowland Saxon) is to:
o spell all high and secondary vowels as <i>;
o spell all low vowels as <a>;
o spell all rounded vowels as <o>;
o spell all diphthongs vowels as <y>;
o reduce all double letters to a single letter;
o alter consonants to bring them closer to a phonemic spelling.
This results in, for example:
"Wee sleekit, couerin, timrous beastie."
and
"Wee sleikit, cowerin, timrous baesty"
both being returned as:
"wi slikt curn, timros bisti."
though in practice I'd recommend passing in one word at a time - many may
match where a few fail. For example, if one writer had written "timorous",
the match would fail ("timoros" would be returned) but I find I get a pretty
high strike rate if I match just one word at a time (oh wait - this form of
the subroutine will only work properly if you pass in one word at a time, so
that's that!).
So here it is (further notes on it below):
sub match_scots {
$_ = shift; # Tak the parameter string...
$_ = "\L$_"; # ...an pit it intae wee letters.
my $fuzziness = shift || 2; # A guid level for maist applications.
# We'll uize capital Y for the consonant form o y whaur we can deteck
it.
# We'll uize capital C for the tch sound whaur we can deteck it.
my $hard = '[^ei]';
s/([bcdfghjklmnpqrstvwxz])i([bcdfghjklmnpqrstvwxz][ie])/$1y$2/g;
s/(.)\1+/$1/g;
s/ful$/fu/;
s/'d$/t/;
s/'//g;
s/c($hard)/k$1/g;
s/kh/ch/g;
s/c$/k/;
s/ck/k/g;
s/dg/j/g;
s/qu/q/g;
s/qh/wh/g;
s/nch/nsh/g;
s/rch/rC/g;
s/tch/C/;
s/^ch/C/;
s/^gh/g/;
s/^y/Y/;
s/y(e?u(k|ch))/$1/g;
s/([aeiouy][bcdfghjklmnpqrstvwxYz]+)(ed|it)$/$1t/;
s/^([bcdfghjklmnpqrstvwxYz]+)a?e$/$1i/;
s/e$//;
s/^([bcdfghjklmnpqrstvwxz]+)y$/$1Y/;
s/yi/Yi/g;
s/oi/oy/g;
s/o[uw]/o/g;
s/y[aeiou]+/y/g;
s/[aeiou]+y/y/g;
s/y$/i/;
s/(ea|ae)/i/g;
s/a[aouw]+/a/g;
s/[eu]/i/g;
s/iw/i/g;
s/ph/f/g;
s/gh/ch/g;
s/([gck])w/$1/g;
s/^sch/sk/;
s/nd/n/g;
s/mb/m/g;
s/([aeiouy][bcdfghjklmnpqrstvwxYz]+)i([lmn])$/$1$2/;
s/(.)\1+/$1/g;
s/[ou]/i/g if ($fuzziness >= 2); # match roondit vowels wi hiegh.
s/a/i/g if ($fuzziness >= 3); # match heigh an roondit vowels wi
laich.
s/y/i/g if ($fuzziness >= 4); # match aa vowels wi diphthongs.
return $_;
}
This takes a Scots word and returns it with the "fuzzy" spelling. If you
call it first with the user's spelling and then the spelling from the
searched Web page, and then compare the results, you should very often
succeed in matching the same word with variant spellings. For example, some
common spellings of my name in Scots are "Sandy", "Sandie", "Sanny" and
"Sawnie". This subroutine returns the string "sani" in every case, and so
succeeds in matching all the spellings.
Since I haven't commented the substitutions, here's an explanation of what
they do:
my $hard = '[^ei]';
sets up a regular expression matching all letters except those which
"soften" a preceding "c".
s/([bcdfghjklmnpqrstvwxz])i([bcdfghjklmnpqrstvwxz][ie])/$1y$2/g;
replaces strings of the form (consonant)i(consonant)e with
(consonant)y(consonant)
s/(.)\1+/$1/g;
reduces all double (or even multiple) letters to single letters.
s/ful$/fu/;
replaces eg "awful" with "awfu".
s/'d$/t/;
replaces eg "stop'd" with "stopt".
s/'//g;
gets rid of all remaining apostrophes.
s/c($hard)/k$1/g;
replaces "hard c" with "k".
s/kh/ch/g;
undoes the previous substitution, restoring "k" to "c", when followed by an
"h".
s/c$/k/;
replaces "c" with "k" at the end of a word.
s/ck/k/g;
s/dg/j/g;
s/qu/q/g;
s/qh/wh/g;
I think the above are obvious enough.
s/nch/nsh/g;
replaces, eg, "french" with "frensh"
s/rch/rC/g;
s/tch/C/;
s/^ch/C/;
using a capital "C" to represent the "/tS/" sound after "r", "t" and at the
beginning of a word.
s/^gh/g/;
turning words like "ghaist" into "gaist".
s/^y/Y/;
if a "y" is at the beginning of a word, we assume it's a consonant.
s/y(e?u(k|ch))/$1/g;
changes, eg, "heuk", "hyeuk", "hyuk" into "heuk", and "heuch", "hyeuch",
"hyuch" into "heuch".
s/([aeiouy][bcdfghjklmnpqrstvwxYz]+)(ed|it)$/$1t/;
changes -it and -ed endings into -t as long ass there's more than one
syllable in the word.
s/^([bcdfghjklmnpqrstvwxYz]+)a?e$/$1i/;
changes eg "dae", "be", we", "tae" into "di", "bi", wi", "ti" to save them
from the following substitution.
s/e$//;
gets rid of all "e"s at the end of a word.
s/^([bcdfghjklmnpqrstvwxz]+)y$/$1Y/;
s/yi/Yi/g;
deals with vowel "y" in some special positions by pretending it's a
consonant.
s/oi/oy/g;
because we're going to use "y" to represent all diphthongs.
s/o[uw]/o/g;
because we're going to represent all rounded vowels as "o" we change "ou"
and "ow" to "o".
s/y[aeiou]+/y/g;
s/[aeiou]+y/y/g;
turning remaining diphthongs into vowel-y's.
s/y$/i/;
...but if it's at the end of a word, make it "i".
s/(ea|ae)/i/g;
because we want to represent all high vowels as "i".
s/a[aouw]+/a/g;
...and all low vowels as "a".
s/[eu]/i/g;
it works best in Scots to treat secondary vowels as high vowels.
s/iw/i/g;
since "ew" will have become "iw" by now and we want to treat it as a high
vowel.
s/ph/f/g;
s/gh/ch/g;
obvious enough.
s/([gck])w/$1/g;
deals with some features of dialects around Aberdeen, deleting the "w" in,
ee "skweel", "gweed", "cweet" (note that I haven't tried to take this
further by changing "wh" to "f" - might be useful as an option).
s/^sch/sk/;
eg "schuil" -> "skuil"
s/nd/n/g;
s/mb/m/g;
obvious if you're familiar enough with Scots.
s/([aeiouy][bcdfghjklmnpqrstvwxYz]+)i([lmn])$/$1$2/;
deletes the I from before syllabic consonants, so that the spellings, eg
"muckle" and "mukkil" will match.
s/(.)\1+/$1/g;
once again deletes all duplicated characters, since some more will have
built up by now.
At this point the string can be used for matching with another
similarly-treated string, but the second parameter to the subroutine can be
used to increase fuzziness and get more matches:
s/[ou]/i/g if ($fuzziness >= 2); # match roondit vowels wi hiegh.
s/a/i/g if ($fuzziness >= 3); # match heigh an roondit vowels wi
laich.
s/y/i/g if ($fuzziness >= 4); # match aa vowels wi diphthongs.
The default fuzziness is "2" since this gives the best results in most
cases.
As you see, it's all very heuristic and takes an understanding of how the
language works in different orthographies and dialects. It takes a lot of
time and testing to figure out a good scheme, but I've used this one in
Scots and it gets good results. It can never be perfect however, since
people sometimes spell different words the same way, but it's pretty good. I
would imagine the same principles could be adapted to Lowland Saxon.
Hmmmm... maybe I ought to add a few lines to replace "-ing" with "-in" where
required!
Sandy
http://scotstext.org/
================================END===================================
* Please submit postings to lowlands-l at listserv.linguistlist.org.
* Postings will be displayed unedited in digest form.
* Please display only the relevant parts of quotes in your replies.
* Commands for automated functions (including "signoff lowlands-l") are
to be sent to listserv at listserv.linguistlist.org or at
http://linguistlist.org/subscribing/sub-lowlands-l.html.
=======================================================================
More information about the LOWLANDS-L
mailing list