LL-L "Resources" 2003.10.06 (04) [E]

Mon Oct 6 16:43:55 UTC 2003

======================================================================
L O W L A N D S - L * 06.OCT.2003 (04) * ISSN 189-5582 * LCSN 96-4226
http://www.lowlands-l.net * lowlands-l at lowlands-l.net
Rules & Guidelines: http://www.lowlands-l.net/index.php?page=rules
Posting Address: lowlands-l at listserv.linguistlist.org
Server Manual: http://www.lsoft.com/manuals/1.8c/userindex.html
Archives: http://listserv.linguistlist.org/archives/lowlands-l.html
Encoding: Unicode (UTF-8) [Please switch your view mode to it.]
=======================================================================
You have received this because you have been subscribed upon request.
To unsubscribe, please send the command "signoff lowlands-l" as message
text from the same account to listserv at listserv.linguistlist.org or
sign off at http://linguistlist.org/subscribing/sub-lowlands-l.html.
=======================================================================
A=Afrikaans Ap=Appalachian B=Brabantish D=Dutch E=English F=Frisian
L=Limburgish LS=Lowlands Saxon (Low German) N=Northumbrian
S=Scots Sh=Shetlandic V=(West)Flemish Z=Zeelandic (Zeêuws)
=======================================================================

From: Sandy Fleming [sandy at scotstext.org]
Subject: "Resources"

> From: Kenneth Rohde Christiansen <kenneth at gnu.org>
> Subject: LL-L "Resources" 2003.10.04 (10) [E]
>
> This seems like a very expensive (lot of regulary expressions) function.
> Does that scale well? I know the regex implementation in Perl is
> excellent, but how many words can you check per minute?
>
> Have you done test with a simple Fuzzy Matching algorithm - it would be
> nice to know the result. Here is a bit of C code - I can port it to perl
> if you want.

Is it important to get the wrong answers quickly, then?  :)

I'm familiar with the various fuzzy matching algorithms but I don't see how
a generalised fuzzy matching algorithm can get anything like good results
when we're dealing with texts in a language with many variant spellings.

For example, in Scots, you get writers with attitudes to orthography ranging
from "scholarly" to "naive" to "exotic" so you get "Sandy" written as:

"Sandy", "Sawnie", "Sanny" or "Sandie"

and "Santa" as:

"Santa", "Saunta", "Santae" ir "Santy".

The method I gave matches all the memebers within each of the two sets while
still finding a mismatch between words from different sets.

This is only possible for an algorithm that "knows" something about Scots
phonology and orthography. For example, the above works because it knows
that a <t> is not a <d> in Scots internally to a word but that <nd> could be
thought of as an alternative spelling of <n>; and that a <y> is the same as
an <ie> at the end of a word, depending on the writer; and by knowing that
that <a>, <aw> and <au> are easily confused by writiers. It can match words
that have few letters in common, such as <schuil> and <skweel> yet
distinguish words that look similar but aren't the same word, such as
<pairch> and <painch>.

Yes, a function that has to know all this will be expensive, but that's not
a problem if you don't call it too often! Since this function can precompile
words for matching, when used with indexing as Jan was suggesting it should
be faster than an on-the-fly fuzzy matching architecture because the words
can be stored in their fuzzy form in the index and then exact matching can
be used. Only a few keywords (the ones entered by the user) need to be run
through the algorithm live.

This works just as well in smaller applications as in larger. For example, I
wrote this to compare a list of thousands of Scots proverbs, to weed out
duplicates and near-duplicates. Since each proverb has to be compared with
every other, I needed to do hundreds of millions of comparisons (hundreds
rather than tens, since there's at least a handful of words in each
proverb). This doesn't mean I need to do hundreds of millions of fuzzy
matches - I just stored the proverbs in their fuzzy spelling and then did
exact matches - so while hundreds of millions of comparisons had to be done,
I only had to run the subroutine a few tens of thousands times to do it.
This was fast enough for the purpose, although in Perl you could make the
subroutine itself much faster by requesting that the regular expressions be
precompiled.

The system can be made much faster still - for example a cache can be used
so that each word only has to be run through the algorithm once. Thus when
it comes to indexing web pages, you should come to a point where the
subroutine itself should only need to be invoked a few times to store an
entire page, most of the "fuzzying" being done by lookup in the cache.

So yes, it's scaleable!

I hope this helps Jan, although I don't want to get into a comp sci
discussion on the list - rather, stick to discussion of how to do matching
on actual Lowlands languages!

Sandy
http://scotstext.org/

----------

From: Kenneth Rohde Christiansen <kenneth at gnu.org>
Subject: LL-L "Resources" 2003.10.04 (10) [E]

This seems like a good idea - and can be used for Low Saxon as well.

I though a bit about this (it is written a bit quickly so it is probably
full of bad English and typos :)

# First you need to normalize the words to compensate for
# orthography differences and different development with
# regard to vocals, diphtongs, etc.
#
# of you could do as Sandy - but the diphtong rule might not work so
# well for low saxon:
#
#    o    spell all high and secondary vowels as <i>;
#    o    spell all low vowels as <a>;
#    o    spell all rounded vowels as <o>;
#    o    spell all diphthongs vowels as <y>;
#    o    reduce all double letters to a single letter;
#    o    alter consonants to bring them closer to a phonemic spelling.
#
# After this you can do fuzzy matching, but if you need
# to look up date from a source this isn't going to be so
# simple - Fuzzy lookup could be interesting and positive
# lookups could be added to the "database".
#
# The big question is now to store these words you want to
# match agains. You probably want them in-memory to be able
# to do quick lookups - but the used memory might grow quickly
# to make this impossible. What kind of data structure do you
# want to use? Here you need to list pros and cons for the
# different choices, depending on things as memory use,
# lookup time, if you can add fuzzy lookup to them.
#
# When this is done you need to be able to feed you program
# with pages in Low Saxon so it can normalize all words and
# add them to the "database". Of course you are only interested
# in what is in <body> and you need to remove everything but
# text. you probably also want to be able to dump the "database"
# and build it again from the dump.
#
# So how will you determent whether a page is Low Saxon or not?
# It is probably okay to only check the first 100 words within <body>
# and then check whether a certain percentage of the words match.
#
# You probably also want to store all links in order to "craw"
# further. You probably don't want to store ones that are already
# indexed.
#
# Even though this fuzzy compare cannot be used in this incarnation,
# I thought it was fun to play with. ;-)
#
# Normalizing without knowing whether it is a German or Dutch based
# orthography creates some problems. dutch u is close to ö in the
# Dutch orthographies, but normalizing German u to ö or just to o might
# make it match way too much. There are a lot of these issues in this
# code - An analysis should be done - and then a lot of testing.
# Doing a good test is HARD - if you know know how to buy a book on it.
#
# it should be possible to determent if a dutch or german based
# orthography is used with one loop through the test. This can
# be done quickly and efficient with one regular expression over
# the whole text - dont split the text up first. Match for specific
# words, or special letters - I have yet to see a ü in a dutch
# based orthography - but! it might occour in a link on the page, or
# a name - but a threshold should take care of that.
#
# since regulary expressions are compiled it is better to do as much
# possible with as few regexs as possible.

I suggest coding this in a good maintainable programming language, but
if you like perl, it is okay to do a test implementations with.

Here's some perl test code I just wrote:

#! /usr/bin/env perl

# warning: this is just a test implementation to play around with.
sub CompareFuzzy
{
    ($str1t, $str2t) = @_;

    # print "Testing if '$str1t' match '$str2t'\n";

    # maybe we should double tense vocals when written as one char?

    # e -> ae/a: werk -> waerk -> wark (waark)

    # remove double vocals/consonants.- maybe this should be done later
    # but this also minimize the strings for the further checks
    $str1t =~ s/(.)(\1)/$1/g;
    $str2t =~ s/(.)(\1)/$1/g;

    # abend / avend, haben / haven
    $str1t =~ s/(b|v)e?n/vn/g;
    $str2t =~ s/(b|v)e?n/vn/g;

    # this also takes care of nk vs ng
    $str1t =~ s/sch|sk|ck|gh|ch|g/k/g;
    $str2t =~ s/sch|sk|ck|gh|ch|g/k/g;

    # german/dutch differences, u - ü differences
    $str1t =~ s/oe|ü/u/g;
    $str2t =~ s/oe|ü/u/g;

    # we don't like dehbung-h do we?
    $str1t =~ s/h//g;
    $str2t =~ s/h//g;

    # different spellings for the oa sound.
    # stellingwarvian ae is ao in other dialects
    # what to do about 'o' here?
    # ex. mor/mar, ofnemen
    $str1t =~ s/ae|oa|ao/a/g;
    $str2t =~ s/ae|oa|ao/a/g;

    $str1t =~ s/o$|ö|eu|ou|ui|au/o/g;
    $str2t =~ s/o$|ö|eu|ou|ui|au/o/g;

    $str1t =~ s/ay|ey|ai|ei|y|ie|ij|i'j|i/e/g;
    $str2t =~ s/ay|ey|ai|ei|y|ie|ij|i'j|i/e/g;

    # what about oin for ein in phalian?

    # can probably be done smarter
    @str1 = split (//, $str1t);
    @str2 = split (//, $str2t);

    print "modified to '$str1t' match '$str2t'\n";

    $i = 0;
    $j = 0;
    $is_matching = true;

    while ($i < $#str1 && $j < $#str2 && $is_matching == true)
    {
$is_matching = false;

if ($str1[$i] eq $str2[$j])
{
    $is_matching = true;
}
else
{
    if ($str1 [$i] eq $str2 [$j+1] &&
$str1 [$i+1] eq $str2 [$j])
    {
$is_matching = true;
$i++;
$j++;
    }

    if (!$is_matching && $i == $j &&
$str1 [$i] eq $str2 [$j+1])
    {
$is_matching = true;
$j++;
    }

    if (!$is_matching && $i == $j &&
$str1 [$i+1] eq $str2 [$j])
    {
                $is_matching = true;
$i++;
    }
}
$i++;
$j++;
    }

    $distance = $#str1 - $#str2;

    # print "DISTANCE: $distance\n";

    # maybe test with a percentage of the length?
    if ($distance > 3 || $distance < -3)
    {
$is_matching = false;
    }

    # print "MATCH:    $is_matching\n";

    return $is_matching;
}

$f = 0;

while ($f < 1) # so I can test efficency by making this number high :-)
not really a good test!
{
    $f++;
    print "Result: " . CompareFuzzy ($ARGV[0], $ARGV[1]) . "\n";
}

# neet nich
# moaken maken
# aenske enchede
# koning konick
# geiht, gaait

# time test: date ; ./fuzzy.pl ziekenhuus zaiknhoes > /dev/null ; date
# 25 sec for 10.000 checks

Comment for Sandy:

Uh, I don't know if you know how regulary expressions are implemented
but each of these regulary expressions are compiled. The more you can do
in one regulary expression the better; less compiling, quicker
substitution.

>    s/c$/k/;
>    s/ck/k/g;

could be s/c$|ck/k/g;

Cheers, Kenneth

--
Kenneth Rohde Christiansen <kenneth at gnu.org>

================================END===================================
* Please submit postings to lowlands-l at listserv.linguistlist.org.
* Postings will be displayed unedited in digest form.
* Please display only the relevant parts of quotes in your replies.
* Commands for automated functions (including "signoff lowlands-l") are
  to be sent to listserv at listserv.linguistlist.org or at
  http://linguistlist.org/subscribing/sub-lowlands-l.html.
=======================================================================