LL-L "Resources" 2003.10.06 (08) [E]

Mon Oct 6 23:34:42 UTC 2003

======================================================================
L O W L A N D S - L * 06.OCT.2003 (08) * ISSN 189-5582 * LCSN 96-4226
http://www.lowlands-l.net * lowlands-l at lowlands-l.net
Rules & Guidelines: http://www.lowlands-l.net/index.php?page=rules
Posting Address: lowlands-l at listserv.linguistlist.org
Server Manual: http://www.lsoft.com/manuals/1.8c/userindex.html
Archives: http://listserv.linguistlist.org/archives/lowlands-l.html
Encoding: Unicode (UTF-8) [Please switch your view mode to it.]
=======================================================================
You have received this because you have been subscribed upon request.
To unsubscribe, please send the command "signoff lowlands-l" as message
text from the same account to listserv at listserv.linguistlist.org or
sign off at http://linguistlist.org/subscribing/sub-lowlands-l.html.
=======================================================================
A=Afrikaans Ap=Appalachian B=Brabantish D=Dutch E=English F=Frisian
L=Limburgish LS=Lowlands Saxon (Low German) N=Northumbrian
S=Scots Sh=Shetlandic V=(West)Flemish Z=Zeelandic (Zeêuws)
=======================================================================

From: Kenneth Rohde Christiansen <kenneth at gnu.org>
Subject: LL-L "Resources" 2003.10.06 (04) [E]

Hi Sandy,

> Is it important to get the wrong answers quickly, then?  :)

No, and I wasn't answering very well. Of course you have do to
normalization of some kind. I just though more about comparing words
than looking words up. So I was imagining a totally different situation.

> Yes, a function that has to know all this will be expensive, but that's
not
> a problem if you don't call it too often! Since this function can
precompile
> words for matching, when used with indexing as Jan was suggesting it
should
> be faster than an on-the-fly fuzzy matching architecture because the words
> can be stored in their fuzzy form in the index and then exact matching can
> be used. Only a few keywords (the ones entered by the user) need to be run
> through the algorithm live.

Yeah, I see what you are getting at, and you will notice that I came to
the same conclusions after having thought about it a bit. It was clearly
late and I didnt really think much about the necessary lookup :)

> So yes, it's scaleable!

Yeah I see. I should read mails before answering them; and not just
think of my past experience with development of translation tools.

--
Kenneth Rohde Christiansen <kenneth at gnu.org>

----------

From: Jan Strunk <strunkjan at hotmail.com>
Subject: LL-L "Resources" 2003.10.04 (08) [E]

Hello folks,

thanks for the input so far. Especially, Kenneth and Sandy!

Kenneth Rohde Christiansen wrote:
> Te be honest this is not an easy task - You can easily say, just do it
> like this or like that...but you have to bear in mind how much memory it
> will consume per page indexed (what about huge pages? if the system
> starts swapping while it indexes huge pages you're in thouble), how
> quick it can index, and how precise it has to be. Well there's a lot of
> other things to think about - so please make a good analysis before
> starting.
I know it is not an easy task. And I will be first and foremost concerned
with the project and I hope that I will be able to produce a Low Saxon
search engine (after that's completed) which will possible take quite some
time.
I will do some planing and analyses. Right now, my project as I plan to
carry out
is this:
            1. Gather a lot of Low Saxon (and partly Low Saxon) webpages
                into a pool that is cached offline. I have already collected
                980 documents (per hand) mostly from the Lowlands-L list
archive.

By the way, Ron would it be ok if I use these documents along with "normal
webpages" for testing different algorithms?

            2. Index these documents using no normalization of spelling
                as a baseline approach for comparison.
            3. Develop / implement unsupervised vs. partly supervised (with
linguistic knowledge)
                algorithms for aligning words and deriving (stochastic)
rules for normalization.
                One of these could be a regular expressions approach like /
inspired
                by Sandy's approach.
            4. Test these different approaches. I will probably do the large
part myself:
                namely the rather more objective test: Is a word that's
found in a document
                really the one searched for. How many results are given
back, how many
                are correct, etc.
                However, I would also like to test "user satisfaction". So,
I hope to build
                a small ugly webinterface where interested users (hopefully
some of you)
                can test queries on the different version and rate their
performance.

This sofar would be my university project. It is all offline and does not
yet constitute
a web search engine.
After that, I would like to get this thing into use. That is: really index
pages from
the web. This however requires a few more things, as Kenneth already pointed
out:
First, I need a stable crawler, also people could send me URLs to index
directly.
Moreover, I need to have a text classificator that tells me whether a
document (or some
part of it) is indeed some kind of Low Saxon. This is a really difficult
problem.
I have implemented some standard language recognition algorithms (n-gram
statistics and keyword
spotting, word frequency-rank comparison) and I still have to try
them out. They are able to distinguish German and Dutch quite well, but it
is certainly a more difficult
task to distinguish between Low Saxon, Dutch, German, Limburgish,
Luxembourgish, Frisian, etc.
Another problem is that many Low Saxon texts on the web are actually
embedded in German, Dutch, English
texts. So checking the first hundred words of a document as Kenneth proposed
would probably
miss many Low Saxon texts. But this is all no longer my university project,
so that I can have more of
an engineering perspective and try to make it work with heuristics, etc.
We could even try to build one as a cooperative activity.

Now some more technical details of the concept I have right now:
The computer science department in Stanford largely uses Java. As it would
be good for me to learn it anyway, I have decided to implement as much as
possible in Java. If it gets too complicated, I might use Perl to do the
statistical
normalization rule learning, as I have been doing statistical natural
language
processing in Perl for two years now. Java also has the advantage that I can
use the open source indexer Lucene and don't have to worry that much about
storing the index. Lucene seems to use quite effective compression
techniques...
I plan to use a similar concept as Sandy (if I understand him correctly).
I will build an index only of the normalized wordforms, this should make it
more compact because I guess that the vocabulary size of all Low Saxon texts
on the web, should still be manageable. The main problem is the variation.
During indexing and retrieval, I will do the normalization. I guess that
applying the normalization
rules will actually not consume that much time on average because webpages
(in Low Saxon)
tend to be rather short.

Alternatively, I might try a quicker solution (and even evaluate it against
the solution above)
by building a tool that expands queries with alternative forms and sends
these as
a disjunctive query on to Google, filtering Google's output with a language
recognizer.
Hopefully, once I have learned rules to normalize words to a "common" form,
I could
also use them to expand alternatives.

So, again. Thank you Sandy and Kenneth. I will look through your proposals
in detail
later. You have already helped me a lot by pointing out important issues and
by
giving me examples that I could try to follow.

Thank you very much!

Jan Strunk
strunk at linguistics.ruhr-uni-bochum.de
jstrunk at stanford.edu

----------

From: R. F. Hahn <lowlands-l at lowlands-l.net>
Subject: Resources

Jan,

> By the way, Ron would it be ok if I use these documents along with "normal
> webpages" for testing different algorithms?

That would be fine, Jan.  Those pages are public domain information,
something that comes with being hosted by LINGUIST.  (The basic idea is to
add them to a free and universally accessible linguistic database.)
However, if you publish them or anything about them, I ask that you
acknowledge their origin.

Success with your project!
Reinhard/Ron

================================END===================================
* Please submit postings to lowlands-l at listserv.linguistlist.org.
* Postings will be displayed unedited in digest form.
* Please display only the relevant parts of quotes in your replies.
* Commands for automated functions (including "signoff lowlands-l") are
  to be sent to listserv at listserv.linguistlist.org or at
  http://linguistlist.org/subscribing/sub-lowlands-l.html.
=======================================================================