[Corpora-List] (no subject)

Thu Jan 17 08:09:56 UTC 2013

Thank you very much for your answer. But if I have two scripts for a word, and the first script generates 358 units (107 units - correct) and the second script - 497 units (471 units - correct) after my hand-validation of the list,  which I get using "print lower-words" (this command helps me to provide output in .txt file, because of utf8 code, which isn't visible in xfst), does it mean that the first script is not a correct one? Which of this two scripts is better? 

Thank you in advance,

Irina L

________________________________
 From: Mike Maxwell <maxwell at umiacs.umd.edu>
To: Eirini LS <eirini_ls at yahoo.com> 
Cc: "corpora at uib.no" <corpora at uib.no> 
Sent: Thursday, January 17, 2013 3:58 AM
Subject: Re: [Corpora-List] (no subject)

On 1/15/2013 12:28 PM, Eirini LS wrote:
> I was a bit confused, when a person who has created an analyzer
> (using xerox calculus, lexc) argued that the module works only
> for analysis, but doesn't generates anything and nobody can use
> it in other direction (using lookup, recall). It is not right to
> read a list that it generates using a command "print lower-words".
> Is it right? How can I check the quality of an analyzer?

Since no one has responded to this, I'll try.

The Xerox Finite State Tools (both lexc and xfst) are inherently bidirectional; if you can analyze words, you can also generate from whatever underlying representation the writer of the parser code has chosen.  That is, if 'cats' analyzes as 'cat+PL', then you can input 'cat+PL' in generate mode, and it will give you 'cats'.

What the person you talked to may have been referring to is the fact that (if I'm remembering correctly) the standard version of lexc (and xfst) places a limit on how many "words" it will print with "print words" (I wasn't thinking there was a limit on print-lower-words, but I may be wrong). As I understand it, this has to do with the fact that Xerox was trying to protect its investment in the code that produced upper/lower pairs from a lexicon plus rules--otherwise, you could compile a transducer using lexc and/or xfst, dump the upper/lower pairs, and input those pairs into some simple-to-build and unlicensed FST which had no compilation capability.  There was a commercial version of the tools which cost considerably more, and which could be used to build commercial and distributable FSTs.  But I am not a lawyer, and my memory of that is fuzzy.  If you need more information, you should contact Lauri Karttunen and Ken Beesley, who wrote the book on
 xfst and lexc (literally and figuratively).

Also, there is now an open source tool, foma, which does most of what xfst did, with the exception of compile-replace (used for some kinds of reduplication); but I believe foma has a work-around for this.  The compile-replace algorithm was patented.

Checking the quality of a morph analyzer like xfst/lexc (or any other such tools) is a different question.  There are lots of ways to do it; one we used was to run test cases (words to be parsed) through xfst and hand-validate the output.  The input/output pairs were stored in a version control system, so as to allow regression testing.  There are other ways as well.

For the record, I would not use "print lower-words" for testing the parser, since that doesn't tell you whether you get the *correct* analysis.
--     Mike Maxwell
    maxwell at umiacs.umd.edu
    "My definition of an interesting universe is
    one that has the capacity to study itself."
        --Stephen Eastmond
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20130117/80818be6/attachment.htm>
-------------- next part --------------
_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora