[Corpora-List] (no subject)

Mike Maxwell maxwell at umiacs.umd.edu
Thu Jan 17 00:58:06 UTC 2013


On 1/15/2013 12:28 PM, Eirini LS wrote:
> I was a bit confused, when a person who has created an analyzer
> (using xerox calculus, lexc) argued that the module works only
>for analysis, but doesn't generates anything and nobody can use
> it in other direction (using lookup, recall). It is not right to
> read a list that it generates using a command "print lower-words".
>Is it right? How can I check the quality of an analyzer?

Since no one has responded to this, I'll try.

The Xerox Finite State Tools (both lexc and xfst) are inherently bidirectional; if you can analyze 
words, you can also generate from whatever underlying representation the writer of the parser code 
has chosen.  That is, if 'cats' analyzes as 'cat+PL', then you can input 'cat+PL' in generate mode, 
and it will give you 'cats'.

What the person you talked to may have been referring to is the fact that (if I'm remembering 
correctly) the standard version of lexc (and xfst) places a limit on how many "words" it will print 
with "print words" (I wasn't thinking there was a limit on print-lower-words, but I may be wrong). 
As I understand it, this has to do with the fact that Xerox was trying to protect its investment in 
the code that produced upper/lower pairs from a lexicon plus rules--otherwise, you could compile a 
transducer using lexc and/or xfst, dump the upper/lower pairs, and input those pairs into some 
simple-to-build and unlicensed FST which had no compilation capability.  There was a commercial 
version of the tools which cost considerably more, and which could be used to build commercial and 
distributable FSTs.  But I am not a lawyer, and my memory of that is fuzzy.  If you need more 
information, you should contact Lauri Karttunen and Ken Beesley, who wrote the book on xfst and lexc 
(literally and figuratively).

Also, there is now an open source tool, foma, which does most of what xfst did, with the exception 
of compile-replace (used for some kinds of reduplication); but I believe foma has a work-around for 
this.  The compile-replace algorithm was patented.

Checking the quality of a morph analyzer like xfst/lexc (or any other such tools) is a different 
question.  There are lots of ways to do it; one we used was to run test cases (words to be parsed) 
through xfst and hand-validate the output.  The input/output pairs were stored in a version control 
system, so as to allow regression testing.  There are other ways as well.

For the record, I would not use "print lower-words" for testing the parser, since that doesn't tell 
you whether you get the *correct* analysis.
-- 
	Mike Maxwell
	maxwell at umiacs.umd.edu
	"My definition of an interesting universe is
	one that has the capacity to study itself."
         --Stephen Eastmond

_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora



More information about the Corpora mailing list