[Corpora-List] Licensing output of a GPL'd morphological analyser

Francis Tyers ftyers at prompsit.com
Sat Jan 16 20:14:16 UTC 2010


El ds 16 de 01 de 2010 a les 17:13 +0000, en/na Andras Kornai va
escriure:
> On the production side of this, we (Budapest Institute of Technology
> Media Research Lab) always release our tools under an LGPL rather than
> GPL (http://en.wikipedia.org/wiki/GNU_Lesser_General_Public_License)
> precisely to avoid putting our users in this quandry. Depending on
> your language, we may even have a morphological analyzer lying around,
> as we are in the process of creating one for each of the top 31
> wikipedia languages with 100k entries (for a list see 
> http://meta.wikimedia.org/wiki/List_of_Wikipedias) that needs one (e.g.
> Chinese needs a different set of tools, not a morphological analyzer as such). 

WOW!

This is really exciting, I had the same idea, and we (in the Apertium
project) have been working on many morphological analysers for some
time. We would love to share data with you. They are encoded as
finite-state transducers in XML, but you can easily export a full form
list.

Current analysers that have > 80% coverage:

http://paste2.org/p/615078

Other stuff:

http://apertium.svn.sourceforge.net/viewvc/apertium/incubator/

It's also worth noting for Finnish there is omorfi:

http://wiki.apertium.org/wiki/Omorfi

And for the Sámi languages, the Giellatekno has many high coverage HFST
based analysers:

https://victorio.uit.no/langtech/trunk/gt

Best regards,

Fran

> Andras Kornai
> 
> On Fri, Jan 15, 2010 at 10:02:50PM -0500, Mike Maxwell wrote:
> > Matthew Honnibal wrote:
> > > I've always wondered about the limits of this. What if there were an
> > > annotated corpus with a restrictive license, but the text were public
> > > domain. A tool is trained on the corpus and provided under a
> > > restrictive license. I then turn around and run it back over its
> > > training data. In the limit case, it's a memory-based learner that
> > > will achieve 100% accuracy on its training corpus. Surely this isn't
> > > legal, but where might the boundary line be?
> > 
> > I haven't been following this thread closely, so the following may have 
> > already been said.  And IANAL either.  But to give another option: 
> > Create a list of the types (not tokens) in the corpus, and run the 
> > parser over that.  I can't imagine how such a list of words could be 
> > copyrighted, regardless of the status of the original corpus.
> > 
> > It does not make sense to run a tagger on a list of types, and some of 
> > the words will likely parse ambiguously.
> > 
> > Alternatively: with some FSTs, it is possible to dump the entire list of 
> > words that it recognizes (assuming the FST is not cyclic--so if there is 
> > compounding in the language, you need to somehow limit the length of the 
> > compounds).  Here you don't even need a corpus.  Of course, the 
> > resulting list may be very long...an FST is nothing but a compressed 
> > form of such a list.
> > -- 
> >     Mike Maxwell
> >     What good is a universe without somebody around to look at it?
> >     --Robert Dicke, Princeton physicist
> > 
> > _______________________________________________
> > Corpora mailing list
> > Corpora at uib.no
> > http://mailman.uib.no/listinfo/corpora
> > 
> 
> _______________________________________________
> Corpora mailing list
> Corpora at uib.no
> http://mailman.uib.no/listinfo/corpora



_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora



More information about the Corpora mailing list