[Corpora-List] Licensing output of a GPL'd morphological analyser

Andras Kornai andras at kornai.com
Sat Jan 16 17:13:53 UTC 2010


On the production side of this, we (Budapest Institute of Technology
Media Research Lab) always release our tools under an LGPL rather than
GPL (http://en.wikipedia.org/wiki/GNU_Lesser_General_Public_License)
precisely to avoid putting our users in this quandry. Depending on
your language, we may even have a morphological analyzer lying around,
as we are in the process of creating one for each of the top 31
wikipedia languages with 100k entries (for a list see 
http://meta.wikimedia.org/wiki/List_of_Wikipedias) that needs one (e.g.
Chinese needs a different set of tools, not a morphological analyzer as such). 

Andras Kornai

On Fri, Jan 15, 2010 at 10:02:50PM -0500, Mike Maxwell wrote:
> Matthew Honnibal wrote:
> > I've always wondered about the limits of this. What if there were an
> > annotated corpus with a restrictive license, but the text were public
> > domain. A tool is trained on the corpus and provided under a
> > restrictive license. I then turn around and run it back over its
> > training data. In the limit case, it's a memory-based learner that
> > will achieve 100% accuracy on its training corpus. Surely this isn't
> > legal, but where might the boundary line be?
> 
> I haven't been following this thread closely, so the following may have 
> already been said.  And IANAL either.  But to give another option: 
> Create a list of the types (not tokens) in the corpus, and run the 
> parser over that.  I can't imagine how such a list of words could be 
> copyrighted, regardless of the status of the original corpus.
> 
> It does not make sense to run a tagger on a list of types, and some of 
> the words will likely parse ambiguously.
> 
> Alternatively: with some FSTs, it is possible to dump the entire list of 
> words that it recognizes (assuming the FST is not cyclic--so if there is 
> compounding in the language, you need to somehow limit the length of the 
> compounds).  Here you don't even need a corpus.  Of course, the 
> resulting list may be very long...an FST is nothing but a compressed 
> form of such a list.
> -- 
>     Mike Maxwell
>     What good is a universe without somebody around to look at it?
>     --Robert Dicke, Princeton physicist
> 
> _______________________________________________
> Corpora mailing list
> Corpora at uib.no
> http://mailman.uib.no/listinfo/corpora
> 

_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora



More information about the Corpora mailing list