[Corpora-List] Licensing output of a GPL'd morphological analyser
Andras Kornai
andras at kornai.com
Sat Jan 16 17:13:53 UTC 2010
On the production side of this, we (Budapest Institute of Technology
Media Research Lab) always release our tools under an LGPL rather than
GPL (http://en.wikipedia.org/wiki/GNU_Lesser_General_Public_License)
precisely to avoid putting our users in this quandry. Depending on
your language, we may even have a morphological analyzer lying around,
as we are in the process of creating one for each of the top 31
wikipedia languages with 100k entries (for a list see
http://meta.wikimedia.org/wiki/List_of_Wikipedias) that needs one (e.g.
Chinese needs a different set of tools, not a morphological analyzer as such).
Andras Kornai
On Fri, Jan 15, 2010 at 10:02:50PM -0500, Mike Maxwell wrote:
> Matthew Honnibal wrote:
> > I've always wondered about the limits of this. What if there were an
> > annotated corpus with a restrictive license, but the text were public
> > domain. A tool is trained on the corpus and provided under a
> > restrictive license. I then turn around and run it back over its
> > training data. In the limit case, it's a memory-based learner that
> > will achieve 100% accuracy on its training corpus. Surely this isn't
> > legal, but where might the boundary line be?
>
> I haven't been following this thread closely, so the following may have
> already been said. And IANAL either. But to give another option:
> Create a list of the types (not tokens) in the corpus, and run the
> parser over that. I can't imagine how such a list of words could be
> copyrighted, regardless of the status of the original corpus.
>
> It does not make sense to run a tagger on a list of types, and some of
> the words will likely parse ambiguously.
>
> Alternatively: with some FSTs, it is possible to dump the entire list of
> words that it recognizes (assuming the FST is not cyclic--so if there is
> compounding in the language, you need to somehow limit the length of the
> compounds). Here you don't even need a corpus. Of course, the
> resulting list may be very long...an FST is nothing but a compressed
> form of such a list.
> --
> Mike Maxwell
> What good is a universe without somebody around to look at it?
> --Robert Dicke, Princeton physicist
>
> _______________________________________________
> Corpora mailing list
> Corpora at uib.no
> http://mailman.uib.no/listinfo/corpora
>
_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora
More information about the Corpora
mailing list