[Corpora-List] Licensing output of a GPL'd morphological analyser
Mike Maxwell
maxwell at umiacs.umd.edu
Sat Jan 16 03:02:50 UTC 2010
Matthew Honnibal wrote:
> I've always wondered about the limits of this. What if there were an
> annotated corpus with a restrictive license, but the text were public
> domain. A tool is trained on the corpus and provided under a
> restrictive license. I then turn around and run it back over its
> training data. In the limit case, it's a memory-based learner that
> will achieve 100% accuracy on its training corpus. Surely this isn't
> legal, but where might the boundary line be?
I haven't been following this thread closely, so the following may have
already been said. And IANAL either. But to give another option:
Create a list of the types (not tokens) in the corpus, and run the
parser over that. I can't imagine how such a list of words could be
copyrighted, regardless of the status of the original corpus.
It does not make sense to run a tagger on a list of types, and some of
the words will likely parse ambiguously.
Alternatively: with some FSTs, it is possible to dump the entire list of
words that it recognizes (assuming the FST is not cyclic--so if there is
compounding in the language, you need to somehow limit the length of the
compounds). Here you don't even need a corpus. Of course, the
resulting list may be very long...an FST is nothing but a compressed
form of such a list.
--
Mike Maxwell
What good is a universe without somebody around to look at it?
--Robert Dicke, Princeton physicist
_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora
More information about the Corpora
mailing list