[Corpora-List] Licensing output of a GPL'd morphological analyser

Sat Jan 16 03:02:50 UTC 2010

Matthew Honnibal wrote:
> I've always wondered about the limits of this. What if there were an
> annotated corpus with a restrictive license, but the text were public
> domain. A tool is trained on the corpus and provided under a
> restrictive license. I then turn around and run it back over its
> training data. In the limit case, it's a memory-based learner that
> will achieve 100% accuracy on its training corpus. Surely this isn't
> legal, but where might the boundary line be?

I haven't been following this thread closely, so the following may have 
already been said.  And IANAL either.  But to give another option: 
Create a list of the types (not tokens) in the corpus, and run the 
parser over that.  I can't imagine how such a list of words could be 
copyrighted, regardless of the status of the original corpus.

It does not make sense to run a tagger on a list of types, and some of 
the words will likely parse ambiguously.

Alternatively: with some FSTs, it is possible to dump the entire list of 
words that it recognizes (assuming the FST is not cyclic--so if there is 
compounding in the language, you need to somehow limit the length of the 
compounds).  Here you don't even need a corpus.  Of course, the 
resulting list may be very long...an FST is nothing but a compressed 
form of such a list.
-- 
    Mike Maxwell
    What good is a universe without somebody around to look at it?
    --Robert Dicke, Princeton physicist

_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora