[Corpora-List] POS-tagger maintenance and improvement

Andras Kornai andras at kornai.com
Thu Feb 26 00:17:51 UTC 2009


On Thu, Feb 26, 2009 at 12:12:52AM +0100, Francis Tyers wrote:
> > is that the GPL basically stands in the way of industry-academia
> > partnerships, FSF claims to the contrary notwithstanding. 
> 
> (Insert BSD vs. GPL flame war here)

Right.
 
> There are many counter examples to this, e.g. the previously mentioned
> GrammarSoft, whose VISLCG is GPL and which has disambiguation grammars
> available under a range of licences. There are also plenty of companies
> which make a living using and providing services for GPL software.

This last sentence is what I call "FSF claims to the contrary". Yes you can
make a small consulting business based on GPL software, or if you are IBM
you may even be able to build a large consulting business that way. (Note
that RedHat and other GPL champions are tiny dots on the map of software --
the entire market capitalization of RH at 2.7 g$ is comparable to the
annual income of SAS at 2.26 g$.)

Let us grant the point that one can make a GPL business. However, our
typical users are telcos and ISPs and other companies whose primary
business is not software, let alone software consulting, and they are
totally opposed to the idea of opening up their codebase (in part because
of security by obscurity reasons, the subject of another worthy flamewar).
Perhaps they are wrong-headed, and should open up. However, we feel
absolutely no reason to fight this war, our business is with NLP not with
free software evangelism. 
 
> The problem at any rate is not with code, there are probably hundreds of
> POS taggers out there under a wide variety of licences. The problem is
> with data. 
> 
> You can train a free part-of-speech tagger on a proprietary corpus, or
> you can train a proprietary part-of-speech tagger on a free corpus... or
> you could if they existed -- creating POS tagged corpora for a range of
> languages using either Wikipedia (for you GFDL / CC-BY-SA fans) or
> Gutenburg (for the public domain / BSD minded) would be a great place to
> start.

The data problem, even for copyrighted data, is far less for NLP than it is
usually made out to be. We at the Budapest Institute of Technology
assembled a large aligned Hungarian-English corpus (Hunglish, LDC2008T01)
which is chock-full of copyrighted text, by the simple expedient of merging
all text files in one and alphabetically sorting the sentences leaving a
portion out, which makes it more labor-intensive to restore the original
documents than keying them in.  This method (blessed by the UPenn lawyers)
destroys the value of the corpus for discourse analysis or convergence
studies like Curran and Osborne 2002, but 95% of what we as computational
linguists do is at or below the sentence level. 

I would like the main thrust of what I said be not lost in the noise of the
flamewar: some kind of clearinghouse for corrected data would be useful. I
didn't offer to set one up because I'm not sure Budapest Inst of Tech has
the resources (the bottleneck is not server space but the effort it takes
to curate the data, wikis are not great for this), but we'd be happy to
contribute. 

Andras Kornai
 
> Fran
> 
> PS. One of the things that we've done is decide to use _free_ text for
> performing evaluations. So if you want to e.g. evaluate your MT system
> using post-edition, instead of taking news text from whichever
> newspaper, take the text from Wikipedia, then you can translate,
> post-edit and distribute the resulting parallel aligned corpus free for
> others to use.

_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora



More information about the Corpora mailing list