[Corpora-List] POS-tagger maintenance and improvement

Andras Kornai andras at kornai.com
Fri Feb 27 02:39:23 UTC 2009


On Thu, Feb 26, 2009 at 10:36:51PM +0000, Jimmy O'Regan wrote:
> I think you'll find that their statement was carefully worded to
> merely portray the issues in the area without giving any direct,
> specific advice: these kinds of legal analyses are quite often given
> to law students to perform: they are not lawyers, and cannot give
> legal advice.

Not sure what you mean, they _are_ lawyers, pretty high-powered ones, 
paid by U Penn. 
 
> You only focus on the 'economic harm' aspect; you should also consider
> that, if any of the publishers also produce corpora, 

Who does? In general publishers are lousy about publishing their
material in any but the most traditional format. Some content
providers (WSJ, Reuters, etc) have a more enlightened attitude, but
this is rare, and it is trivial to avoid stepping on their toes. 

> or if any of the translators sell their translation memories, 

After some looking I located exactly one vendor selling TM content,
http://www.tmmarketplace.com, and their white paper seems to make the
exact same legal argument that the UPenn lawyers made, check it out.
May even check out their English-Hungarian material, made me curious,
but of course I wouldn't dream of including it in our corpus. (On the
other hand I wouldn't be shocked to find they are repackaging and
selling our material, well, they are welcome.)

> then they have a very real case where you are causing them economic harm.
> 
> Economic harm is far from the only factor in copyright; at best, you
> simply won't be held liable to pay a large amount in damages. Who
> wants to put the work into compiling a corpus, only to be hit with a
> cease and desist notice?

You sound as if you speak from vast experience about corpus linguists
getting hit with all kinds of notices, being held liable for vast
amounts of damages, and in the end getting tarred, feathered, and ran
out of town. I would be interested in hearing about any such cases. 

As for the "who wants to" question, there are always reasons not to do
something. What if one of the sentences is offensive to some group of
people, entices to violence, or advocates breaking the law? This is
quite possible, we certainly didn't check over 2m sentences by hand. 
What if the corpus contains defamatory statements or somebody's trade 
secrets? Oh, the possibilities! (TM:-)

We did it, very real lawyers said it was OK (a similar opinion, also
coming from real lawyers, was discussed in Corpora #4162) and I
recommend this course to anyone who prefers to get work done to
getting bogged down in phantom speculation based on armchair
lawyering.  One rarely, if ever? sees scholars sued for publishing
their corpus, the risk seems to be bearable. 

A more real issue, familiar from all branches of science, is that
people are often reluctant to part with their data (which took a lot
of effort to gather) before they fully exploited it themselves, and by
the time they are done the material is often stale. I see a lot of
debate in biology about the sharing of sequence data (which has, let
us not delude ourselves, orders of magnitude more commercial value
than the texts we tend to work with). I'm sure many of us have asked
other workers in the field for some of the data they created and got
nothing in response.  

Andras Kornai

_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora



More information about the Corpora mailing list