The premise is probably wrong. Annotated training data is expensive to produce,<div>especially if the skills needed by the annotators are substantial. The value of</div><div>an annotated data set is  the information that it contains about what</div>

<div>the correct answers are. There are probably a large number of learning</div><div>algorithms and approaches that would be roughly as effective as each </div><div>other at drawing this information out of the data set and making it available</div>

<div>for deployment in an application.</div><div><br></div><div>The creators of the data have two advantages: the first is access, and the</div><div>second is the possibility that the data might have been organized into a form</div>

<div>that supports efficient learning with the particular software and algorithms that</div><div>they have in mind to deploy. The first is a clear commercial advantage that </div><div>cannot easily be nullified. But the second (the form of the data) is exactly the </div>

<div>sort of thing that a clever programmer can easily adjust, so that advantage</div><div>is easily nullified</div><div><br></div><div>There is no obvious reason why an ordinary commercial company that has prepared such a data set</div>

<div><div><div><div>would give away the competitive advantage associated with access to the data. Sometimes</div><div>big companies such as Netflix, Google, Yahoo or Microsoft do choose to do this anyway, because</div><div>

they are interested in directing the attention of the research community to the problems that</div><div>matter to them. This has a payoff in terms of recruitment and so on (perhaps). I think the</div><div>big companies are also betting that their advantages in terms of infrastructure and sheer</div>

<div>numbers of deployable staff will outweigh any risks of giving away competitive advantage.</div><div><br></div><div><br></div><div><br><div class="gmail_quote">On Wed, Oct 22, 2008 at 9:28 AM, Seth Grimes <span dir="ltr"><<a href="mailto:grimes@altaplana.com">grimes@altaplana.com</a>></span> wrote:<br>

<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex;">Have others considered the competitive value of training data?<br>

<br>

I'm referring to data that would be usable for commercial purposes, unlike<br>

data provided through the Linguistic Data Consortium (LDC) for research<br>

purposes.  The trade-off for a commercial organization is the opportunity<br>

to recapture the expense of annotating a data set against the risk of<br>

accelerating time to market, or promoting a sale at one's own expense, of<br>

a competing product or service.<br>

<br>

My premise is that a software system's greatest value lies in what it can<br>

do with the training data rather than in the training data itself.  But<br>

what considerations do others see?<br>

<br>

Thanks,<br>

<br>

                                        Seth<br>

<br>

<br>

<br>

--<br>

Seth Grimes   Alta Plana Corp, analytical computing & data management<br>

               Intelligent Enterprise magazine (CMP), Contributing Editor<br>

<a href="mailto:grimes@altaplana.com">grimes@altaplana.com</a>       <a href="http://altaplana.com" target="_blank">http://altaplana.com</a>    +1 301-270-0795<br>

<br>

_______________________________________________<br>

Corpora mailing list<br>

<a href="mailto:Corpora@uib.no">Corpora@uib.no</a><br>

<a href="http://mailman.uib.no/listinfo/corpora" target="_blank">http://mailman.uib.no/listinfo/corpora</a><br>

</blockquote></div><br></div></div></div></div>