[Corpora-List] Commercial/competitive value of training data

chris brew cbrew at acm.org
Wed Oct 22 15:28:46 UTC 2008


The premise is probably wrong. Annotated training data is expensive to
produce,especially if the skills needed by the annotators are substantial.
The value of
an annotated data set is  the information that it contains about what
the correct answers are. There are probably a large number of learning
algorithms and approaches that would be roughly as effective as each
other at drawing this information out of the data set and making it
available
for deployment in an application.

The creators of the data have two advantages: the first is access, and the
second is the possibility that the data might have been organized into a
form
that supports efficient learning with the particular software and algorithms
that
they have in mind to deploy. The first is a clear commercial advantage that
cannot easily be nullified. But the second (the form of the data) is exactly
the
sort of thing that a clever programmer can easily adjust, so that advantage
is easily nullified

There is no obvious reason why an ordinary commercial company that has
prepared such a data set
would give away the competitive advantage associated with access to the
data. Sometimes
big companies such as Netflix, Google, Yahoo or Microsoft do choose to do
this anyway, because
they are interested in directing the attention of the research community to
the problems that
matter to them. This has a payoff in terms of recruitment and so on
(perhaps). I think the
big companies are also betting that their advantages in terms of
infrastructure and sheer
numbers of deployable staff will outweigh any risks of giving away
competitive advantage.



On Wed, Oct 22, 2008 at 9:28 AM, Seth Grimes <grimes at altaplana.com> wrote:

> Have others considered the competitive value of training data?
>
> I'm referring to data that would be usable for commercial purposes, unlike
> data provided through the Linguistic Data Consortium (LDC) for research
> purposes.  The trade-off for a commercial organization is the opportunity
> to recapture the expense of annotating a data set against the risk of
> accelerating time to market, or promoting a sale at one's own expense, of
> a competing product or service.
>
> My premise is that a software system's greatest value lies in what it can
> do with the training data rather than in the training data itself.  But
> what considerations do others see?
>
> Thanks,
>
>                                        Seth
>
>
>
> --
> Seth Grimes   Alta Plana Corp, analytical computing & data management
>               Intelligent Enterprise magazine (CMP), Contributing Editor
> grimes at altaplana.com       http://altaplana.com    +1 301-270-0795
>
> _______________________________________________
> Corpora mailing list
> Corpora at uib.no
> http://mailman.uib.no/listinfo/corpora
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20081022/42b816a7/attachment.htm>
-------------- next part --------------
_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora


More information about the Corpora mailing list