[Corpora-List] survey/feedback regarding "Million American Annotation Effort" proposal

Tom O'Hara thomas.paul.ohara at gmail.com
Thu Oct 4 18:53:06 UTC 2012


I am working on a proposal to the Obama administration to sponsor a
large-scale annotation effort, as one way to ameliorate the high
unemployment rate in the United States. Thinking large, it would be billed
as the "The Million American Annotation Effort". A plaint text version is
shown at the end of this message, and the formatted version of the proposal
can be downloaded via http://cs.txstate.edu/~to17/temp
/million-american-annotation-effort-draft.docx.

I would like to get feedback on the proposal as well as input on some
aspects (e.g., target areas for annotations). Below is a brief survey to be
filled out, along with space for general comments. To minimize traffic to
the Corpora list, please just reply to me (thomas.paul.ohara at gmail.com). I
will summarize the responses in a few weeks, maintaining anonymity.

The survey is also intended to gauge the extent to which the research
community feels such an annotation effort would be worthwhile as well as to
see if others might be interested in collaborating on it. Naturally,
getting full approval for such a proposal is a long shot to say the least.
However, it might be possible to get a pilot study funded, especially in
the context of exploring novel approaches to the addressing the
unemployment problem (e.g., part of campaign promise).

Best,
Tom

----------

Survey:
[x] Keep anonymous (e.g., with respect to 'Other' or 'General Feedback')

1. Overall assessment of proposal's merit
Choose one:
a. ( ) Basically meritless
b. ( ) Not at all practical
c. ( ) Feasible if sufficiently constrained
d. ( ) Entirely worthwhile
e. (*) No response

2. Level of expected participation by responder
Choose one:
a. ( ) Interested in organizational work
b. ( ) Interested in consulting role (e.g., resource development)
c. ( ) Can't participate directly but can provide moral support
d. ( ) No interest whatsoever (e.g., see 1a above)
e. (*) No response

3. Target inventory for word-sense annotations
Choose one or more:
a. [ ] WordNet (http://wordnet.princeton.edu)
b. [ ] FrameNet (https://framenet.icsi.berkeley.edu)
c. [ ] Dante (http://www.webdante.com)
d. [ ] Other: ________________________________________
e. [x] No response

4. Additional natural language annotations target areas
Choose one or more:
a. [ ] Machine translation
b. [ ] Semantic roles (e.g., FrameNet)
c. [ ] Information retrieval
d. [ ] Other: ________________________________________
e. [x] No response

5. Other target areas
Choose one or more:
a. [ ] Image analysis (e.g., object recognition)
b. [ ] Music information retrieval (e.g., verse-level annotations)
c. [ ] Web page annotation (e.g., for Semantic Web)
d. [ ] Other: ________________________________________
e. [x] No response

General Feedback:
________________________________________________________________________________
________________________________________________________________________________
________________________________________________________________________________
________________________________________________________________________________
________________________________________________________________________________
[Use as much space as desired]

 ===============================================================================

*The Million American Annotation Effort: DRAFT*
note: See attachment for formatted version (e.g., hyperlinks with pointers
to further information)

Tom O'Hara
Adjunct Professor
Computer Science Department
Texas State University
2 October 2012

Idea

The fallout from the 2008 economic crisis is still being felt, particularly
with respect to unemployment. A novel way to reduce unemployment would be
to hire up to one million Americans to perform annotations of data in
support of intelligent computer applications. For example, the field of
natural language processing (NLP) endeavors to get computers to understand
English to facilitate intelligent search and other applications. NLP
commonly exploits learning by example, a popular technique for artificial
intelligence (AI), which is trying to make computers achieve human-level
intelligence. A large-scale effort on annotating data to provide training
examples for NLP and other areas of AI can be used to spur high-tech
innovations (e.g., in support of semantic web). This can be viewed as a
stimulus package that promotes both employment and high-tech investment.

Motivation

Exploiting human annotations via example-based learning has led to
significant advancements in NLP. Sample application areas include the
following: 1) grammatical parsing, which derives syntactic parse trees, a
generalization of the sentence diagrams used in grade school; 2) word-sense
disambiguation, such as in choosing which dictionary sense definition best
fits a word in context; and 3) semantic role tagging, which indicates how
phrases contribute to sentence meaning (e.g., the who, what, where).
Previously, development of such systems would require significant
programming and knowledge engineering efforts, often yielding systems
tailored to particular domains to achieve better precision. With
example-based machine learning, the engineering efforts are focused on
extracting features from tagged data. An advantage is that such system can
be readily adapted to different domains by using different training data.

Producing some of the NLP annotation types require technical backgrounds in
linguistics as with the parse trees in the Penn Treebank, but word-sense
disambiguation (WSD) can readily be done by native English speakers without
specialized training. Currently, a small subset of the common English
vocabulary has been sufficiently annotated to allow for accurate WSD
(roughly 1,000 distinct words[1]), so a large-scale effort is critical to
ensure much broader coverage (> 50,000 words). Assuming 10,000 instances
for each word with five different annotators per example (for quality
assurance purposes), there would need to be over a billion total word-sense
annotations for basic coverage of English. Thus, a large-scale annotation
effort would be instrumental in achieving this goal.

Having such a large annotated corpus of word senses would be indispensable
for the Semantic Web, a long-term effort now under way to make web pages
explicit regarding the entities they discuss. For example, a web page
discussing dogs as pets would use internal labels indicating that the web
page pertains to Canis familiaris (rather than say hotdogs). These
technical usage labels would normally be hidden from end users, who would
still be querying using the more natural keyword approach (e.g., simple
English phrases like "kid-friendly dog"). Therefore, making full use of the
semantic web would require the ability to disambiguate word senses by
search engines. For instance, if a web page is explicitly tagged with
category labels unrelated to canines, searches for "smart guide dog" should
generally omit that page, even if all the keywords matched. The latter
could happen in web pages containing slang usages (e.g., "lucky dog").

Other areas exploiting example-based learning can benefit from having
annotations by large numbers of non-technical users. For example, this
would make it possible to have detailed topic and mood annotations covering
the entire Million Song Dataset, which is popular music information
retrieval (MIR) research. Currently, detailed annotations have only been
done for datasets involving around 10,000 songs. Having complete
annotations for the Million Song Dataset would help maintain the American
competitive edge in the burgeoning MIR market. In addition, image
annotations can help with computer recognition of objects; and, document
annotations can help with text categorization. Two or three such annotation
target areas will be selected, based on input from the research community.

Result

The end result will be a variety of human tagged datasets that U.S.
companies can license for use in R&D. The data can be made available for
academic research for modest fees. Higher license fees can be assessed for
foreign corporations. (Data protection might be an issue, so that special
safeguards might be needed as in reserving certain portions for commercial
use by companies meeting strict security safeguarding criteria.)

Benefits

- A large number of unskilled workers can be hired to perform annotation
tasks just requiring basic reading and analysis skills (e.g., only high
school education required).
- Welfare recipients can be required to perform annotations in proportion
to amount of benefits received, such as full-time work for those receiving
equivalent of full-time pay. For example, this would allow a single mother
to work at home while still caring for children. (Depending on
circumstances, reduced workloads might be allowed.)
- Minimal infrastructure will be required, as many people will already have
home computers and internet service, the main requirements for annotation
work from home. (The computers do not need to be state of the art, as the
annotation software would be run on a server.)

Proposal chances

- With unemployment still at high levels (> 8%), this can be attractive to
the White House as a novel way to get large numbers of jobs with minimal
infrastructure costs (unlike public works).
- Partisan politics might preclude the proposal getting immediately
approved, but a pilot study might be feasible. For example, $1 billion
dollars would allow for about 40,000 annotator jobs.

Related work

Several different approaches have been applied to produce word-sense
disambiguation annotations. The most common has been the use of
professional annotators, notably in support of the Senseval word-sense
disambiguation competitions (now Semeval). The approach used is typically
to have trained linguists concentrate on specific words, rather than trying
to annotate all words in a sentence at the same time as done in earlier
annotations (e.g., SemCor for WordNet, the lexicon commonly used in NLP).
Other approaches have relied upon on online users to provide ad hoc
annotations in the context of interaction with a web service (e.g.,
language games). Word-sense annotations are also done with respect to
FrameNet, which concentrates on semantic role tagging.  In addition, there
has been work on annotations for the American National Corpus.

More recently, crowd sourcing has developed into a cost-effective
alternative to traditional annotations, such as via Amazon's Mechanical
Turk. This involves soliciting humans to perform tasks that computers have
difficulty in return for payment upon successful completion. There has been
success in certain application areas (e.g., document relevancy judgments
for information retrieval). However, large-scale annotations have not been
produced for word-sense disambiguation. A drawback of crowd sourcing is
that there is no guarantee that the same pool of annotators will be used
throughout. In addition, as the payment model typically is based on
individual tasks, it is less suitable than traditional annotation in the
context of jobs creation.

Cost estimate

1. Annotators
- rate: $7.50-10/hour (minimum wage is $7.25/hour[2])
- wages (52 weeks; 40 hours/week): $15,600-20,800/person
- benefits, employer taxes, etc.: $7,000-10,000/person (???)
subtotal: Roughly $24 billion for 1M annotators full-time. See below for
cost sharing ideas.
note: Some annotators might only be working part-time, in which case a
higher wage might be offered to offset lack of benefits.

2. Low-level Management
note: It is unclear how much management would be needed. The following
assumes one manager (supervisor) per 100 annotators.
todo: rework as range (e.g., $250M - $1B)
- rate: $15-20/hour
- wages (52 weeks; 40 hours/week): $31,200-41,600/person
- benefits, employer taxes, etc.: $15,000/person (???)
- tasks: coordinating annotators; performing quality assurance (QA)
subtotal: roughly $500M (1O,000 low-level managers)

3. Upper Management
note: Likewise unsure amount of upper management needed: assuming 1 per 100
low-level ones. Managers should have a strong quantitative background to
help in the overall data analysis.
- rate: $25-30/hour
- wages (52 weeks; 40 hours/week): $52,000-$62,000/person
- benefits, employer taxes, etc.: $20,000/person (???)
- tasks: Workload distribution for annotation task; data analysis
subtotal: roughly $500K (100 2nd-level)

4. Research Staff
note: Assuming 3 fulltime primary investigators (co-PI's) and 4 halftime
graduate assistants (GA) each. todo: add in technical support staff as
well; rework along the lines of a large NSF grant
- PI: $100K/per
- GA: $40K/per (1/2 mgt [$27.5K + 10K tuition + $2.5K other???])
- university costs listed below
subtotal: $780K

5. Infrastructure
note: annotators will provide own computer and internet access (there can
be some funds set aside to support the indigent)
servers (storage and analysis): 10 @ $2K each => $20K
admin: $50K (or done by GA's)
university costs (estimated via research staff costs [a la infrastructure
as half research grant]): $780K
subtotal: $850K

6. Resource Preparation
Certain annotation tasks would require preparation of external resources.
For example, with word-sense disambiguation, it is critical that the target
sense inventory matches actual word usage in practice. For Senseval, this
was achieved by modifying the WordNet lexicon to make the distinctions
clearer. In addition, in cases where alternative lexicons were used,
mappings into WordNet were developed, as the latter is the most common
lexicon used in computational linguistics.
subtotal: $500K ???
Total: ~$25B

Cost sharing ideas

- Can be required for welfare recipients so existing Health and Human
Services budget can be used to cover part of the costs (e.g., 25-50% of
total).
- Likewise, extended unemployment benefits can be conditional upon
annotation work (e.g., 10-25% of total).
- Individual states can be required to fund a percentage of the cost in
proportion to number of employees hired there (e.g., 5-10% of total).

Potential criticisms

- Naive annotations unreliable.
  counter: Require multiple annotations per item (e.g., 5 or more).
- Niche market so can't recoup costs by licensing.
  counter: view as infrastructure support (i.e., intangible)
- Fiscal hawks will make it hard to fund entire project ($25B), especially
in short term.
  counter: just getting trial study might be worthwhile in itself
- Other researchers might feel cost inordinate. For example, entire NSF
budget is $6.9 billion per year[3], and that for the NIH is $26.4 billion
per year[4].
  counter: small portion actually goes into research
Complications
- Converting welfare recipients to annotators might incur significant cost
overhead for other federal departments.
- Might be viewed as creating yet another federal bureaucracy (e.g.,
management structure required).
Miscellaneous
- For the trial, services like eLance or oDesk can be used to supply
time-tracking infrastructure. Both have a high surcharge (roughly 10%), so
perhaps negotiation can be done to make it more cost effective (less than
5%).
- Another advantage of targeting natural language annotations is that it
can help the annotators improve their language skills. Furthermore, it
might even interest some of them into pursuing a career in linguistics.

Footnotes
1. See http://www.senseval.org/data.html for a representative sample of
word-sense annotation datasets.
2. www.dol.gov/dol/topic/wages/minimumwage.htm
3. http://en.wikipedia.org/wiki/National_Science_Foundation
4. http://en.wikipedia.org/wiki/National_Institutes_of_Health
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20121004/ddc3f600/attachment.htm>
-------------- next part --------------
_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora


More information about the Corpora mailing list