[Corpora-List] Building corpora
Martin Wynne
martin.wynne at oucs.ox.ac.uk
Thu Jan 24 10:46:14 UTC 2008
Dear Arrieta,
It's an interesting question, and I'm afraid that I don't know of any
work that has been done on this. I don't think anyone will really be
able to help you in terms of amounts of money, as there are too many
variables, most importantly the cost of labour and overheads in your
country.
However, to start the discussion on this topic, I would suggest that you
need to work out the costs of the following:
- planning the project
- allocating or hiring staff, getting rooms, buying computers, arranging
all the infrastructure; or paying for the costs of these things to your
institution, or arranging permission to use them
- setting up a computing infrastructure for your project data and
communications (text storage, version control system, intranet, wiki, etc)
- obtaining the texts (selection, obtaining permissions,
downloading/digitising, collecting and marking up metadata, quality control)
- transforming the texts to a common format appropriate for further
processing and annotation
- devising an annotation model
- training annotators
- manually annotating the corpus
- checking the corpus: size, sampling, text integrity, markup errors,
annotation accuracy and consistency
- documenting the corpus and annotation
- making the corpus available, and continuing support for its storage,
use and licensing
- long-term archiving and preservation
And when you've worked out the man-hours involved, double them for a
realistic estimate.
So, for some projects, the biggest cost will be the digitisation of the
materials. For others, it will be paying for permission to use them. For
some, it will be to construct a schema for annotation. For many it will
be to do the annotation, etc
I welcome suggestions for the things that I've overlooked in the list above!
Martin
Kutz Arrieta wrote:
> Dear list members,
> I'm trying to calculate what the cost of building annotated corpora for a
> language with no pre-existing annotated data with enough quality and volume
> to be used in initial versions of MT and translation memory would be.
> Any suggestions would be appreciated.
> KA
> karrieta at vicomtech.org
>
>
>
> ------------------------------------------------------------------------
>
> _______________________________________________
> Corpora mailing list
> Corpora at uib.no
> http://mailman.uib.no/listinfo/corpora
>
--
Martin Wynne
Oxford Text Archive: http://www.ota.ox.ac.uk/
Oxford e-Research Centre: http://www.oerc.ox.ac.uk/
CLARIN: http://www.clarin.eu/
Oxford University Computing Services
13 Banbury Road
Oxford
UK - OX2 6NN
Tel: +44 1865 283299
Fax: +44 1865 273275
martin.wynne at oucs.ox.ac.uk
_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora
More information about the Corpora
mailing list