[Corpora-List] Building corpora

Thu Jan 24 10:46:14 UTC 2008

Dear Arrieta,

It's an interesting question, and I'm afraid that I don't know of any 
work that has been done on this. I don't think anyone will really be 
able to help you in terms of amounts of money, as there are too many 
variables, most importantly the cost of labour and overheads in your 
country.

However, to start the discussion on this topic, I would suggest that you 
need to work out the costs of the  following:

- planning the project
- allocating or hiring staff, getting rooms, buying computers, arranging 
all the infrastructure; or paying for the costs of these things to your 
institution, or arranging permission to use them
- setting up a computing infrastructure for your project data and 
communications (text storage, version control system, intranet, wiki, etc)
- obtaining the texts (selection, obtaining permissions, 
downloading/digitising, collecting and marking up metadata, quality control)
- transforming the texts to a common format appropriate for further 
processing and annotation
- devising an annotation model
- training annotators
- manually annotating the corpus
- checking the corpus: size, sampling, text integrity, markup errors, 
annotation accuracy and consistency
- documenting the corpus and annotation
- making the corpus available, and continuing support for its storage, 
use and licensing
- long-term archiving and preservation

And when you've worked out the man-hours involved, double them for a 
realistic estimate.

So, for some projects, the biggest cost will be the digitisation of the 
materials. For others, it will be paying for permission to use them. For 
some, it will be to construct a schema for annotation. For many it will 
be to do the annotation, etc

I welcome suggestions for the things that I've overlooked in the list above!

Martin

Kutz Arrieta wrote:
> Dear list members,
> I'm trying to calculate what the cost of building annotated corpora for a
> language with no pre-existing annotated data with enough quality and volume
> to be used in initial versions of MT and translation memory would be.
> Any suggestions would be appreciated. 
> KA
> karrieta at vicomtech.org
>  
>
>   
> ------------------------------------------------------------------------
>
> _______________________________________________
> Corpora mailing list
> Corpora at uib.no
> http://mailman.uib.no/listinfo/corpora
>   

-- 
Martin Wynne

Oxford Text Archive: http://www.ota.ox.ac.uk/
Oxford e-Research Centre: http://www.oerc.ox.ac.uk/
CLARIN: http://www.clarin.eu/

Oxford University Computing Services
13 Banbury Road
Oxford
UK - OX2 6NN
Tel: +44 1865 283299
Fax: +44 1865 273275
martin.wynne at oucs.ox.ac.uk

_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora