[Corpora-List] Annotation without lexicons

Miles Osborne miles at inf.ed.ac.uk
Tue Jan 28 11:11:16 UTC 2003


this could be tackled as a bootstrapping problem: given some (possibly) limited
annotated training set and two (or more) POS taggers etc initially trained on
that data, *cotrain* between both taggers w.r.t the unannotated data.

at this year's eacl, we have a paper that in part deals with cross-genre
bootstrapping.  for you, one could imagine viewing Old Spanish as being from a
different genre to Modern Spanish (just!).

Miles

Quoting Mark Davies <mdavies at ilstu.edu>:

> Corpus annotation is of course usually done with the aid of a lexicon
> containing POS and lemma information.  But imagine that you need to tag
> and
> lemmatize a 1-2 million word corpus of a language for which you do not
> have
> a lexicon.  A variant of this might be the need to annotate a corpus
> from
> the older stage of a language -- e.g. Middle English or Old Spanish --
>
> which is related to a modern language for which you do have a lexicon.
> How
> is this best done?
>
> I've had to address this issue in creating several different corpora and
>
> have developed my own approach to the problem, but I'm interested in
> alternate approaches that others might have taken.  I realize that this
>
> might be a FAQ, but any pointers to relevant literature would be
> helpful.  Thanks in advance.
>
> Mark Davies
>
>
> ====================================================
> Mark Davies, Associate Professor, Spanish Linguistics
> 4300 Foreign Languages, Illinois State University, Normal, IL
> 61790-4300
> 309-438-7975 (voice) / 309-438-8083 (fax)
> http://mdavies.for.ilstu.edu
> ** Historical and dialectal Spanish and Portuguese syntax **
> ** Corpus design and use / Web-database scripting /  Distance education
> **
> =====================================================



More information about the Corpora mailing list