[Corpora-List] quantities of publicly available parallel text?

Alexandre Rafalovitch arafalov at gmail.com
Wed Feb 27 19:01:15 UTC 2008


On Wed, Feb 27, 2008 at 10:46 AM, Adam Kilgarriff
<adam at lexmasterclass.com> wrote:
> But aren't all these official, centralised corpora both of rather peculiar
> genres, and rather small?  More interesting, to my mind, is Tiedemann and
> Nygard's work, based on the neat observations that

Peculiar genre, perhaps. So is legal and biomedical domain and that
has been getting some recent attention.

As to small, what is considered to be too small? I have 5 million
(uncleaned) tokens  for one language in one subtype of documents
(Resolutions of the General Assembly). Is that too small for the kind
of work you envisage?

If so, what would be a good number? Apologies, if this question has
already been answered before.

Regards,
   Alex.

_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora



More information about the Corpora mailing list