[Corpora-List] quantities of publicly available parallel text?
Mike Maxwell
maxwell at umiacs.umd.edu
Wed Feb 27 04:07:20 UTC 2008
Chris Dyer wrote:
> Is anyone aware of attempts to estimate how much machine-readable
> parallel text is publicly available? I'm trying to get a general
> sense of the scale of parallel data we currently have (and are likely
> to have in the future, assuming current growth trends). Does anyone
> have any statistics on this sort of thing?
I've tried to come up with these figures several times, with emphasis on
other than "high density languages". One set of figures (from my talk
at the ACL 2005 Workshop on Building and Using Parallel Texts) of what
you could expect to find in the way of bitext at a minimum is the following:
------------
If the language is written, the New Testament (140k words (tokens) in Greek)
For languages which have the complete Bible (OT and NT): ~770k words
(tokens; ~30k types in English)
Other common sources: Declaration of Human Rights (~1800 words in English)
------------
There are no solid figures on how many languages are written, but I'd
guess in the neighborhood of 1500 (out of the roughly 7000 languages in
the world). Of course, not all of those texts are in electronic form,
but it wouldn't take a large effort to key them in. (I would guess that
OCR is probably not reliable enough.)
While we were at the LDC, around 2004, Bill Poser and I did a survey of
resources for LoDLs, specifically for all languages with at least a
million speakers (according to the Ethnologue), but leaving out most of
the European languages, as well as Japanese, Mandarin Chinese, Modern
Standard Arabic, and Korean. Our goal was not to find out how much of
each resource was available, but to see which languages had at least a
certain minimum level of resources. For bilingual text, the minimum
level was 100k words in electronic form, either in a corpus or estimated
to be available if you scrounged around on the internet. Of the 300 or
so languages with a million speakers, we got through around 150 before
we ran out of time. I don't have the figures right now (and more
importantly, they're badly out of date), but I think we came up with
less than 30 languages (maybe a lot less) that had that amount of
parallel text. Of course that leaves out the high density languages, so
you could add another 20 or 30, and I suspect the number is
substantially higher now. There are some surprises--Basque and Inuit,
for example, have substantial amounts of parallel text.
I suspect translation houses own a fair amount of bitext, but for
various reasons can't release it. (I don't know what genre it is.)
Later, when we worked on a project to create a set of resources for
LoDLs at the LDC, the scarcity of bilingual text "in the wild" made us
decide to create our bitext by contracting out to translation agencies
for most of the languages. The languages in question were Hungarian
(for which substantial bitext already existed), Uzbek, Bengali (=
Bangla), Urdu, Tigrinya, Yoruba (which hardly had any electronic text,
much less bitext in electronic form), and Tagalog (the Communist Party
of the Philippines had thoughtfully provided bitext for this and a
couple other Philippine languages, although I've heard that the
translations were a bit stilted). Some other languages were added
later, and NMSU did several too.
As for the trends, I think the short answer is "translation is
expensive; who will pay for it?" Wrt that, Mark Davis has an
interesting graph of GDP by language in Unicode Technical Note #13:
http://www.unicode.org/notes/tn13). It's not very encouraging. And
while there is a noticeable increase in computational resources for many
languages in the years since Bill and I looked at this, the standard has
also gotten a lot higher. 100k words is probably way too low as a
threshold, for example.
--
Mike Maxwell
What good is a universe without somebody around to look at it?
--Robert Dicke, Princeton physicist
_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora
More information about the Corpora
mailing list