[Corpora-List] quantities of publicly available parallel text?

Wed Feb 27 04:07:20 UTC 2008

Chris Dyer wrote:
> Is anyone aware of attempts to estimate how much machine-readable
> parallel text is publicly available?  I'm trying to get a general
> sense of the scale of parallel data we currently have (and are likely
> to have in the future, assuming current growth trends).  Does anyone
> have any statistics on this sort of thing?

I've tried to come up with these figures several times, with emphasis on 
other than "high density languages".  One set of figures (from my talk 
at the ACL 2005 Workshop on Building and Using Parallel Texts) of what 
you could expect to find in the way of bitext at a minimum is the following:
------------
If the language is written, the New Testament (140k words (tokens) in Greek)
For languages which have the complete Bible (OT and NT): ~770k words 
(tokens; ~30k types in English)
Other common sources: Declaration of Human Rights (~1800 words in English)
------------

There are no solid figures on how many languages are written, but I'd 
guess in the neighborhood of 1500 (out of the roughly 7000 languages in 
the world).  Of course, not all of those texts are in electronic form, 
but it wouldn't take a large effort to key them in.  (I would guess that 
OCR is probably not reliable enough.)

While we were at the LDC, around 2004, Bill Poser and I did a survey of 
resources for LoDLs, specifically for all languages with at least a 
million speakers (according to the Ethnologue), but leaving out most of 
the European languages, as well as Japanese, Mandarin Chinese, Modern 
Standard Arabic, and Korean.  Our goal was not to find out how much of 
each resource was available, but to see which languages had at least a 
certain minimum level of resources.  For bilingual text, the minimum 
level was 100k words in electronic form, either in a corpus or estimated 
to be available if you scrounged around on the internet.  Of the 300 or 
so languages with a million speakers, we got through around 150 before 
we ran out of time.  I don't have the figures right now (and more 
importantly, they're badly out of date), but I think we came up with 
less than 30 languages (maybe a lot less) that had that amount of 
parallel text.  Of course that leaves out the high density languages, so 
you could add another 20 or 30, and I suspect the number is 
substantially higher now.  There are some surprises--Basque and Inuit, 
for example, have substantial amounts of parallel text.

I suspect translation houses own a fair amount of bitext, but for 
various reasons can't release it.  (I don't know what genre it is.)

Later, when we worked on a project to create a set of resources for 
LoDLs at the LDC, the scarcity of bilingual text "in the wild" made us 
decide to create our bitext by contracting out to translation agencies 
for most of the languages.  The languages in question were Hungarian 
(for which substantial bitext already existed), Uzbek, Bengali (= 
Bangla), Urdu, Tigrinya, Yoruba (which hardly had any electronic text, 
much less bitext in electronic form), and Tagalog (the Communist Party 
of the Philippines had thoughtfully provided bitext for this and a 
couple other Philippine languages, although I've heard that the 
translations were a bit stilted).  Some other languages were added 
later, and NMSU did several too.

As for the trends, I think the short answer is "translation is 
expensive; who will pay for it?"  Wrt that, Mark Davis has an 
interesting graph of GDP by language in Unicode Technical Note #13: 
http://www.unicode.org/notes/tn13).  It's not very encouraging.  And 
while there is a noticeable increase in computational resources for many 
languages in the years since Bill and I looked at this, the standard has 
also gotten a lot higher.  100k words is probably way too low as a 
threshold, for example.
-- 
    Mike Maxwell
    What good is a universe without somebody around to look at it?
    --Robert Dicke, Princeton physicist

_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora