[Corpora-List] Re: Minor(ity) Language

Briony Williams b.williams at bangor.ac.uk
Thu Mar 9 11:47:12 UTC 2006


Mike Maxwell wrote:
> Reminds me of a project we (mostly Bill Poser and myself) did at the LDC 
> a few years back, in which we tried to quantify the resources available 
> for languages with at least a million speakers (of which the Ethnologue 
> reports something like 330).  We looked on the web for things like 100k 
> words of monolingual and bilingual text, bilingual lexicons, 
> morphological parsers (where relevant), etc.  We did _not_ try to 
> quantify more high-end things, such as syntactic parsers or MT programs 
> (although we recorded them if we found them).  Everything was 
> text-based: we did not look at speech resources.

This sounds similar to the BLARK concept ("Basic Language Resource Kit"), 
which was proposed by Stephen Krauwer and developed by ELSNET and ELRA. See 
http://www.elda.org/blark - quote: "in the framework of the ENABLER thematic 
network ... ELDA elaborated a report defining a (minimal) set of LRs to be 
made available for as many languages as possible and mapping the actual gaps 
that should be filled in so as to meet the needs of the HLT field.".

That website also contains "BLARK matrices", one per language, to be filled 
in similarly to the LDC project described above.

However, there are differences:

1) BLARK covers speech resources also (not just text resources).
2) BLARK does not set a minimum number of speakers for a language (hence it 
can cover lesser-used languages as well).
3) BLARK also includes "high-end" modules, e.g. syntactic parsers, sentence 
generation).
4) The BLARK matrix can be filled in with a greater degree of detail than 
"yes/no" - i.e. "irrelevant", "important", "very important", "essential".

The website asks researchers to fill in details for languages which they have 
knowledge of - all languages, not only European ones.  This is a much-needed 
project and should be encouraged.

Best regards

Briony Williams



More information about the Corpora mailing list