[Corpora-List] Re: Minor(ity) Language
Briony Williams
b.williams at bangor.ac.uk
Thu Mar 9 11:47:12 UTC 2006
Mike Maxwell wrote:
> Reminds me of a project we (mostly Bill Poser and myself) did at the LDC
> a few years back, in which we tried to quantify the resources available
> for languages with at least a million speakers (of which the Ethnologue
> reports something like 330). We looked on the web for things like 100k
> words of monolingual and bilingual text, bilingual lexicons,
> morphological parsers (where relevant), etc. We did _not_ try to
> quantify more high-end things, such as syntactic parsers or MT programs
> (although we recorded them if we found them). Everything was
> text-based: we did not look at speech resources.
This sounds similar to the BLARK concept ("Basic Language Resource Kit"),
which was proposed by Stephen Krauwer and developed by ELSNET and ELRA. See
http://www.elda.org/blark - quote: "in the framework of the ENABLER thematic
network ... ELDA elaborated a report defining a (minimal) set of LRs to be
made available for as many languages as possible and mapping the actual gaps
that should be filled in so as to meet the needs of the HLT field.".
That website also contains "BLARK matrices", one per language, to be filled
in similarly to the LDC project described above.
However, there are differences:
1) BLARK covers speech resources also (not just text resources).
2) BLARK does not set a minimum number of speakers for a language (hence it
can cover lesser-used languages as well).
3) BLARK also includes "high-end" modules, e.g. syntactic parsers, sentence
generation).
4) The BLARK matrix can be filled in with a greater degree of detail than
"yes/no" - i.e. "irrelevant", "important", "very important", "essential".
The website asks researchers to fill in details for languages which they have
knowledge of - all languages, not only European ones. This is a much-needed
project and should be encouraged.
Best regards
Briony Williams
More information about the Corpora
mailing list