[Corpora-List] Re: Minor(ity) Language

Mike Maxwell maxwell at ldc.upenn.edu
Thu Mar 9 23:19:06 UTC 2006


Briony Williams wrote:
> This sounds similar to the BLARK concept ("Basic Language Resource 
> Kit"), which was proposed by Stephen Krauwer and developed by ELSNET and 
> ELRA. See http://www.elda.org/blark - quote: "in the framework of the 
> ENABLER thematic network ... ELDA elaborated a report defining a 
> (minimal) set of LRs to be made available for as many languages as 
> possible and mapping the actual gaps that should be filled in so as to 
> meet the needs of the HLT field.".
> That website also contains "BLARK matrices", one per language, to be 
> filled in similarly to the LDC project described above.

We (Chris Cieri and myself) presented our project at one of the early 
meetings discussing the BLARK a couple years ago (maybe it was the first 
one, I'm not sure).  Our reason for setting the bounds on our own survey 
work (languages with >= 1M speakers, more or less binary decisions, 
etc.) were practical: we wanted to set a goal that we could achieve. 
And at that, we only got about half way through our list of languages.

 > However, there are differences:
 >
 > 1) BLARK covers speech resources also (not just text resources).
 > 2) BLARK does not set a minimum number of speakers for a
 >    language (hence it can cover lesser-used languages as well).
 > 3) BLARK also includes "high-end" modules, e.g. syntactic parsers,
 >    sentence generation).
 > 4) The BLARK matrix can be filled in with a greater degree of detail
 >    than "yes/no" - i.e. "irrelevant", "important", "very
 >    important", "essential".

We left out virtually all the European languages, precisely because we 
felt we could rely on the European community to survey those 
languages--and also because it was obvious that most European languages 
were rapidly becoming at least "medium density" languages, if not high 
density, and our goal was to report on _low_ density languages.  At the 
other end, we didn't try to cover languages with fewer than a million 
speakers, because we had to set a limit somewhere (even if it was an 
arbitrary limit) if we were to have a doable project.  And the chances 
seemed very slim that a small language was going to have much in the way 
of resources.  (There are fortunate exceptions, of course, but we would 
have spent a lot of time looking for them.)

A couple questions in our survey, while having binary answers, were more 
along the "irrelevant/ essential" line (point (4) above).  For instance, 
we asked whether the language had a complex inflectional morphology, by 
which we meant roughly "significantly more complex than English."  The 
reason for asking that was that whether you needed to ask another 
question--if there was a morphological parser for the language--depended 
on the answer to the complex morphology question.

As for not looking for syntactic parsers, our feeling was that this was 
a survey of _low_ density languages, so almost by definition the answer 
would be "no".  (If no one has built a morphological parser for 
Tigrinya, then there won't be a syntactic parser.)  The same point 
largely holds for speech resources, although that may be changing now.

 > The website asks researchers to fill in details for languages
 > which they  have knowledge of - all languages, not only European
 > ones.  This is a much-needed project and should be encouraged.

I agree about the importance.  It looks like the website has just Modern 
Standard Arabic at this point, unless I missed s.t.  It would be great 
to expand this.

As I say, I've tried several times to revive (funding for) the sort of 
survey we did at the LDC, with improvements.  My feeling is that doing 
such a survey, and keeping it up-to-date, will require both training of 
multiple surveyors (I don't think it should be a two-person job, like 
ours was) and paying them to take the time do a good job (and to do 
updates).

I have immense respect for open-source ventures like the wikipedia, but 
such projects are going to be hit-and-miss when it comes to languages: 
the wikipedia doesn't exist in 300 languages, and probably won't for a 
long time.  OTOH, you can find some language resources (particularly 
monolingual text, and sometimes dictionaries) for a lot of low density 
languages, either because there is some commercial market for them 
(newspapers), or because it's a one-person labor of love (some 
dictionaries).  But that's my personal opinion, and I would love to be 
proved wrong!

    Mike Maxwell
    CASL/ University of Maryland



More information about the Corpora mailing list