[Corpora-List] Re: Minor(ity) Language
Mike Maxwell
maxwell at ldc.upenn.edu
Thu Mar 9 23:19:06 UTC 2006
Briony Williams wrote:
> This sounds similar to the BLARK concept ("Basic Language Resource
> Kit"), which was proposed by Stephen Krauwer and developed by ELSNET and
> ELRA. See http://www.elda.org/blark - quote: "in the framework of the
> ENABLER thematic network ... ELDA elaborated a report defining a
> (minimal) set of LRs to be made available for as many languages as
> possible and mapping the actual gaps that should be filled in so as to
> meet the needs of the HLT field.".
> That website also contains "BLARK matrices", one per language, to be
> filled in similarly to the LDC project described above.
We (Chris Cieri and myself) presented our project at one of the early
meetings discussing the BLARK a couple years ago (maybe it was the first
one, I'm not sure). Our reason for setting the bounds on our own survey
work (languages with >= 1M speakers, more or less binary decisions,
etc.) were practical: we wanted to set a goal that we could achieve.
And at that, we only got about half way through our list of languages.
> However, there are differences:
>
> 1) BLARK covers speech resources also (not just text resources).
> 2) BLARK does not set a minimum number of speakers for a
> language (hence it can cover lesser-used languages as well).
> 3) BLARK also includes "high-end" modules, e.g. syntactic parsers,
> sentence generation).
> 4) The BLARK matrix can be filled in with a greater degree of detail
> than "yes/no" - i.e. "irrelevant", "important", "very
> important", "essential".
We left out virtually all the European languages, precisely because we
felt we could rely on the European community to survey those
languages--and also because it was obvious that most European languages
were rapidly becoming at least "medium density" languages, if not high
density, and our goal was to report on _low_ density languages. At the
other end, we didn't try to cover languages with fewer than a million
speakers, because we had to set a limit somewhere (even if it was an
arbitrary limit) if we were to have a doable project. And the chances
seemed very slim that a small language was going to have much in the way
of resources. (There are fortunate exceptions, of course, but we would
have spent a lot of time looking for them.)
A couple questions in our survey, while having binary answers, were more
along the "irrelevant/ essential" line (point (4) above). For instance,
we asked whether the language had a complex inflectional morphology, by
which we meant roughly "significantly more complex than English." The
reason for asking that was that whether you needed to ask another
question--if there was a morphological parser for the language--depended
on the answer to the complex morphology question.
As for not looking for syntactic parsers, our feeling was that this was
a survey of _low_ density languages, so almost by definition the answer
would be "no". (If no one has built a morphological parser for
Tigrinya, then there won't be a syntactic parser.) The same point
largely holds for speech resources, although that may be changing now.
> The website asks researchers to fill in details for languages
> which they have knowledge of - all languages, not only European
> ones. This is a much-needed project and should be encouraged.
I agree about the importance. It looks like the website has just Modern
Standard Arabic at this point, unless I missed s.t. It would be great
to expand this.
As I say, I've tried several times to revive (funding for) the sort of
survey we did at the LDC, with improvements. My feeling is that doing
such a survey, and keeping it up-to-date, will require both training of
multiple surveyors (I don't think it should be a two-person job, like
ours was) and paying them to take the time do a good job (and to do
updates).
I have immense respect for open-source ventures like the wikipedia, but
such projects are going to be hit-and-miss when it comes to languages:
the wikipedia doesn't exist in 300 languages, and probably won't for a
long time. OTOH, you can find some language resources (particularly
monolingual text, and sometimes dictionaries) for a lot of low density
languages, either because there is some commercial market for them
(newspapers), or because it's a one-person labor of love (some
dictionaries). But that's my personal opinion, and I would love to be
proved wrong!
Mike Maxwell
CASL/ University of Maryland
More information about the Corpora
mailing list