ASL corpora: A proposal.

Dan Parvaz dparvaz at UNM.EDU
Mon Sep 11 18:51:25 UTC 2000


Fellow SLLING-Lers,

Over the past few years, I have had a recurring experience which I would
like to share with you. I would get a hankering -- you know what I mean --
to tackle a particular topic, only to find that little to no empirical
data exist on the subject. My hankering eventually goes away, and I am
left feeling -- again, I'm going to assume that you know what I mean. :-)

My particular philosophical bent, which is by no means unique to me, is to
not rely exclusively on native intuition for grammaticality judgments; I
like to see attestations, and lots of them. Were I working in any written
language (and many spoken languages which are not written), I would have
access to corpora, or have the opportunity to "roll my own" from freely
available electronic texts. Apparently, no such resource is publicly
available for any sign language that I know. Heck, for most studied SLs, I
can't even get a decent dictionary listing attested sentences for each
entry!

Which brings me to this proposal. Is there any way that we can come up
with a group corpus by combining our efforts? Call it the SLLING ASL
corpus (actually, there could be several corpora covering several SLs).
Following the open-source ethic in the software world, it would be owned
by everyone and no one. We (there are a few of us working in comp.
ling/literary computing... Christian, Thomas, d'Armand?) could even use
open source software components to implement it, eschewing proprietary
platforms.

Okay, now I'll pull my head out of the clouds and acknowledge that there
are a few obstacles that need overcoming. Agreeing on an orthography is an
issue, although I don't see why a single middle-of-the-road system
(somewhere between glosses and noting the degree of flexion on every
knuckle) would be any worse than operational. At any rate,
machine-readability is a must; a good diagnostic might be "Can you grep
it?"

Then there's the old what's-in-it-for-me issue. As a professor of mine put
it, "Contributing to this won't get [a newly-hired faculty member]
tenure." There are a few responses to this:

1. We can get data from people who already have tenure, :-)
2. appeal to everyone's scholarly instincts by saying that such a corpus
   has the potential to raise the level of work done across the field (in
   my universe, this would work), or
3. appeal to everyone's [n|gr]eed for data and say that when everyone
   contributes, everyone benefits. It worked for CHILDES, right? And folks
   can do what they want with it (within reason). You want to tag the
   corpus for discourse information? Be our guest. Want to use it to
   configure a statistical parser? Dig in.

Last, but not least, is the time and effort angle on this. Transcribing
is *work*. Perhaps a grant would need to be written to get an initial
corporeal (?) mass that then folks could contribute to at their own
pace/convenience. Or is there another way to do this? The only thing is
that the material in the corpus should IMHO consist of continuous texts in
various genres, as opposed to contrived example sentences used in papers.

So, what do you think? Is this worth starting a conversation or two (or
ten) over?

Cheers,

Dan.

____________
,,,
.. .   D A N  P A R V A Z  --  Geek-in-Residence
 U    University of New Mexico Linguistics Dept
 -    dparvaz@{unm.edu,lanl.gov}   505.480.9638



More information about the Slling-l mailing list