[Corpora-List] Linguistics, corpus linguistics, and diglossia

Thu Dec 16 07:04:00 UTC 2010

On 16.12.2010, at 07:06, Adam Kilgarriff wrote:

> It's all about getting the right corpus.  It's almost always harder to get informal than formal text types.  The spoken-conversation part of the BNC is a great role-model.

It might be all about creating your own right corpus...

> A delight of the web is that it has lots of informal language in it, specially in blogs and similar, so, with a little application, we can gather text of informal types.  Our work on web corpus collection always has that in mind.

The problem with the web is the source limitation. Even in highly developed countries with the highest internet penetration there might be 2/3 of the population being internet users (US for example), it tends to go down to 1/4 in most of the developed countries, and towards 0 in the regions that might interest us most (say Pakistan). But, the problem everywhere is, only a small portion of internet users per language will be creative wrt. content contribution in form of text and language data, and the "informal language" being rather specific, not necessarily colloquial (at least I haven't seen enough colloquial Ruhrpott data online, neither Zagreb-slang, or Chakavian, or Arbanasi).

> > saying that corpus linguistics was exactly the wrong way to build a dictionary
> 
> That's just a counsel of failure.  What does she propose doing instead? Guess (sorry, introspect - mustn't be rude)? Copy existing? Ask her friends?

I think this is rather right in the sense of: using existing corpora for the generation of colloquial language dictionaries. Creating corpora from data collected via transcripts of questionnaires, recordings, interviews etc. to get some quantities out of it might be helpful. But, qualitative data is what counts most in common such compilations, corpora are less relevant here, fieldwork is crucial, in particular in what Mike mentions, a diglossic situation. Corpora might be helpful as organizations of the collected data for extraction of further details, but they do not seem to be at the core of such an endeavor.

Damir

--
Dr. Damir Cavar
http://ling.unizd.hr/~dcavar/
Uni Konstanz: mobile +49 176 60928748 - office: +49 7531 885357
Uni Zadar: mobile +385 91 8837344
fax (e-mail): +385 23 400063
FaceTime: dcavar at me.com

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20101216/ec8b1da9/attachment.htm>
-------------- next part --------------
_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora