[Corpora-List] Linguistics, corpus linguistics, and diglossia

Sat Dec 18 12:48:56 UTC 2010

Hi Mike, John and others following this thread

I have followed much of the previous discussion - but felt less qualified to comment
while it seemed to be more concerned with linguistic definitions of diglossia.

Here are my initial thoughts:

a) all written languages are diglossic to some extent, i.e. display some differences between the two
major modes (speech and writing), with a spectrum of sub-modes in between (e.g. written-to-be-spoken,
for example a political speech, where there will be differences between what was written and what
was spoken, as well as the additional delivery components - stress, pronunciation, pauses, deviations
from script, etc; and spoken-to-be-written, for example dictated letters/memos/news reportage, etc
where again there will be differences in the resulting written text). EAGLES, TEI, Cobuild, BNC, and others
have  discussed problems of text-types/genres in this spectrum. Computer-mediated communication has
generated many new genres (with features closer in many respects to spoken language) which are being
discussed and categorised more recently.

b) recordings of speech can contain various problems, such as surrounding noises, multiple speakers, etc
just as historical manuscripts may contain smudges and stains, handwriting issues, etc

c) the audio data corpus may be searchable by sounds (I'm not sure if this has been implemented yet, as I am not
a specialist in this area, but if not, I'm sure it will be: segment the audio data, and use voice-recognition to accept
the sound to be searched), or by phonetic symbols (if it has been phonetically transcribed; e.g. search for /yu:z/
to find instances of 'youse'), but the user would have to disambiguate 'youse' from other occurrences of this sound,
e.g. in 'what he's told *you's* of no importance to me', or more likely 'use', which may be very numerous?

d) the same problems occur with text transcriptions of speech: whichever set of transcription conventions you use,
it will be extremely difficult to capture all the variations in the oral delivery.

e) but the biggest problem is that, the more variations you transcribe, in greater detail and with greater accuracy, the more difficult
it will be for the user to find all the occurrences of an "item" (indeed, it requires us to re-define "item"); indexation for retrieval
becomes a non-trivial task.

f) This became apparent early on with lemmatization of modern data at Cobuild .

g) I encountered it again more forcefully when working with historical data, as in the Dictionary of TRADED GOODS & COMMODITIES
1550-1800 project at Wolverhampton University. There were so many spelling variations of each "item", that it became difficult to be
sure that one had retrieved all instances of that "item"... even alphabetized frequency lists were not much use, if variation was in the
first letter (enuff, inough, etc). One needed a whole range of tools, plus major investment of time and linguistic expertise in the manual
scrutiny of outputs...

h) In GeWiss, a current research project on Spoken Academic Discourse funded by the Volkswagen Foundation,
(http://www1.aston.ac.uk/lss/research/research-projects/gewiss-spoken-academic-discourse/)
we are therefore transcribing in several 'tiers': (i) the 'normalised spelling' tier, which includes established written spellings
for some pronunciation variations - this will allow us to search for all instances of 'want to', including the established 'wanna',
however it may have been pronounced on a particular occasion (ii) a 'comment' tier, where transcribers can describe the exact
nature of the variation in as much detail as they have time to do - this will allow us to add  'wannoo' or 'wannae'  to the retrievable
instances of 'want to',  if that is how it is pronounced in particular cases, and if a researcher is interested in such variations
(iii) other tiers are available - e.g. for translation of items from other languages in bilingual/multilingual speech, which is an increasing
phenomenon in modern times, or for editorial comments about subsequent modifications to information in any of the tiers

Best
Ramesh

Ramesh Krishnamurthy
Lecturer in English Studies, School of Languages and Social Sciences,
Aston University, Birmingham B4 7ET, UK
Tel: +44 (0)121-204-3812 ; Fax: +44 (0)121-204-3766 [Room NX08, 10th
Floor, North Wing of Main Building]
http://www1.aston.ac.uk/lss/staff/krishnamurthyr/
Director, ACORN (Aston Corpus Network project): http://acorn.aston.ac.uk/
---

Message: 4

Date: Fri, 17 Dec 2010 12:45:51 -0500

From: Mike Maxwell <maxwell at umiacs.umd.edu<mailto:maxwell at umiacs.umd.edu>>

Subject: Re: [Corpora-List] Linguistics, corpus linguistics, and

      diglossia

To: "Angus B. Grieve-Smith" <grvsmth at panix.com<mailto:grvsmth at panix.com>>

Cc: corpora at uib.no<mailto:corpora at uib.no>

On 12/15/2010 8:35 PM, Angus B. Grieve-Smith wrote:

> As I'm sure you're aware, corpus linguistics is fine; it's just that

> you need a corpus that's representative of what you're studying.

Ay, there's the rub.  What do you do when the corpora don't exist, because people have been educated not to write the way they talk?

I'm sure corpus linguists have pondered this.  How do they study things like "ain't", "y'all", "youse", "youse-uns", "might could", and other non-standard constructions?  Large scale transcription of spoken English?  (And English is barely diglossic, compared with languages like Arabic or Tamil.)

--

      Mike Maxwell

---

Message: 5

Date: Fri, 17 Dec 2010 13:30:54 -0500

From: "John F. Sowa" <sowa at bestweb.net<mailto:sowa at bestweb.net>>

Subject: Re: [Corpora-List] Linguistics, corpus linguistics, and

      diglossia

To: corpora at uib.no<mailto:corpora at uib.no>

On 12/17/2010 12:45 PM, Mike Maxwell wrote:

> I'm sure corpus linguists have pondered this.  How do they study

> things like "ain't", "y'all", "youse", "youse-uns", "might could", and

> other non-standard constructions?

A friend of mine, who was analyzing Hawaiian Creole, asked one of her students to put a tape recorder under the kitchen table at home.

In interviews, the native creole speakers would always "correct themselves", but they only spoke freely after they forgot that the recorder was running.

John Sowa
---
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20101218/337681a6/attachment.htm>
-------------- next part --------------
_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora