[Corpora-List] Linguistics, corpus linguistics, and diglossia

Sat Dec 18 16:30:30 UTC 2010

Hi,

A footnote to section c: Microsoft has made an attempt to commoditize audio
search with OneNote 2007 and 2010. 

I see this being used to e.g. index lecture recordings, but I am wondering
how useable/useful linguists find this audio search feature in MS-OneNote.

Thanks,

thomas

-- 
Dr. Thomas Plagwitz 
Language Learning Center Manager
Instructor of German 
Web:  <http://www.plagwitz.org/> http://www.plagwitz.org/,
<http://plagwitz1.spaces.live.com/?_c11_BlogPart_BlogPart=summary&_c=BlogPar
t> Sitemap 

From: corpora-bounces at uib.no [mailto:corpora-bounces at uib.no] On Behalf Of
Krishnamurthy, Ramesh
Sent: Saturday, December 18, 2010 7:49 AM
To: maxwell at umiacs.umd.edu; sowa at bestweb.net
Cc: corpora at uib.no
Subject: [Corpora-List] Linguistics, corpus linguistics, and diglossia

Hi Mike, John and others following this thread

I have followed much of the previous discussion - but felt less qualified to
comment 

while it seemed to be more concerned with linguistic definitions of
diglossia.

Here are my initial thoughts:

a) all written languages are diglossic to some extent, i.e. display some
differences between the two 

major modes (speech and writing), with a spectrum of sub-modes in between
(e.g. written-to-be-spoken, 

for example a political speech, where there will be differences between what
was written and what 

was spoken, as well as the additional delivery components - stress,
pronunciation, pauses, deviations 

from script, etc; and spoken-to-be-written, for example dictated
letters/memos/news reportage, etc 

where again there will be differences in the resulting written text).
EAGLES, TEI, Cobuild, BNC, and others 

have  discussed problems of text-types/genres in this spectrum.
Computer-mediated communication has 

generated many new genres (with features closer in many respects to spoken
language) which are being 

discussed and categorised more recently.

b) recordings of speech can contain various problems, such as surrounding
noises, multiple speakers, etc

just as historical manuscripts may contain smudges and stains, handwriting
issues, etc

c) the audio data corpus may be searchable by sounds (I'm not sure if this
has been implemented yet, as I am not

a specialist in this area, but if not, I'm sure it will be: segment the
audio data, and use voice-recognition to accept

the sound to be searched), or by phonetic symbols (if it has been
phonetically transcribed; e.g. search for /yu:z/

to find instances of 'youse'), but the user would have to disambiguate
'youse' from other occurrences of this sound,

e.g. in 'what he's told *you's* of no importance to me', or more likely
'use', which may be very numerous?

d) the same problems occur with text transcriptions of speech: whichever set
of transcription conventions you use,

it will be extremely difficult to capture all the variations in the oral
delivery. 

e) but the biggest problem is that, the more variations you transcribe, in
greater detail and with greater accuracy, the more difficult

it will be for the user to find all the occurrences of an "item" (indeed, it
requires us to re-define "item"); indexation for retrieval

becomes a non-trivial task.

f) This became apparent early on with lemmatization of modern data at
Cobuild . 

g) I encountered it again more forcefully when working with historical data,
as in the Dictionary of TRADED GOODS & COMMODITIES 

1550-1800 project at Wolverhampton University. There were so many spelling
variations of each "item", that it became difficult to be 

sure that one had retrieved all instances of that "item". even alphabetized
frequency lists were not much use, if variation was in the 

first letter (enuff, inough, etc). One needed a whole range of tools, plus
major investment of time and linguistic expertise in the manual 

scrutiny of outputs.

h) In GeWiss, a current research project on Spoken Academic Discourse funded
by the Volkswagen Foundation,

(http://www1.aston.ac.uk/lss/research/research-projects/gewiss-spoken-academ
ic-discourse/)

we are therefore transcribing in several 'tiers': (i) the 'normalised
spelling' tier, which includes established written spellings

for some pronunciation variations - this will allow us to search for all
instances of 'want to', including the established 'wanna', 

however it may have been pronounced on a particular occasion (ii) a
'comment' tier, where transcribers can describe the exact 

nature of the variation in as much detail as they have time to do - this
will allow us to add  'wannoo' or 'wannae'  to the retrievable 

instances of 'want to',  if that is how it is pronounced in particular
cases, and if a researcher is interested in such variations

(iii) other tiers are available - e.g. for translation of items from other
languages in bilingual/multilingual speech, which is an increasing

phenomenon in modern times, or for editorial comments about subsequent
modifications to information in any of the tiers

Best

Ramesh

Ramesh Krishnamurthy
Lecturer in English Studies, School of Languages and Social Sciences,
Aston University, Birmingham B4 7ET, UK
Tel: +44 (0)121-204-3812 ; Fax: +44 (0)121-204-3766 [Room NX08, 10th
Floor, North Wing of Main Building]
http://www1.aston.ac.uk/lss/staff/krishnamurthyr/
Director, ACORN (Aston Corpus Network project): http://acorn.aston.ac.uk/ 

---

Message: 4

Date: Fri, 17 Dec 2010 12:45:51 -0500

From: Mike Maxwell <maxwell at umiacs.umd.edu>

Subject: Re: [Corpora-List] Linguistics, corpus linguistics, and

      diglossia

To: "Angus B. Grieve-Smith" <grvsmth at panix.com>

Cc: corpora at uib.no

On 12/15/2010 8:35 PM, Angus B. Grieve-Smith wrote:

> As I'm sure you're aware, corpus linguistics is fine; it's just that 

> you need a corpus that's representative of what you're studying.

Ay, there's the rub.  What do you do when the corpora don't exist, because
people have been educated not to write the way they talk?

I'm sure corpus linguists have pondered this.  How do they study things like
"ain't", "y'all", "youse", "youse-uns", "might could", and other
non-standard constructions?  Large scale transcription of spoken English?
(And English is barely diglossic, compared with languages like Arabic or
Tamil.)

-- 

      Mike Maxwell

---

Message: 5

Date: Fri, 17 Dec 2010 13:30:54 -0500

From: "John F. Sowa" <sowa at bestweb.net>

Subject: Re: [Corpora-List] Linguistics, corpus linguistics, and

      diglossia

To: corpora at uib.no

On 12/17/2010 12:45 PM, Mike Maxwell wrote:

> I'm sure corpus linguists have pondered this.  How do they study 

> things like "ain't", "y'all", "youse", "youse-uns", "might could", and 

> other non-standard constructions?

A friend of mine, who was analyzing Hawaiian Creole, asked one of her
students to put a tape recorder under the kitchen table at home.

In interviews, the native creole speakers would always "correct themselves",
but they only spoke freely after they forgot that the recorder was running.

John Sowa

---

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20101218/ef97b218/attachment.htm>
-------------- next part --------------
_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora