[Corpora-List] Linguistics, corpus linguistics, and diglossia
Thomas Plagwitz
thomas_plagwitz at hotmail.com
Sat Dec 18 16:30:30 UTC 2010
Hi,
A footnote to section c: Microsoft has made an attempt to commoditize audio
search with OneNote 2007 and 2010.
I see this being used to e.g. index lecture recordings, but I am wondering
how useable/useful linguists find this audio search feature in MS-OneNote.
Thanks,
thomas
--
Dr. Thomas Plagwitz
Language Learning Center Manager
Instructor of German
Web: <http://www.plagwitz.org/> http://www.plagwitz.org/,
<http://plagwitz1.spaces.live.com/?_c11_BlogPart_BlogPart=summary&_c=BlogPar
t> Sitemap
From: corpora-bounces at uib.no [mailto:corpora-bounces at uib.no] On Behalf Of
Krishnamurthy, Ramesh
Sent: Saturday, December 18, 2010 7:49 AM
To: maxwell at umiacs.umd.edu; sowa at bestweb.net
Cc: corpora at uib.no
Subject: [Corpora-List] Linguistics, corpus linguistics, and diglossia
Hi Mike, John and others following this thread
I have followed much of the previous discussion - but felt less qualified to
comment
while it seemed to be more concerned with linguistic definitions of
diglossia.
Here are my initial thoughts:
a) all written languages are diglossic to some extent, i.e. display some
differences between the two
major modes (speech and writing), with a spectrum of sub-modes in between
(e.g. written-to-be-spoken,
for example a political speech, where there will be differences between what
was written and what
was spoken, as well as the additional delivery components - stress,
pronunciation, pauses, deviations
from script, etc; and spoken-to-be-written, for example dictated
letters/memos/news reportage, etc
where again there will be differences in the resulting written text).
EAGLES, TEI, Cobuild, BNC, and others
have discussed problems of text-types/genres in this spectrum.
Computer-mediated communication has
generated many new genres (with features closer in many respects to spoken
language) which are being
discussed and categorised more recently.
b) recordings of speech can contain various problems, such as surrounding
noises, multiple speakers, etc
just as historical manuscripts may contain smudges and stains, handwriting
issues, etc
c) the audio data corpus may be searchable by sounds (I'm not sure if this
has been implemented yet, as I am not
a specialist in this area, but if not, I'm sure it will be: segment the
audio data, and use voice-recognition to accept
the sound to be searched), or by phonetic symbols (if it has been
phonetically transcribed; e.g. search for /yu:z/
to find instances of 'youse'), but the user would have to disambiguate
'youse' from other occurrences of this sound,
e.g. in 'what he's told *you's* of no importance to me', or more likely
'use', which may be very numerous?
d) the same problems occur with text transcriptions of speech: whichever set
of transcription conventions you use,
it will be extremely difficult to capture all the variations in the oral
delivery.
e) but the biggest problem is that, the more variations you transcribe, in
greater detail and with greater accuracy, the more difficult
it will be for the user to find all the occurrences of an "item" (indeed, it
requires us to re-define "item"); indexation for retrieval
becomes a non-trivial task.
f) This became apparent early on with lemmatization of modern data at
Cobuild .
g) I encountered it again more forcefully when working with historical data,
as in the Dictionary of TRADED GOODS & COMMODITIES
1550-1800 project at Wolverhampton University. There were so many spelling
variations of each "item", that it became difficult to be
sure that one had retrieved all instances of that "item". even alphabetized
frequency lists were not much use, if variation was in the
first letter (enuff, inough, etc). One needed a whole range of tools, plus
major investment of time and linguistic expertise in the manual
scrutiny of outputs.
h) In GeWiss, a current research project on Spoken Academic Discourse funded
by the Volkswagen Foundation,
(http://www1.aston.ac.uk/lss/research/research-projects/gewiss-spoken-academ
ic-discourse/)
we are therefore transcribing in several 'tiers': (i) the 'normalised
spelling' tier, which includes established written spellings
for some pronunciation variations - this will allow us to search for all
instances of 'want to', including the established 'wanna',
however it may have been pronounced on a particular occasion (ii) a
'comment' tier, where transcribers can describe the exact
nature of the variation in as much detail as they have time to do - this
will allow us to add 'wannoo' or 'wannae' to the retrievable
instances of 'want to', if that is how it is pronounced in particular
cases, and if a researcher is interested in such variations
(iii) other tiers are available - e.g. for translation of items from other
languages in bilingual/multilingual speech, which is an increasing
phenomenon in modern times, or for editorial comments about subsequent
modifications to information in any of the tiers
Best
Ramesh
Ramesh Krishnamurthy
Lecturer in English Studies, School of Languages and Social Sciences,
Aston University, Birmingham B4 7ET, UK
Tel: +44 (0)121-204-3812 ; Fax: +44 (0)121-204-3766 [Room NX08, 10th
Floor, North Wing of Main Building]
http://www1.aston.ac.uk/lss/staff/krishnamurthyr/
Director, ACORN (Aston Corpus Network project): http://acorn.aston.ac.uk/
---
Message: 4
Date: Fri, 17 Dec 2010 12:45:51 -0500
From: Mike Maxwell <maxwell at umiacs.umd.edu>
Subject: Re: [Corpora-List] Linguistics, corpus linguistics, and
diglossia
To: "Angus B. Grieve-Smith" <grvsmth at panix.com>
Cc: corpora at uib.no
On 12/15/2010 8:35 PM, Angus B. Grieve-Smith wrote:
> As I'm sure you're aware, corpus linguistics is fine; it's just that
> you need a corpus that's representative of what you're studying.
Ay, there's the rub. What do you do when the corpora don't exist, because
people have been educated not to write the way they talk?
I'm sure corpus linguists have pondered this. How do they study things like
"ain't", "y'all", "youse", "youse-uns", "might could", and other
non-standard constructions? Large scale transcription of spoken English?
(And English is barely diglossic, compared with languages like Arabic or
Tamil.)
--
Mike Maxwell
---
Message: 5
Date: Fri, 17 Dec 2010 13:30:54 -0500
From: "John F. Sowa" <sowa at bestweb.net>
Subject: Re: [Corpora-List] Linguistics, corpus linguistics, and
diglossia
To: corpora at uib.no
On 12/17/2010 12:45 PM, Mike Maxwell wrote:
> I'm sure corpus linguists have pondered this. How do they study
> things like "ain't", "y'all", "youse", "youse-uns", "might could", and
> other non-standard constructions?
A friend of mine, who was analyzing Hawaiian Creole, asked one of her
students to put a tape recorder under the kitchen table at home.
In interviews, the native creole speakers would always "correct themselves",
but they only spoke freely after they forgot that the recorder was running.
John Sowa
---
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20101218/ef97b218/attachment.htm>
-------------- next part --------------
_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora
More information about the Corpora
mailing list