[Corpora-List] Linguistics, corpus linguistics, and diglossia

Janne Bondi Johannessen jannebj at iln.uio.no
Sat Dec 18 14:06:46 UTC 2010


Hi.
Of course, it all depends on what kind of corpus one has at hand.
Written corpora have their limitations whether there is a diglossic
situation or not. Spoken language corpora, especially if they contain
free, spontaneous conversations between people from the same group,
will often counteract many of the problems of written language
corpora.

I know first hand that the topic of spoken vs written corpora has been
studied the last years.

1) The journal Studia Linguistica had a special issue on spoken
language in 2008, which I edited: April 2008, Volume 62, Issue 1,
Pages 1–153
http://onlinelibrary.wiley.com/doi/10.1111/stul.2008.62.issue-1/issuetoc

2) Also in 2008 an anthology of linguistic studies using a spoken
language corpus was published, including topics in lexicography,
sociolinguistics, syntax, computational linguistics. Many of the
papers realised to their delight that the spoken language corpus
revealed aspects (lexical as well as grammatical) of the language that
had been unknown to that time (given the usual emphasis on written
language).
If you are amongst the 20 million who can read a Scandinavian
language, you can read this book:
Johannessen/Hagen (red.): Språk i Oslo [Language in Oslo]
http://www.mamut.net/novus/shop/

Best wishes,
Janne.

2010/12/18 Krishnamurthy, Ramesh <r.krishnamurthy at aston.ac.uk>:
> Hi Mike, John and others following this thread
>
>
>
> I have followed much of the previous discussion – but felt less qualified to
> comment
>
> while it seemed to be more concerned with linguistic definitions of
> diglossia.
>
>
>
> Here are my initial thoughts:
>
>
>
> a) all written languages are diglossic to some extent, i.e. display some
> differences between the two
>
> major modes (speech and writing), with a spectrum of sub-modes in between
> (e.g. written-to-be-spoken,
>
> for example a political speech, where there will be differences between what
> was written and what
>
> was spoken, as well as the additional delivery components – stress,
> pronunciation, pauses, deviations
>
> from script, etc; and spoken-to-be-written, for example dictated
> letters/memos/news reportage, etc
>
> where again there will be differences in the resulting written text).
> EAGLES, TEI, Cobuild, BNC, and others
>
> have  discussed problems of text-types/genres in this spectrum.
> Computer-mediated communication has
>
> generated many new genres (with features closer in many respects to spoken
> language) which are being
>
> discussed and categorised more recently.
>
>
>
> b) recordings of speech can contain various problems, such as surrounding
> noises, multiple speakers, etc
>
> just as historical manuscripts may contain smudges and stains, handwriting
> issues, etc
>
>
>
> c) the audio data corpus may be searchable by sounds (I’m not sure if this
> has been implemented yet, as I am not
>
> a specialist in this area, but if not, I’m sure it will be: segment the
> audio data, and use voice-recognition to accept
>
> the sound to be searched), or by phonetic symbols (if it has been
> phonetically transcribed; e.g. search for /yu:z/
>
> to find instances of ‘youse’), but the user would have to disambiguate
> ‘youse’ from other occurrences of this sound,
>
> e.g. in ‘what he’s told *you’s* of no importance to me’, or more likely
> ‘use’, which may be very numerous?
>
>
>
> d) the same problems occur with text transcriptions of speech: whichever set
> of transcription conventions you use,
>
> it will be extremely difficult to capture all the variations in the oral
> delivery.
>
>
>
> e) but the biggest problem is that, the more variations you transcribe, in
> greater detail and with greater accuracy, the more difficult
>
> it will be for the user to find all the occurrences of an “item” (indeed, it
> requires us to re-define “item”); indexation for retrieval
>
> becomes a non-trivial task.
>
>
>
> f) This became apparent early on with lemmatization of modern data at
> Cobuild .
>
>
>
> g) I encountered it again more forcefully when working with historical data,
> as in the Dictionary of TRADED GOODS & COMMODITIES
>
> 1550-1800 project at Wolverhampton University. There were so many spelling
> variations of each “item”, that it became difficult to be
>
> sure that one had retrieved all instances of that “item”… even alphabetized
> frequency lists were not much use, if variation was in the
>
> first letter (enuff, inough, etc). One needed a whole range of tools, plus
> major investment of time and linguistic expertise in the manual
>
> scrutiny of outputs…
>
>
>
> h) In GeWiss, a current research project on Spoken Academic Discourse funded
> by the Volkswagen Foundation,
>
> (http://www1.aston.ac.uk/lss/research/research-projects/gewiss-spoken-academic-discourse/)
>
> we are therefore transcribing in several ‘tiers’: (i) the ‘normalised
> spelling’ tier, which includes established written spellings
>
> for some pronunciation variations – this will allow us to search for all
> instances of ‘want to’, including the established ‘wanna’,
>
> however it may have been pronounced on a particular occasion (ii) a
> ‘comment’ tier, where transcribers can describe the exact
>
> nature of the variation in as much detail as they have time to do – this
> will allow us to add  ‘wannoo’ or ‘wannae’  to the retrievable
>
> instances of ‘want to’,  if that is how it is pronounced in particular
> cases, and if a researcher is interested in such variations
>
> (iii) other tiers are available – e.g. for translation of items from other
> languages in bilingual/multilingual speech, which is an increasing
>
> phenomenon in modern times, or for editorial comments about subsequent
> modifications to information in any of the tiers
>
>
>
> Best
>
> Ramesh
>
>
>
>
>
> Ramesh Krishnamurthy
> Lecturer in English Studies, School of Languages and Social Sciences,
> Aston University, Birmingham B4 7ET, UK
> Tel: +44 (0)121-204-3812 ; Fax: +44 (0)121-204-3766 [Room NX08, 10th
> Floor, North Wing of Main Building]
> http://www1.aston.ac.uk/lss/staff/krishnamurthyr/
> Director, ACORN (Aston Corpus Network project): http://acorn.aston.ac.uk/
>
> ---
>
> Message: 4
>
> Date: Fri, 17 Dec 2010 12:45:51 -0500
>
> From: Mike Maxwell <maxwell at umiacs.umd.edu>
>
> Subject: Re: [Corpora-List] Linguistics, corpus linguistics, and
>
>       diglossia
>
> To: "Angus B. Grieve-Smith" <grvsmth at panix.com>
>
> Cc: corpora at uib.no
>
>
>
> On 12/15/2010 8:35 PM, Angus B. Grieve-Smith wrote:
>
>> As I'm sure you're aware, corpus linguistics is fine; it's just that
>
>> you need a corpus that's representative of what you're studying.
>
>
>
> Ay, there's the rub.  What do you do when the corpora don't exist, because
> people have been educated not to write the way they talk?
>
>
>
> I'm sure corpus linguists have pondered this.  How do they study things like
> "ain't", "y'all", "youse", "youse-uns", "might could", and other
> non-standard constructions?  Large scale transcription of spoken English?
> (And English is barely diglossic, compared with languages like Arabic or
> Tamil.)
>
> --
>
>       Mike Maxwell
>
>
>
> ---
>
> Message: 5
>
> Date: Fri, 17 Dec 2010 13:30:54 -0500
>
> From: "John F. Sowa" <sowa at bestweb.net>
>
> Subject: Re: [Corpora-List] Linguistics, corpus linguistics, and
>
>       diglossia
>
> To: corpora at uib.no
>
>
>
> On 12/17/2010 12:45 PM, Mike Maxwell wrote:
>
>> I'm sure corpus linguists have pondered this.  How do they study
>
>> things like "ain't", "y'all", "youse", "youse-uns", "might could", and
>
>> other non-standard constructions?
>
>
>
> A friend of mine, who was analyzing Hawaiian Creole, asked one of her
> students to put a tape recorder under the kitchen table at home.
>
>
>
> In interviews, the native creole speakers would always "correct themselves",
> but they only spoke freely after they forgot that the recorder was running.
>
>
>
> John Sowa
>
> ---
>
> _______________________________________________
> Corpora mailing list
> Corpora at uib.no
> http://mailman.uib.no/listinfo/corpora
>
>



-- 
Janne Bondi Johannessen
Professor, The Text Laboratory, ILN, http://www.hf.uio.no/tekstlab/
President, NEALT, http://omilia.uio.no/nealt/
University of Oslo
P.O.Box 1102 Blindern, N-0317 Oslo, Norway
Tel: +47 22 85 68 14, mob.: +47 928 966 34

_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora



More information about the Corpora mailing list