<html xmlns:v="urn:schemas-microsoft-com:vml" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:w="urn:schemas-microsoft-com:office:word" xmlns:m="http://schemas.microsoft.com/office/2004/12/omml" xmlns="http://www.w3.org/TR/REC-html40"><head><meta http-equiv=Content-Type content="text/html; charset=us-ascii"><meta name=Generator content="Microsoft Word 12 (filtered medium)"><style><!--

/* Font Definitions */

@font-face

        {font-family:"Cambria Math";

        panose-1:2 4 5 3 5 4 6 3 2 4;}

@font-face

        {font-family:Calibri;

        panose-1:2 15 5 2 2 2 4 3 2 4;}

@font-face

        {font-family:Tahoma;

        panose-1:2 11 6 4 3 5 4 4 2 4;}

@font-face

        {font-family:Consolas;

        panose-1:2 11 6 9 2 2 4 3 2 4;}

@font-face

        {font-family:Verdana;

        panose-1:2 11 6 4 3 5 4 4 2 4;}

/* Style Definitions */

p.MsoNormal, li.MsoNormal, div.MsoNormal

        {margin:0in;

        margin-bottom:.0001pt;

        font-size:11.0pt;

        font-family:"Calibri","sans-serif";}

a:link, span.MsoHyperlink

        {mso-style-priority:99;

        color:blue;

        text-decoration:underline;}

a:visited, span.MsoHyperlinkFollowed

        {mso-style-priority:99;

        color:purple;

        text-decoration:underline;}

p.MsoPlainText, li.MsoPlainText, div.MsoPlainText

        {mso-style-priority:99;

        mso-style-link:"Plain Text Char";

        margin:0in;

        margin-bottom:.0001pt;

        font-size:10.5pt;

        font-family:Consolas;}

p.MsoAcetate, li.MsoAcetate, div.MsoAcetate

        {mso-style-priority:99;

        mso-style-link:"Balloon Text Char";

        margin:0in;

        margin-bottom:.0001pt;

        font-size:8.0pt;

        font-family:"Tahoma","sans-serif";}

span.PlainTextChar

        {mso-style-name:"Plain Text Char";

        mso-style-priority:99;

        mso-style-link:"Plain Text";

        font-family:Consolas;}

span.BalloonTextChar

        {mso-style-name:"Balloon Text Char";

        mso-style-priority:99;

        mso-style-link:"Balloon Text";

        font-family:"Tahoma","sans-serif";}

span.EmailStyle21

        {mso-style-type:personal;

        font-family:"Calibri","sans-serif";

        color:windowtext;}

span.EmailStyle22

        {mso-style-type:personal-reply;

        font-family:"Calibri","sans-serif";

        color:#1F497D;}

.MsoChpDefault

        {mso-style-type:export-only;

        font-size:10.0pt;}

@page WordSection1

        {size:8.5in 11.0in;

        margin:1.0in 1.0in 1.0in 1.0in;}

div.WordSection1

        {page:WordSection1;}

--></style><!--[if gte mso 9]><xml>

<o:shapedefaults v:ext="edit" spidmax="1026" />

</xml><![endif]--><!--[if gte mso 9]><xml>

<o:shapelayout v:ext="edit">

<o:idmap v:ext="edit" data="1" />

</o:shapelayout></xml><![endif]--></head><body lang=EN-US link=blue vlink=purple><div class=WordSection1><p class=MsoNormal><span style='color:#1F497D'>Hi,<o:p></o:p></span></p><p class=MsoNormal><span style='color:#1F497D'><o:p> </o:p></span></p><p class=MsoNormal><span style='color:#1F497D'>A footnote to section c: Microsoft has made an attempt to commoditize audio search with OneNote 2007 and 2010. <o:p></o:p></span></p><p class=MsoNormal><span style='color:#1F497D'><o:p> </o:p></span></p><p class=MsoNormal><span style='color:#1F497D'>I see this being used to e.g. index lecture recordings, but I am wondering how useable/useful linguists find this audio search feature in MS-OneNote.<o:p></o:p></span></p><p class=MsoNormal><span style='color:#1F497D'><o:p> </o:p></span></p><p class=MsoNormal><span style='color:#1F497D'>Thanks,<o:p></o:p></span></p><p class=MsoNormal><span style='color:#1F497D'>thomas<o:p></o:p></span></p><p class=MsoNormal><span style='color:#1F497D'><o:p> </o:p></span></p><div><p class=MsoNormal><span style='font-size:10.0pt;font-family:"Verdana","sans-serif";color:#1F497D'>-- <br>Dr. Thomas Plagwitz <br>Language Learning Center Manager<br>Instructor of German <br>Web: <u><a href="http://www.plagwitz.org/"><span style='color:#0066CC'>http://www.plagwitz.org/</span></a></u>, </span><u><span style='font-size:10.0pt;font-family:"Verdana","sans-serif";color:#0070C0'><a href="http://plagwitz1.spaces.live.com/?_c11_BlogPart_BlogPart=summary&_c=BlogPart"><span style='color:#0070C0'>Sitemap</span></a></span></u><u><span style='font-size:10.0pt;font-family:"Verdana","sans-serif";color:#1F497D'> </span></u><span style='color:#1F497D'><o:p></o:p></span></p></div><p class=MsoNormal><span style='color:#1F497D'><o:p> </o:p></span></p><div><div style='border:none;border-top:solid #B5C4DF 1.0pt;padding:3.0pt 0in 0in 0in'><p class=MsoNormal><b><span style='font-size:10.0pt;font-family:"Tahoma","sans-serif"'>From:</span></b><span style='font-size:10.0pt;font-family:"Tahoma","sans-serif"'> corpora-bounces@uib.no [mailto:corpora-bounces@uib.no] <b>On Behalf Of </b>Krishnamurthy, Ramesh<br><b>Sent:</b> Saturday, December 18, 2010 7:49 AM<br><b>To:</b> maxwell@umiacs.umd.edu; sowa@bestweb.net<br><b>Cc:</b> corpora@uib.no<br><b>Subject:</b> [Corpora-List] Linguistics, corpus linguistics, and diglossia<o:p></o:p></span></p></div></div><p class=MsoNormal><o:p> </o:p></p><p class=MsoNormal><span lang=EN-GB>Hi Mike, John and others following this thread<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-GB><o:p> </o:p></span></p><p class=MsoNormal><span lang=EN-GB>I have followed much of the previous discussion – but felt less qualified to comment <o:p></o:p></span></p><p class=MsoNormal><span lang=EN-GB>while it seemed to be more concerned with linguistic definitions of diglossia.<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-GB><o:p> </o:p></span></p><p class=MsoNormal><span lang=EN-GB>Here are my initial thoughts:<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-GB><o:p> </o:p></span></p><p class=MsoNormal><span lang=EN-GB>a) all written languages are diglossic to some extent, i.e. display some differences between the two <o:p></o:p></span></p><p class=MsoNormal><span lang=EN-GB>major modes (speech and writing), with a spectrum of sub-modes in between (e.g. written-to-be-spoken, <o:p></o:p></span></p><p class=MsoNormal><span lang=EN-GB>for example a political speech, where there will be differences between what was written and what <o:p></o:p></span></p><p class=MsoNormal><span lang=EN-GB>was spoken, as well as the additional delivery components – stress, pronunciation, pauses, deviations <o:p></o:p></span></p><p class=MsoNormal><span lang=EN-GB>from script, etc; and spoken-to-be-written, for example dictated letters/memos/news reportage, etc <o:p></o:p></span></p><p class=MsoNormal><span lang=EN-GB>where again there will be differences in the resulting written text). EAGLES, TEI, Cobuild, BNC, and others <o:p></o:p></span></p><p class=MsoNormal><span lang=EN-GB>have  discussed problems of text-types/genres in this spectrum. Computer-mediated communication has <o:p></o:p></span></p><p class=MsoNormal><span lang=EN-GB>generated many new genres (with features closer in many respects to spoken language) which are being <o:p></o:p></span></p><p class=MsoNormal><span lang=EN-GB>discussed and categorised more recently.<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-GB><o:p> </o:p></span></p><p class=MsoNormal><span lang=EN-GB>b) recordings of speech can contain various problems, such as surrounding noises, multiple speakers, etc<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-GB>just as historical manuscripts may contain smudges and stains, handwriting issues, etc<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-GB><o:p> </o:p></span></p><p class=MsoNormal><span lang=EN-GB>c) the audio data corpus may be searchable by sounds (I’m not sure if this has been implemented yet, as I am not<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-GB>a specialist in this area, but if not, I’m sure it will be: segment the audio data, and use voice-recognition to accept<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-GB>the sound to be searched), or by phonetic symbols (if it has been phonetically transcribed; e.g. search for /yu:z/<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-GB>to find instances of ‘youse’), but the user would have to disambiguate ‘youse’ from other occurrences of this sound,<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-GB>e.g. in ‘what he’s told *<b><u>you’s</u></b><u>*</u> of no importance to me’, or more likely ‘use’, which may be very numerous?<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-GB><o:p> </o:p></span></p><p class=MsoNormal><span lang=EN-GB>d) the same problems occur with text transcriptions of speech: whichever set of transcription conventions you use,<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-GB>it will be extremely difficult to capture all the variations in the oral delivery. <o:p></o:p></span></p><p class=MsoNormal><span lang=EN-GB><o:p> </o:p></span></p><p class=MsoNormal><span lang=EN-GB>e) but the biggest problem is that, the more variations you transcribe, in greater detail and with greater accuracy, the more difficult<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-GB>it will be for the user to find all the occurrences of an “item” (indeed, it requires us to re-define “item”); indexation for retrieval<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-GB>becomes a non-trivial task.<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-GB><o:p> </o:p></span></p><p class=MsoNormal><span lang=EN-GB>f) This became apparent early on with lemmatization of modern data at Cobuild . <o:p></o:p></span></p><p class=MsoNormal><span lang=EN-GB><o:p> </o:p></span></p><p class=MsoNormal><span lang=EN-GB>g) I encountered it again more forcefully when working with historical data, as in the Dictionary of TRADED GOODS & COMMODITIES <o:p></o:p></span></p><p class=MsoNormal><span lang=EN-GB>1550-1800 project at Wolverhampton University. There were so many spelling variations of each “item”, that it became difficult to be <o:p></o:p></span></p><p class=MsoNormal><span lang=EN-GB>sure that one had retrieved all instances of that “item”… even alphabetized frequency lists were not much use, if variation was in the <o:p></o:p></span></p><p class=MsoNormal><span lang=EN-GB>first letter (enuff, inough, etc). One needed a whole range of tools, plus major investment of time and linguistic expertise in the manual <o:p></o:p></span></p><p class=MsoNormal><span lang=EN-GB>scrutiny of outputs…<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-GB><o:p> </o:p></span></p><p class=MsoNormal><span lang=EN-GB>h) In GeWiss, a current research project on Spoken Academic Discourse funded by the Volkswagen Foundation,<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-GB>(<a href="http://www1.aston.ac.uk/lss/research/research-projects/gewiss-spoken-academic-discourse/">http://www1.aston.ac.uk/lss/research/research-projects/gewiss-spoken-academic-discourse/</a>)<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-GB>we are therefore transcribing in several ‘tiers’: (i) the ‘normalised spelling’ tier, which includes established written spellings<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-GB>for some pronunciation variations – this will allow us to search for all instances of ‘want to’, including the established ‘wanna’, <o:p></o:p></span></p><p class=MsoNormal><span lang=EN-GB>however it may have been pronounced on a particular occasion (ii) a ‘comment’ tier, where transcribers can describe the exact <o:p></o:p></span></p><p class=MsoNormal><span lang=EN-GB>nature of the variation in as much detail as they have time to do – this will allow us to add  ‘wannoo’ or ‘wannae’  to the retrievable <o:p></o:p></span></p><p class=MsoNormal><span lang=EN-GB>instances of ‘want to’,  if that is how it is pronounced in particular cases, and if a researcher is interested in such variations<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-GB>(iii) other tiers are available – e.g. for translation of items from other languages in bilingual/multilingual speech, which is an increasing<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-GB>phenomenon in modern times, or for editorial comments about subsequent modifications to information in any of the tiers<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-GB><o:p> </o:p></span></p><p class=MsoNormal><span lang=EN-GB>Best<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-GB>Ramesh<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-GB><o:p> </o:p></span></p><p class=MsoNormal><span lang=EN-GB><o:p> </o:p></span></p><p class=MsoNormal><span lang=EN-GB style='font-size:12.0pt;font-family:"Times New Roman","serif"'>Ramesh Krishnamurthy<br>Lecturer in English Studies, School of Languages and Social Sciences,<br>Aston University, Birmingham B4 7ET, UK<br>Tel: +44 (0)121-204-3812 ; Fax: +44 (0)121-204-3766 [Room NX08, 10th<br>Floor, North Wing of Main Building]<br><a href="http://www1.aston.ac.uk/lss/staff/krishnamurthyr/">http://www1.aston.ac.uk/lss/staff/krishnamurthyr/</a><br>Director, ACORN (Aston Corpus Network project): <a href="http://acorn.aston.ac.uk/">http://acorn.aston.ac.uk/</a> <o:p></o:p></span></p><p class=MsoNormal><span lang=EN-GB style='font-size:12.0pt;font-family:"Times New Roman","serif"'>---<o:p></o:p></span></p><p class=MsoPlainText><span lang=EN-GB>Message: 4<o:p></o:p></span></p><p class=MsoPlainText><span lang=EN-GB>Date: Fri, 17 Dec 2010 12:45:51 -0500<o:p></o:p></span></p><p class=MsoPlainText><span lang=EN-GB>From: Mike Maxwell <<a href="mailto:maxwell@umiacs.umd.edu">maxwell@umiacs.umd.edu</a>><o:p></o:p></span></p><p class=MsoPlainText><span lang=EN-GB>Subject: Re: [Corpora-List] Linguistics, corpus linguistics, and<o:p></o:p></span></p><p class=MsoPlainText><span lang=EN-GB>      diglossia<o:p></o:p></span></p><p class=MsoPlainText><span lang=EN-GB>To: "Angus B. Grieve-Smith" <<a href="mailto:grvsmth@panix.com">grvsmth@panix.com</a>><o:p></o:p></span></p><p class=MsoPlainText><span lang=EN-GB>Cc: <a href="mailto:corpora@uib.no">corpora@uib.no</a><o:p></o:p></span></p><p class=MsoPlainText><span lang=EN-GB><o:p> </o:p></span></p><p class=MsoPlainText><span lang=EN-GB>On 12/15/2010 8:35 PM, Angus B. Grieve-Smith wrote:<o:p></o:p></span></p><p class=MsoPlainText><span lang=EN-GB>> As I'm sure you're aware, corpus linguistics is fine; it's just that <o:p></o:p></span></p><p class=MsoPlainText><span lang=EN-GB>> you need a corpus that's representative of what you're studying.<o:p></o:p></span></p><p class=MsoPlainText><span lang=EN-GB><o:p> </o:p></span></p><p class=MsoPlainText><span lang=EN-GB>Ay, there's the rub.  What do you do when the corpora don't exist, because people have been educated not to write the way they talk?<o:p></o:p></span></p><p class=MsoPlainText><span lang=EN-GB><o:p> </o:p></span></p><p class=MsoPlainText><span lang=EN-GB>I'm sure corpus linguists have pondered this.  How do they study things like "ain't", "y'all", "youse", "youse-uns", "might could", and other non-standard constructions?  Large scale transcription of spoken English?  (And English is barely diglossic, compared with languages like Arabic or Tamil.)<o:p></o:p></span></p><p class=MsoPlainText><span lang=EN-GB>-- <o:p></o:p></span></p><p class=MsoPlainText><span lang=EN-GB>      Mike Maxwell<o:p></o:p></span></p><p class=MsoPlainText><span lang=EN-GB><o:p> </o:p></span></p><p class=MsoNormal><span lang=EN-GB>---<o:p></o:p></span></p><p class=MsoPlainText><span lang=EN-GB>Message: 5<o:p></o:p></span></p><p class=MsoPlainText><span lang=EN-GB>Date: Fri, 17 Dec 2010 13:30:54 -0500<o:p></o:p></span></p><p class=MsoPlainText><span lang=EN-GB>From: "John F. Sowa" <<a href="mailto:sowa@bestweb.net">sowa@bestweb.net</a>><o:p></o:p></span></p><p class=MsoPlainText><span lang=EN-GB>Subject: Re: [Corpora-List] Linguistics, corpus linguistics, and<o:p></o:p></span></p><p class=MsoPlainText><span lang=EN-GB>      diglossia<o:p></o:p></span></p><p class=MsoPlainText><span lang=EN-GB>To: <a href="mailto:corpora@uib.no">corpora@uib.no</a><o:p></o:p></span></p><p class=MsoPlainText><span lang=EN-GB><o:p> </o:p></span></p><p class=MsoPlainText><span lang=EN-GB>On 12/17/2010 12:45 PM, Mike Maxwell wrote:<o:p></o:p></span></p><p class=MsoPlainText><span lang=EN-GB>> I'm sure corpus linguists have pondered this.  How do they study <o:p></o:p></span></p><p class=MsoPlainText><span lang=EN-GB>> things like "ain't", "y'all", "youse", "youse-uns", "might could", and <o:p></o:p></span></p><p class=MsoPlainText><span lang=EN-GB>> other non-standard constructions?<o:p></o:p></span></p><p class=MsoPlainText><span lang=EN-GB><o:p> </o:p></span></p><p class=MsoPlainText><span lang=EN-GB>A friend of mine, who was analyzing Hawaiian Creole, asked one of her students to put a tape recorder under the kitchen table at home.<o:p></o:p></span></p><p class=MsoPlainText><span lang=EN-GB><o:p> </o:p></span></p><p class=MsoPlainText><span lang=EN-GB>In interviews, the native creole speakers would always "correct themselves", but they only spoke freely after they forgot that the recorder was running.<o:p></o:p></span></p><p class=MsoPlainText><span lang=EN-GB><o:p> </o:p></span></p><p class=MsoPlainText><span lang=EN-GB>John Sowa<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-GB>---<o:p></o:p></span></p></div></body></html>