[Corpora-List] R: Erorr corpora / error types for Spanish and Italian

rita.calabrese at libero.it rita.calabrese at libero.it
Sun Sep 18 18:18:33 UTC 2011


Dear Eckhard,
here's the link to the International Corpus of Learner English:

http://www.uclouvain.be/en-277586.html

Rita Calabrese
University of Salerno
ITALY
>----Messaggio originale----
>Da: corpora-request at uib.no
>Data: 15/09/2011 12.00
>A: <corpora at uib.no>
>Ogg: Corpora Digest, Vol 51, Issue 16
>
>Today's Topics:
>
>   1.  Erorr corpora / error types for Spanish and Italian
>      (Eckhard Bick)
>   2. Re:  Frequency of the pronoun I (Ken Litkowski)
>   3. Re:  Frequency of the pronoun I (Marc Brysbaert)
>   4.  QUERY: Joomla 1.6-1.7 components for linguistic corpora
>      (Grokhovski)
>   5.  2nd,	revised CFP: "TEI for Linguists" (special issue of
>      jTEI) (Piotr Ba?ski)
>   6. Re:  Frequency of the pronoun I (Rich Cooper)
>   7. Re:  Frequency of the pronoun I (Mike Scott)
>   8.  Semi-conducter texts (Yuri Tambovtsev)
>
>
>----------------------------------------------------------------------
>
>Message: 1
>Date: Wed, 14 Sep 2011 18:04:50 +0200
>From: Eckhard Bick <eckhard.bick at mail.dk>
>Subject: [Corpora-List] Erorr corpora / error types for Spanish and
>	Italian
>To: Corpora List <corpora at uib.no>
>
>Hello,
>
>Is anybody aware of Spanish or Italian error corpora, ideally with some 
>systematic error type classification and markup? Be it learner corpora 
>or corpora targeting spell/grammar checking ...
>
>Thanks in advance,
>Eckhard Bick
>
>-- 
>Eckhard Bick,
>cand.med., dr.phil.
>University of Southern Denmark
>e-mail: eckhard.bick at mail.dk
>web: http://beta.visl.sdu.dk
>
>
>
>
>------------------------------
>
>Message: 2
>Date: Wed, 14 Sep 2011 12:00:47 -0400
>From: Ken Litkowski <ken at clres.com>
>Subject: Re: [Corpora-List] Frequency of the pronoun I
>To: corpora at uib.no
>
>This discussion has focused on only one aspect of James Pennebaker's 
>work, the 'I' frequency, and perhaps not as much on his many 
>contributions to content analysis, which may have even more relevance to 
>discussions on this list.
>
>Kyle Dent of Xerox has recently performed an analysis 
><http://www.parc.com/content/attachments/through-twitter-glass.pdf> of 
>2400 tweets, with the aim of classifying them into "Questions" and "Not 
>Questions". He developed an elaborate NLP system to deal with these 
>tweets. He kindly provided me with these data, so that I could examine 
>them with my content analysis program to see how well they could be 
>analyzed without all the NLP superstructure. I happened to run a first 
>analysis at the time of this thread. It simply compares the two sets as 
>a whole.
>
>The corpus size is 31,000 words (hardly the stature of BNC, COCA, or 
>OEC). But, curiously, both "i" and "the" hold the top two frequency 
>positions in both:
>
>Set                "the"    "I"
>Questions            400    327
>Not Questions        437    575
>
>Wow! Could this be a classification signature? Although this is not 
>likely, various other statistics in various combinations generated in 
>the program may very well be. So, here we have a micro-genre analysis 
>that confirms the other comments on this thread, much like the Known 
>Similarity Corpora of Adam Kilgarriff (15 years ago!).
>
>Sentiment analysis is an emerging field, but is currently dominated by 
>heavy NLP techniques. I would suggest that techniques from content 
>analysis might provide a nice complement.
>
>     Ken
>-- 
>Ken Litkowski        TEL.: 301-482-0237
>CL Research          EMAIL: ken at clres.com
>9208 Gue Road        Home Page: http://www.clres.com
>Damascus, MD 20872-1025 USA Blog: http://www.clres.com/blog
>-------------- next part --------------
>A non-text attachment was scrubbed...
>Name: not available
>Type: text/html
>Size: 2781 bytes
>Desc: not available
>URL: <http://www.uib.
no/mailman/public/corpora/attachments/20110914/98ac07ad/attachment.txt>
>
>------------------------------
>
>Message: 3
>Date: Wed, 14 Sep 2011 18:47:02 +0200
>From: "Marc Brysbaert" <Marc.Brysbaert at UGent.be>
>Subject: Re: [Corpora-List] Frequency of the pronoun I
>To: "corpora at uib.no" <corpora at uib.no>
>
>For what it is worth, in line with what is written below it looks like  
>the you/the ratio makes an even cleaner distinction between the  
>different types of corpora. There may indeed be various reasons why  
>people include lots of Is in their text. m
>
>Source			I/the	you/the
>COCA (academic)		0.04	0.02
>COCA (newspapers)		0.11	0.06
>Google (books)		0.12
>COCA (magazines)		0.13	0.11
>American blogs		0.31
>COCA (fiction)		0.35	0.20
>COCA (television programs)	0.39	0.37
>Shakespearean plays	1.31
>SUBTLEX (film subtitles)	1.36	1.42
>
>
>
>Quoting "Alon Lischinsky" <alon.lischinsky at kultmed.umu.se>:
>
>> Richard,
>>
>>> It is striking how clearly your figures indicate how well that single
>> measure works as an indication of corpus character.  Thanks for a useful
>> metric.  It might even be used to identify a significant measure of
>> subjectivity in the corpus.
>>
>> Whatever it is that the FIRST_PERSON_PRONOUN/DEFINITE_ARTICLE ratio
>> measures, it is certainly not 'subjectivity' in any of its usual senses.
>>
>> I assume you mean what the OED glosses as '[t]he quality or condition of
>> resting upon subjective facts or mental representation; the character of
>> existing in the mind only'. However, it is unclear why this should 
correlate
>> with the frequency of explicit self-mention. First-person pronouns (FPPs)
>> can feature prominently in starkly objective accounts of past or present
>> material processes involving the self:
>>
>> 'Only time I have brown rice is before training and I was having white 
rice
>> after training, but now I am cutting out most carbs' (
>> http://anabolicminds.com/forum/mma/172158-cutting-weight-carbs.html)
>>
>> At the same time, they can be entirely absent from intensely subjective
>> appraisals:
>>
>> 'As the work developed (and it seemed as if it never would) the music grew
>> almost imperceptibly into a spiteful, clattering machine, only to end back
>> in the rapturous gossamer of impossibly high and blissful shards of sound.
>> There was an encore of some solo Bach ? always welcome, but there?s often
>> the feeling that offering an old favourite after some difficult 
contemporary
>> music is something of an apology to the intolerant few who can?t help
>> coughing their guts up out of ignorance and boredom.' (
>> http://www.musicomh.com/classical/proms/2011-69_0911.htm)
>>
>> There is no systematic catalogue of the uses and functions of the first
>> person plural pronoun that I know of, but there's been quite extensive
>> discussion of the topic at Language Log (see
>> http://languagelog.ldc.upenn.edu/nll/?p=3155 for a list of relevant 
posts),
>> and the data suggest nothing like the simple correlation you posit.
>>
>> Cheers,
>>
>> A.
>>
>
>
>
>
>
>
>------------------------------
>
>Message: 4
>Date: Thu, 15 Sep 2011 02:57:10 +0400
>From: Grokhovski <plgr at mail.ru>
>Subject: [Corpora-List] QUERY: Joomla 1.6-1.7 components for
>	linguistic corpora
>To: CORPORA at UIB.NO
>
>Dear colleagues, 
>
>could you please advise whether there are available components for Joomla 1.6-
1.7 which allow web presentation of linguistic corpora and data bases (lexical 
bases also included)?
>
>Yours sincerely,
>
> Pavel Grokhovski,
> Associate Professor, PhD,
> Chair of Mongolian and Tibetan Studies,
> Saint-Petersburg University, Russia
>
>http://spbu.academia.edu/PavelGrokhovski
>http://orient.pu.ru/dept_mongol/grohovsky.php (page in Russian
> only)
>
>
>------------------------------
>
>Message: 5
>Date: Wed, 14 Sep 2011 15:02:17 +0200
>From: Piotr Ba?ski <bansp at o2.pl>
>Subject: [Corpora-List] 2nd,	revised CFP: "TEI for Linguists" (special
>	issue of jTEI)
>To: Corpora at uib.no
>
>^ Due to the requests for deadline extension that we have received
>^ during the vacation season, we extend the deadlines and publish a
>^ modified call for contributions.
>
>2nd, revised, call for contributions to jTEI, topic: ?TEI for linguists?
>----------------------------------------------------------------------
>
>The Text Encoding Initiative, the publisher of the TEI Guidelines that
>have set the standards for Digital Humanities for the past 20 years, has
>recently launched a new peer-reviewed open-access journal, the jTEI
>(http://journal.tei-c.org/journal/), designed to become the primary
>platform for the dissemination of TEI-related content.
>
>The conveners of the TEI Special Interest Group ?TEI for Linguists? have
>the pleasure to announce a revised call for papers for the third,
>special edition of jTEI, devoted to the topic of the use of the TEI
>Guidelines for linguistic purposes.
>
>While the Guidelines are an obvious encoding standard in Digital
>Humanities research, they are still not so obvious a choice for those
>working in linguistics. This is surprising, particularly in the field of
>computational and corpus linguistics, because the Guidelines address
>many issues relevant to creators and maintainers of digitalised
>collections of language data such as language corpora, transcriptions of
>spoken language or lexical databases, as well as to descriptions of this
>kind of data, in the form of electronic dictionaries, linguistic
>annotations, Feature-Structure-based modelling of information, or
>metadata catalogs. Moreover, with recent developments in data mining and
>text analysis, the needs of Digital Humanities researchers are becoming
>closely aligned with those working in the field of Natural Language
>Processing. The annotation scheme developed under the auspices of the
>Text Encoding Initiative has the potential to become one of the links
>between these disciplines.
>
>We invite contributions dealing with, in particular:
>
>* (un)suitability of the TEI for the annotation of linguistically
>  annotated corpora;
>* reasons for (not) adopting the TEI in the field of linguistics and
>  language-resource management;
>* the relationship between the TEI encoding scheme and the standards of
>  ISO TC37/SC4 ?Language Resource Management?;
>* the TEI as the common ground between the Humanities and NLP;
>* interoperability between data formats used in the field of
>  linguistics and in TEI annotations;
>* usefulness of TEI modules to linguists, e.g. for purposes of
>  transcribing speech or encoding feature structures;
>* the potential for rich structuring of documents that the TEI offers
>  vs. text mining / Information Extraction / text analysis -- is the
>  TEI a potential player in this field?
>
>
>Full papers are due on October 31. The notifications of acceptance will
>be sent on December 15.
>
>For further information submission and author guidelines, please see
>http://journal.tei-c.org/journal/about/submissions
>
>With any further questions, please e-mail journal at tei-c.org .
>
>Dates:
>-----
>* Submission of full papers for review: 31 October 2011
>* Notification of acceptance: 15 December 2011
>* Complete submissions due: 31 January 2012
>
>Guest Editors for this issue:
>----------------------------
>* Piotr Ba?ski, University of Warsaw
>* Eleonora Litta Modignani Picozzi, King?s College, London
>* Andreas Witt, Institut für Deutsche Sprache, Mannheim
>
>
>
>
>------------------------------
>
>Message: 6
>Date: Wed, 14 Sep 2011 10:26:14 -0700
>From: "Rich Cooper" <rich at englishlogickernel.com>
>Subject: Re: [Corpora-List] Frequency of the pronoun I
>To: "'Alon Lischinsky'" <alon.lischinsky at kultmed.umu.se>
>Cc: corpora at uib.no
>
>Dear Alon,
>
> 
>
>Thanks for your clarification.  By "subjectivity"
>I merely meant the view of a situation as seen
>from the observer's point of view - the "I", "me",
>"my", "mine" suite of words, regardless of what
>the "I" reports.  Whether what the "I" sees is
>emotional, factual, or tinged with motives of any
>kind, is for analysts to determine based on the
>text as a whole.  
>
> 
>
>HTH,
>
>-Rich
>
> 
>
>Sincerely,
>
>Rich Cooper
>
>EnglishLogicKernel.com
>
>Rich AT EnglishLogicKernel DOT com
>
>9 4 9 \ 5 2 5 - 5 7 1 2
>
>  _____  
>
>From: alischinsky at gmail.com
>[mailto:alischinsky at gmail.com] On Behalf Of Alon
>Lischinsky
>Sent: Wednesday, September 14, 2011 1:32 AM
>To: Rich Cooper
>Cc: corpora at uib.no
>Subject: Re: [Corpora-List] Frequency of the
>pronoun I
>
> 
>
>Richard,
>
>> It is striking how clearly your figures indicate
>how well that single measure works as an
>indication of corpus character.  Thanks for a
>useful metric.  It might even be used to identify
>a significant measure of subjectivity in the
>corpus. 
>
>Whatever it is that the
>FIRST_PERSON_PRONOUN/DEFINITE_ARTICLE ratio
>measures, it is certainly not 'subjectivity' in
>any of its usual senses.
>
>I assume you mean what the OED glosses as '[t]he
>quality or condition of resting upon subjective
>facts or mental representation; the character of
>existing in the mind only'. However, it is unclear
>why this should correlate with the frequency of
>explicit self-mention. First-person pronouns
>(FPPs) can feature prominently in starkly
>objective accounts of past or present material
>processes involving the self:
>
>'Only time I have brown rice is before training
>and I was having white rice after training, but
>now I am cutting out most carbs'
>(http://anabolicminds.com/forum/mma/172158-cutting
>-weight-carbs.html)
>
>At the same time, they can be entirely absent from
>intensely subjective appraisals:
>
>'As the work developed (and it seemed as if it
>never would) the music grew almost imperceptibly
>into a spiteful, clattering machine, only to end
>back in the rapturous gossamer of impossibly high
>and blissful shards of sound. There was an encore
>of some solo Bach - always welcome, but there's
>often the feeling that offering an old favourite
>after some difficult contemporary music is
>something of an apology to the intolerant few who
>can't help coughing their guts up out of ignorance
>and boredom.'
>(http://www.musicomh.com/classical/proms/2011-69_0
>911.htm)
>
>
>There is no systematic catalogue of the uses and
>functions of the first person plural pronoun that
>I know of, but there's been quite extensive
>discussion of the topic at Language Log (see
>http://languagelog.ldc.upenn.edu/nll/?p=3155 for a
>list of relevant posts), and the data suggest
>nothing like the simple correlation you posit.
>
>Cheers,
>
>A.
>
>-------------- next part --------------
>A non-text attachment was scrubbed...
>Name: not available
>Type: text/html
>Size: 7837 bytes
>Desc: not available
>URL: <http://www.uib.
no/mailman/public/corpora/attachments/20110914/bb0f911c/attachment.txt>
>
>------------------------------
>
>Message: 7
>Date: Thu, 15 Sep 2011 08:42:42 +0100
>From: Mike Scott <mike at lexically.net>
>Subject: Re: [Corpora-List] Frequency of the pronoun I
>To: corpora at uib.no
>
>There are also English texts without THE (lists of products, election 
>results etc.) so the computation either way would need to avoid dividing 
>by zero...
>
>What a useful discussion. Clarified a particularly cluttered and dusty 
>corner of my own thinking.
>
>Cheers -- Mike
>
>On 13/09/2011 19:19, Rich Cooper wrote:
>>
>> Using "the/I" can lead to infinite values in corpora (scientific lit, 
>> patents) that never use the pronoun "I".  It might be better practice 
>> to use the inverse, i.e. the "I/the" ration, which would be 0.0 for 
>> such corpora.
>>
>
>-- 
>Mike Scott
>
>***
>If you publish research which uses WordSmith, do let me know so I can include 
it at
>http://www.lexically.
net/wordsmith/corpus_linguistics_links/papers_using_wordsmith.htm
>***
>University of Aston and Lexical Analysis Software Ltd.
>mike.scott at aston.ac.uk
>www.lexically.net
>
>-------------- next part --------------
>A non-text attachment was scrubbed...
>Name: not available
>Type: text/html
>Size: 3447 bytes
>Desc: not available
>URL: <http://www.uib.
no/mailman/public/corpora/attachments/20110915/259f93cd/attachment.txt>
>
>------------------------------
>
>Message: 8
>Date: Thu, 15 Sep 2011 16:18:20 +0700
>From: "Yuri Tambovtsev" <yutamb at mail.ru>
>Subject: [Corpora-List] Semi-conducter texts
>To: <corpora at uib.no>
>
>Dear Corpora colleagues, Surely, the most frequent word in a corpora depends 
on the materials of the texts under study. In the texts on semi-conducters the 
most frequent words are = the, of, and, in, a, for, ... etc. However, in 
personal messages it is more likely to have "I" as the most frequent words. So, 
what of it? Therefore, let us not break into the open door. Be well, Yuri 
Tambovtsev, Novosibirsk 
>-------------- next part --------------
>A non-text attachment was scrubbed...
>Name: not available
>Type: text/html
>Size: 723 bytes
>Desc: not available
>URL: <http://www.uib.
no/mailman/public/corpora/attachments/20110915/29801cf8/attachment.txt>
>
>----------------------------------------------------------------------
>Send Corpora mailing list submissions to
>	corpora at uib.no
>
>To subscribe or unsubscribe via the World Wide Web, visit
>	http://mailman.uib.no/listinfo/corpora
>or, via email, send a message with subject or body 'help' to
>	corpora-request at uib.no
>
>You can reach the person managing the list at
>	corpora-owner at uib.no
>
>When replying, please edit your Subject line so it is more specific
>than "Re: Contents of Corpora digest..."
>
>
>_______________________________________________
>Corpora mailing list
>Corpora at uib.no
>http://mailman.uib.no/listinfo/corpora
>
>
>End of Corpora Digest, Vol 51, Issue 16
>***************************************
>



_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora



More information about the Corpora mailing list