[Corpora-List] Corpora Digest, Vol 53, Issue 24

Mon Nov 21 14:11:11 UTC 2011

Dear Sylviane
Hope all is well with you. The one which springs to mind, which you probably don't want, is the one put together by Marianne Hundt. Her advertisements are a manually assembled corpus of late 19th and 20th century American mail order catalogues. 
Love
Antoinette

________________________________________
From: corpora-bounces at uib.no [corpora-bounces at uib.no] on behalf of corpora-request at uib.no [corpora-request at uib.no]
Sent: 21 November 2011 11:00
To: corpora at uib.no
Subject: Corpora Digest, Vol 53, Issue 24

Today's Topics:

   1.  (no subject) (PANAYIOTA VATIKIOTI)
   2.  Significance test for TTR (CRuehlemann at aol.com)
   3.  Hvas anyone got a corpus of Arabic transliterated        into Roman
      alphabet available or know where there is one? (Yorick Wilks)
   4. Re:  Significance test for TTR (Angus Grieve-Smith)
   5. Re:  Significance test for TTR (CRuehlemann at aol.com)
   6.  Corpus of advertisements (Sylviane Granger)
   7. Re:  Significance test for TTR (Georgios Mikros)
   8.  Call for Papers: Second Workshop on Computational
      Linguistics and Writing (CL&W 2012) (Michael Piotrowski)
   9.  Man bites dog (Mike Maxwell)
  10. Re:  Man bites dog (Mark Lybrand)
  11. Re:  Man bites dog (Mike Maxwell)
  12.  Language Acquisition (Mark Lybrand)
  13. Re:  Significance test for TTR (David L. Hoover)
  14. Re:  Language Acquisition (David Wible)

----------------------------------------------------------------------

Message: 1
Date: Sun, 20 Nov 2011 11:09:23 +0000 (GMT)
From: PANAYIOTA VATIKIOTI <vpanayiota79 at yahoo.com>
Subject: [Corpora-List] (no subject)
To: "Corpora at uib.no" <Corpora at uib.no>

Hello,

I wish to unsubscribe from the list.

Best regards,
Panayiota
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: text/html
Size: 289 bytes
Desc: not available
URL: <http://www.uib.no/mailman/public/corpora/attachments/20111120/1e104a3b/attachment.txt>

------------------------------

Message: 2
Date: Sun, 20 Nov 2011 12:20:44 -0500 (EST)
From: CRuehlemann at aol.com
Subject: [Corpora-List] Significance test for TTR
To: CORPORA at UIB.NO

Hi all,

The type token ratio (TTR) is a measure of the lexical diversity of a
text/text type. If one finds in two texts/text types two widely differing TTRs,
one would like to assess the significance of this finding.

Which test is appropriate for differences between TTRs?

Best
Chris
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: text/html
Size: 825 bytes
Desc: not available
URL: <http://www.uib.no/mailman/public/corpora/attachments/20111120/59877114/attachment.txt>

------------------------------

Message: 3
Date: Sun, 20 Nov 2011 12:48:04 -0500
From: Yorick Wilks <Y.Wilks at dcs.shef.ac.uk>
Subject: [Corpora-List] Hvas anyone got a corpus of Arabic
        transliterated  into Roman alphabet available or know where there is
        one?
To: corpora <corpora at uib.no>

Thanks for any information.
Yorick Wilks

On 11 Nov 2011, at 03:15, Yunqing Xia wrote:

> Dear colleagues,
>
> We recently started the research on cross-/multi-lingual topic
> detection within short text collection.  At this stage, we focus on
> Chinese and English. We appreciate any feedback that points us to the
> related work on algorithms and resources.
>
>
> Regards,
> Yunqing
>
> ---------------------------
> Yunqing Xia, Dr., A. Prof.
> Center for Speech and Language Technology
> Tsinghua University, Bejing
> ---------------------------
>
> _______________________________________________
> UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
> Corpora mailing list
> Corpora at uib.no
> http://mailman.uib.no/listinfo/corpora

------------------------------

Message: 4
Date: Sun, 20 Nov 2011 12:48:23 -0500
From: Angus Grieve-Smith <grvsmth at panix.com>
Subject: Re: [Corpora-List] Significance test for TTR
To: CORPORA at UIB.NO

Good question!  Before you even bother testing for significance, you need a random sample - or in your case, random samples of two populations.  Are your samples random?  If not, statistical significance is impossible, so don't waste time on it.

- Angus B. Grieve-Smith
grvsmth at panix.com

CRuehlemann at aol.com wrote:

>Hi all,
>
>The type token ratio (TTR) is a measure of the lexical diversity of a
>text/text type. If one finds in two texts/text types two widely differing TTRs,
>one would like to assess the significance of this finding.
>
>Which test is appropriate for differences between TTRs?
>
>Best
>Chris
>
>_______________________________________________
>UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
>Corpora mailing list
>Corpora at uib.no
>http://mailman.uib.no/listinfo/corpora
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: text/html
Size: 824 bytes
Desc: not available
URL: <http://www.uib.no/mailman/public/corpora/attachments/20111120/5a7aa8a4/attachment.txt>

------------------------------

Message: 5
Date: Sun, 20 Nov 2011 12:58:56 -0500 (EST)
From: CRuehlemann at aol.com
Subject: Re: [Corpora-List] Significance test for TTR
To: CORPORA at UIB.NO

Suppose the two samples are random and suppose they were controled for
length, you would have x tokens in either sample but y1 types in sample 1  and
y2 types in sample 2 - is there a test for significance of the  difference?
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: text/html
Size: 667 bytes
Desc: not available
URL: <http://www.uib.no/mailman/public/corpora/attachments/20111120/d5215deb/attachment.txt>

------------------------------

Message: 6
Date: Sun, 20 Nov 2011 18:58:06 +0100
From: Sylviane Granger <sylviane.granger at uclouvain.be>
Subject: [Corpora-List] Corpus of advertisements
To: corpora <corpora at uib.no>

Dear colleagues,

Could anyone point me to a corpus of English and/or French advertisements?

Thanks a lot!

Best wishes,

Sylviane Granger

Professor Sylviane Granger
Director
Centre for English Corpus Linguistics
Université catholique de Louvain
Place Blaise Pascal 1
B-1348 Louvain-la-Neuve (Belgium)
http://www.uclouvain.be/en-cecl.html
http://www.uclouvain.be/sylviane.granger

------------------------------

Message: 7
Date: Sun, 20 Nov 2011 20:00:00 +0200
From: "Georgios Mikros" <gmikros at isll.uoa.gr>
Subject: Re: [Corpora-List] Significance test for TTR
To: <CRuehlemann at aol.com>,      <CORPORA at uib.no>

Dear Chris,

First things first. TTR is highly dependent to text length  so you have to
be sure that the measurements have been taken from equal size text samples.
Otherwise you should use a more robust index such as Yule's K or Zipf's Z
(see the [1] for a detailed description of this problem). Now coming to your
original question, TTR is a continuous variable and you could use the whole
range of parametric statistics. This means that you can use a t-test if you
want to check whether TTR is significant different across two classes (e.g.
Gender distinction in essays), or ANOVA if your independent variable has
many classes (e.g. Text Genre, Text Topic etc). You can also implement a
linear regression model with dependent variable TTR and independent
variables the ones that describe your research hypothesis. In all the above
cases you need multiple TTR measurements because inferential statistics are
based on the distribution parameters of the TTR. There is also the option to
compare a single TTR value to a distribution of TTR values using one-sample
location test (also called Z test) which actually can tell you how the
specific TTR value lies away from the mean of the TTRs.

If the only thing you know are just 2 TTR values I don't think you can
compare them in any meaningful way.

Best

George Mikros

[1] Tweedie, Fiona J., & Baayen, Harald R. (1998). How variable may a
constant be? Measures of lexical richness in perspective. Computers and the
Humanities, 32(5), 323-352.

____________________________

George K. Mikros

Associate Professor of Computational and Quantitative Linguistics

Department of Italian Language and Literature

School of Philosophy

National and Kapodistrian University of Athens

Panepistimioupoli Zografou, GR-15784

Athens, Greece

Tel: +30 210 7277491, +30 6976111742

Email: gmikros at isll.uoa.gr

Web: http://users.uoa.gr/~gmikros/

From: corpora-bounces at uib.no [mailto:corpora-bounces at uib.no] On Behalf Of
CRuehlemann at aol.com
Sent: Sunday, November 20, 2011 7:21 PM
To: CORPORA at uib.no
Subject: [Corpora-List] Significance test for TTR

Hi all,

The type token ratio (TTR) is a measure of the lexical diversity of a
text/text type. If one finds in two texts/text types two widely differing
TTRs, one would like to assess the significance of this finding.

Which test is appropriate for differences between TTRs?

Best

Chris

-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: text/html
Size: 7853 bytes
Desc: not available
URL: <http://www.uib.no/mailman/public/corpora/attachments/20111120/1774920c/attachment.txt>

------------------------------

Message: 8
Date: Mon, 21 Nov 2011 00:00:58 +0100
From: Michael Piotrowski <mxp at cl.uzh.ch>
Subject: [Corpora-List] Call for Papers: Second Workshop on
        Computational   Linguistics and Writing (CL&W 2012)
To: elsnet-list at elsnet.org, WPA-L at ASU.EDU, corpora at uib.no

Apologies if you receive multiple copies of this message.

Please distribute it to colleagues.

********************************************************************
CALL FOR PAPERS

Second Workshop on Computational Linguistics and Writing (CL&W 2012):
Linguistic and cognitive aspects of document creation and document
engineering

  Workshop at EACL 2012 http://eacl2012.org/

Web site: http://www.lingured.info/clw2012/

Workshop date: April 23 or 24, 2012

Location: Avignon, France

Submission deadline: January 27, 2012

*********************************************************************

Writing, whether professional, academic, or private, needs editors,
input tools and display devices, and involves the coordination of
cognitive, linguistic, and technical aspects.  Most texts composed in
the 21st century are probably created on electronic devices; people
compose texts in word processors, text editors, content management
systems, blogs, wikis, e-mail clients, and instant messaging
applications.  Texts are rendered and displayed on very small and very
large screens, they are meant to be read by experts and laypersons,
and they are supposed to be interactive and printable all at the same
time.

The production of documents has been researched from various
perspectives:

- Writing research has been concerned with text processing tools and
  cognitive processes since the 1970s.  The current rise of new
  writing environments and genres (e.g., blogging), as well as new
  possibilities to observe text production in the workplace, has
  prompted new studies in this area of research.

- Document engineering is concerned with aspects of rendering and
  displaying textual and other resources for the creation,
  maintenance, and management of documents.  Writers today use tools
  for layout design, collaborating with co-authors, and tracking
  changes in the production process with versioning systems---all of
  these are active research areas in document engineering.

- Computational linguistics has mostly been concerned with static or
  finished texts.  There is now a growing need to explore how
  computational linguistics can support human text production and
  interactive text processing.  Methods from natural language
  processing can also provide support for exploring data relevant for
  writing research (e.g., keystroke-logging data) and document
  engineering (e.g., tailoring documents to specific user needs).

CL&W 2010, held at NAACL 2010 in Los Angeles, was a successful
workshop, offering researchers from different but related disciplines
a platform for sharing findings and ideas.  This follow-on Workshop on
Computational Linguistics and Writing aims to bring together
researchers from the communities listed above to stimulate discussion
and cooperation between these areas of research.  We are interested in
research that explores writing processes, text production, and
document engineering principles as well as actual working systems that
support writers in one or more aspects when producing a document.

Submissions are invited which address questions like the following:

- How can the creation of texts and documents be supported by methods,
  resources, and tools from computational linguistics?  This includes
  NLP tools and techniques that can be used or have been used to
  support writing (e.g., grammar and style checking, document
  structuring, thematic segmentation, editing and revision aids).

- How can we get a better understanding of writing processes,
  strategies, and needs?  Which methods, resources, and tools from
  computational linguistics could support research in this area?

- How do high-level writing processes and the mechanics of writing
  relate to each other?

- How do writing tools influence composing?

- Is there a need for the development of new writing tools?  What can
  we learn from earlier approaches and tools like RUSKIN, Writer's
  Workbench or Augment, or from source code editors for programming
  languages?

- How can insights from writing research and methods from
  computational linguistics help writers with special needs?

- How can techniques from HCI research and psychology be used to gain
  new insights concerning the composing and writing process and to
  improve writing tools?

- How can methods and resources from computational linguistics help to
  scale from controlled lab experiments with only a few participants
  to workplace observation over a long period of time with dozens of
  writers?

- How can algorithms and methods from document engineering be used to
  support natural-language writing as the creation of content?

- How can aspects of document design be used for the development of
  (automatic) authoring aids or document processing?

Topics of interest for this workshop include, but are not limited to,
the following:

- Resources and tools to assist or support the creation of
  natural-language texts and documents

- Algorithms and techniques for authoring aids

- Supporting the authoring of multilingual, multimedia, and adaptive
  documents

- Interplay of cognitive processes, cognitive resources, and writing tools

- Observation of writing in natural settings and insights for
  improving authoring tools

- Experimental studies pertinent to writing tools

- User interface and HCI issues in current and future writing and
  document processing tools

- Predictive editing methods

- Authoring tools for less-resourced languages

- Evaluation of tools and resources

*Format of the Workshop*

We will have talks, posters, and a plenary discussion.  The plenary
discussion is intended to combine different perspectives, to identify
future directions for research, and to stimulate interdisciplinary
networking and cooperation between writing research, document
engineering, and computational linguistics.

*Submissions*

We invite researchers to submit full papers of up to 9 pages
(excluding references) or short papers of up to 4 pages (including
references).  These page limits must be strictly observed.
Submissions must be in English.

Reviewing of papers will be double-blind by the members of the program
committee, and all submissions will receive several independent
reviews.  Papers submitted at review stage must not contain the
authors' names, affiliations, or any information that may disclose the
authors' identity.  Furthermore, self-references that reveal the
author's identity, e.g., "We previously showed (Smith, 1991) ...",
should be avoided.  Instead, use citations such as "Smith previously
showed (Smith, 1991) ...".  Do not use anonymous citations.  Do not
include acknowledgments.  Papers that do not conform to these
requirements may be rejected without review.

Submission is electronic using the START submission system at:

  https://www.softconf.com/eacl2012/CLW2012/

Submissions must be uploaded to START by the submission deadline (see
below).

All submissions must be in PDF format.  Papers must follow the
two-column format of EACL 2012.  We strongly recommended the use of
the style files provided on the workshop Web site.  We reserve the
right to reject submissions that do not conform to these styles.

If you intend to submit your paper to several EACL 2012 workshops,
please contact the workshop chairs beforehand.

Authors of accepted papers will be invited to present their research
at the workshop.  Accepted papers will be published in the electronic
workshop proceedings.  The workshop proceedings will be part of the
EACL 2012 proceedings, published by ACL.

Full instructions for submissions and style files can be found on the
workshop Web site at http://lingured.info/clw2012/?Submissions.

*Date and Location*

Location: EACL 2012 in Avignon, France
Date: April 23 or 24, 2012

*Important Dates*

Deadline for submission: January 27, 2012
Notification of acceptance: February 24, 2012
Revised version of papers: March 9, 2012
Workshop: April 23 or 24, 2012

*Organizers*

Michael Piotrowski (University of Zurich, Switzerland), mxp at cl.uzh.ch
Cerstin Mahlow (University of Basel, Switzerland), cerstin.mahlow at unibas.ch
Robert Dale (Macquarie University, Australia), robert.dale at mq.edu.au

*Program Committee*

      * Gerd Bräuer (University of Education Freiburg, Germany)
      * Jill Burstein (ETS, USA)
      * Rickard Domeij (The Language Council of Sweden, Sweden)
      * Kevin Egan (University of Southern California, USA)
      * Caroline Hagège (Xerox Research Centre Europe, France)
      * Sofie Johansson Kokkinakis (University of Gothenburg, Sweden)
      * Ola Karlsson (The Language Council of Sweden, Sweden)
      * Ola Knutsson (Stockholm University, Sweden)
      * Eva Lindgren (Umeå University, Sweden)
      * Aurélien Max (LIMSI, France)
      * Guido Nottbusch (University of Potsdam, Germany)
      * Daniel Perrin (Zurich University of Applied Sciences, Switzerland)
      * Martin Reynaert (Tilburg University, The Netherlands)
      * Gert Rijlaarsdam (University of Amsterdam, The Netherlands)
      * Koenraad de Smedt (University of Bergen, Norway)
      * Eric Wehrli (University of Geneva, Switzerland)
      * Carl Whithaus (UC Davis, USA)
      * Michael Zock (CNRS, France)

*Further Information*

http://www.lingured.info/clw2012/

*Workshop Contact Address*

clw2012 at lingured.info

--
Dr.-Ing. Michael Piotrowski, M.A. <mxp at cl.uzh.ch>
Institute of Computational Linguistics, University of Zurich
Phone +41 44 63-54313 | OpenPGP public key ID 0x1614A044
* OUT NOW: Systems and Frameworks for Computational Morphology
* <http://www.springer.com/computer/ai/book/978-3-642-23137-7>

------------------------------

Message: 9
Date: Sun, 20 Nov 2011 22:15:58 -0500
From: Mike Maxwell <maxwell at umiacs.umd.edu>
Subject: [Corpora-List] Man bites dog
To: corpora <corpora at uib.no>

In LILT 6 (http://elanguage.net/journals/index.php/lilt/issue/current),
"Zipf's Law and l'Arbitraire du Signe," Martin Kay discusses statistical
MT, and says (p.22):

    Notice that a language model would, and should, guarantee
    that the French ?homme mord chien? would be translated into
    English as ?dog bites man?, rather than ?man bites dog?,
    which is what it really means.

I once proposed this exact example (with Spanish rather than French) to
a computational linguist who knew more about MT than I do.  (People who
know more about MT than I do are quite common.  Ok, they're quite common
among computational linguists :-).)  That person suggested I needed to
learn more about MT.

It would be nice to find myself making the same mistake that Martin Kay
made.  It would be even nicer if it weren't a mistake.

Is Kay's claim correct?  The context is of course pure statistical MT,
not hybrid rule/ statistical systems.  Assume that the pair "homme mord
chien"/ "man bites dog" never occurs in the training data, but that the
reverse does (or at least that "dog bites man" appears on the English
side, presumably with some significant frequency).
--
        Mike Maxwell
        maxwell at umiacs.umd.edu
        "My definition of an interesting universe is
        one that has the capacity to study itself."
         --Stephen Eastmond

------------------------------

Message: 10
Date: Sun, 20 Nov 2011 19:22:59 -0800
From: Mark Lybrand <mlybrand at gmail.com>
Subject: Re: [Corpora-List] Man bites dog
To: Mike Maxwell <maxwell at umiacs.umd.edu>
Cc: corpora <corpora at uib.no>

My french is rusty, but spanish would have a disambuation by prefixing the
accustive with a preposition:
E
Hombre muerde a perro. (Articles omitted to correspond better with the
example)

Mark
On Nov 20, 2011 7:16 PM, "Mike Maxwell" <maxwell at umiacs.umd.edu> wrote:

> In LILT 6 (http://elanguage.net/**journals/index.php/lilt/issue/**current<http://elanguage.net/journals/index.php/lilt/issue/current>),
> "Zipf's Law and l'Arbitraire du Signe," Martin Kay discusses statistical
> MT, and says (p.22):
>
>   Notice that a language model would, and should, guarantee
>   that the French ?homme mord chien? would be translated into
>   English as ?dog bites man?, rather than ?man bites dog?,
>   which is what it really means.
>
> I once proposed this exact example (with Spanish rather than French) to a
> computational linguist who knew more about MT than I do.  (People who know
> more about MT than I do are quite common.  Ok, they're quite common among
> computational linguists :-).)  That person suggested I needed to learn more
> about MT.
>
> It would be nice to find myself making the same mistake that Martin Kay
> made.  It would be even nicer if it weren't a mistake.
>
> Is Kay's claim correct?  The context is of course pure statistical MT, not
> hybrid rule/ statistical systems.  Assume that the pair "homme mord chien"/
> "man bites dog" never occurs in the training data, but that the reverse
> does (or at least that "dog bites man" appears on the English side,
> presumably with some significant frequency).
> --
>        Mike Maxwell
>        maxwell at umiacs.umd.edu
>        "My definition of an interesting universe is
>        one that has the capacity to study itself."
>        --Stephen Eastmond
>
> ______________________________**_________________
> UNSUBSCRIBE from this page: http://mailman.uib.no/options/**corpora<http://mailman.uib.no/options/corpora>
> Corpora mailing list
> Corpora at uib.no
> http://mailman.uib.no/**listinfo/corpora<http://mailman.uib.no/listinfo/corpora>
>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: text/html
Size: 2638 bytes
Desc: not available
URL: <http://www.uib.no/mailman/public/corpora/attachments/20111120/b3032836/attachment.txt>

------------------------------

Message: 11
Date: Sun, 20 Nov 2011 22:45:47 -0500
From: Mike Maxwell <maxwell at umiacs.umd.edu>
Subject: Re: [Corpora-List] Man bites dog
To: Mark Lybrand <mlybrand at gmail.com>
Cc: corpora <corpora at uib.no>

On 11/20/2011 10:22 PM, Mark Lybrand wrote:
> My french is rusty, but spanish would have a disambuation by prefixing
> the accustive with a preposition:
> Hombre muerde a perro.

I may not remember what I actually said (and possibly I didn't even use
another language); and this is a little beside the point.  But--I'm not
sure the Spanish would necessarily use the 'a' for an animal.  The
examples that are usually given where the 'a' marker is needed are
almost always with humans, not animals.  Aissen ("Differential object
marking: Iconicity vs. economy", fn 24) writes:
     In Spanish, object marking is optional for animate
     (non-human) definites and for human indefinites.
At any rate, this is supposed to be headline language, so articles are
omitted, and it seems plausible that 'a' could be omitted too.  But I
don't want to get too deeply into these questions, as my real question
was about statistical MT.
--
        Mike Maxwell
        maxwell at umiacs.umd.edu
        "My definition of an interesting universe is
        one that has the capacity to study itself."
         --Stephen Eastmond

------------------------------

Message: 12
Date: Sun, 20 Nov 2011 20:19:02 -0800
From: Mark Lybrand <mlybrand at gmail.com>
Subject: [Corpora-List] Language Acquisition
To: corpora <corpora at uib.no>

Okay, so this is probably not a "corpora" issue.  Forgive me please, as I
am an NLP piker.  The question that is plaguing me right now is if there is
any research in using AI to mimic language acquistion.  Rather, have there
been attempts made to create a rational agent that uses typical human
strategies to learn a new language. It would seem that such an approach
could be helpful in creating assistive technologies for learners of a
foreign language.  Can you guys steer me in the right direction?

Thanks. Feel free to just ignore me altogether if this is completely OT.

--
Mark :)
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: text/html
Size: 635 bytes
Desc: not available
URL: <http://www.uib.no/mailman/public/corpora/attachments/20111120/214a4759/attachment.txt>

------------------------------

Message: 13
Date: Sun, 20 Nov 2011 22:43:54 -0500
From: "David L. Hoover" <david.hoover at nyu.edu>
Subject: Re: [Corpora-List] Significance test for TTR
To: corpora at uib.no

Dear Chris,

George has given a good explanation of some of the problems. A much more
severe problem is that lexical diversity/vocabulary richness is simply
not a very reliable statistic for differentiating texts/authors.
Although Tweedie and Baayen conclude that it can be used with caution,
my own research has shown that lexical diversity shows extreme
fluctuation within the works of a single author and even between
different sections of the same text. Perhaps there might be a more
systematic and reliable difference between text types than between
authors or texts, but lexical diversity is so variable that even this
doesn't seem very likely. For more detail , see my
?Another Perspective on Vocabulary Richness.? Computers and the
Humanities, 37(2), 2003: 151-78.

Best,
David Hoover

On 11/20/2011 1:00 PM, Georgios Mikros wrote:
>
> Dear Chris,
>
> First things first. TTR is highly dependent to text length so you have
> to be sure that the measurements have been taken from equal size text
> samples. Otherwise you should use a more robust index such as Yule?s K
> or Zipf?s Z (see the [1] for a detailed description of this problem).
> Now coming to your original question, TTR is a continuous variable and
> you could use the whole range of parametric statistics. This means
> that you can use a t-test if you want to check whether TTR is
> significant different across two classes (e.g. Gender distinction in
> essays), or ANOVA if your independent variable has many classes (e.g.
> Text Genre, Text Topic etc). You can also implement a linear
> regression model with dependent variable TTR and independent variables
> the ones that describe your research hypothesis. In all the above
> cases you need multiple TTR measurements because inferential
> statistics are based on the distribution parameters of the TTR. There
> is also the option to compare a single TTR value to a distribution of
> TTR values using one-sample location test (also called Z test) which
> actually can tell you how the specific TTR value lies away from the
> mean of the TTRs.
>
> If the only thing you know are just 2 TTR values I don?t think you can
> compare them in any meaningful way.
>
> Best
>
> George Mikros
>
> [1] Tweedie, Fiona J., & Baayen, Harald R. (1998). How variable may a
> constant be? Measures of lexical richness in perspective. Computers
> and the Humanities, 32(5), 323-352.
>
> ____________________________
>
> George K. Mikros
>
> Associate Professor of Computational and Quantitative Linguistics
>
> Department of Italian Language and Literature
>
> School of Philosophy
>
> National and Kapodistrian University of Athens
>
> Panepistimioupoli Zografou, GR-15784
>
> Athens, Greece
>
> Tel: +30 210 7277491, +30 6976111742
>
> Email: gmikros at isll.uoa.gr <mailto:gmikros at isll.uoa.gr>
>
> Web: http://users.uoa.gr/~gmikros/ <http://users.uoa.gr/%7Egmikros/>
>
> *From:*corpora-bounces at uib.no [mailto:corpora-bounces at uib.no] *On
> Behalf Of *CRuehlemann at aol.com
> *Sent:* Sunday, November 20, 2011 7:21 PM
> *To:* CORPORA at uib.no
> *Subject:* [Corpora-List] Significance test for TTR
>
> Hi all,
>
> The type token ratio (TTR) is a measure of the lexical diversity of a
> text/text type. If one finds in two texts/text types two widely
> differing TTRs, one would like to assess the significance of this finding.
>
> Which test is appropriate for differences between TTRs?
>
> Best
>
> Chris
>
>
>
> _______________________________________________
> UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
> Corpora mailing list
> Corpora at uib.no
> http://mailman.uib.no/listinfo/corpora

------------------------------

Message: 14
Date: Mon, 21 Nov 2011 14:27:53 +0800
From: David Wible <wible at stringnet.org>
Subject: Re: [Corpora-List] Language Acquisition
To: Mark Lybrand <mlybrand at gmail.com>
Cc: corpora <corpora at uib.no>

Mark mentions the possibility of AI-spawned rational agents contributing to
the design of language learning technologies. Let me say something about
that, at least wrt second language acquisition/learning. (Sorry this is
ignoring Mark's original question. And what follows is not at all aimed at
Mark.)

To me, the current area in which to look for breakthroughs in language
acquisition research (at least the kind that hopes to be relevant to
educators in trying to foster language acquisition) is research that *
enriches* the scope of contextual factors that matter to situated, human
learners, not research that further decontextualizes its 'models' of
learning (scrubbing them clean of the messy variables that muck up the
research design). In my experience, the most lamentable aspect of efforts
to create 'assistive technologies for learners' is the development of such
technologies in sanitized, lab-like conditions (or in the sanitized
conditions of certain technologists' own minds) without the benefit of any
front-line classroom experience with living, breathing learners or teachers
(or parents or administrators). I have spent a hefty chunk of my academic
life trying to develop technologies that assist in language learning, so
I'm all for it. But my own quirky main conclusion from all these years is
that the stuff which is designed and made in 'idealized' conditions is
often hopelessly detached from what would take hold in actual learning
ecologies, and because of that, it won't 'scale up' beyond the stage of lab
toys). What portion of teachers who allow their students to be used as
subjects in testing out these technologies are glad when it's over and,
short of coercion, would never touch the stuff again . R&D efforts in
language learning technology, need from the earliest stages, more
'anthropologists' and 'ethnographers' and teachers from the 'trenches'
where the technologies are hoping to make a contribution, not more
decontextualized, sanitized models of language acquisition.

Maybe there will be a day when AI's rational agents can feel peer pressure,
can feel 'face' and loss of 'face', the urge to be a member of a social
group, a day when an AI rational agent draws its very identity from the
'culture' it 'belongs to', (or, for that matter, can feel an identity of
any sort) and can 'feel' the high 'personal' (robotic?) stakes of stepping
out of that cultural identity to risk entry into a different one, risk
being rejected, experience being excluded or admitted to that 'speech
community' based on ones competence in using another language. (To me,
these human attributes are central rather than peripheral to explanations
of (2nd) language learning.) When that day comes, when AI's rational agents
can be designed with those attributes, then I'll be the first to want them
in my R&D team developing language learning technologies. Until then, where
are the anthropologists (and where are the....(fill in the blank; who else
do we need to join in our efforts?)!

Sorry to ramble.

David Wible,
National Central University
Taiwan

On Mon, Nov 21, 2011 at 12:19 PM, Mark Lybrand <mlybrand at gmail.com> wrote:

> Okay, so this is probably not a "corpora" issue.  Forgive me please, as I
> am an NLP piker.  The question that is plaguing me right now is if there is
> any research in using AI to mimic language acquistion.  Rather, have there
> been attempts made to create a rational agent that uses typical human
> strategies to learn a new language. It would seem that such an approach
> could be helpful in creating assistive technologies for learners of a
> foreign language.  Can you guys steer me in the right direction?
>
> Thanks. Feel free to just ignore me altogether if this is completely OT.
>
> --
> Mark :)
>
> _______________________________________________
> UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
> Corpora mailing list
> Corpora at uib.no
> http://mailman.uib.no/listinfo/corpora
>
>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: text/html
Size: 4646 bytes
Desc: not available
URL: <http://www.uib.no/mailman/public/corpora/attachments/20111121/d775870f/attachment.txt>

----------------------------------------------------------------------
Send Corpora mailing list submissions to
        corpora at uib.no

To subscribe or unsubscribe via the World Wide Web, visit
        http://mailman.uib.no/listinfo/corpora
or, via email, send a message with subject or body 'help' to
        corpora-request at uib.no

You can reach the person managing the list at
        corpora-owner at uib.no

When replying, please edit your Subject line so it is more specific
than "Re: Contents of Corpora digest..."

_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora

End of Corpora Digest, Vol 53, Issue 24
***************************************
_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora