25.4689, Software: Computational Linguistics; Morphology; Text/Corpus Linguistics: types2: Type and Hapax Accumulation Curves

The LINGUIST List via LINGUIST linguist at listserv.linguistlist.org
Fri Nov 21 17:30:25 UTC 2014


LINGUIST List: Vol-25-4689. Fri Nov 21 2014. ISSN: 1069 - 4875.

Subject: 25.4689, Software: Computational Linguistics; Morphology; Text/Corpus Linguistics: types2: Type and Hapax Accumulation Curves

Moderators: Damir Cavar, Indiana U <damir at linguistlist.org>
            Malgorzata E. Cavar, Indiana U <gosia at linguistlist.org>

Reviews: reviews at linguistlist.org
Anthony Aristar <aristar at linguistlist.org>
Helen Aristar-Dry <hdry at linguistlist.org>
Sara Couture, Indiana U <sara at linguistlist.org>

Homepage: http://linguistlist.org

Do you want to donate to LINGUIST without spending an extra penny? Bookmark
the Amazon link for your country below; then use it whenever you buy from
Amazon!

USA: http://www.amazon.com/?_encoding=UTF8&tag=linguistlist-20
Britain: http://www.amazon.co.uk/?_encoding=UTF8&tag=linguistlist-21
Germany: http://www.amazon.de/?_encoding=UTF8&tag=linguistlistd-21
Japan: http://www.amazon.co.jp/?_encoding=UTF8&tag=linguistlist-22
Canada: http://www.amazon.ca/?_encoding=UTF8&tag=linguistlistc-20
France: http://www.amazon.fr/?_encoding=UTF8&tag=linguistlistf-21

For more information on the LINGUIST Amazon store please visit our
FAQ at http://linguistlist.org/amazon-faq.cfm.

Editor for this issue: Andrew Lamont <alamont at linguistlist.org>
================================================================


Date: Fri, 21 Nov 2014 12:30:15
From: Tanja Säily [tanja.saily at helsinki.fi]
Subject: Computational Linguistics; Morphology; Text/Corpus Linguistics: types2: Type and Hapax Accumulation Curves

E-mail this message to a friend:
http://linguistlist.org/issues/emailmessage/verification.cfm?iss=25-4689.html&submissionid=35981777&topicid=13&msgnumber=1
 
types2 is a free tool for visualizing and assessing the statistical
significance of differences in word frequencies across corpora and other data
sets. It is especially useful for analysing variation in the frequencies of
types and hapax legomena, which are common measures of morphological
productivity and lexical diversity. The previous version, types1, was
introduced in 2009; the new version facilitates comparisons through
interactive visualization and adjusts the significance for multiple hypothesis
testing.

The software can analyse data sets from the perspective of the following
statistics:
- number of words: the total number of running words in the text corpus
- number of tokens: the words of interest in our study
- number of types: how many distinct tokens we have seen
- number of hapaxes: how many tokens have occurred only once

The tool can be employed for visualization, statistical hypothesis testing,
and exploratory data analysis. To enhance the reliability of the results, it
uses robust, nonparametric statistics (more specifically, Monte Carlo
permutation tests). The only modelling assumption is that, under the null
hypothesis, individual ''samples'' are exchangeable.

The software is written by Jukka Suomela, and the system is designed and
developed in collaboration with Tanja Säily. It has been tested on Windows,
Macintosh and Unix platforms. The output is provided in three formats: web
pages, PDF images and raw statistics in a database. The software is freely
available at http://users.ics.aalto.fi/suomela/types2/ and
http://dx.doi.org/10.5281/zenodo.9868


Linguistic Field(s): Computational Linguistics
                     Morphology
                     Text/Corpus Linguistics






----------------------------------------------------------
LINGUIST List: Vol-25-4689	
----------------------------------------------------------







More information about the LINGUIST mailing list