20.297, Diss: Lexicography/Semantics/Text/Corpus Ling: Arppe: 'Univariate, ...'

Fri Jan 30 22:38:45 UTC 2009

LINGUIST List: Vol-20-297. Fri Jan 30 2009. ISSN: 1068 - 4875.

Subject: 20.297, Diss: Lexicography/Semantics/Text/Corpus Ling: Arppe: 'Univariate, ...'

Moderators: Anthony Aristar, Eastern Michigan U <aristar at linguistlist.org>
            Helen Aristar-Dry, Eastern Michigan U <hdry at linguistlist.org>

Reviews: Randall Eggert, U of Utah  
       <reviews at linguistlist.org> 

Homepage: http://linguistlist.org/

The LINGUIST List is funded by Eastern Michigan University, 
and donations from subscribers and publishers.

Editor for this issue: Evelyn Richter <evelyn at linguistlist.org>
================================================================  

To post to LINGUIST, use our convenient web form at
http://linguistlist.org/LL/posttolinguist.html.

===========================Directory==============================  

1)
Date: 30-Jan-2009
From: Antti Arppe < antti.arppe at helsinki.fi >
Subject: Univariate, Bivariate, and Multivariate Methods in Corpus-Based Lexicography: A study of synonymy

-------------------------Message 1 ---------------------------------- 
Date: Fri, 30 Jan 2009 17:36:55
From: Antti Arppe [antti.arppe at helsinki.fi]
Subject: Univariate, Bivariate, and Multivariate Methods in Corpus-Based Lexicography: A study of synonymy

E-mail this message to a friend:
http://linguistlist.org/issues/emailmessage/verification.cfm?iss=20-297.html&submissionid=203807&topicid=14&msgnumber=1

Institution: University of Helsinki 
Program: Department of General Linguistics 
Dissertation Status: Completed 
Degree Date: 2008 

Author: Antti Arppe

Dissertation Title: Univariate, Bivariate, and Multivariate Methods in
Corpus-Based Lexicography: A study of synonymy 

Dissertation URL:  http://urn.fi/URN:ISBN:978-952-10-5175-3

Linguistic Field(s): Lexicography
                     Semantics
                     Text/Corpus Linguistics

Subject Language(s): Finnish (fin)

Dissertation Director(s):
Fred Karlsson
Lauri Carlson
Martti Vainio
Juhani Järvikivi
Urho Määttä

Dissertation Abstract:

In this dissertation, I present an overall methodological framework for
studying linguistic alternations, focusing specifically on lexical
variation in denoting a single meaning, that is, synonymy. As the practical
example, I employ the synonymous set of the four most common Finnish verbs
denoting THINK, namely ajatella, miettiä, pohtia and harkita 'think,
reflect, ponder, consider'. As a continuation to previous work, I describe
in considerable detail the extension of statistical methods from
dichotomous linguistic settings (e.g., Gries 2003; Bresnan et al. 2007) to
polytomous ones, that is, concerning more than two possible alternative
outcomes.

The applied statistical methods are arranged into a succession of stages
with increasing complexity, proceeding from univariate via bivariate to
multivariate techniques in the end. As the central multivariate method, I
argue for the use of polytomous logistic regression and demonstrate its
practical implementation to the studied phenomenon, thus extending the work
by Bresnan et al. (2007), who applied simple (binary) logistic regression
to a dichotomous structural alternation in English.

The results of the various statistical analyses confirm that a wide range
of contextual features across different categories are indeed associated
with the use and selection of the selected think lexemes; however, a
substantial part of these features are not exemplified in current Finnish
lexicographical descriptions. The multivariate analysis results indicate
that the semantic classifications of syntactic argument types are on the
average the most distinctive feature category, followed by overall semantic
characterizations of the verb chains, and then syntactic argument types
alone, with morphological features pertaining to the verb chain and
extra-linguistic features relegated to the last position.

In terms of overall performance of the multivariate analysis and modeling,
the prediction accuracy seems to reach a ceiling at a Recall rate of
roughly two-thirds of the sentences in the research corpus. The analysis of
these results suggests a limit to what can be explained and determined
within the immediate sentential context and applying the conventional
descriptive and analytical apparatus based on currently available
linguistic theories and models.

The results also support Bresnan's (2007) and others' (e.g., Bod et al.
2003) probabilistic view of the relationship between linguistic usage and
the underlying linguistic system, in which only a minority of linguistic
choices are categorical, given the known context - represented as a feature
cluster - that can be analytically grasped and identified. Instead, most
contexts exhibit degrees of variation as to their outcomes, resulting in
proportionate choices over longer stretches of usage in texts or speech. 

-----------------------------------------------------------
LINGUIST List: Vol-20-297