[Corpora-List] BAAL Corpus Linguistics Special Interest Group - April 28th

K.A.OHalloran K.A.Ohalloran at open.ac.uk
Thu Apr 13 18:00:37 UTC 2006


 
Dear all

Here are the programme and abstracts for the annual meeting of the BAAL Corpus Linguistics Special Interest Group, held at the Open University, Milton Keynes, UK on April 28th 06.  (BAAL stands for British Association for Applied Linguistics).

 

Topic: 'Text analysis using corpora - methodological issues'

 

Programme

10.00 Introduction (Kieran O'Halloran)
10.15 - 11.15 Susan Hunston (Guest Speaker) Text and Intertextuality: Debating the Issues
11.15 - 12.15 Guy Cook (Guest Speaker) "It just says 'could' . Yes I just spotted that." Corpus facts in discourse analysis.
12.15 - 12.45 Bettina Starcke, University of Trier Corpus Linguistic Evidence and Criteria for its Evaluation

Lunch 

Problems with Data Identification 

1.45 - 2.15 Lynne Cameron and Alice Deignan, University of Leeds Emergentism, metaphor and text analysis
2.15 - 2.45 David Oakey, University of Birmingham Phraseology beyond the bundle: finding a way in
2.45 - 3.15 Duncan Hunter and Richard Smith, University of Warwick Identifying keywords and charting their development: Methodological issues in corpus-based historical research

Coffee / Tea 

Issues around Spoken Data 

3.45 - 4.15 Svenja Adolphs and Dawn Knight, Nottingham University Analysing spoken corpora: methodological issues and technological challenges 
4.15 - 4.45 Nuria Hernandez, Freiburg University Dialect corpora and orthographic dialect transcriptions - some methodological considerations 

4.45 - 5.00 Summary and final discussion 

 

Abstracts 

Susan Hunston, University of Birmingham 

This paper will tackle three main issues. First, when corpus investigation techniques are used in the service of other disciplines, whose research questions take precedence? This issue will be discussed largely in the context of corpora used in literary studies, where two different methodological approaches will be compared. Secondly, possible objections to the assumptions underlying corpus investigations (of the type often described as corpus-driven) are explored. The question asked is: to what extent does a focus on intertext make such methodologies inevitable? The final and most extensive part of the paper will report my own efforts to integrate text and corpus approaches to the study of evaluation. This will comment on quantitative and qualitative techniques and consider their application to a comparison between three texts on the topic of avian flu. 

________________________________

"It just says 'could' . Yes I just spotted that." Corpus facts in discourse analysis

Guy Cook, The Open University 

Corpus linguistics often presents itself as a replacement for other kinds of analysis. It is claimed in particular that the "facts" of a corpus analysis are superior to those produced by intuition. This talk seeks to find a middle way between the two poles inherent in this opposition. It advances a constructive critique of some accepted corpus wisdom. It argues that 

	*	the objectification of "actual" rather than idealised language should not be bought at the cost of idealising language users 
	*	for use in discourse analysis, corpus analysis needs a theory and evidence of whether its facts are consciously or subconsciously noticed by actual language users 
	*	certain discourse qualities such as eloquence and salience are beyond the reach of corpus analysis. 
	*	intuition remains the basis of key operations in corpus analysis, including corpus construction, and the interpretation of semantic prosody and key-word lists. 

The argument does not dismiss corpus analysis however. On the contrary, it acknowledges the invaluable insights it allows. But it is suggested that corpus analysis is strongest when it presents itself as a component rather than the totality of discourse analysis, and works in conjunction with investigation of what is salient and valuable to actual users. 

The argument is illustrated with reference to the speaker's use of corpus analysis in four research projects on the language of controversies over food politics: one on food labels, two on the GM food debate, and one on organic food marketing. In each of these projects, analysis of a corpus of language use was combined with intensive analysis of short texts, and with interview and focus-group data, to produce a rounded view of how specific linguistic choices reflect the values and strategies of real writers, and affect the views and behaviour of real readers. 

________________________________

Corpus Linguistic Evidence and Criteria for its Evaluation 

Bettina Starcke, University of Trier 

The question of whether corpus linguistics generates objective linguistic evidence is a central question in the evaluation of corpus linguistic analyses. Arguments are that corpus linguists use cor­pora or texts for analyses that are not subject to change once the analyses have started, and that the use of software contributes to the objectiveness of the analyses. On the other hand, the choice of corpus is subject to the analyst's personal or professional interests, and the software and its settings are selected by the researcher. Finally, the interpretation of the data generated by the software is a subjective process. Objective and subjective features in a corpus linguistic analysis are therefore interdependent. 

This means that in order to evaluate the reliability of an analysis, we need fixed criteria to test their scientific rigour and the validity of conclusions drawn from the data. The four criteria I suggest for this process are growth of knowledge, replicability, checkability and innovation. 

The question of whether an analysis enhances our knowledge with regard to the original research purpose is essential. Its answer and evaluation should take into account the probabilistic and comparative nature of corpus linguistic evidence. 

To allow for the replication of an analysis, documentation of the research process is required. This includes a description (or the inclusion) of the data and the software with its settings used for the analysis in the report on the research. Documenting these parameters allows other researchers to identify decisions taken in the original research and to question them. And, more importantly, the resulting transparency facilitates a better understanding of the purpose and reasoning of the research. 

Checkability expands replicability as described above. In addition to facilitating a better understanding of the original research, it also requires researchers to make their analysis transparent to an extent that enables others to test the techniques and hypotheses on different data, software, theoretical premises etc. This allows for an evaluation as to whether the results from the source study can be gen­eralized and hold if checked with different data. Again, the probabilistic nature of corpus analyses have to be considered. 

Asking whether a linguistic study is innovative and brings new insights into the field of study is the fourth criterion. This entails the question whether the choice of method was appropriate and whether the evidence generated is the best possible evidence. 

________________________________

Emergentism, metaphor and text analysis

Lynne Cameron and Alice Deignan, University of Leeds 

Patterns of metaphor use found in a small, hand-searchable corpus of transcribed talk were subjected to further investigation in a large computerized corpus, using the now well-established technique of combining small and large corpora (for example, Cameron & Deignan 2003). In this study, the small corpus consists of conversations between the daughter of a man killed by a bomb planted by the Irish Republican Army, and the perpetrator of the bombing. The large corpus was the spoken section of the Bank of English. Close analysis of metaphor in the small corpus reveals a number of semi-fixed multi-word expressions with non-literal meaning. These expressions - for example, walk away from - are not 'idioms' and yet are idiomatic in some sense; they are not fixed and yet show some levels of fixedness; they are not completely predictable in use and yet are far from random; they are not clearly metaphorical, often being metonymic or ambiguous. As such, they present problems for analysis at both a formal and semantic level. It appears that there is a bundle of linguistic, semantic, pragmatic and affective patterns of use that constrain metaphorical production and interpretation of an expression like walk away from . An analysis of the concordance for walk* away from in the larger corpus confirmed this. It also suggested a close link between the exact linguistic form of walk away from, and its semantic and pragmatic features.

Corpus researchers have found that expressions like these form a sizeable section of many concordances, and yet they are often left to one side because there is no way to categorise or account for them. Our talk offers a first attempt at such an account. We adopt an emergentist perspective (MacWhinney 1999; Larsen-Freeman & Cameron, in press). This sees human systems, including language, as complex dynamic systems, and the language repertoires of individuals and social groups as emergent phenomena, resulting from processes of adaptation and change over time. We offer the term 'metaphoreme' as a descriptor for the pattern bundles found in our corpora, suggesting how metaphoremes emerge from, and contribute to, language use. 

________________________________

Phraseology beyond the bundle: finding a way in

David Oakey, University of Birmingham 

This is a talk about my work in progress, looking at problems in identifying discontinuous and semi-fixed phrases in corpora against the background of previous work on collocation (c.f. Sinclair), fixed expressions and idioms (c.f. Moon), (cf. Nattinger and DeCarrico), metadiscourse (c.f. Hyland) and lexical bundles (c.f. Biber et al). I describe the methodology by which a commonly-used phraseological item was identified from a 40 million-word comparative corpus of research articles from eight academic disciplines. I then look in detail into the textual environments of this particular phraseological item and present examples of the variations in its use across the disciplines. This leads to some remarks about how quantitative, lexical, semantic, syntactic, and pragmatic features need to be taken into account when making statements on the nature of phrase boundaries in academic discourse. Audience feedback would be much appreciated. 

________________________________

Identifying keywords and charting their development: Methodological issues in corpus-based historical research

Duncan Hunter and Richard Smith, University of Warwick 

The focus of this paper will be a discussion of methodological issues surrounding the investigation of keywords within corpora, in particular in the context of historical research into the discursive construction of particular academic or professional communities. We begin by describing two different approaches to the investigation of keywords using corpus tools. The first, developed by Stubbs (1996), initially identifies a set of keywords through a process of intuition, and then applies concordancing techniques and statistical procedures to describe features of their collocation. The second, referred to notably by Fairclough in his (2000) study of the language of New Labour, deploys corpus tools during the initial stage of keyword selection, based on their frequencies relative to a larger corpus. Further analysis of the terms' collocation and semantic prosody is then carried out using techniques similar to those of Stubbs. 

We will suggest that, methodologically, Fairclough's approach to the selection of keywords has the advantage of being more empirically reliable, in that the selection of keywords is related to evidence of frequency within the corpus. We shall also demonstrate the results of some preliminary corpus investigation, following the model of Fairclough's research. However, we shall also discuss the drawbacks of a purely statistical approach to the selection of keywords, showing that there may be advantages to combining statistical analysis with more intuitive methods. 

Although the main focus of our paper is on appropriate procedures for corpus-based selection of keywords, we also wish to touch on some of the specific requirements and benefits of historical corpus-based research, considering, for example, methodological issues relating to the identification of keywords at different points in time, and the potential advantages of tracking keywords diachronically in enabling in-depth understanding of the evolution of a particular discipline or profession. 

References

Fairclough, Norman (2000). New Labour, New Language? London: Routledge. 

Stubbs, Michael (1996). Text and Corpus Analysis: Computer-Assisted Studies of Language and Culture . Oxford: Blackwell. 

________________________________

Analysing spoken corpora: methodological issues and technological challenges 

Svenja Adolphs and Dawn Knight, Nottingham University

The difficulties associated with the development of spoken corpora large enough to yield stable analytical results have meant that much of corpus linguistics has focused on the analysis of written discourse. However, alongside the large-scale studies of lexico-grammar on the basis of mainly written corpora, there has been a consistent effort in the exploration of spoken discourse using a corpus-based approach. Spoken corpora provide a particularly valuable resource for both quantitative and qualitative types of analysis of specific pragmatic functions. As such they can help in the re-evaluation of claims and concepts that originate in more philosophical traditions where the conceptualisation of pragmatic functions has arguably received most attention. 

However, one of the key differences between written and spoken corpus analysis is that current spoken corpora tend to be mediated records, textual renderings of events which are multi-modal in nature, and thus capturing only a limited and limiting aspect of the reality of that event. As a result, analyses of pragmatic functions in spoken corpora tend to exclude the exploration of the interplay between gesture and language and therefore neglect a core element in the construction of meaning in interaction. 

This presentation reports on the development of a multi-modal spoken corpus at the University of Nottingham and explores the implications of a multi-modal corpus analysis for our understanding of pragmatic categories. Using as an example the category of active listenership in conversation, the presentation focuses on the way in which corpus-based descriptions of functional categories might be affected by the systematic exploration of a multi-modal resource. Technological and methodological issues with regard to data capture and representation will be discussed alongside possible areas of application within the field of applied linguistics. 

________________________________

Dialect corpora and orthographic dialect transcriptions - some methodological considerations

Nuria Hernandez, Freiburg University 

This paper elaborates on some practical and theoretical issues that might be encountered when working with dialect data. Based on experiences with FRED, a corpus of English dialects recently compiled at Freiburg University, I will consider some practical problems concerning the searchability of dialect transcripts as well as general restrictions on linguistic claims based on dialect corpora.

FRED is currently one of the largest databases for English dialects, with 300 hours of speech recordings and over 2.5 million words of corresponding transcripts. It consists of casual oral history interviews with native speakers from all over the British Isles. According to age and other social factors, these informants qualify as traditional dialect speakers. With its 370 interviews from 9 major dialect areas (including Wales, Scotland and the Hebrides) FRED is a valuable database for diatopic variation. Nevertheless, it represents but a section of possible varieties of English and its significance for linguistic generalisations is therefore restricted. As is the case with other dialect corpora, researchers might have to reconsider the data at hand and decide where to place them on a standard - non-standard continuum. Different accounts of the same data may vary, depending on the definition of terms like 'dialect', 'spoken standard', etc. and the linguist's estimation of potentially influencing factors such as the interview situation. Depending on the phenomenon under investigation, transcription guidelines may complicate a classification. FRED, which was collected for morpho-syntactic research purposes, consists of easy-to-read orthographic transcripts that were partly standardized, and nonstandard pronunciations such as h-dropping are not always reproduced. However, pronunciation variants do occur, and we need to know them before being able to search and analyse them.

My aim is to draw attention to the necessity (i) of clearly establishing the type of variety on which any linguistic study is based as well as its distance/proximity to a previously defined standard and (ii) of paying special attention to the degree of standardization that the data might have undergone from speech recording to transcript. I will propose that, for orthographically transcribed dialect corpora like FRED, a consistent inlined annotation scheme (myself [miself]) comprising both the dialect variant and the standard form presents an optimal solution, preserving the readability and searchability as well as rendering a more adequate picture of the amount of variation found in the corpus. 

 

More information on the day can be found at:

http://corpus-sig-baal.org.uk/ou_sem06.htm

 

or through contacting: 

	Lia Blaj ( L.L.Blaj at open.ac.uk ) 
	Institute of Educational Technology, 
	Room 199, Geoffrey Crowther Building, 
	The Open University, 
	Milton Keynes MK7 6AA
	UK

 

The local organisers are: Dr Kieran O'Halloran, Dr Caroline Coffin (Centre for Language and Communications, The Open University), Lia Blaj (Institute of Educational Technology, The Open University) in concert with Dr Paul Thompson (University of Reading), the Corpus Linguistics SIG convenor. 

 
Regards...Kieran
 
Dr Kieran O'Halloran
Centre for Language and Communications
Faculty of Education and Language Studies
Open University
Walton Hall
Milton Keynes
MK7 6AA
UK



More information about the Corpora mailing list