11.2537, Review: Botley et al: Multilingual Corpora

The LINGUIST Network linguist at linguistlist.org
Sat Nov 25 05:52:06 UTC 2000


LINGUIST List:  Vol-11-2537. Sat Nov 25 2000. ISSN: 1068-4875.

Subject: 11.2537, Review: Botley et al: Multilingual Corpora

Moderators: Anthony Aristar, Wayne State U.<aristar at linguistlist.org>
            Helen Dry, Eastern Michigan U. <hdry at linguistlist.org>
            Andrew Carnie, U. of Arizona <carnie at linguistlist.org>

Reviews: Andrew Carnie: U. of Arizona <carnie at linguistlist.org>

Editors: Karen Milligan, Wayne State U. <karen at linguistlist.org>
         Michael Appleby, E. Michigan U. <michael at linguistlist.org>
         Rob Beltz, E. Michigan U. <rob at linguistlist.org>
         Lydia Grebenyova, E. Michigan U. <lydia at linguistlist.org>
         Jody Huellmantel, Wayne State U. <jody at linguistlist.org>
         Marie Klopfenstein, Wayne State U. <marie at linguistlist.org>
	 Naomi Ogasawara, E. Michigan U. <naomi at linguistlist.org>
	 James Yuells, Wayne State U. <james at linguistlist.org>
         Ljuba Veselinova, Stockholm U. <ljuba at linguistlist.org>

Software: John Remmers, E. Michigan U. <remmers at emunix.emich.edu>
          Gayathri Sriram, E. Michigan U. <gayatri at linguistlist.org>

Home Page:  http://linguistlist.org/

The LINGUIST List is funded by Eastern Michigan University, Wayne
State University, and donations from subscribers and publishers.


Editor for this issue: Andrew Carnie <carnie at linguistlist.org>
 ==========================================================================

What follows is another discussion note contributed to our Book Discussion
Forum.  We expect these discussions to be informal and interactive; and
the author of the book discussed is cordially invited to join in.

If you are interested in leading a book discussion, look for books
announced on LINGUIST as "available for discussion."  (This means that
the publisher has sent us a review copy.)  Then contact Andrew Carnie at
     carnie at linguistlist.org

=================================Directory=================================

1)
Date:  Wed, 8 Nov 2000 15:31:29 +0530 (IST)
From:  Niladri Sekhar Dash <niladri at www.isical.ac.in>
Subject:  Reply

-------------------------------- Message 1 -------------------------------

Date:  Wed, 8 Nov 2000 15:31:29 +0530 (IST)
From:  Niladri Sekhar Dash <niladri at www.isical.ac.in>
Subject:  Reply


Simon Philip Botley, Anthony Mark McEnery and Andrew Wilson
(2000) Multilingual Corpora in Teaching and Research. Rodopi: Amsterdam -
Atlanta. Binding:  Paperback. ISBN: 90-420-0551-3
(bound). Pages: 220. Price - ???

Reviewed by Niladri Sekhar Dash, Indian Statistical Institute, Calcutta,
India.

Synopsis

	The publication of this volume implies the maturity of the
sub-area within linguistics and focuses towards multiple applications of
corpora in research and education. This is an important addition to other
volumes of corpus linguistics by Aarts and Meijs (1984, 1986) Sinclair
(1991), Svartvik (ed.) (1992), Barnbrook (1996), McEnery and Wilson
(1996), Graeme Kennedy (1998), Biber et. al. (1998), Ooi (1998) and many
others.

	Though most of the books related with corpus linguistics deal with
different aspects of corpus design and development, types of corpora,
annotation schemes, corpus tools, and general applications of corpus, some
are highly object-oriented, application-intended and
target-specific. Among these, the volume of Boguraev and Pustejovsky
(1996) deals with corpus-processing for lexical acquisition, that of Ooi
(1998) investigates corpus-based lexicography, and that of Oakes
(1998) explores statistical techniques and corpus applications. This
volume is also designed with a specific purpose: to show how multilingual
parallel corpora can be used in teaching and research. However, almost
equal emphasis is given on design and application of alignment technology,
an essential tool for extracting appropriate information form parallel
corpora.

	In the introductory Chapter (pp. 1-37) Michael Oakes and Tony
McEnery of Lancaster University, UK sum up the core of
different methods anchored for bilingual text alignment referring in some
details to various statistical and linguistic techniques used by different
scholars engaged to the task. The chapter presents a general, up-to-date
overview of current works in the development of alignment technology, and
provides a thematic classification of the contents of the following
chapters.

	In Chapter 2 (pp. 38-64) Michel Simard, George Foster, Marie-Loise
Hannan, Elliott Macklovitch and Pierre Plamondon of Centre d'Innovation en
Technologies de l'Information, Canada, deal with bilingual text alignment
as a part of translation analysis (TA) referring to the reconstruction of
correspondences between segments of a source text and the segments of its
translation. In introduction they identify the recent upsurge of text
alignment technology and locate the application areas both in academic and
commercial sectors.

	After differentiating among the methods used by Brown et
al. (1991), Gale and Church (1991, 1993), Dagan et al. (1993) and Melamed
(1996) and others they describe JACAL (Just Another cognate Alignment
Program) which is developed by them to look for similar patterns of
characters to produce a reliable bi-text map independent of the texts'
logical divisions. However, it maps only a fraction of the characters
(first 4 characters) of cognate words in a pair of texts. As a stand-alone
program it's robustness is evaluated by running on a collection of
bilingual texts with satisfactory results.

	The next part describes SALIGN - a program for sentence alignment
based on model of Brown et al. (1993). The search engine (called
TransSearch) allows a user to search a large corpora of bilingual texts
for specific expressions in one or both languages. It offers two
advantages: "first, it guarantees that the system's bi-text output
contains both the query and its translation; second, it provides a
coherent context for the presentation of results, enabling the user to
evaluate the relevance of each resulting translation in terms of the
problem at hand" (p. 48). The program is tested on its own and with JACAL
on a collection of bilingual texts to find that "SALIGN consistently
records much higher recall (a traditional evaluation measure for
information retrieval systems) figures than the other
programs" (p.54). Despite encouraging robustness the overall result of the
program is quite disappointing pointing to its inability to account for
translation omissions, insertions or segmentation errors.

	The next part describes TMALIGN - a word alignment program
designed after Brown et al.'s stochastic translation model (1993) to
calculate bi-textual correspondences at the word level. It is run over a
bilingual corpus drawn from the Canadian Hansards. The result is
significantly better probably because the input data is quite large and
clean, and the program is composed on the sentence-level aligned corpora.

	Thus they describe their effort to align linguistic units in
bilingual texts at character, sentence and word level. Finally, they
direct towards some future works related with proper segmentation of input
texts into paragraphs, sentences and words; advent of a number of
translation support tools (e.g. translation memories, bilingual
concordances, translation checkers, translation dictation machines etc.),
and the development of more elaborate and robust models to account for
full range high quality machine translation.

	In Chapter 3 (pp. 65-69) Pernilla Daneilsson and Daniel Ridings of
Gteborgs Universitet, Sweden, briefly inform how they conduct a course for
getting students ready to select appropriate terminology for translation
purpose. The corpus they use includes both translated and non-translated
texts of Swedish and English covering same domains and genres. Using Gale
and Church's (1993) alignment algorithm they align texts at sentence
level; store them in TEI format; and use Normalised SGML Library for
alignment annotation.

	As their aim is to train the students for searching out
appropriate terms from the source and target texts for translation, they
decide that in both texts "words or phrases having a special meaning or
frequent use in a specific domain/genre/area" (p. 68) should be
searched. However, their experience shows that (i) a term in the source
text may not have an appropriate corresponding term in the target text,
(ii) translation 'per se' is probably impossible since a word in one
culture may not correspond to a single word in another culture, (iii) a
translated term may be a hyponym to the term in the source text, and
(iv) culturally specific terms in the source language may not have
correspondence in the target language. The students are, therefore,
enabled to identify the correct domain of use of each term and locate it
in Swedish contexts in parallel with its translation in English
contexts. In case of problematic terms they are advised to browse through
the record of previous translations when required.

	In Chapter 4 (pp. 73-85) Carol Peters, Eugenio Picchi and Lisa
Biagini of Istituto di Linguistica Computazionale, Pisa, Italy, describe
how a Bilingual Corpus System is implemented for processing and searching
both parallel and comparable text archives for language teaching and
learning. Though they use both parallel text corpora (consisting sets of
translationally equivalent texts) and comparable text corpora (comprising
sets of texts from pairs of languages potential to contrastive and
comparative study), they feel that "parallel text corpus is most likely to
provide useful data for the average second language student, who is mainly
looking for information on ways in which a given word or phrase can be
translated acceptably in another language" (pp. 74). To achieve their goal
they use an Italian/English bilingual lexical database and morphological
analysers and generators to process and search parallel corpora. Their
system operates in two distinct stages: in the first stage "a bilingual
electronic dictionary and morphological components are used to link pairs
of English/Italian texts on the basis of L1/L2 translation
equivalents" (p. 76), and in a second stage the "L1/L2 links are used by
the bilingual text query system to construct parallel contexts for any
form or co-occurrence of forms searched in either of the two sets of
texts" (p. 76).

	After processing parallel bilingual text archives by
synchronisation procedure all the links are obtained and memorised for
parallel query system. For searching each source text word the parallel
concordances for the target text are constructed and associated links, if
any, are searched. With these, the user can search for a single word; can
search for all the words of a given lemma by using the morphological
generator; and can extract relevant information on the translation of
idioms and collocations.

	In the next part of the chapter they describe how they apply a
different approach in processing comparable corpus. For this they consider
texts from the same domain or on the same topic from Italian and English
mainly focusing on nouns with the conviction that "in domain-specific
corpora it is mainly the nouns that bear the weight of topic-specificity,
i.e. technical message, the verbs tend to have a more general
meaning" (p. 80). Given a particular term or set of terms found in the
texts of the source language, the aim is to identify contexts which treat
the same argument in the texts of the target language. To do this they
attempt to isolate the vocabulary or context related to that term in the
source corpus assuming that the word will be surrounded by a similar
vocabulary or context in target language. Next using their lexical tools
(morphological analysers and generators, a bilingual lexical database
etc.) they construct equivalent vocabulary for both the languages and
create sets of translation equivalents. Finally, the system searches the
target corpus in order to identify words and expressions that can be
considered as in some way lexically equivalent to the selected term in the
source language.

	Finally, they try to implement a function that will allow them "to
query on the combinations of more than one term (collocates and
compounds), essential for studies on terminology as terms generally appear
in the form of multiword units" (p. 83). They also intend to refine the
search criteria; increase the efficiency of the algorithm in order to
improve performance and to increase precision of retrieval eliminating
noise; and test alternative methods of the MI (Mutual Information) index
to see whether the results change substantially.

	In Chapter 5 (pp. 86 - 91) Rene Meyer, Mary Ellen Okurowski and
Thrse Hand of New Mexico State University, USA describe how their
adult-centred language training approach is effective by using authentic
corpora and language tools. For this purpose they use OLEADA (meaning,
'tidal wave' in Spanish), an indigenously developed multilingual software
environment integrating on-line multilingual text corpora, information
retrieval and language analysis tools. A single user interface allows
smooth access to the texts and tools in ten languages. Users are enabled
to study the language of retrieved texts by using OLEADA's different
language analysis tools such as on-line dictionaries and references,
XConcord, parallel text alignment, segmentor, frequency count and
user-generated annotations. In future some more tools such as lexical
collocation identification, part-of-speech tagging, parsing and entity
identification are supposed to be incorporated in the system.

	The system has strong application potential for language trainers,
classroom instructors and independent learners as it provides
"just-in-time training with texts on demand, tasks that parallel
professional needs, and comparative feedback for self evaluation on the
part of the learner"(p. 87). The classroom instructors can use it to
respond to learner needs, to reply queries, to retrieve topical
information, to find reference, and to many other similar tasks of
teaching and instructions. The trainers can use it for frequency count to
asses which elements should be emphasised and taught first and to compare
the frequency of words within a document to the frequency of the same
words within a corpus or user-specified sub-corpus. By KWIC they can
locate and display the terms and phrases with all their contexts of use in
corpora. With XConcord they can present, compare and discuss the contexts
of important items to improve students' expectancy skills by studying
actual content domains of items. The parallel search tools enable them to
access foreign language texts and the corresponding English translations
with applications of all other tools inbuilt in OLEADA. Moreover, the
annotation facility provides clues, definitions, grammatical explanations
and similar information about the text to the learners while they work
through particular tasks. For independent learners OLEADA serves in the
same way as it does for other two types of users. Moreover,  it enables
them to identify work-related texts, prepare relevant assignments and
discover and study new language phenomena. On-line dictionaries provide
them language specific linguistic resources; segmentors help them in
automatic segmentation of written texts into paragraphs, sentences or
words; and XConcord support them to examine context-specific samples and
grammatical structures.

	Thus OLEADA resources support the learners through all stages of
learning process including the final feedback. The OLEADA corpora also
contain written speech read aloud as well as transcribed colloquial speech
with accompanying audio to enhance listening comprehension. All these
utilities put together enable the learners to study as they work and work
as they study.

	In Chapter 6 (pp. 92-105) Jennifer Pearson of Dublin City
University, Ireland, describes an approach to the teaching of terminology
by using electronic resources. For terminological research the resources
available to students are of three types: a large collection of full texts
(articles from newspapers and journals and encyclopaedias), large
collection of abstracts (of articles from science, computing and
business) and single texts (in more than one language with a similar
communicative function dealing with a particular topic: mostly lectures as
an aid to specialised translation).

	For selection of resources a number of criteria are under
consideration such as author-reader relationship (expert-expert or
expert-naive communication), published texts (because they ensure some
degree of acceptance of the terminology within particular field), text
origin (product of an individual or a collaborative venture), constitution
of texts (single or composite), factuality (text must be factual),
technicality (text may be technical or semi-technical), and intended
outcome of the text (informative as an article in newspaper, didactic as
used in teaching of a subject, and stipulative (as standard or regulatory
texts prescribing and defining terms used in particular subject domains).

	While searching a term students look for a number of different
categories of information. To situate the term in conceptual hierarchy
they look for evidence of genus-species relations (term preceded or
followed by its superordinate term or general language equivalent),
part-whole relations (terms is either a part or whole of the preceding or
following term), quasi-synonymous relations (terms explained using an
equivalent term or phrase) and similar other relations. Moreover, they
require to know the characteristics of the term such as its purpose,
origin, function, inputs, outputs, properties etc. Once they are equipped
with all these clues they are asked to identify the meaning of a term (or
two culture-specific terms); to establish equivalence across language; and
to identify appropriate collocations and related terms.

	Their experience in identifying the meaning of culture-specific
terms reveals that terms "which occur quite frequently in financial,
political and economic texts are particularly problematic for
translators" (p. 99). To them the best solution for this is to "use the
document itself to find clues to the meaning of the term and, if this
fails, to consult other documents on the subject" (p. 99). In case of
establishing equivalence across languages they propose (as they did for
English and French) for identification of meaning of a term in the source
language as well as identification of meaning of a similar term in the
target language with the help of a dictionary. Once potential equivalents
are identified they can be confirmed of their appropriateness by the use
of a parallel corpora (bilingual texts dealing with the same subject area
and with similar communicative function). In conclusion they underline the
importance of finding related terms (as they may provide basis for
additional glossary entries) and collocations (as this information is
rarely found in dictionaries) for retrieving terminological definitions
from texts.

	In Chapter 7 (pp. 106-115) Michael Barlow of Rice University, USA
describes how parallel texts can be used in language teaching. In his
opinion by using a corpora and text analysis program students can learn
language in a better way than using a dictionary, thesaurus or grammar
because corpora provide learner "a rich and adaptable research environment
in which the data are selected examples of language use, embedded in their
linguistic context" (p. 106). He cites some case studies on the treatment
of reflexive forms and the use of certain lexical items in English corpora
to substantiate that corpus-based investigations are more competent to
reveal the complexities and fine-grained patterns of use of lexical items
in language. He postulates that "it is likely that the bulk of language
acquisition is the result of inductive rather than deductive learning
mechanisms, a fact which, if true, has far-reaching consequences for the
teaching of languages" (p. 109).

	In the second part of the chapter he describes the research based
on the analysis of parallel texts, some uses of parallel texts in the
language classroom, and the ParaConc: a simple parallel text concordance
program to search words and phrases in parallel corpora. Searching through
parallel texts he finds that some of the reflexives in English are not
translated with a reflexive in French. Similarly, collocations and
polysemy structure of particular lexical item of English strongly contrast
with their French correlates. He is, however, able to locate the areas of
dispute and identify the reasons of such differences.

	Finally, he argues that students can use parallel corpora in
classroom 'for the feel of a second language'; to obtain some concrete
knowledge of correspondences; to explore the richness of context of a
particular lexical item not available in bilingual dictionary; to gather
important information concerning the relative frequency of different
constructions and collocations; to understand the distinctions of meaning
expressed by particular terms in both source and target language; to know
how the context in terms of discourse and genre can provide clues to the
appropriate meanings etc.

	In Chapter 8 (pp. 116-133) David Woolls of Birmingham University,
UK describes the development of a user-driven multilingual parallel
concordancer as a tool for use in the classroom. The system, developed as
a part of Lingua project, works with Danish, English, French, German,
Greek and Italian. The corpus includes a set of texts covering children's
literature, fiction, non-fiction and general scientific writings, and is
conformed to the guidelines of TEI (Text Encoding Initiative). It uses
Minmark - a highly reduced mark-up program designed after SGML (Standard
Generalised Mark-up Language) as detailed marked up corpora posit problems
in handling the texts and translations. The alignment algorithm of Gale
and Church (1993) is simplified to make user-friendly and work by
reference to paragraph and sentence boundaries. The advantage of the
method is that whatever the dual linear/length relationship exists between
languages, the algorithm can be considered language independent. It is
considerably simpler and yet produces results parallel to Gale and Church
algorithm.

	The program has an impact on the users at the time of searching,
sorting and testing of the data. They are able to move between any pair of
languages, study in contrastive translations, and select languages and
files quite easily. While searching for contexts of particular items it is
assumed that the contexts should appear within a proximity of up to six
words to the left, right or either side of the search item, or anywhere in
the same sentence, or anywhere in the same paragraph. The proximity option
is equally advantageous like other concordancers, the sentence option is
useful "where examples of linguistic features prone to extremely
distribution are sought" (p.130), and paragraph option is useful "to
identify paragraphs where two characters in a text are
interacting" (p.130). For quick and practical classroom operation the
system needs good search speed and reasonable accuracy leading to the
remodification of the standard concept of alignment and encoding.

	In Chapter 9 (pp. 134-147) Stig Johansson of University of Oslo,
Norway and Knut Hofland of University of Bergen, Norway describe their
current works on contrastive analysis and translation studies with the
English-Norwegian parallel corpus, and focus on the new directions of
research. The size of the corpus is approximately 2.5 million words,
consisting comparable original texts of each language and their
translations into the other language. It is encoded according to the TEI
guidelines and aligned at sentence level. Presently the corpus is
subjected to study the "presentative constructions in English and
Norwegian, word order in English and Norwegian, expressions of possibility
in English and Norwegian, Norwegian discourse particles and their English
correspondences" (p. 134). However, the chapter only presents some
comparative studies of occurrence of some linguistic items in parallel
corpora and their respective translations and "the expansion of the corpus
to other languages for use in multilingual research" (p. 135).

	Navigating through the parallel corpus they find that the
Norwegian modal auxiliary 'skal' is far more widely used than the
etymologically related English 'shall', while the Norwegian modal particle
'nok' may correspond to a wide range of forms (adverbs, verbs and
clauses) in English. The plausible interpretation of these results may be
the availability or non-availability of appropriate terms in translations,
or some language or culture specific factors.

	The second part of the chapter deals with analysis and contrastive
studies of multilingual corpora comprising six English fiction texts
aligned with their translations into German and Norwegian. Using Dice
similarity measure (McEnery and Oakes 1995) they extract cognates form the
English original texts and the translations to show that "while Norwegian
shares a lot of vocabulary both with English an German, the latter have
far less in common" (p.141). They also study specific text-based
constructions (e.g. equative structures, cleft constructions, analogous
one-clause constructions, non-analogous one-clause constructions, initial
'which' etc.) across three languages to show that "English reversed
pseudo-clefts are almost always conveyed by other types of constructions
in German and Norwegian translations" (p.146). In conclusion they opine
that the emergence of parallel multilingual corpora would give new insight
into language research; supply important input for the production of
teaching materials and the writing of contrastive grammars and bilingual
dictionaries; and provide a bridge between language description and
language use.

	In Chapter 10 (pp. 148-156) Raphael Salkie of University of
Brighton, UK describes how they compile a small and medium-sized
multi-lingual corpora (SMEMUCs: a new acronym coined by the author) in the
INTERSECT project and use it as a source of language research and
teaching. For the task at hand they give the comparable corpora (texts
from English, French and German) a computer-readable form; take some
decisions regarding editing of texts for easy handing, storage and
retrieval; make suitable alignment of texts at sentence and paragraph
level; correct typographical (spelling) errors form the aligned texts; and
save files in text-only (ASCII) format.

	The corpora, prepared thus, are used for contrastive linguistic
researches (e.g. use of epistemic modality in French and English, use of
English 'but' vs. French 'mais', or English 'allege' Vs, German 'sollen'
etc.); for studying different aspects of grammar and vocabulary; for
"comparing corpus data with the entries in bilingual
dictionaries" (p.154); and for teaching translations.

	In Chapter 11 (pp. 157-176) Josef Schmied and Barbara Fink of
University of Chemnitz, Germany describe a contrastive lexicological study
based on an English-German  parallel texts and translations. The chapter
highlights the use of English 'with' and its German translation
equivalents in a sub-corpus comprising texts from tourist brochures,
publications by European Union, scientific textbooks and literary texts.

	In the first half they identify the prepositional and prototypical
use of 'with' from the corpus texts, note it's semantic diversity, and
search for its syntactic categories. They also observe the distribution of
'with' across text types and syntactic functions in English corpus to show
that "whereas adnominal and clausal 'with' are particularly frequent in
tourist texts, literature uses more adverbial 'with'" (p. 163). For
English preposition 'with', the German has many translation equivalents
among which 'mit' is mostly used followed by other prepositions like
'bei', adjectives with an adnominal function like 'beschmt', or
zero-translations etc. In some cases in German either the entire sense
element is omitted or other solutions are solicited to express the sense
of 'with'. Most of these changes are caused due the content of
translations, text types, translator's language choice, besides other
language-specific grammatical or syntactic factors. They argue that the
prototypical equation 'with' = 'mit' is unsatisfactory in many respects,
quantitatively and qualitatively. Therefore, it is better to try a
functional grammar approach as a primary classification of 'with' because
meaning-based categorisation leaves so many cases in between.

	In conclusion they hope that contrastive corpus linguistics can
(i) show that simple word-class based equivalents found in bilingual
dictionaries are not sufficient for translations; (ii) expose how
different innovations spread across text-types until they permeate the
entire language structure; or (iii) provide more detailed empirical
description of a language in a typological perspective.

	So far the volume emphasises on Western European languages which
are mostly genetically related. It has little exposure to the possibility
of developing parallel corpora or text alignment algorithms for
genetically or typologically non-related languages. This part is, however,
taken into consideration in the concluding Chapter (pp. 177-191) where
Tony McEnery, Scott Piao and Xu Xin of Lancaster University, UK present
some works on experimental corpus building in two 'un-related' languages,
English and Chinese comprising texts from general science, letters,
poetry, fiction and social service leaflets. They also try to design an
annotation scheme for parts-of-speech encoding, and develop an algorithm
based upon bi-variate distributions to align sentences of parallel texts
at word level.

	With some necessary modifications on the existing techniques or
innovating some new techniques for the problem at hand, they are
successful in their effort to demonstrate that their new alignment
technique on the correlation between English and Chinese pairs is
effective and the results are quite stable in the corpora. However, their
attempt raises a demand for more such work among various language pairs so
that exiting "alignment technology can be tested and refined, enabling a
wide range of work between large number of languages" (p. 189).

A critical evaluation

It is a rare experience to come across a volume where all the chapters are
so well written with ample scope for laymen to get acquainted with this
new area of language research and training. In fact such a difficult area
would not have been so nicely handled if the authors were not well versed
in their respective fields. The book shows how corpora (either monolingual
or bilingual or parallel or comparable or aligned) can be used for
teaching as well as for new research and understanding the language.

	As a reviewer I am delighted to read this book, and I believe
anyone interested in applying corpora in language teaching and research
can gather from this book many novel and exciting ideas for exploiting
corpora: a treasure house of linguistic properties. However, only a few
observations which I like to cite here may be considered in the next
edition of the book.

(i) Barring the last chapter, in all other chapters the discussion or
experiment are centred within the language pairs. Because of their
genealogical, typological, orthographic and many other similarities, the
corpora of these languages are probably easier to align, which may not be
so for corpora of language strongly different in their respective
features. It would be interesting to see what are the new approaches are
to be employed for aligning corpora belonging to English-Hindi,
English-Japanese, Bangla-Chinese or Arabic-Japanese.

(ii) The application of multilingual corpora is not confined within
teaching and research as focused in the title of the book. In fact, almost
all writers have identified many more application areas of corpora. The
multi-functional utility of corpora is probably the best perceived by
Svartvik (1986) who visualises that corpora can be used in "lexicography,
lexicology, syntax, semantics, word-formation, parsing, question-answer
synthesis, software development, spelling checkers, speech synthesis and
recognition, text-to-speech conversion, pragmatics, text linguistics,
language teaching and learning, stylistics, machine translation, child
language, psycholinguistics, sociolinguistics, theoretical linguistics,
corpus clones in other languages such as Arabic and Spanish - well, even
language and sex". The application scope of corpora is further expanded in
observations of Atkins et al. (1992), Leech and Fligelstone (1992),
McEnery and Wilson (1996), Rundell (1996), Barlow (1996), Biber at
al. (1998), Kennedy (1998), Teubert (2000) and many other experts in this
area.

(iii) A few minor mistakes in orthography such as in page 62, line 39
'tem' should be 'term', in page 179, line 22 'in noted' should be deleted,
'Section 3.1.1' mentioned in page 179, line 23 is not found in chapter 3
etc.

(iv) In some cases the full forms of the abbreviated terms like LL (page
14, line 10),  LDB (page 76, line 17), OLEADA (page 86, line 3), INTERSECT
(page 149, line 6), are not given in the texts.

(v) Probably a glossary of different terms (mostly new) used in the book
would have been a fine attribute to the volume as well as a good help to
the readers.

	However, the book with comprehensive introduction to the subject
is a good reference work to all corpus users as well as to language
researchers, instructors and teachers. The general readers with a liking
for language and linguistics also can find this book interesting. The
quality of paper, printing and binding is of international standard.

E)Bibliography

Aarts, J. and Meijs, W. (eds.) (1984) Corpus Linguistics. Amsterdam: Rodopi.
Aarts, J. and Meijs, W. (eds.) (1986) Corpus Linguistics II. Amsterdam: Rodopi.
Atkins, S., J. Clear and N. Ostler. (1992) "Corpus Design Criteria",
Literary and Linguistic Computing. 7(1): 1-16.
Barlow, M. (1996) "Corpora for Theory and Practice", International Journal
of Corpus Linguistics. 1(1): 1-38.
Barnbrook, G. (1996) Language and Computers. Edinburgh: Edinburgh
University Press.
Biber, D., S. Conrad, and R. Reppen (1998) Corpus Linguistics: Investigating Language Structure and
Use. Cambridge: Cambridge University Press.
Boguraev, B. and J. Pustejovsky (1996) Corpus Processing for Lexical
Acquisition. Cambridge, Mass.: MIT Press.
Brown, P. F., Lai, J. and Mercer, R. (1991) "Aligning Sentences in
Parallel Corpora", in Proceedings of ACL-91, Berkeley.
Brown, P. F., Della Pietra, S. A., Della Pietra, V. J, and Mercer,
R. L. (1993) "The Mathematics of Statistical Machine
Translation: Parameter Estimation", Computational Linguistics,
19(2): 263-312.
Dagan, I., Church, K.W. and Gale, W. A. (1993) "Robust Bilingual Word
Alignment for Machine Aided Translation", in Proceedings of the Workshop
on Very Large Corpora: Academic and Industrial Perspectives, Columbus
Ohio.
Gale, W. A. and Church, K. W. (1991) "A Program for Aligning Sentences in
Bilingual Corpora", in Proceedings of ACL-91, Berkeley.
Gale, W. A. and Church, K. W. (1993) "A Program for Aligning Sentences in
Bilingual Corpora", Computational Linguistics, 19(1): 75-102.
Kennedy, G. (1998) An Introduction to Corpus Linguistics. London: Addison-Wesley Longman.
Leech, G. and S. Fligelstone (1992) "Computers and Corpus Analysis" in
C. S. Butler (ed.) Computers and Written Texts. Oxford: Blackwell
Publishers. 115-140.
McEnery, A. M. and Oakes, M. P. (1995) "Sentence and word alignment in
the CRATER project: Methods and assessment", in S. Armstrong-Warwick and
E. Tzoukerman (eds.) Proceedings of the EACL-SIGDAT Worshop, Dublin,
pp. 77-86.
McEnery, T. and A. Wilson (1996) Corpus Linguistics. Edinburgh: Edinburgh
University Press.
Melamed, I. D. (1996) "A Geometric Approach to Mapping Bi-text
Correspondence", in Proceedings of the Conference on Empirical Methods in
Natural Language Processing, Philadelphia.
Oakes, M. P. (1998) Statistics for Corpus
Linguistics. Edinburgh: Edinburgh University Press.
Ooi, V. B. Y. (1998) Computer Corpus Lexicography. Edinburgh: Edinburgh
University Press.
Rundell, M. (1996) "The Corpus of the Future and the Future of the
Corpus". Talk at a special conference on New Trends in Reference Science
at Exeter, UK (a hand out).
Sinclair, J. (1991) Corpus Corpus, Concordance,
Collocation. Oxford: Oxford University Press.
Svartvik, J. (1986) "For Nelson Francis", ICAME News. No. 10: 8-9.
Svartvik, J. (ed.) (1992) Directions in Corpus Linguistics: Proceedings
of Nobel Symposium 82. Berlin: Mouton de Gruyter.
Teubert, W. (2000) "Corpus Linguistics - A Partisan view", International
Journal of Corpus Linguistics. 4(1): 1-16.

A short biography of the reviewer
==================================
Niladri Sekhar Dash passed MA in Linguistics from Calcutta University in
1991. In 1994 he completed ANLP from Indian Institute of Technology,
Kanpur. From 1992 to 1995 he worked as Language Analyst in the TDIL (Text
Development in Indian Languages) project of the Ministry of Information
and Technology, Govt. of India. From 1995 to 1997 he worked as Technical
Assistant in Computational Linguistics and NLP at Computer Vision and
Pattern Recognition Unit of Indian Statistical Institute, Calcutta. From
1997 he works as Scientific Assistant in the same institute. He has
submitted his thesis on corpus design and development for language
processing for the Ph.D. degree to Calcutta University. His present areas
of research are: corpus design and development, word processing,
parts-of-speech tagging, morphological processing, Word Sense
Disambiguation etc. His contact address is: Computer Vision and Pattern
recognition Unit. Indian Statistical Institute. 203. B. T. Road. Calcutta
700035. India. Emails: <niladri at isical.ac.in> (Off.),
<niladrisekhar at hotmail.com> (Res.).


---------------------------------------------------------------------------

If you buy this book please tell the publisher or author
that you saw it reviewed on the LINGUIST list.

---------------------------------------------------------------------------
LINGUIST List: Vol-11-2537



More information about the LINGUIST mailing list