Corpora: CL course - THANK YOU!

Tue Apr 16 11:13:58 UTC 2002

Dear list members,

I'd like to thank all of you who responded to my query on 'How to
organise a corpus linguistics course'! I knew that there were some nice
corpus linguists out there but I didn't know you were so many! A big
"THANK YOU" goes to:
Petra Maier
Tylman Ule
Nadja Nesselhauf
Tony Berber Sardinha
Damon Allen Davison
Bilge Say
? Rykov
Francois Maniez
Frank H. Müller
Oliver Mason
Geoffrey Williams
Anke Lüdeling
Detmar Meurers.

The comments you made helped me to make decisions about texts to choose
and about how to structure the course. Attached to this email

[From listadm: removed, but available at:
http://www.hit.uib.no/corpora/tentativeXscheduleXnew.doc]

you find the tentative schedule I handed out in the first session
yesterday (which was fun actually: a nice small group of students who had
no idea what corpus linguistics might be and didn't know why they had
chosen the course but who seemed to be eager to learn everything about it
and who all volunteered to do oral presentations).

As for the advice list members gave me, I've added or paraphrased parts
of their emails below.

Thanks very much for your help again!
All the best from Cologne,
Ute

ute.roemer at uni-koeln.de

Petra Maier (pmaier at cis.uni-muenchen.de):
I was giving a CL1 course (4 hours / week, 2 h/week reserved for oral
talks) for severarl semesters now. Martin/Jurawsky's Introduction to
language and speech processing turned out to be a good source for short
oral talks. We started with 10-20 students, which was very good, but now
we have more than 60 students and it turned out that oral talks are no
more feasible!

Tylman Ule (ule at sfs.uni-tuebingen.de):
May I recommend as a tool for querying corpora the TIGERSearch engine
(http://www.ims.uni-stuttgart.de/projekte/TIGER/TIGERSearch/).  It is,
admittedtly, a query tool that mainly targets highly annotated corpora
(Penn Treebank, Negra, Suzanne), but then, it comes for free, and has a
powerful query language designed with the linguist in mind.  It is also
available for a number of platforms (including Mac and Windows).  There
are corpus samplers that come with the tool, and any of the supported
corpora may be imported if you decide to buy them.
(It is a kind of plug, because it was partly developed in the DEREKO
project, and I was in the DEREKO team.)
As for high-volume data, I think the BNC still has no competitor with
respect to the fine-grained categories that let you do research on
differences in, e.g., gender/age/origin of speaker/writer, and, of
course, text type.  The sara search engine that comes with it is
definitely not the only way to access it, although I guess it should be
simpler to install than any other solution.  (I installed the BNC a long
while ago, and decided to extract the data immediately for using it in
Xlex for, e.g., concordancing
(http://santana.uni-muenster.de/XlexPublic/) - sorry, this is another
shameless plug).

Nadja Nesselhauf (nadja.nesselhauf at unibas.ch) recommended to choose
chapters from introductory textbooks (Biber/Conrad/Reppen 1998, Sinclair
1991) and Charles Fillmore's 1991 as well as Inge deMönnink's 1999
article for student presentations.

Tony Berber Sardinha (tony4 at uol.com.br):
I've been giving Corpus Linguistics courses here in Brazil to non-native
speakers of English for three years now, in a postgraduate department of
APplied Linguistics. Most students are EFL teachers and teacher trainers.
I stick to Windows - teaching students how to use Linux would just take
too long. What I'd recommend in terms of software is WordSmith Tools
(about 60 pounds for an individual license) and MicroConcord (free).
WordSmith is powerful and relatively easy to use, although some would
object to this and say that it has a steep learning curve, which is true
only if want to do the most 'advanced' stuff, such as key key words,
indexing, clumps, etc. A tagger such as QTAG or WinBrill is also helpful
(both free). As far as equipment goes, you might want to get hold of a
computer projector for your computer room, so that students can follow you
as click along in WordSmith Tools or any other software. Some students do
tend to get lost in the many windows that WS Tools opens. As far as
contents, one of the things that struck me over the years is how hard
students find to analyze concordances, and so I devote at least 4 3-hour
sessions to concordance analysis workshops, so that students begin to get
a grip on how to identify patterns in concordances, represent these
patterns consistently and evaluate their importance.

You can see some of my course materials at
http://lael.pucsp.br/~tony/cursos/

I apologize in advance for bad links on that page since this website is
being transferred from another location

Damon Allen Davison (davison at socal.rr.com):
You should send Prof. Dr. Achim Stein an e-mail
(achim.stein at po.uni-stuttgart.de).  He held a course for French corpus
linguistics a few years ago in the Romanisches Seminar in Cologne.  I
thought his organization was quite good.  (Ich bin aber nicht voellig
unbefangen, weil ich fuer diesen Kurs Tutor war...)  He had a lot of
materials in eletronic form (everything, I think).  But with Anglisten,
you might just use McEnery/Wilson as your text.  They are online, of
course: http://www.ling.lancs.ac.uk/monkey/ihe/linguistics/contents.htm
BTW, we did use Cygwin/GNU/Linux under Windows because Achim's tools
were actually bash scripts.  Michael Barlow's Monoconc Pro is really
good, though.  I know that he sometimes offers special licenses for his
software.
http://www.ruf.rice.edu/~barlow/corpus.html
http://www.athel.com/mono.html

P bI K O B_ B.B." (rykov at narod.ru):
As to me - I plan to follow McEnery's CL book and Cathrine Ball's course
- both are in WWW.
My only problem is: I prefer BUC as best for studying and I had it for
free on mainframe tape. But I can not get it - because it is for fee
now.

Bilge Say (bsay at ii.metu.edu.tr) sent me his own course schedule and
wrote:
I am attaching the course outline of the course "Using Corpora for
Language Research"  , hoping that it might help somewhat with the
organization. Since my interest is in NLP and my students are cognitive
science students, this is not strictly a Corpus Linguistics course,
though. Some chapters of Biber's and McEnery's books might make
presentation materials.

Francois Maniez (fmaniez at wanadoo.fr):
I would also add the TACT concordancing software to the list, as well as
the Amalgam POS-tagger, to which you can e-mail texts to be tagged (there
is a choice of eight different tagsets). It is available at
http://agora.leeds.ac.uk/amalgam/.

Frank H. Müller (fhm at sfs.nphil.uni-tuebingen.de) recommended to choose
chapters from introductory textbooks rather than specific research
articles. He also mentioned the books "Working with German Corpora" and
"Computerlinguistik und Sprachtechnologie", edited by Kai-Uwe Carstensen
et al. and an online course written by one of his colleagues and
accessible via
http://gross.sfs.nphil.uni-tuebingen.de:8080/release/ (link:
"textbook").

Oliver Mason (oliver at ccl.bham.ac.uk) wrote:
I wouldn't spend too much time on technical issues, as most UGs will
probably not have to deal with that a lot.  Geoff's "Language and
Computers" gives a good intro to how to get your data.  Annotation is also
something that I feel is a bit overrated, as most people want to do
analysis, not annotation, and you also have technical issues to deal with
then.
> Then I still need to find some more short articles (not longer than
> 10-12 pages) I could give to my students to prepare short oral
> presentations. Will selected sections from (introductory) CL textbooks
> be more useful for this purpose than descriptions of specific research
> projects?
Not sure about CL textbooks; I'd rather use more applied texts, which show
the benefit of corpus methods.  I did a couple of sessions on lexicography
recently, and Looking Up was a useful source which the students felt gave
them good reasons why you would want to use a corpus.

Geoffrey Williams (geoffrey.williams at wanadoo.fr):
I teach a course in corpus linguistics in Nantes for students following
the licence Sciences du Langage programme. These are studying general
linguistics and will have had a course in English applied linguistics
taught by myself in the first semester. I use the latter to give them the
necessary background to contextualism. The CL course is optional in the
second semester and consists of 12 2hour blocks in a computer room. I find
that these students need a rapid hands-on approach whilst being given some
background to humanities computing. The level of computer knowledge is
highly variable, often nil, so it is necessary to be clear as to the
difference between a Word doc and a text file as they do not always see
the difference. This is all done through a mixture of self discovery and
teaching. Like Tony I do not use linux, much as I would like to, as it is
not readily available to these students. I quickly get onto text and use
the concordancer generously provided free by Darmstadt, Wincord at
http://www.ifs.tu-darmstadt.de/sprachlit/wconcord.htm . This runs under
Windows and does all the basic tasks needed in discovering language in
context. I show them WordSmith as this is the best, but the fac is too
stingy to buy anything. I do not go into POS tagging due to lack of time,
but concentrate on what a concordancer can show when working on plain
text. Once they have seen a concordancer at work I go into more detail as
to what constiututes a corpus etc. For background reading I recommend:
Tognini Bonelli, E. 2001. Corpus Linguistics at Work. Amsterdam: John
Benjamin.
PARTINGTON A. 1998. Patterns and Meanings, Amsterdam : John Benjamin's
Kennedy G. 1998. An introduction to Corpus Linguistics. Longman
and of course
SINCLAIR J. McH., 1991 Corpus, Concordance, Collocation. Oxford: Oxford
University Press.

Anke Lüdeling (aluedeli at uos.de):
if you want to work on German in your corpus class: there is a very nice
tool for tagging and morphological analysis called MORPHY which you
could use. I like it a lot for teaching purposes because it is easy to
understand and to use - you can change a number of parameters and
directly see the consequences. Another bonus is that the documentation
on MORPHY is very well-written. The two texts given below are so clear
that students will be able to understand it without too much prior
knowledge.
MORPHY can be downloaded from
http://www-psycho.uni-paderborn.de/lezius
The documentation can be downloaded from
http://www.ims.uni-stuttgart.de/~lezius
For students I would especially recommend Rapp and Lezius (2001) and
Lezius, Rapp & Wettler (1998).
If you are still looking for (short) research papers, look at the
proceedings of the Corpus Linguistics Conference in Lancaster, 2001:
there are a number of very interesting topics.

Detmar Meurers (dm at ling.osu.edu) stressed the importance of focussing on
theoretical aspects within corpus linguistics.
As for short articles for student presentations he recommended "Corpus
Annotation" by Roger Garside, Geoffrey Leech and Tony McEnery. He also
recommended McEnery/Wilson's "Corpus Linguistics" as an introductory
textbook.