[Corpora-List] Punctuation
Jane A. Edwards
edwards at ICSI.Berkeley.EDU
Wed Jan 12 01:54:11 UTC 2005
One cautionary note (though perhaps it is obvious):
the clearest cases for punctuation analysis will be those drawn
from *written* language corpora (e.g., Brown and LOB).
Although spoken language corpora contain punctuation marks, these do
not necessarily follow the conventions of written language, but rather
are sometimes strategic choices for encoding prosody to some extent
within the constraints of standard keyboards (i.e., without resorting
to special characters).
Also, the number of punctuation marks which are used and which ones
in particular can have a large impact on the "meaning" of any particular
one of them. (I've written on this elsewhere if of interest.)
This point is of course partly related to Eric Atwell's point:
> (usage depends on original sources so there is no corpus-wide
> standardised punctuation)
which is also important.
I can't resist mentioning two important works for background lit:
1) Quirk, et al. A Comprehensive grammar of the English language.
London ; New York : Longman, 1985. x, 1779 p. : 26 cm.
Everyone knows this course, but I think the sections on punctuation
are not given nearly the attention they deserve.
2) For punctuation in historical context, I would also recommend
the following:
Parkes, M. B. (Malcolm Beckwith)
Pause and effect : an introduction to the history of punctuation
in the West / M.B. Parkes.
Berkeley : University of California Press, c1993.
Parkes is often overlooked, but is fascinating, and full of plates
which go all the way back to ancient texts (Greek and Latin).
He makes a very strong point to the effect that punctuatino has served
very different functions at different points in time, depending on
the nature of the audience for which the text was put into writing.
In Ancient Greece, one important use of writing was to preserve
spoken language and help students become better orators.
The claim is made that people didn't read silently until much later.
Another point perhaps of interest: the amount of punctuation
in the Bible varied greatly from on era to the next depending on the
intended readership. When it was a homogenous readership (native
speaking monks), there was less punctuation ; later on, when it
was a more heterogenous readership in far-flung countries, there
tended to be more punctuation per page.
-Jane Edwards
From owner-corpora at lists.uib.no Tue Jan 11 15:48:57 2005
Cc: "Grant, T." <tg21 at leicester.ac.uk>, Nancy Ide <ide at cs.vassar.edu>,
Keith Suderman <suderman at cs.vassar.edu>
From: Nancy Ide <ide at cs.vassar.edu>
Subject: Re: [Corpora-List] Punctuation
Date: Tue, 11 Jan 2005 18:22:07 -0500
To: corpora at lists.uib.no
X-Virus-Scanned: by amavisd-new-20030616-p9 (Debian) at cs.vassar.edu
X-checked-clean: by exiscan on noralf
X-Scanner: fac4ae74441f46a01336a951083fb4fe http://tjinfo.uib.no/virus.html
X-UiB-SpamFlag: NO UIB: 0.0 hits, 11.0 required
The American National Corpus is being represented using an XML format
in which the original formatting is preserved in attributes, so in
general you should be able to determine where scare quotes were used.
The ANC First Release of 11 million words is available from the
Linguistic Data Consortium (ldc at ldc.upenn.edu) for $75 for research
use. However, within a couple of months a second release of approx. 20
million words, which includes the 11 million words of the First
release, will be available. The 1st release data included in the 2nd
release will be much "cleaner" and many errors will have been fixed.
Also, very soon (within a month) Mark Davies' web-based search and
retrieval software for the BNC will also handle the ANC 1st release.
The URL for his software is http://view.byu.edu.
Nancy Ide
On Jan 11, 2005, at 11:56 AM, Eric Atwell wrote:
> Tim,
> most English corpora since pioneering Brown and LOB in 1960s have
> included punctuation, so any of these might do.
> The British National Corpus from 1990s has the advantage of www-based
> trail search, you can "try before you buy" at
> http://sara.natcorp.ox.ac.uk/lookup.html
>
> For example I tried search term {'|"}
> - regular expression finding all occurrences of ' or "
> (usage depends on original sources so there is no corpus-wide
> standardised punctuation)
>
> I'm not sure how to identify all and only scare quotes via such regular
> expressions... good luck!
>
> Eric Atwell, school of Computing, Leeds University
>
>
> On Tue, 11 Jan 2005, Grant, T. wrote:
>
>> I'm looking for a freely accessible English language corpus which
>> allows analysis of punctuation marks - I'm interested for example in
>> examining the use of scare quotes.
>>
>> Any ideas gratefully received.
>>
>> Tim
>>
>> ______________________________________
>> Tim Grant
>> Forensic Section - School of Psychology
>> University of Leicester
>> 106 New Walk
>> Leicester LE1 7EA
>> UK
>>
>> TG21 at leicester.ac.uk
>> http://www.le.ac.uk/psychology/tg21/
>>
>> + 44(0)116 252 3658 (Direct Line) - + 44(0)116 252 2451 (Secretary) -
>> + 44(0)116 252 3994 (Fax)
>>
>>
>>
>
> --
> Eric Atwell, Senior Lecturer, Computer Vision and Language research
> group,
> School of Computing, University of Leeds, LEEDS LS2 9JT, England
> TEL: +44-113-2335430 FAX: +44-113-2335468
> http://www.comp.leeds.ac.uk/eric
>
>
=======================================================
Nancy Ide
Professor of Computer Science
Vassar College
Poughkeepsie, NY 12604-0520 USA
Tel: +1 845 437-5988 Fax: +1 845 437-7498
ide at cs.vassar.edu
Chercheur Associe
Equipe Langue et Dialogue, LORIA/CNRS
Campus Scientifique - BP 239
54506 Vandoeuvre-les-Nancy FRANCE
Tel: +33 (0)3 83 59 20 47 Fax: +33 (0)3 83 41 30 79
ide at loria.fr
=======================================================
More information about the Corpora
mailing list