[Corpora-List] Punctuation

Wed Jan 12 01:54:11 UTC 2005

One cautionary note (though perhaps it is obvious):
the clearest cases for punctuation analysis will be those drawn
from *written* language corpora (e.g., Brown and LOB).
Although spoken language corpora contain punctuation marks, these do
not necessarily follow the conventions of written language, but rather
are sometimes strategic choices for encoding prosody to some extent
within the constraints of standard keyboards (i.e., without resorting
to special characters).

Also, the number of punctuation marks which are used and which ones
in particular can have a large impact on the "meaning" of any particular
one of them.  (I've written on this elsewhere if of interest.)
This point is of course partly related to Eric Atwell's point:
	> (usage depends on original sources so there is no corpus-wide
	>  standardised punctuation)
which is also important.

I can't resist mentioning two important works for background lit:
1) Quirk, et al.  A Comprehensive grammar of the English language.
     London ; New York : Longman, 1985.  x, 1779 p. : 26 cm.
Everyone knows this course, but I think the sections on punctuation
are not given nearly the attention they deserve.
2) For punctuation in historical context, I would also recommend
the following:
     Parkes, M. B. (Malcolm Beckwith)
       Pause and effect : an introduction to the history of punctuation
         in the West / M.B. Parkes.
       Berkeley : University of California Press, c1993.

Parkes is often overlooked, but is fascinating, and full of plates
which go all the way back to ancient texts (Greek and Latin).
He makes a very strong point to the effect that punctuatino has served
very different functions at different points in time, depending on
the nature of the audience for which the text was put into writing.
In Ancient Greece, one important use of writing was to preserve
spoken language and help students become better orators.
The claim is made that people didn't read silently until much later.

Another point perhaps of interest:  the amount of punctuation
in the Bible varied greatly from on era to the next depending on the
intended readership.  When it was a homogenous readership (native
speaking monks), there was less punctuation ; later on, when it
was a more heterogenous readership in far-flung countries, there
tended to be more punctuation per page.

-Jane Edwards

	From owner-corpora at lists.uib.no Tue Jan 11 15:48:57 2005
	Cc: "Grant, T." <tg21 at leicester.ac.uk>, Nancy Ide <ide at cs.vassar.edu>,
	        Keith Suderman <suderman at cs.vassar.edu>
	From: Nancy Ide <ide at cs.vassar.edu>
	Subject: Re: [Corpora-List] Punctuation
	Date: Tue, 11 Jan 2005 18:22:07 -0500
	To: corpora at lists.uib.no
	X-Virus-Scanned: by amavisd-new-20030616-p9 (Debian) at cs.vassar.edu
	X-checked-clean: by exiscan on noralf
	X-Scanner: fac4ae74441f46a01336a951083fb4fe http://tjinfo.uib.no/virus.html
	X-UiB-SpamFlag: NO UIB: 0.0 hits, 11.0 required

	The American National Corpus is being represented using an XML format
	in which the original formatting is preserved in attributes, so in
	general you should be able to determine where scare quotes were used.

	The ANC First Release of 11 million words is available from the
	Linguistic Data Consortium (ldc at ldc.upenn.edu) for $75 for research
	use. However, within a couple of months a second release of approx. 20
	million words, which includes the 11 million words of the First
	release, will be available. The 1st release data included in the 2nd
	release will be much "cleaner" and many errors will have been fixed.

	Also, very soon (within a month) Mark Davies' web-based search and
	retrieval software for the BNC will also handle the ANC 1st release.
	The URL for his software is http://view.byu.edu.

	Nancy Ide

	On Jan 11, 2005, at 11:56 AM, Eric Atwell wrote:

	> Tim,
	> most English corpora since pioneering Brown and LOB in 1960s have
	> included punctuation, so any of these might do.
	> The British National Corpus from 1990s has the advantage of www-based
	> trail search, you can "try before you buy" at
	> http://sara.natcorp.ox.ac.uk/lookup.html
	>
	> For example I tried search term {'|"}
	> - regular expression finding all occurrences of ' or "
	> (usage depends on original sources so there is no corpus-wide
	>  standardised punctuation)
	>
	> I'm not sure how to identify all and only scare quotes via such regular
	> expressions... good luck!
	>
	> Eric Atwell, school of Computing, Leeds University
	>
	>
	> On Tue, 11 Jan 2005, Grant, T. wrote:
	>
	>> I'm looking for a freely accessible English language corpus which
	>> allows analysis of punctuation marks - I'm interested for example in
	>> examining the use of scare quotes.
	>>
	>> Any ideas gratefully received.
	>>
	>> Tim
	>>
	>> ______________________________________
	>> Tim Grant
	>> Forensic Section - School of Psychology
	>> University of Leicester
	>> 106 New Walk
	>> Leicester LE1 7EA
	>> UK
	>>
	>> TG21 at leicester.ac.uk
	>> http://www.le.ac.uk/psychology/tg21/
	>>
	>> + 44(0)116 252 3658 (Direct Line) - + 44(0)116 252 2451 (Secretary) -
	>> + 44(0)116 252 3994 (Fax)
	>>
	>>
	>>
	>
	> --
	> Eric Atwell, Senior Lecturer, Computer Vision and Language research
	> group,
	> School of Computing, University of Leeds, LEEDS LS2 9JT, England
	> TEL: +44-113-2335430  FAX: +44-113-2335468
	> http://www.comp.leeds.ac.uk/eric
	>
	>
	=======================================================

	Nancy Ide

	Professor  of Computer Science
	Vassar College
	Poughkeepsie, NY 12604-0520 USA
	Tel: +1 845 437-5988 Fax: +1 845 437-7498
	ide at cs.vassar.edu

	Chercheur Associe
	Equipe Langue et Dialogue, LORIA/CNRS
	Campus Scientifique - BP 239
	54506 Vandoeuvre-les-Nancy FRANCE
	Tel: +33 (0)3 83 59 20 47 Fax: +33 (0)3 83 41 30 79
	ide at loria.fr

	=======================================================