[Corpora-List] Punctuation

Jane A. Edwards edwards at ICSI.Berkeley.EDU
Thu Jan 13 23:54:02 UTC 2005


Hello Keith,

Oops!  The fact you are addressing me personally makes me think
my posting may have been construed as critical of ANC.  Not intended.
It was only by coincidence that I leapt into the flow after Nancy Ide.

My intent was to focus on the interpretation of punctuation by users of
any corpus, and on two references which I found interesting with
reference to punctuation in general (though not scare quotes per se).

Thank you very much for your posting, though, as I had not known
of those great properties of ANC:
- inclusion of written, spoken, AND written to be spoken;
- move toward compatibility with Xaira as search tool;
- preserving the integrity the data source's punctuation conventions.
Wonderful!

	> I hope this helps,
	> Keith

Thanks,

-Jane

	From suderman at cs.vassar.edu Thu Jan 13 09:59:34 2005
	Date: Thu, 13 Jan 2005 13:02:09 -0500
	From: Keith Suderman <suderman at cs.vassar.edu>
	Subject: Re: [Corpora-List] Punctuation
	X-Sender: suderman at pop.cs.vassar.edu (Unverified)
	To: "Jane A. Edwards" <edwards at ICSI.Berkeley.EDU>, corpora at lists.uib.no,
	        ide at cs.vassar.edu
	Cc: suderman at cs.vassar.edu, tg21 at leicester.ac.uk
	MIME-version: 1.0
	Content-transfer-encoding: 7BIT

	Hello Jane,

	All of the texts in the ANC are classified as written, spoken, or written
	to be spoken.  The texts are also classified in other ways (domain,
	audience, etc.), so depending on the sophistication of the search tool you
	are using it is possible to query any subset of the corpus you wish.  While
	we do not supply our own search tool, any tool that works with an XML or
	TEI compliant corpus should also work with the ANC.  In particular, we are
	working closely with the folks at the BNC to ensure that the ANC is
	compatible with their search tool Xaira
	(http://www.oucs.ox.ac.uk/rts/xaira/index.xml).

	 > Also, the number of punctuation marks which are used and which ones
	 > in particular can have a large impact on the "meaning" of any particular
	 > one of them.

	Whenever possible we have left the punctuation as is.  For example, various
	texts may represent double quotes with the double quote character, two
	single quotes, two back ticks (``) followed by two single quotes, or some
	other ISO character that looks similar to the double quote character.  Our
	goal is to provide the raw data, as we receive it, and let more informed
	people make the "strategic choices".

	Unfortunately, you will find punctuation in some of our spoken texts.  This
	is because those texts already had the punctuation when we received the
	files and removing it would just be another manipulation of the data.

	I hope this helps,
	Keith


	At 05:54 PM 1/11/2005 -0800, you wrote:
	>One cautionary note (though perhaps it is obvious):
	>the clearest cases for punctuation analysis will be those drawn
	>from *written* language corpora (e.g., Brown and LOB).
	>Although spoken language corpora contain punctuation marks, these do
	>not necessarily follow the conventions of written language, but rather
	>are sometimes strategic choices for encoding prosody to some extent
	>within the constraints of standard keyboards (i.e., without resorting
	>to special characters).
	>
	>Also, the number of punctuation marks which are used and which ones
	>in particular can have a large impact on the "meaning" of any particular
	>one of them.  (I've written on this elsewhere if of interest.)
	>This point is of course partly related to Eric Atwell's point:
	>         > (usage depends on original sources so there is no corpus-wide
	>         >  standardised punctuation)
	>which is also important.
	>
	>I can't resist mentioning two important works for background lit:
	>1) Quirk, et al.  A Comprehensive grammar of the English language.
	>      London ; New York : Longman, 1985.  x, 1779 p. : 26 cm.
	>Everyone knows this course, but I think the sections on punctuation
	>are not given nearly the attention they deserve.
	>2) For punctuation in historical context, I would also recommend
	>the following:
	>      Parkes, M. B. (Malcolm Beckwith)
	>        Pause and effect : an introduction to the history of punctuation
	>          in the West / M.B. Parkes.
	>        Berkeley : University of California Press, c1993.
	>
	>Parkes is often overlooked, but is fascinating, and full of plates
	>which go all the way back to ancient texts (Greek and Latin).
	>He makes a very strong point to the effect that punctuatino has served
	>very different functions at different points in time, depending on
	>the nature of the audience for which the text was put into writing.
	>In Ancient Greece, one important use of writing was to preserve
	>spoken language and help students become better orators.
	>The claim is made that people didn't read silently until much later.
	>
	>Another point perhaps of interest:  the amount of punctuation
	>in the Bible varied greatly from on era to the next depending on the
	>intended readership.  When it was a homogenous readership (native
	>speaking monks), there was less punctuation ; later on, when it
	>was a more heterogenous readership in far-flung countries, there
	>tended to be more punctuation per page.
	>
	>-Jane Edwards

	--------------------------------------------------
	Keith Suderman
	Technical Specialist
	American National Corpus
	suderman at cs.vassar.edu
	http://americannationalcorpus.org



More information about the Corpora mailing list