[Corpora-List] Punctuation
Keith Suderman
suderman at cs.vassar.edu
Thu Jan 13 18:02:09 UTC 2005
Hello Jane,
All of the texts in the ANC are classified as written, spoken, or written
to be spoken. The texts are also classified in other ways (domain,
audience, etc.), so depending on the sophistication of the search tool you
are using it is possible to query any subset of the corpus you wish. While
we do not supply our own search tool, any tool that works with an XML or
TEI compliant corpus should also work with the ANC. In particular, we are
working closely with the folks at the BNC to ensure that the ANC is
compatible with their search tool Xaira
(http://www.oucs.ox.ac.uk/rts/xaira/index.xml).
> Also, the number of punctuation marks which are used and which ones
> in particular can have a large impact on the "meaning" of any particular
> one of them.
Whenever possible we have left the punctuation as is. For example, various
texts may represent double quotes with the double quote character, two
single quotes, two back ticks (``) followed by two single quotes, or some
other ISO character that looks similar to the double quote character. Our
goal is to provide the raw data, as we receive it, and let more informed
people make the "strategic choices".
Unfortunately, you will find punctuation in some of our spoken texts. This
is because those texts already had the punctuation when we received the
files and removing it would just be another manipulation of the data.
I hope this helps,
Keith
At 05:54 PM 1/11/2005 -0800, you wrote:
>One cautionary note (though perhaps it is obvious):
>the clearest cases for punctuation analysis will be those drawn
>from *written* language corpora (e.g., Brown and LOB).
>Although spoken language corpora contain punctuation marks, these do
>not necessarily follow the conventions of written language, but rather
>are sometimes strategic choices for encoding prosody to some extent
>within the constraints of standard keyboards (i.e., without resorting
>to special characters).
>
>Also, the number of punctuation marks which are used and which ones
>in particular can have a large impact on the "meaning" of any particular
>one of them. (I've written on this elsewhere if of interest.)
>This point is of course partly related to Eric Atwell's point:
> > (usage depends on original sources so there is no corpus-wide
> > standardised punctuation)
>which is also important.
>
>I can't resist mentioning two important works for background lit:
>1) Quirk, et al. A Comprehensive grammar of the English language.
> London ; New York : Longman, 1985. x, 1779 p. : 26 cm.
>Everyone knows this course, but I think the sections on punctuation
>are not given nearly the attention they deserve.
>2) For punctuation in historical context, I would also recommend
>the following:
> Parkes, M. B. (Malcolm Beckwith)
> Pause and effect : an introduction to the history of punctuation
> in the West / M.B. Parkes.
> Berkeley : University of California Press, c1993.
>
>Parkes is often overlooked, but is fascinating, and full of plates
>which go all the way back to ancient texts (Greek and Latin).
>He makes a very strong point to the effect that punctuatino has served
>very different functions at different points in time, depending on
>the nature of the audience for which the text was put into writing.
>In Ancient Greece, one important use of writing was to preserve
>spoken language and help students become better orators.
>The claim is made that people didn't read silently until much later.
>
>Another point perhaps of interest: the amount of punctuation
>in the Bible varied greatly from on era to the next depending on the
>intended readership. When it was a homogenous readership (native
>speaking monks), there was less punctuation ; later on, when it
>was a more heterogenous readership in far-flung countries, there
>tended to be more punctuation per page.
>
>-Jane Edwards
--------------------------------------------------
Keith Suderman
Technical Specialist
American National Corpus
suderman at cs.vassar.edu
http://americannationalcorpus.org
More information about the Corpora
mailing list