[Corpora-List] Punctuation

Nancy Ide ide at cs.vassar.edu
Tue Jan 11 23:22:07 UTC 2005


The American National Corpus is being represented using an XML format
in which the original formatting is preserved in attributes, so in
general you should be able to determine where scare quotes were used.

The ANC First Release of 11 million words is available from the
Linguistic Data Consortium (ldc at ldc.upenn.edu) for $75 for research
use. However, within a couple of months a second release of approx. 20
million words, which includes the 11 million words of the First
release, will be available. The 1st release data included in the 2nd
release will be much "cleaner" and many errors will have been fixed.

Also, very soon (within a month) Mark Davies' web-based search and
retrieval software for the BNC will also handle the ANC 1st release.
The URL for his software is http://view.byu.edu.

Nancy Ide

On Jan 11, 2005, at 11:56 AM, Eric Atwell wrote:

> Tim,
> most English corpora since pioneering Brown and LOB in 1960s have
> included punctuation, so any of these might do.
> The British National Corpus from 1990s has the advantage of www-based
> trail search, you can "try before you buy" at
> http://sara.natcorp.ox.ac.uk/lookup.html
>
> For example I tried search term {'|"}
> - regular expression finding all occurrences of ' or "
> (usage depends on original sources so there is no corpus-wide
>  standardised punctuation)
>
> I'm not sure how to identify all and only scare quotes via such regular
> expressions... good luck!
>
> Eric Atwell, school of Computing, Leeds University
>
>
> On Tue, 11 Jan 2005, Grant, T. wrote:
>
>> I'm looking for a freely accessible English language corpus which
>> allows analysis of punctuation marks - I'm interested for example in
>> examining the use of scare quotes.
>>
>> Any ideas gratefully received.
>>
>> Tim
>>
>> ______________________________________
>> Tim Grant
>> Forensic Section - School of Psychology
>> University of Leicester
>> 106 New Walk
>> Leicester LE1 7EA
>> UK
>>
>> TG21 at leicester.ac.uk
>> http://www.le.ac.uk/psychology/tg21/
>>
>> + 44(0)116 252 3658 (Direct Line) - + 44(0)116 252 2451 (Secretary) -
>> + 44(0)116 252 3994 (Fax)
>>
>>
>>
>
> --
> Eric Atwell, Senior Lecturer, Computer Vision and Language research
> group,
> School of Computing, University of Leeds, LEEDS LS2 9JT, England
> TEL: +44-113-2335430  FAX: +44-113-2335468
> http://www.comp.leeds.ac.uk/eric
>
>
=======================================================

Nancy Ide

Professor  of Computer Science
Vassar College
Poughkeepsie, NY 12604-0520 USA
Tel: +1 845 437-5988 Fax: +1 845 437-7498
ide at cs.vassar.edu

Chercheur Associe
Equipe Langue et Dialogue, LORIA/CNRS
Campus Scientifique - BP 239
54506 Vandoeuvre-les-Nancy FRANCE
Tel: +33 (0)3 83 59 20 47 Fax: +33 (0)3 83 41 30 79
ide at loria.fr

=======================================================



More information about the Corpora mailing list