[Corpora-List] News from the British National Corpus

Lou Burnard lou.burnard at computing-services.oxford.ac.uk
Fri Feb 17 12:41:39 UTC 2006


Some news items from the BNC, hopefully of interest to this list (please 
redistribute)

1. BNC Goes XML
2. New BNC website
3. New release of Xaira
4. A postscript on licensing

1. BNC Goes XML

A decade or more after its first appearance, the British National Corpus 
(BNC) is still the most widely-available general-purpose fully-annotated 
English language corpus and is still very widely used. Technology moves 
on, however, and the SGML format which we used in 1994, state of the art 
as it then was, is looking increasingly ancient. More significantly, 
SGML software is not so easy to find or deploy.

For that reason, we have long planned to re-issue the corpus in XML 
format. XML is close enough to SGML for the migration to be painless and 
automatic. Moreover the range of software available for XML is 
increasing day by day -- very probably even the software you are using 
to read this message can handle it; certainly more and more NLP related 
tools and resources are produced in it.

The BNC Baby sampler we produced last year was an experimental step in 
the direction of producing BNC-XML. We are now ready to make the big 
leap forward, by converting the whole corpus to XML. The plan is to 
complete this in the next few months and to start distribution of a 
third edition of the BNC early this summer.

Naturally, we would also like to take this opportunity to fix as many as 
possible known errors and identifiable glitches in the existing  corpus. 
We don't have the resources to add more texts or to do a manual proofing 
and correction of the entire corpus, but we can (and will) fix known 
systematic markup errors, tidy up misclassifications, remove duplicate 
texts and so on. Our aim is to fix as many as possible of the errors 
which impair the usefulness of the BNC as a source for generalizations 
about the lexicon, for example where the input stream has been wrongly 
segmented. Because every sentence in the BNC has a unique identifier 
(the combination of text name and sentence number, we think that many 
such errors can be fixed without the need for manual intervention.

You can help us by providing us with information about errors you've 
already noticed. We'd also much appreciate any comments you have about 
overall ways of improving the BNC in its new XML guise. We have plans 
already in hand to address the most frequently voiced concerns (eg. "how 
do I get rid of the tags?") and will be posting a list of the planned 
changes on the new website in due course.

If you want to send us notice of specific errors and typos, please send 
them by email to natcorp at oucs.ox.ac.uk, preferably in a consistent 
format. Something like the following (for example) would be an ideal way 
of pointing out that the apostrophe after "horse" in s-unit number 891 
of text A6B is in the wrong place:

A6B  891
FOR   <w DPS>his <w NN1>horse'<c PUN>.
READ   <w DPS>his <w NN1>horse<c PUQ>&equo;<c PUN>.

Reports of more general errors are also very welcome, of course, and 
should be sent to the same address.

Deadline for sending in reports of BNC errors and typos: 15 March 2006.

  **** There will be a Prize draw for all those who contribute error ****
  **** reports! Be first to get a (free!) copy of the new BNC! ***

(Yes, if you have already reported a mistake in the past, you can send 
it to us again to be entered for the prize draw!)

2. New BNC Website

We (Ylva mostly) have also been working hard on bringing the BNC website 
up to date. This is now also managed in XML, which makes maintaining a 
consistent design easier as well as simplifying the authoring task. 
Please take a look at http://www-dev.natcorp.ox.ac.uk
and give us your feedback at natcorp at oucs.ox.ac.uk -- all being well, we 
will switch the current address to point to this new site within a week 
or two.

3. New release of Xaira

A new release of Xaira, the software which developed out of SARA into a 
general purpose XML corpus query tool, is now available for download 
from  http://xaira.sf.net

This (1.17) is the version we will be using to index the BNC XML 
edition, and which we will distribute with it. Xaira can be used to 
index any XML corpus, not just the BNC; it has also been used for XML 
corpora in Chinese, Sanskrit, Hungarian,  and many other languages. 
Xaira will work with any kind of XML markup, not just BNC style.  It 
also includes a number of new features which were not possible in Sara, 
notably better facilities for collocation searching and subcorpus 
manipulation.

Xaira will run standalone or networked on 32 bit versions of Windows 
(W2K, XP). A range of interfaces is available for other platforms: the 
server has been installed on various flavours of Unix, including Mac 
OSX. Simple PHP and Java clients are included, demonstrating how Xaira 
can be built in to a web services architecture.


4. A Postscript on Licensing

* Xaira is open source software licenced under the GNU Public Licence.

* The BNC XML edition will be distributed under the same licensing 
conditions and pricing structure as the current BNC World edition.

* If you took out a licence for the BNC World Edition within six months 
of the date of release of the BNC XML Edition, you will receive a free 
upgrade to the XML Edition and a new licence.

* We expect to maintain support for the BNC World Edition for six months 
after the release date of the BNC XML Edition. Licences for the BNC 
World edition will then start to expire.

Lou Burnard and Ylva Berglund
British National Corpus
Oxford University Computing Services
13 Banbury Rd
Oxford OX2 6NN

Email: natcorp at oucs.ox.ac.uk
Fax: +44 (0)1865 273 275



More information about the Corpora mailing list