[Corpora-List] COMPARA version 5.0 - anouncement
Santos Diana
Diana.Santos at sintef.no
Mon Nov 10 09:46:56 UTC 2003
Dear all,
We are pleased to announce COMPARA's version 5.0, with over one million
words of English and Portuguese parallel texts.
COMPARA is an extensible bidirectional parallel corpus of English and
Portuguese that is freely accessible at http://www.linguateca.pt/COMPARA/.
The corpus has been continuously improved since its first version back in
2000. Version 5.0 is the result of an extensive revision of the corpus and
its encoding.
The corpus is encoded in the IMS Corpus Workbench system and is searchable
via the DISPARA Web interface. Alignment is based on the source-text
sentence and allows users to search for sentences that have been joined,
split, added to, deleted from, and reordered in translation. Other
searchable features are translators' notes, foreign words, titles, emphasis
and named entities.
Version 5.0 contains 39 aligned text extracts of published fiction by 27
different authors from Angola, Brazil, Mozambique, Portugal, South Africa,
the United Kingdom and the United States, and 25 more texts are in the
processing queue.
New features in COMPARA version 5.0 include:
- all texts have been revised for encoding of single and double quotes
(and made distinct from apostrophes)
- a new semantics was given to the structural markup <foreign>,
<title> and <emph>, and a new category was added, <named> (for named
entities)
- a new procedure for sentence definition, regarding the colon, was
enforced
- a better and more complete display of the results, as well as of the
corpus overview, was implemented
- an improvement in the random choice of hits to be displayed was
brought about
- a new search and display feature was added, that of original vs.
translated text
Ana Frankenberg-Garcia & Diana Santos
compara at linguateca.pt
www.linguateca.pt/COMPARA/
====================================
Diana Santos, Diana.Santos at sintef.no
Linguateca, http://www.linguateca.pt
SINTEF Telecom & Informatics
Pb 124 Blindern, N-0314 Oslo Noruega
More information about the Corpora
mailing list