[Corpora-List] ELRA - Language Resources Catalogue - Update

ELRA ELDA Information info at elda.org
Wed Apr 4 16:23:16 UTC 2012


Our apologies if you have received multiple copies of this announcement.

*****************************************************************
ELRA - Language Resources Catalogue - Update
*****************************************************************

ELRA is happy to announce that 1 new Monolingual Lexicon, 3 new Speech 
Resources and 3 new Evaluation Packages are now available in its catalogue.
Moreover, updated versions of the ESTER Corpus, ESTER Evaluation Package 
and Bulgarian WordNet have also been released.

*1) New Language Resources:

ELRA-L0088 Arabic Morphological Dictionary
*The Arabic Morphological Dictionary contains 7,912,551 entries, 
including 6,247,291 nouns, 1,537,499 verbs, 127,563 adjectives, 198 
grammatical words. All files are provided as plain text in UTF8 
character encoding, which represents about 154 Mb of data.
For more information, see: 
http://catalog.elra.info/product_info.php?products_id=1163

*ELRA-S0338 ESTER 2 Corpus
*ESTER 2 Corpus, produced within the ESTER 2 evaluation campaign, 
consists of a manually transcribed radio broadcast news corpus amounting 
about 100 hours and quick transcriptions of African radios amounting 
about 6 hours. An annotation of named entities is provided within the 
development data (about 6 hours).
For more information, see: 
http://catalog.elra.info/product_info.php?cPath=37_46&products_id=1167 
<http://catalog.elra.info/product_info.php?cPath=37_46&products_id=1167>

*ELRA-S0339 Acoustic database for Polish unit selection speech synthesis
*This database contains parliamentary statements and newspaper reviews 
read by a semi-professional male speaker. It consists of a selection of 
2150 sentences annotated and manually verified, including 100 rare 
phonemes in words. The total duration of the recordings is 3.45 hours. 
The database is phonetically annotated and manually corrected, which 
represents a lexicon of 11761 words with phonetic transcription.
For more information, see: 
http://catalog.elra.info/product_info.php?cPath=37_39&products_id=1164 
<http://catalog.elra.info/product_info.php?cPath=37_39&products_id=1164>

*ELRA-S0342 Acoustic database for Polish concatenative speech synthesis
*This database consists of 1443 nonsense words including all the 
diphones for the Polish language. The database includes information such 
as: the name of the diphone, context of the diphone, phonetic 
transcription in SAMPA, identifier of the wave file where it is placed, 
and three numbers: the beginning, the middle and the end of the diphone.
For more information, see: 
http://catalog.elra.info/product_info.php?cPath=37_39&products_id=1168 
<http://catalog.elra.info/product_info.php?cPath=37_39&products_id=1168>
*
ELRA-E0035 DEFT'08 Evaluation Package
*DEFT (DEfi Fouille de Texte -- Text Mining Challenge) organizes 
evaluation campaigns in the field of text mining. The topic of DEFT 2008 
edition is related to the classification of texts by topics and genres. 
DEFT'08 Evaluation Package enables to compare two corpora with different 
genres (a newspaper article corpus extracted from Le Monde newspaper and 
a corpus of encyclopaedic articles extracted from the internet free 
encyclopaedia, Wikipedia) on the basis of the same set of pre-defined 
categories.
For more information, see: 
http://catalog.elra.info/product_info.php?products_id=1165

*ELRA-E0039 CLEF QAST (2007-2009) -- Evaluation Package
*The CLEF QAST (2007-2009) contains the data used for the Question 
Answering on Speech Transcripts tracks of the CLEF campaigns carried out 
from 2007 to 2009. These tracks tested the performance of monolingual 
Question Answering systems on collections of audio transcriptions.
For more information, see: 
http://catalog.elra.info/product_info.php?products_id=1162

*ELRA-E0040 MEDAR Evaluation Package
*The MEDAR Evaluation Package was produced within the project MEDAR 
(MEDiterranean ARabic language and speech technology), supported by the 
European Commission's ICT programme. It aims to enable the evaluation of 
SLT /MT (Machine Translation) systems for translation tasks applying to 
the English-to-Arabic direction.
For more information, see: 
http://catalog.elra.info/product_info.php?cPath=42_43&products_id=1166 
<http://catalog.elra.info/product_info.php?cPath=42_43&products_id=1166>

*2) Updated Language Resources:*

*ELRA-S0241 ESTER Corpus
*/This new release contains 100 hours of orthographically transcribed 
news broadcast (instead of 60 hours for the previous release)./
The ESTER Corpus is a subset of the ESTER Evaluation Package (catalogue 
ref. ELRA-E0021), which was produced within the French national project 
ESTER (Evaluation of Broadcast News enriched transcription systems), as 
part of the Technolangue programme funded by the French Ministry of 
Research and New Technologies (MRNT). The ESTER project enabled to carry 
out a campaign for the evaluation of Broadcast News enriched 
transcription systems for French.
For more information, see: 
http://catalog.elra.info/product_info.php?products_id=999

*ELRA-E0021 ESTER Evaluation Package
*/This new release contains 100 hours of orthographically transcribed 
news broadcast (instead of 60 hours for the previous release).
/The ESTER Evaluation Package was produced within the French national 
project ESTER (Evaluation of Broadcast News enriched transcription 
systems), as part of the Technolangue programme funded by the French 
Ministry of Research and New Technologies (MRNT). The ESTER project 
enabled to carry out a campaign for the evaluation of Broadcast News 
enriched transcription systems for French.
This package includes the material that was used for the ESTER 
evaluation campaign. It includes resources, protocols, scoring tools, 
results of the campaign, etc., that were used or produced during the 
campaign. The aim of these evaluation packages is to enable external 
players to evaluate their own system and compare their results with 
those obtained during the campaign itself.
The campaign is distributed over three actions: orthographic 
transcription, segmentation and information extraction (named entity 
tracking).
For more information, see: 
http://catalog.elra.info/product_info.php?products_id=995

*ELRA-M0041 Bulgarian WordNet
*/This new release contains / 38209 synsets/(instead of 23715 synsets 
for the previous release).
/The Bulgarian WordNet is a network of lexical-semantic relations, an 
electronic thesaurus with a structure modelled on that of the Princeton 
WordNet and those constructed in the EuroWordNet and BalkaNet project. 
Bulgarian WordNet describes meaning of a lexical unit by placing it 
within a network of semantic relations, such as hypernyny, meronymy, 
antonymy etc. It contains 38209 synsets, 83493 literals, 89242 relations 
(including 58095 semantic relations, 4172 extralinguistic relations).
For more information, see: 
http://catalog.elra.info/product_info.php?cPath=42_45&products_id=802 
<http://catalog.elra.info/product_info.php?cPath=42_45&products_id=802>


For more information on the catalogue, please contact Valérie Mapelli 
mailto:mapelli at elda.org

Visit our On-line Catalogue: http://catalog.elra.info
Visit the Universal Catalogue: http://universal.elra.info
Archives of ELRA Language Resources Catalogue Updates: 
http://www.elra.info/LRs-Announcements.html
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20120404/470fc005/attachment.htm>
-------------- next part --------------
_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora


More information about the Corpora mailing list