[Corpora-List] News from LDC
Linguistic Data Consortium
ldc at ldc.upenn.edu
Tue Aug 25 15:09:29 UTC 2009
*
- LDC at Interspeech 2009 in **Brighton**, **UK** -*
**
*- Arabic English Newswire Translation Collection
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2009T22>
-
*
*- BioProp Version 1.0
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2009T04> -
*
The Linguistic Data Consortium (LDC) would like to provide information
on our upcoming conference participation and announce the availability
of two new publications.*
*
------------------------------------------------------------------------
*LDC at Interspeech 2009 in **Brighton**, **UK**, **September 6-10, 2009*
LDC is pleased to announce its participation at Interspeech 2009 in
Brighton, UK. LDC researchers will present papers on the following
topics (conveniently in the same session):
* XTrans: A Speech Annotation and Transcription Tool
Thursday 10 September 2009, Session 2-O4, 13.30 (paper #3)
* The Broadcast Narrow Band Speech Corpus: A New Resource Type for
Large Scale Language Recognition
Thursday 10 September 2009, Session 2-O4, 13.30 (paper #6)
Two papers co-authored by LDC's director, Mark Liberman, will also be
presented:
* Automatic Formant Extraction for Sociolinguistic Analysis of Large
Corpora (co-authors Keelan Evanini, Stephen Isard)'
Wednesday 9 September 2009, Session 1-P1 10:00 (paper #3)
* Investigating /l/ Variation in English through Forced Alignment
(co-author Jiahong Yuan)
Wednesday 9 September 2009, Session 3-O2 16:00 (paper #5)
Visit our display in the exhibition hall at the Brighton Centre on
Kings' Road for a special giveaway or just to say hello.
Follow the link for more information on Interspeech 2009
<http://www.interspeech2009.org/>.
*New Publications*
(1) The Arabic English Newswire Translation Collection
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2009T22>
consists of approximately 550,000 words of Arabic newswire text and its
English translation from Agence France Presse (France), An Nahar
(Lebanon) and Assabah (Tunisia). The source Arabic text was used in
LDC's Arabic Treebank, specifically, in Part 1 (Part 1 v. 2.0
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2003T06>;
Part 1 v. 3.0
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2005T02>),
Part 3 (Part 3 v. 1.0
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2004T11>;
Part 3 v. 2.0
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2005T20>)
and Part 4 (Part 4 v. 1.0
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2005T30>).
A subset of Agence France Presse (AFP) source text from Arabic Treebank:
Part 1 v. 2.0 was previously translated and released by LDC in Arabic
Treebank: Part 1 - 10K-word English Translation, LDC2003T07
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2003T07>.
The English translations in this corpus were provided by translation
agencies using LDC's Arabic Translation Guidelines.
The number of stories and their epochs for each source are as follows:
AFP
734 stories; July 2000 - November 2000
An Nahar
600 stories; January 2002 - December 2002
Assabah
397 stories; September 2004 - November 2004
Total
1731 stories
Word count of Arabic tokens by source is shown in the following table:
AFP
102,564
An Nahar
299,681
Assabah
149,259
------------------------------------------------------------------------
Total
551,504
The original source files used different encodings for the Arabic
characters, including UTF8 and ASMO. SGML tags were used for marking
sentence and paragraph boundaries and for annotating other information
about each story. All Arabic source data was converted to UTF and most
SGML tags were removed or replaced by "plain text" markers.
*
(2) BioProp Version 1.0
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2009T04>
was developed by researchers at Academia Sinica
<http://www.sinica.edu.tw/main_e.shtml>, Taipei, Taiwan. It consists of
proposition bank-style annotations for approximately 500 English
biomedical journal abstracts. The source abstracts, annotated in
accordance with Penn Treebank II <http://www.cis.upenn.edu/%7Etreebank/>
guidelines, are contained in the GENIA Treebank (GTB). The GTB was
developed at the Tsujii Laboratory
<http://www-tsujii.is.s.u-tokyo.ac.jp/> at the University of Tokyo
<http://www.u-tokyo.ac.jp/index_e.html>.
The purpose of the GENIA Project
<http://www-tsujii.is.s.u-tokyo.ac.jp/GENIA> is to develop tools and
resources for automatic information extraction of biomedical
information. One result of that work is the GENIA corpus, a collection
of 2000 biomedical journal abstracts containing semantic class
annotation for biomedical terms, part-of-speech (POS) tags and
coreferences. The GTB is a subset of that corpus. BioProp Version 1.0
adds a proposition bank to the GTB.
Proposition Bank
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2004T14>
(PropBank) contains annotations of predicate argument structures and
semantic roles in a treebank schema in the newswire domain. To construct
BioProp Version 1.0, a semantic role labeling (SRL) system trained on
PropBank was used to annotate the GTB. SRL, also called shallow semantic
parsing, is a popular semantic analysis technique. In SRL, sentences are
represented by one or more predicate-argument structures (PAS), also
known as propositions. Each PAS is composed of a predicate (e.g., a
verb) and several arguments (e.g., noun phrases) that have different
semantic roles, including main arguments such as agent and patient, and
adjunct arguments, such as time, manner and location. The term
"argument" refers to a syntactic constituent of the sentence related to
the predicate, and the term "semantic role" refers to the semantic
relationship between a sentence's predicate and argument.
BioProp Version 1.0 consists of approximately 150,000 words. Each line
in the corpus provides a PAS annotation that can be mapped to a sentence
in the GTB.
------------------------------------------------------------------------
Ilya Ahtaridis
Membership Coordinator
--------------------------------------------------------------------
Linguistic Data Consortium Phone: (215) 573-1275
University of Pennsylvania Fax: (215) 573-2175
3600 Market St., Suite 810 ldc at ldc.upenn.edu
Philadelphia, PA 19104 USA http://www.ldc.upenn.edu
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20090825/2007aba7/attachment.htm>
-------------- next part --------------
_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora
More information about the Corpora
mailing list