[Corpora-List] News from LDC

Linguistic Data Consortium ldc at ldc.upenn.edu
Tue Aug 25 15:09:29 UTC 2009


*
-  LDC at Interspeech 2009 in **Brighton**, **UK**  -*
**

*-  Arabic English Newswire Translation Collection 
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2009T22> 
  -
*

*-  BioProp Version 1.0 
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2009T04>  -
*

The Linguistic Data Consortium (LDC) would like to provide information 
on our upcoming conference participation and announce the availability 
of two new publications.*
*

------------------------------------------------------------------------

*LDC at Interspeech 2009 in **Brighton**, **UK**, **September 6-10, 2009*


LDC is pleased to announce its participation at Interspeech 2009 in 
Brighton, UK. LDC researchers will present papers on the following 
topics (conveniently in the same session):

    * XTrans: A Speech Annotation and Transcription Tool

                Thursday 10 September 2009, Session 2-O4, 13.30 (paper #3)

    * The Broadcast Narrow Band Speech Corpus: A New Resource Type for
      Large Scale Language Recognition

            Thursday 10 September 2009, Session 2-O4, 13.30 (paper #6)

 Two papers co-authored by LDC's director, Mark Liberman, will also be 
presented:

    * Automatic Formant Extraction for Sociolinguistic Analysis of Large
      Corpora (co-authors Keelan Evanini, Stephen Isard)'

                Wednesday 9 September 2009, Session 1-P1 10:00 (paper #3)

    *  Investigating /l/ Variation in English through Forced Alignment
      (co-author Jiahong Yuan)

                Wednesday 9 September 2009, Session 3-O2 16:00 (paper #5)

Visit our display in the exhibition hall at the Brighton Centre on 
Kings' Road for a special giveaway or just to say hello.

Follow the link for more information on Interspeech 2009 
<http://www.interspeech2009.org/>.


*New Publications*


(1) The Arabic English Newswire Translation Collection 
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2009T22> 
consists of approximately 550,000 words of Arabic newswire text and its 
English translation from Agence France Presse (France), An Nahar 
(Lebanon) and Assabah (Tunisia). The source Arabic text was used in 
LDC's Arabic Treebank, specifically, in Part 1 (Part 1 v. 2.0 
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2003T06>; 
Part 1 v. 3.0 
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2005T02>), 
Part 3 (Part 3 v. 1.0 
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2004T11>; 
Part 3 v. 2.0 
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2005T20>) 
and Part 4 (Part 4 v. 1.0 
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2005T30>). 
A subset of Agence France Presse (AFP) source text from Arabic Treebank: 
Part 1 v. 2.0 was previously translated and released by LDC in Arabic 
Treebank: Part 1 - 10K-word English Translation, LDC2003T07 
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2003T07>. 
The English translations in this corpus were provided by translation 
agencies using LDC's Arabic Translation Guidelines.

The number of stories and their epochs for each source are as follows:

AFP

	

734 stories; July 2000 - November 2000

An Nahar

	

600 stories; January 2002 - December 2002

Assabah

	

397 stories; September 2004 - November 2004

Total

	

1731 stories

Word count of Arabic tokens by source is shown in the following table:

AFP

	

102,564

An Nahar

	

299,681

Assabah

	

149,259

------------------------------------------------------------------------

Total

	

551,504

The original source files used different encodings for the Arabic 
characters, including UTF8 and ASMO. SGML tags were used for marking 
sentence and paragraph boundaries and for annotating other information 
about each story. All Arabic source data was converted to UTF and most 
SGML tags were removed or replaced by "plain text" markers.


*

 

 (2) BioProp Version 1.0 
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2009T04> 
was developed by researchers at Academia Sinica 
<http://www.sinica.edu.tw/main_e.shtml>, Taipei, Taiwan. It consists of 
proposition bank-style annotations for approximately 500 English 
biomedical journal abstracts. The source abstracts, annotated in 
accordance with Penn Treebank II <http://www.cis.upenn.edu/%7Etreebank/> 
guidelines, are contained in the GENIA Treebank (GTB). The GTB was 
developed at the Tsujii Laboratory 
<http://www-tsujii.is.s.u-tokyo.ac.jp/> at the University of Tokyo 
<http://www.u-tokyo.ac.jp/index_e.html>.

The purpose of the GENIA Project 
<http://www-tsujii.is.s.u-tokyo.ac.jp/GENIA> is to develop tools and 
resources for automatic information extraction of biomedical 
information. One result of that work is the GENIA corpus, a collection 
of 2000 biomedical journal abstracts containing semantic class 
annotation for biomedical terms, part-of-speech (POS) tags and 
coreferences. The GTB is a subset of that corpus. BioProp Version 1.0 
adds a proposition bank to the GTB.

Proposition Bank 
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2004T14> 
(PropBank) contains annotations of predicate argument structures and 
semantic roles in a treebank schema in the newswire domain. To construct 
BioProp Version 1.0, a semantic role labeling (SRL) system trained on 
PropBank was used to annotate the GTB. SRL, also called shallow semantic 
parsing, is a popular semantic analysis technique. In SRL, sentences are 
represented by one or more predicate-argument structures (PAS), also 
known as propositions. Each PAS is composed of a predicate (e.g., a 
verb) and several arguments (e.g., noun phrases) that have different 
semantic roles, including main arguments such as agent and patient, and 
adjunct arguments, such as time, manner and location. The term 
"argument" refers to a syntactic constituent of the sentence related to 
the predicate, and the term "semantic role" refers to the semantic 
relationship between a sentence's predicate and argument.

BioProp Version 1.0 consists of approximately 150,000 words. Each line 
in the corpus provides a PAS annotation that can be mapped to a sentence 
in the GTB.

------------------------------------------------------------------------

Ilya Ahtaridis
Membership Coordinator

--------------------------------------------------------------------

Linguistic Data Consortium                     Phone: (215) 573-1275
University of Pennsylvania                       Fax: (215) 573-2175
3600 Market St., Suite 810                         ldc at ldc.upenn.edu
 Philadelphia, PA 19104 USA                   http://www.ldc.upenn.edu

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20090825/2007aba7/attachment.htm>
-------------- next part --------------
_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora


More information about the Corpora mailing list