[Corpora] [Corpora-List] New from LDC

Linguistic Data Consortium ldc at ldc.upenn.edu
Wed Oct 22 19:31:41 UTC 2014


/New publications:/*

- Chinese Discourse Treebank 0.5 <#chinese>  -

- GALE Arabic-English Word Alignment -- Broadcast Training Part 2 <#gale>  -

- United Nations Proceedings Speech <#un>  -*
------------------------------------------------------------------------
*New publications*

(1) Chinese Discourse Treebank 0.5 
<https://catalog.ldc.upenn.edu/LDC2014T21> was developed at Brandeis 
University as part of the Chinese Treebank Project 
<http://www.cs.brandeis.edu/%7Eclp/ctb/>and consists of approximately 
73,000 words of Chinese newswire text annotated for discourse relations. 
It follows the lexically grounded approach of the Penn Discourse 
Treebank (PDTB) (LDC2008T05 <https://catalog.ldc.upenn.edu/LDC2008T05>) 
with adaptations based on the linguistic and statistical characteristics 
of Chinese text. Discourse relations are lexically anchored by discourse 
connectives (e.g., because, but, therefore), which are viewed as 
predicates that take abstract objects such as propositions, events and 
states as their arguments. Along with PDTB-style schemes for English, 
Turkish, Hindi and Czech, Chinese Discourse Treebank provides an 
additional perspective on how the PDTB approach can be extended for 
cross-lingual annotation of discourse relations.

Data was selected from the newswire material in Chinese Treebank 8.0 
(LDC2013T21 <https://catalog.ldc.upenn.edu/LDC2013T21>), specifically, 
from Xinhua News Agency stories. There are approximately 5,500 
annotation instances. Following the PDTB format, each annotation 
instance consists of 27 vertical bar delimited fields. The fields 
specify the attributes of the discourse relation as a whole, as well as 
the attributes of its two arguments. Not all fields are filled in this 
release. Filled fields are indicated by a pair of angle brackets; the 
remaining fields are place holders for future releases.


*

(2) GALE Arabic-English Word Alignment -- Broadcast Training Part 2 
<https://catalog.ldc.upenn.edu/LDC2014T22> was developed by LDC and 
contains 215,923 tokens of word aligned Arabic and English parallel text 
enriched with linguistic tags. This material was used as training data 
in the DARPA GALE (Global Autonomous Language Exploitation) program. 
Some approaches to statistical machine translation include the 
incorporation of linguistic knowledge in word aligned text as a means to 
improve automatic word alignment and machine translation quality. This 
is accomplished with two annotation schemes: alignment and tagging. 
Alignment identifies minimum translation units and translation relations 
by using minimum-match and attachment annotation approaches. A set of 
word tags and alignment link tags are designed in the tagging scheme to 
describe these translation units and relations. Tagging adds contextual, 
syntactic and language-specific features to the alignment annotation.

This release consists of Arabic source broadcast news and broadcast 
conversation data collected by LDC from 2007-2009.The Arabic word 
alignment tasks consisted of the following components:

Normalizing tokenized tokens as needed

Identifying different types of links

Identifying sentence segments not suitable for annotation

Tagging unmatched words attached to other words or phrases

*

(3) United Nations Proceedi 
<https://catalog.ldc.upenn.edu/LDC2014S08>ngs Speech 
<https://catalog.ldc.upenn.edu/LDC2014S08> was developed by the United 
Nations <http://www.un.org/> (UN) and contains approximately 8,500 hours 
of recorded proceedings in the six official UN languages, Arabic, 
Chinese, English, French, Russian and Spanish. The data was recorded in 
2009-2012 from sessions 64-66 of the General Assembly 
<http://www.un.org/en/ga/> (GA) and First Committee 
<http://www.un.org/en/ga/first/> (FC) (Disarmament and International 
Security), and meetings 6434-6763 of the Security Council 
<http://www.un.org/en/sc/>.

Recordings were made using a customized system following a daily 
internal circulated instruction from the Meetings Management Section 
<http://www.un.org/depts/DGACM/mms.shtml>. Most of the subjects and 
information related to a particular meeting or session are published in 
a UN Journal which can be found in the following here 
<http://www.un.org/en/documents/journal.asp>.

Data is presented either as mp3 or flac compressed wav and are 16-bit 
single channel files in either 22,050 or 8,000 Hz organized by committee 
and session number, then language. The folder labeled "Floor" indicates 
the microphone used by the particular speaker. Those files may include 
other languages, for instance, if the speaker's language was not among 
the six official UN languages.



------------------------------------------------------------------------

-- 
--

Ilya Ahtaridis
Membership Coordinator
--------------------------------------------------------------------
Linguistic Data Consortium                  Phone: 1 (215) 573-1275
University of Pennsylvania                    Fax: 1 (215) 573-2175
3600 Market St., Suite 810                        ldc at ldc.upenn.edu
Philadelphia, PA 19104 USA                 http://www.ldc.upenn.edu

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20141022/4cb8386a/attachment.htm>
-------------- next part --------------
_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora


More information about the Corpora mailing list