[Corpora] [Corpora-List] New from LDC
Linguistic Data Consortium
ldc at ldc.upenn.edu
Wed Oct 22 19:31:41 UTC 2014
/New publications:/*
- Chinese Discourse Treebank 0.5 <#chinese> -
- GALE Arabic-English Word Alignment -- Broadcast Training Part 2 <#gale> -
- United Nations Proceedings Speech <#un> -*
------------------------------------------------------------------------
*New publications*
(1) Chinese Discourse Treebank 0.5
<https://catalog.ldc.upenn.edu/LDC2014T21> was developed at Brandeis
University as part of the Chinese Treebank Project
<http://www.cs.brandeis.edu/%7Eclp/ctb/>and consists of approximately
73,000 words of Chinese newswire text annotated for discourse relations.
It follows the lexically grounded approach of the Penn Discourse
Treebank (PDTB) (LDC2008T05 <https://catalog.ldc.upenn.edu/LDC2008T05>)
with adaptations based on the linguistic and statistical characteristics
of Chinese text. Discourse relations are lexically anchored by discourse
connectives (e.g., because, but, therefore), which are viewed as
predicates that take abstract objects such as propositions, events and
states as their arguments. Along with PDTB-style schemes for English,
Turkish, Hindi and Czech, Chinese Discourse Treebank provides an
additional perspective on how the PDTB approach can be extended for
cross-lingual annotation of discourse relations.
Data was selected from the newswire material in Chinese Treebank 8.0
(LDC2013T21 <https://catalog.ldc.upenn.edu/LDC2013T21>), specifically,
from Xinhua News Agency stories. There are approximately 5,500
annotation instances. Following the PDTB format, each annotation
instance consists of 27 vertical bar delimited fields. The fields
specify the attributes of the discourse relation as a whole, as well as
the attributes of its two arguments. Not all fields are filled in this
release. Filled fields are indicated by a pair of angle brackets; the
remaining fields are place holders for future releases.
*
(2) GALE Arabic-English Word Alignment -- Broadcast Training Part 2
<https://catalog.ldc.upenn.edu/LDC2014T22> was developed by LDC and
contains 215,923 tokens of word aligned Arabic and English parallel text
enriched with linguistic tags. This material was used as training data
in the DARPA GALE (Global Autonomous Language Exploitation) program.
Some approaches to statistical machine translation include the
incorporation of linguistic knowledge in word aligned text as a means to
improve automatic word alignment and machine translation quality. This
is accomplished with two annotation schemes: alignment and tagging.
Alignment identifies minimum translation units and translation relations
by using minimum-match and attachment annotation approaches. A set of
word tags and alignment link tags are designed in the tagging scheme to
describe these translation units and relations. Tagging adds contextual,
syntactic and language-specific features to the alignment annotation.
This release consists of Arabic source broadcast news and broadcast
conversation data collected by LDC from 2007-2009.The Arabic word
alignment tasks consisted of the following components:
Normalizing tokenized tokens as needed
Identifying different types of links
Identifying sentence segments not suitable for annotation
Tagging unmatched words attached to other words or phrases
*
(3) United Nations Proceedi
<https://catalog.ldc.upenn.edu/LDC2014S08>ngs Speech
<https://catalog.ldc.upenn.edu/LDC2014S08> was developed by the United
Nations <http://www.un.org/> (UN) and contains approximately 8,500 hours
of recorded proceedings in the six official UN languages, Arabic,
Chinese, English, French, Russian and Spanish. The data was recorded in
2009-2012 from sessions 64-66 of the General Assembly
<http://www.un.org/en/ga/> (GA) and First Committee
<http://www.un.org/en/ga/first/> (FC) (Disarmament and International
Security), and meetings 6434-6763 of the Security Council
<http://www.un.org/en/sc/>.
Recordings were made using a customized system following a daily
internal circulated instruction from the Meetings Management Section
<http://www.un.org/depts/DGACM/mms.shtml>. Most of the subjects and
information related to a particular meeting or session are published in
a UN Journal which can be found in the following here
<http://www.un.org/en/documents/journal.asp>.
Data is presented either as mp3 or flac compressed wav and are 16-bit
single channel files in either 22,050 or 8,000 Hz organized by committee
and session number, then language. The folder labeled "Floor" indicates
the microphone used by the particular speaker. Those files may include
other languages, for instance, if the speaker's language was not among
the six official UN languages.
------------------------------------------------------------------------
--
--
Ilya Ahtaridis
Membership Coordinator
--------------------------------------------------------------------
Linguistic Data Consortium Phone: 1 (215) 573-1275
University of Pennsylvania Fax: 1 (215) 573-2175
3600 Market St., Suite 810 ldc at ldc.upenn.edu
Philadelphia, PA 19104 USA http://www.ldc.upenn.edu
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20141022/4cb8386a/attachment.htm>
-------------- next part --------------
_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora
More information about the Corpora
mailing list