[Corpora-List] New from the LDC
Linguistic Data Consortium
ldc at ldc.upenn.edu
Tue Jan 29 16:52:06 UTC 2008
*
** *LDC2008T03
*- ACE 2005 English SpatialML Annotations
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2008T03> -*
LDC2008S01
*- CSLU: Portland Cellular Telephone Speech Version 1.3
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2008S01> -*
LDC2008T01
*- Hungarian-English Parallel Text, Version 1.0
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2008T01> -
The Linguistic Data Consortium (LDC) is pleased to announce the
availability of three new publications.
*
------------------------------------------------------------------------
**
*New Publications*
(1) The ACE (Automatic Contact Extraction) program focuses on
developing automatic content extraction technology to support automatic
processing of human language in text form. The kind of information
recognized and extracted from text includes entities, values, temporal
expressions, relations and events. SpatialML is a mark-up language for
representing spatial expressions in natural language documents.
SpatialML's focus is primarily on geography and culturally-relevant
landmarks, rather than biology, cosmology, geology, or other regions of
the spatial language domain. The goal is to allow for potentially better
integration of text collections with resources such as databases that
provide spatial information about a domain, including gazetteers,
physical feature databases and mapping services. In ACE 2005 English
SpatialML Annotations
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2008T03>,
the authors applied SpatialML tags to the English training data
(originally annotated for entities, relations and events) in ACE 2005
Multilingual Training Corpus, LDC2006T06.
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2006T06>
The main SpatialML tag is the PLACE tag. The central goal of SpatialML
is to map PLACE information in text to data from gazetteers and other
databases to the extent possible. Therefore, semantic attributes such as
country abbreviations, country subdivision and dependent area
abbreviations (e.g., US states), and geo-coordinates are used to help
establish such a mapping. LINK and PATH tags express relations between
places, such as inclusion relations and trajectories of various kinds.
To the extent possible, SpatialML leverages ISO and other standards
towards the goal of making the scheme compatible with existing and
future corpora. The SpatialML guidelines are compatible with existing
guidelines for spatial annotation and existing corpora within the ACE
research program.
*
(2) CSLU: Portland Cellular Telephone Speech Version 1.3
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2008S01>
was created by the Center for Spoken Language Understanding (CSLU) at
OGI School of Science and Engineering, Oregon Health and Science
University, Beaverton, Oregon. It consists of cellular telephone speech
and corresponding transcripts, specifically, 7,571 utterances from 515
speakers who made calls in the Portland, Oregon area using cellular
telephones.
Speakers called the CSLU data collection system on cellular telephones,
and they were asked to repeat certain phrases and to respond to other
prompts. Two prompt protocols were used: an In Vehicle Protocol for
speakers calling from inside a vehicle and a Not in Vehicle Protocol for
those calling from outside a vehicle. The protocols shared several
questions, but each protocol contained distinct queries designed to
probe the conditions of the caller's in vehicle/not in vehicle
surroundings. Not every caller provided a response to each prompt.
The text transcriptions were produced using the non time-aligned
word-level conventions described in The CSLU Labeling Guide, which is
included in the documentation for this release. The corpus contains both
orthographic and phonetic transcriptions of corresponding speech files.
*
(3) Hungarian-English Parallel Text, Version 1.0
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2008T01>
(also known as the "Hunglish Corpus") is a sentence-aligned
Hungarian-English parallel corpus consisting of approximately two
million sentence pairs. The corpus contains additional language
resources for the Hungarian text, including a monolingual corpus,
morphological toolset and aligner. Hungarian-English Parallel Text,
Version 1.0 is a joint work of the Media Research and Education Center
<http://mokk.bme.hu/index_html-en?set_language=en&cl=en> at the Budapest
University of Technology and Economics (BUTE) <http://www.bme.hu/en> and
the Corpus Linguistics Department
<http://www.nytud.hu/depts/corpus/index.html> at the Hungarian Academy
of Sciences Institute of Linguistics <http://www.nytud.hu/eng/index.html>.
Sentence pair (.bi) files consist of tab-separated, matching sentence
pairs. The .bi files do not contain segments where deletion or
contraction occurred. They are also filtered based on quality, so the
full reconstruction of the raw texts is impossible. Some .bi files were
shuffled (sorted alphabetically).
Alignment "ladder" (.lad) files preserve the whole of both input texts
with ordering, even those segments that were not successfully aligned.
In .lad files, every line is tab-separated into two columns. The first
is a segment of the Hungarian text. The second is a (supposedly
corresponding) segment of the English text. Such segments of the source
or target text will generally consist of exactly one sentence on both
sides, but can also consist of zero, or more than one, sentence.
------------------------------------------------------------------------
Ilya Ahtaridis
Membership Coordinator
--------------------------------------------------------------------
Linguistic Data Consortium Phone: (215) 573-1275
University of Pennsylvania Fax: (215) 573-2175
3600 Market St., Suite 810 ldc at ldc.upenn.edu
Philadelphia, PA 19104 USA http://www.ldc.upenn.edu
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20080129/fc12e396/attachment.htm>
-------------- next part --------------
_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora
More information about the Corpora
mailing list