[Corpora-List] New from the LDC

Linguistic Data Consortium ldc at ldc.upenn.edu
Tue Jan 29 16:52:06 UTC 2008


*
** *LDC2008T03
*-  ACE 2005 English SpatialML Annotations 
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2008T03>  -*

LDC2008S01
*-  CSLU: Portland Cellular Telephone Speech Version 1.3 
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2008S01>  -*

LDC2008T01
*-  Hungarian-English Parallel Text, Version 1.0 
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2008T01>  -

The Linguistic Data Consortium (LDC) is pleased to announce the 
availability of three new publications.
*

------------------------------------------------------------------------
**

*New Publications*

(1)  The ACE (Automatic Contact Extraction) program focuses on 
developing automatic content extraction technology to support automatic 
processing of human language in text form. The kind of information 
recognized and extracted from text includes entities, values, temporal 
expressions, relations and events. SpatialML is a mark-up language for 
representing spatial expressions in natural language documents. 
SpatialML's focus is primarily on geography and culturally-relevant 
landmarks, rather than biology, cosmology, geology, or other regions of 
the spatial language domain. The goal is to allow for potentially better 
integration of text collections with resources such as databases that 
provide spatial information about a domain, including gazetteers, 
physical feature databases and mapping services. In ACE 2005 English 
SpatialML Annotations 
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2008T03>, 
the authors applied SpatialML tags to the English training data 
(originally annotated for entities, relations and events) in ACE 2005 
Multilingual Training Corpus, LDC2006T06. 
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2006T06>

The main SpatialML tag is the PLACE tag. The central goal of SpatialML 
is to map PLACE information in text to data from gazetteers and other 
databases to the extent possible. Therefore, semantic attributes such as 
country abbreviations, country subdivision and dependent area 
abbreviations (e.g., US states), and geo-coordinates are used to help 
establish such a mapping. LINK and PATH tags express relations between 
places, such as inclusion relations and trajectories of various kinds. 
To the extent possible, SpatialML leverages ISO and other standards 
towards the goal of making the scheme compatible with existing and 
future corpora. The SpatialML guidelines are compatible with existing 
guidelines for spatial annotation and existing corpora within the ACE 
research program.


*

(2)  CSLU: Portland Cellular Telephone Speech Version 1.3 
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2008S01> 
was created by the Center for Spoken Language Understanding (CSLU) at 
OGI School of Science and Engineering, Oregon Health and Science 
University, Beaverton, Oregon. It consists of cellular telephone speech 
and corresponding transcripts, specifically, 7,571 utterances from 515 
speakers who made calls in the Portland, Oregon area using cellular 
telephones.

Speakers called the CSLU data collection system on cellular telephones, 
and they were asked to repeat certain phrases and to respond to other 
prompts. Two prompt protocols were used: an In Vehicle Protocol for 
speakers calling from inside a vehicle and a Not in Vehicle Protocol for 
those calling from outside a vehicle. The protocols shared several 
questions, but each protocol contained distinct queries designed to 
probe the conditions of the caller's in vehicle/not in vehicle 
surroundings. Not every caller provided a response to each prompt.

The text transcriptions were produced using the non time-aligned 
word-level conventions described in The CSLU Labeling Guide, which is 
included in the documentation for this release. The corpus contains both 
orthographic and phonetic transcriptions of corresponding speech files.

*

(3)  Hungarian-English Parallel Text, Version 1.0 
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2008T01> 
(also known as the "Hunglish Corpus") is a sentence-aligned 
Hungarian-English parallel corpus consisting of approximately two 
million sentence pairs. The corpus contains additional language 
resources for the Hungarian text, including a monolingual corpus, 
morphological toolset and aligner.  Hungarian-English Parallel Text, 
Version 1.0 is a joint work of the Media Research and Education Center 
<http://mokk.bme.hu/index_html-en?set_language=en&cl=en> at the Budapest 
University of Technology and Economics (BUTE) <http://www.bme.hu/en> and 
the Corpus Linguistics Department 
<http://www.nytud.hu/depts/corpus/index.html> at the Hungarian Academy 
of Sciences Institute of Linguistics <http://www.nytud.hu/eng/index.html>.

Sentence pair (.bi) files consist of tab-separated, matching sentence 
pairs. The .bi files do not contain segments where deletion or 
contraction occurred. They are also filtered based on quality, so the 
full reconstruction of the raw texts is impossible. Some .bi files were 
shuffled (sorted alphabetically).

Alignment "ladder" (.lad) files preserve the whole of both input texts 
with ordering, even those segments that were not successfully aligned. 
In .lad files, every line is tab-separated into two columns. The first 
is a segment of the Hungarian text. The second is a (supposedly 
corresponding) segment of the English text. Such segments of the source 
or target text will generally consist of exactly one sentence on both 
sides, but can also consist of zero, or more than one, sentence.


------------------------------------------------------------------------

Ilya Ahtaridis
Membership Coordinator

--------------------------------------------------------------------

Linguistic Data Consortium                     Phone: (215) 573-1275
University of Pennsylvania                       Fax: (215) 573-2175
3600 Market St., Suite 810                         ldc at ldc.upenn.edu
Philadelphia, PA 19104 USA                  http://www.ldc.upenn.edu

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20080129/fc12e396/attachment.htm>
-------------- next part --------------
_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora


More information about the Corpora mailing list