[Corpora-List] News from LDC

Fri Jul 27 20:07:06 UTC 2012

*- LDC 20th Anniversary Workshop <#work>  -*

/New publications:/

*- American English Nickname Collection <#name>  -*

*- Arabic Treebank - Broadcast News v1.0 <#atb>  -*

*- Catalan TimeBank 1.0 <#cat>  -*

------------------------------------------------------------------------

*LDC 20th Anniversary Workshop *

LDC announces its *20th Anniversary Workshop on Language Resources*, to 
be held in Philadelphia on September 6-7, 2012. The event will 
commemorate our anniversary, reflect on the beginning of language data 
centers and address the future of language resources.

Workshop themes will include: the developments in human language 
technologies and associated resources that have brought us to our 
current state; the language resources required by the technical 
approaches taken and the impact of these resources on HLT progress; the 
applications of HLT and resources to other disciplines including law, 
medicine, economics, the political sciences and psychology; the impact 
of HLTs and related technologies on linguistic analysis and novel 
approaches in fields as widespread as phonetics, semantics, language 
documentation, sociolinguistics and dialect geography; and finally, the 
impact of any of these developments on the ways in which language 
resources are created, shared and exploited and on the specific 
resources required.

Stay tuned for further details.

*New publications *

(1) American English Nickname Collection 
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2012T11> 
was developed by Intelius, Inc <http://www.intelius.com/corp/>. and is a 
compilation of American English nicknames to given name mappings based 
on information in US government records, public web profiles and 
financial and property reports. This corpus is intended as a tool for 
the quantitative study of nickname usage in the United States such as in 
demographic and sociological studies.

The American English Nickname Collection contains 331,237 distinct 
mappings encompassing millions of names. The data was collected and 
processed through a record linkage pipeline. The steps in the pipeline 
were (1) data cleaning, (2) blocking, (3) pair-wise linkage and (4) 
clustering. In the cleaning step, material was categorized, processed to 
remove junk and spam records and normalized to an approximately common 
representation. The blocking process utilized an algorithm to group 
records by shared properties for determining which record pairs should 
be examined by the pairwise linker as potential duplicates. The linkage 
step assigned a score to record pairs using a supervised pairwise-based 
machine learning model. The clustering step combined record pairs into 
connected components and further partitioned each connected component to 
remove inconsistent pairwise links. The result is that input records 
were partitioned into disjoint sets called profiles, where each profile 
corresponded to a single person.

The material is presented in the form of a comma delimited text file. 
Each line contains a first name, a nickname or alias, its conditional 
probability and its frequency. The conditional probability for each 
nickname is derived from the base data using an algorithm which 
calculates both the probability for which any alias refers to a given 
name and a threshold below which the mapping is most likely an error. 
This threshold eliminates typographic errors and other noise from the data.

The collection is being made available at no charge.

*

(2) Arabic Treebank - Broadcast News v1.0 
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2012T07> 
was developed at LDC. It consists of 120 transcribed Arabic broadcast 
news stories with part-of-speech, morphology, gloss and syntactic tree 
annotation in accordance with the Penn Arabic Treebank (PATB) 
Morphological and Syntactic Annotation Guidelines 
<http://projects.ldc.upenn.edu/ArabicTreebank/>. The ongoing PATB 
project supports research in Arabic-language natural language processing 
and human language technology development.

This release contains 432,976 source tokens before clitics were split, 
and 517,080 tree tokens after clitics were separated for treebank 
annotation. The source materials are Arabic broadcast news stories 
collected by LDC during the period 2005-2008 from the following sources: 
Abu Dhabi TV, Al Alam News Channel, Al Arabiya, Al Baghdadya TV, Al 
Fayha, Alhurra, Al Iraqiyah, Aljazeera, Al Ordiniyah, Al Sharqiyah, 
Dubai TV, Kuwait TV, Lebanese Broadcasting Corp., Oman TV, Radio Sawa, 
Saudi TV and Syria TV. The transcripts were produced by LDC.

*

(3) Catalan TimeBank 1.0 
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2012T10> 
was developed by researchers at Barcelona Media 
<http://www.barcelonamedia.org/> and consists of Catalan texts in the 
AnCora corpus <http://clic.ub.edu/corpus/en/ancora> annotated with 
temporal and event information according to the TimeML specification 
language <http://www.timeml.org/site/index.html>.

TimeML is a schema for annotating eventualities and time expressions in 
natural language as well as the temporal relations among them, thus 
facilitating the task of extraction, representation and exchange of 
temporal information. Catalan Timebank 1.0 is annotated in three levels, 
marking events, time expressions and event metadata. The TimeML 
annotation scheme was tailored for the specifics of the Catalan 
language. Temporal relations in Catalan present distinctions of verbal 
mood (e.g., indicative, subjunctive, conditional, etc.) and grammatical 
aspect (e.g., imperfective) which are absent in English.

Catalan TimeBank 1.0 contains stand-off annotations for 210 documents 
with over 75,800 tokens (including punctuation marks) and 68,000 tokens 
(excluding punctuation). The source documents are from the EFE news 
agency <http://www.efe.com/principal.asp?opcion=0&idioma=CATALAN>, the 
ACN <http://www.catalannewsagency.com/aboutus> Catalan news agency2 and 
the Catalan version of the El Períodico <http://www.elperiodico.cat/ca/> 
newspaper, and span the period from January to December 2000.

The AnCora corpus is the largest multilayer annotated corpus of Spanish 
and Catalan. AnCora contains 400,000 words in Spanish and 275,000 words 
in Catalan. The AnCora documents are annotated on many linguistic levels 
including structure, syntax, dependencies, semantics and pragmatics. 
That information is not included in this release, but it can be mapped 
to the present annotations. The corpus is freely available from the 
Centre de Llenguatge i Computació (CLiC)" <http://clic.ub.edu/ancora>.

The collection is being made available at no charge.
------------------------------------------------------------------------

-- 
--

Ilya Ahtaridis
Membership Coordinator
--------------------------------------------------------------------
Linguistic Data Consortium                  Phone: 1 (215) 573-1275
University of Pennsylvania                    Fax: 1 (215) 573-2175
3600 Market St., Suite 810                        ldc at ldc.upenn.edu
Philadelphia, PA 19104 USA                 http://www.ldc.upenn.edu

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20120727/c7ec71f3/attachment.htm>
-------------- next part --------------
_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora