<html>
<head>
<meta http-equiv="content-type" content="text/html; charset=ISO-8859-1">
</head>
<body bgcolor="#FFFFFF" text="#000000">
<p class="MsoNormal" align="center"><b>- <a href="#work">LDC 20th
Anniversary Workshop</a> -</b></p>
<p class="MsoNormal" align="center"><i>New publications:</i></p>
<p class="MsoNormal" align="center"> <b>- <a href="#name">American
English Nickname Collection</a> -</b></p>
<p class="MsoNormal" align="center"> <b>- <a href="#atb">Arabic
Treebank - Broadcast News v1.0</a> -</b></p>
<p class="MsoNormal" align="center"> <b>- <a href="#cat">Catalan
TimeBank 1.0</a> -</b></p>
<div class="MsoNormal" style="text-align:center" align="center">
<hr align="center" size="2" width="100%"> </div>
<p class="MsoNormal" align="center"><a name="work"></a><b>LDC 20th
Anniversary Workshop </b></p>
<p class="MsoNormal">LDC announces its <b>20th Anniversary Workshop
on Language Resources</b>, to be held in Philadelphia on
September 6-7, 2012. The event will commemorate our anniversary,
reflect on the beginning of language data centers and address the
future of language resources. </p>
<p class="MsoNormal">Workshop themes will include: the developments
in human language technologies and associated resources that have
brought us to our current state; the language resources required
by the technical approaches taken and the impact of these
resources on HLT progress; the applications of HLT and resources
to other disciplines including law, medicine, economics, the
political sciences and psychology; the impact of HLTs and related
technologies on linguistic analysis and novel approaches in fields
as widespread as phonetics, semantics, language documentation,
sociolinguistics and dialect geography; and finally, the impact of
any of these developments on the ways in which language resources
are created, shared and exploited and on the specific resources
required.</p>
<p class="MsoNormal">Stay tuned for further details.</p>
<p class="MsoNormal" align="center"><b>New publications </b></p>
<p class="MsoNormal"><a name="name"></a>(1) <a
href="http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2012T11">American
English
Nickname Collection</a> was developed by <a
href="http://www.intelius.com/corp/">Intelius, Inc</a>. and is a
compilation of American English nicknames to given name mappings
based on information in US government records, public web profiles
and financial and property reports. This corpus is intended as a
tool for the quantitative study of nickname usage in the United
States such as in demographic and sociological studies. </p>
<p class="MsoNormal">The American English Nickname Collection
contains 331,237 distinct mappings encompassing millions of names.
The data was collected and processed through a record linkage
pipeline. The steps in the pipeline were (1) data cleaning, (2)
blocking, (3) pair-wise linkage and (4) clustering. In the
cleaning step, material was categorized, processed to remove junk
and spam records and normalized to an approximately common
representation. The blocking process utilized an algorithm to
group records by shared properties for determining which record
pairs should be examined by the pairwise linker as potential
duplicates. The linkage step assigned a score to record pairs
using a supervised pairwise-based machine learning model. The
clustering step combined record pairs into connected components
and further partitioned each connected component to remove
inconsistent pairwise links. The result is that input records were
partitioned into disjoint sets called profiles, where each profile
corresponded to a single person.</p>
<p class="MsoNormal">The material is presented in the form of a
comma delimited text file. Each line contains a first name, a
nickname or alias, its conditional probability and its frequency.
The conditional probability for each nickname is derived from the
base data using an algorithm which calculates both the probability
for which any alias refers to a given name and a threshold below
which the mapping is most likely an error. This threshold
eliminates typographic errors and other noise from the data.</p>
The collection is being made available at no charge.
<p class="MsoNormal" align="center">*</p>
<p class="MsoNormal"><a name="atb"></a>(2) <a
href="http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2012T07">Arabic
Treebank
- Broadcast News v1.0</a> was developed at LDC. It consists of
120 transcribed Arabic broadcast news stories with part-of-speech,
morphology, gloss and syntactic tree annotation in accordance with
the <a href="http://projects.ldc.upenn.edu/ArabicTreebank/">Penn
Arabic Treebank (PATB) Morphological and Syntactic Annotation
Guidelines</a>. The ongoing PATB project supports research in
Arabic-language natural language processing and human language
technology development. </p>
<p class="MsoNormal">This release contains 432,976 source tokens
before clitics were split, and 517,080 tree tokens after clitics
were separated for treebank annotation. The source materials are
Arabic broadcast news stories collected by LDC during the period
2005-2008 from the following sources: Abu Dhabi TV, Al Alam News
Channel, Al Arabiya, Al Baghdadya TV, Al Fayha, Alhurra, Al
Iraqiyah, Aljazeera, Al Ordiniyah, Al Sharqiyah, Dubai TV, Kuwait
TV, Lebanese Broadcasting Corp., Oman TV, Radio Sawa, Saudi TV and
Syria TV. The transcripts were produced by LDC.</p>
<br>
<p class="MsoNormal" align="center">*</p>
<p class="MsoNormal"><a name="cat"></a>(3) <a
href="http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2012T10">Catalan
TimeBank
1.0</a> was developed by researchers at <a
href="http://www.barcelonamedia.org/">Barcelona Media</a> and
consists of Catalan texts in the <a
href="http://clic.ub.edu/corpus/en/ancora">AnCora corpus</a>
annotated with temporal and event information according to the <a
href="http://www.timeml.org/site/index.html">TimeML
specification language</a>. </p>
<p class="MsoNormal">TimeML is a schema for annotating eventualities
and time expressions in natural language as well as the temporal
relations among them, thus facilitating the task of extraction,
representation and exchange of temporal information. Catalan
Timebank 1.0 is annotated in three levels, marking events, time
expressions and event metadata. The TimeML annotation scheme was
tailored for the specifics of the Catalan language. Temporal
relations in Catalan present distinctions of verbal mood (e.g.,
indicative, subjunctive, conditional, etc.) and grammatical aspect
(e.g., imperfective) which are absent in English. </p>
<p class="MsoNormal">Catalan TimeBank 1.0 contains stand-off
annotations for 210 documents with over 75,800 tokens (including
punctuation marks) and 68,000 tokens (excluding punctuation). The
source documents are from the <a
href="http://www.efe.com/principal.asp?opcion=0&idioma=CATALAN">EFE
news agency</a>, the <a
href="http://www.catalannewsagency.com/aboutus">ACN</a> Catalan
news agency2 and the Catalan version of the <a
href="http://www.elperiodico.cat/ca/">El Períodico</a>
newspaper, and span the period from January to December 2000. </p>
<p class="MsoNormal">The AnCora corpus is the largest multilayer
annotated corpus of Spanish and Catalan. AnCora contains 400,000
words in Spanish and 275,000 words in Catalan. The AnCora
documents are annotated on many linguistic levels including
structure, syntax, dependencies, semantics and pragmatics. That
information is not included in this release, but it can be mapped
to the present annotations. The corpus is freely available from
the <a href="http://clic.ub.edu/ancora">Centre de Llenguatge i
Computació (CLiC)"</a>.</p>
The collection is being made available at no charge.
<div class="MsoNormal" style="text-align:center" align="center">
<hr align="center" size="2" width="100%"> </div>
<div class="moz-text-html" lang="x-western">
<link rel="File-List"
href="file:///C:%5CUsers%5Celefthea%5CAppData%5CLocal%5CTemp%5Cmsohtmlclip1%5C01%5Cclip_filelist.xml">
<link rel="Edit-Time-Data"
href="file:///C:%5CUsers%5Celefthea%5CAppData%5CLocal%5CTemp%5Cmsohtmlclip1%5C01%5Cclip_editdata.mso">
<link rel="themeData"
href="file:///C:%5CUsers%5Celefthea%5CAppData%5CLocal%5CTemp%5Cmsohtmlclip1%5C01%5Cclip_themedata.thmx">
<link rel="colorSchemeMapping"
href="file:///C:%5CUsers%5Celefthea%5CAppData%5CLocal%5CTemp%5Cmsohtmlclip1%5C01%5Cclip_colorschememapping.xml">
<pre class="moz-signature" cols="72">
</pre>
</div>
<pre class="moz-signature" cols="72">--
--
Ilya Ahtaridis
Membership Coordinator
--------------------------------------------------------------------
Linguistic Data Consortium Phone: 1 (215) 573-1275
University of Pennsylvania Fax: 1 (215) 573-2175
3600 Market St., Suite 810 <a class="moz-txt-link-abbreviated" href="mailto:ldc@ldc.upenn.edu">ldc@ldc.upenn.edu</a>
Philadelphia, PA 19104 USA <a class="moz-txt-link-freetext" href="http://www.ldc.upenn.edu">http://www.ldc.upenn.edu</a>
</pre>
</body>
</html>