<html>

  <head>

    <meta http-equiv="content-type" content="text/html; charset=ISO-8859-1">

  </head>

  <body bgcolor="#FFFFFF" text="#000000">

    <p class="MsoNormal" align="center"><b>- <a href="#work">LDC 20th

          Anniversary Workshop</a>  -</b></p>

    <p class="MsoNormal" align="center"><i>New publications:</i></p>

    <p class="MsoNormal" align="center"> <b>-  <a href="#name">American

          English Nickname Collection</a>  -</b></p>

    <p class="MsoNormal" align="center"> <b>-  <a href="#atb">Arabic

          Treebank - Broadcast News v1.0</a>  -</b></p>

    <p class="MsoNormal" align="center"> <b>-  <a href="#cat">Catalan

          TimeBank 1.0</a>  -</b></p>

    <div class="MsoNormal" style="text-align:center" align="center">

      <hr align="center" size="2" width="100%"> </div>

    <p class="MsoNormal" align="center"><a name="work"></a><b>LDC 20th

        Anniversary Workshop </b></p>

    <p class="MsoNormal">LDC announces its <b>20th Anniversary Workshop

        on Language Resources</b>, to be held in Philadelphia on

      September 6-7, 2012. The event will commemorate our anniversary,

      reflect on the beginning of language data centers and address the

      future of language resources. </p>

    <p class="MsoNormal">Workshop themes will include: the developments

      in human language technologies and associated resources that have

      brought us to our current state; the language resources required

      by the technical approaches taken and the impact of these

      resources on HLT progress; the applications of HLT and resources

      to other disciplines including law, medicine, economics, the

      political sciences and psychology; the impact of HLTs and related

      technologies on linguistic analysis and novel approaches in fields

      as widespread as phonetics, semantics, language documentation,

      sociolinguistics and dialect geography; and finally, the impact of

      any of these developments on the ways in which language resources

      are created, shared and exploited and on the specific resources

      required.</p>

    <p class="MsoNormal">Stay tuned for further details.</p>

    <p class="MsoNormal" align="center"><b>New publications </b></p>

    <p class="MsoNormal"><a name="name"></a>(1) <a

href="http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2012T11">American

English

        Nickname Collection</a> was developed by <a

        href="http://www.intelius.com/corp/">Intelius, Inc</a>. and is a

      compilation of American English nicknames to given name mappings

      based on information in US government records, public web profiles

      and financial and property reports. This corpus is intended as a

      tool for the quantitative study of nickname usage in the United

      States such as in demographic and sociological studies. </p>

    <p class="MsoNormal">The American English Nickname Collection

      contains 331,237 distinct mappings encompassing millions of names.

      The data was collected and processed through a record linkage

      pipeline. The steps in the pipeline were (1) data cleaning, (2)

      blocking, (3) pair-wise linkage and (4) clustering. In the

      cleaning step, material was categorized, processed to remove junk

      and spam records and normalized to an approximately common

      representation. The blocking process utilized an algorithm to

      group records by shared properties for determining which record

      pairs should be examined by the pairwise linker as potential

      duplicates. The linkage step assigned a score to record pairs

      using a supervised pairwise-based machine learning model. The

      clustering step combined record pairs into connected components

      and further partitioned each connected component to remove

      inconsistent pairwise links. The result is that input records were

      partitioned into disjoint sets called profiles, where each profile

      corresponded to a single person.</p>

    <p class="MsoNormal">The material is presented in the form of a

      comma delimited text file. Each line contains a first name, a

      nickname or alias, its conditional probability and its frequency.

      The conditional probability for each nickname is derived from the

      base data using an algorithm which calculates both the probability

      for which any alias refers to a given name and a threshold below

      which the mapping is most likely an error. This threshold

      eliminates typographic errors and other noise from the data.</p>

    The collection is being made available at no charge.

    <p class="MsoNormal" align="center">*</p>

    <p class="MsoNormal"><a name="atb"></a>(2) <a

href="http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2012T07">Arabic

Treebank

        - Broadcast News v1.0</a> was developed at LDC. It consists of

      120 transcribed Arabic broadcast news stories with part-of-speech,

      morphology, gloss and syntactic tree annotation in accordance with

      the <a href="http://projects.ldc.upenn.edu/ArabicTreebank/">Penn

        Arabic Treebank (PATB) Morphological and Syntactic Annotation

        Guidelines</a>. The ongoing PATB project supports research in

      Arabic-language natural language processing and human language

      technology development. </p>

    <p class="MsoNormal">This release contains 432,976 source tokens

      before clitics were split, and 517,080 tree tokens after clitics

      were separated for treebank annotation. The source materials are

      Arabic broadcast news stories collected by LDC during the period

      2005-2008 from the following sources: Abu Dhabi TV, Al Alam News

      Channel, Al Arabiya, Al Baghdadya TV, Al Fayha, Alhurra, Al

      Iraqiyah, Aljazeera, Al Ordiniyah, Al Sharqiyah, Dubai TV, Kuwait

      TV, Lebanese Broadcasting Corp., Oman TV, Radio Sawa, Saudi TV and

      Syria TV. The transcripts were produced by LDC.</p>

    <br>

    <p class="MsoNormal" align="center">*</p>

    <p class="MsoNormal"><a name="cat"></a>(3) <a

href="http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2012T10">Catalan

TimeBank

        1.0</a> was developed by researchers at <a

        href="http://www.barcelonamedia.org/">Barcelona Media</a> and

      consists of Catalan texts in the <a

        href="http://clic.ub.edu/corpus/en/ancora">AnCora corpus</a>

      annotated with temporal and event information according to the <a

        href="http://www.timeml.org/site/index.html">TimeML

        specification language</a>. </p>

    <p class="MsoNormal">TimeML is a schema for annotating eventualities

      and time expressions in natural language as well as the temporal

      relations among them, thus facilitating the task of extraction,

      representation and exchange of temporal information. Catalan

      Timebank 1.0 is annotated in three levels, marking events, time

      expressions and event metadata. The TimeML annotation scheme was

      tailored for the specifics of the Catalan language. Temporal

      relations in Catalan present distinctions of verbal mood (e.g.,

      indicative, subjunctive, conditional, etc.) and grammatical aspect

      (e.g., imperfective) which are absent in English. </p>

    <p class="MsoNormal">Catalan TimeBank 1.0 contains stand-off

      annotations for 210 documents with over 75,800 tokens (including

      punctuation marks) and 68,000 tokens (excluding punctuation). The

      source documents are from the <a

        href="http://www.efe.com/principal.asp?opcion=0&idioma=CATALAN">EFE

        news agency</a>, the <a

        href="http://www.catalannewsagency.com/aboutus">ACN</a> Catalan

      news agency2 and the Catalan version of the <a

        href="http://www.elperiodico.cat/ca/">El Períodico</a>

      newspaper, and span the period from January to December 2000. </p>

    <p class="MsoNormal">The AnCora corpus is the largest multilayer

      annotated corpus of Spanish and Catalan. AnCora contains 400,000

      words in Spanish and 275,000 words in Catalan. The AnCora

      documents are annotated on many linguistic levels including

      structure, syntax, dependencies, semantics and pragmatics. That

      information is not included in this release, but it can be mapped

      to the present annotations. The corpus is freely available from

      the <a href="http://clic.ub.edu/ancora">Centre de Llenguatge i

        Computació (CLiC)"</a>.</p>

    The collection is being made available at no charge.

    <div class="MsoNormal" style="text-align:center" align="center">

      <hr align="center" size="2" width="100%"> </div>

    <div class="moz-text-html" lang="x-western">

      <link rel="File-List"

href="file:///C:%5CUsers%5Celefthea%5CAppData%5CLocal%5CTemp%5Cmsohtmlclip1%5C01%5Cclip_filelist.xml">

      <link rel="Edit-Time-Data"

href="file:///C:%5CUsers%5Celefthea%5CAppData%5CLocal%5CTemp%5Cmsohtmlclip1%5C01%5Cclip_editdata.mso">

      <link rel="themeData"

href="file:///C:%5CUsers%5Celefthea%5CAppData%5CLocal%5CTemp%5Cmsohtmlclip1%5C01%5Cclip_themedata.thmx">

      <link rel="colorSchemeMapping"

href="file:///C:%5CUsers%5Celefthea%5CAppData%5CLocal%5CTemp%5Cmsohtmlclip1%5C01%5Cclip_colorschememapping.xml">

      <pre class="moz-signature" cols="72">

</pre>

    </div>

    <pre class="moz-signature" cols="72">-- 

--

Ilya Ahtaridis

Membership Coordinator

--------------------------------------------------------------------

Linguistic Data Consortium                  Phone: 1 (215) 573-1275

University of Pennsylvania                    Fax: 1 (215) 573-2175

3600 Market St., Suite 810                        <a class="moz-txt-link-abbreviated" href="mailto:ldc@ldc.upenn.edu">ldc@ldc.upenn.edu</a>

Philadelphia, PA 19104 USA                 <a class="moz-txt-link-freetext" href="http://www.ldc.upenn.edu">http://www.ldc.upenn.edu</a>

</pre>

  </body>

</html>