<html>

  <head>

    <meta http-equiv="content-type" content="text/html; charset=ISO-8859-1">

  </head>

  <body bgcolor="#FFFFFF" text="#000000">

    <div align="center"><i>New publications:</i></div>

    <p class="MsoNormal" align="center"> LDC2012T09<br>

      <b>- <a href="ae">Arabic-Dialect/English Parallel Text</a> -</b><br>

      <br>

      LDC2012T08<br>

      <b>- <a href="ce">Prague Czech-English Dependency Treebank 2.0</a> 

        -</b><o:p></o:p></p>

    <br>

    <hr size="2" width="100%">

    <p class="MsoNormal" align="center"><b>New publications</b><o:p></o:p></p>

    <p class="MsoNormal"><br>

      <a name="ae"></a>(1) <a

href="http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2012T09">Arab</a><a

href="http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2012T09">ic-Dialect/English

        Parallel Text</a> was developed by <a

        href="http://www.bbn.com/">Raytheon BBN Technologies</a> (BBN),

      LDC and <a href="http://www.sakhr.com/">Sakhr Software </a>and

      contains approximately 3.5 million tokens of Arabic dialect

      sentences and their English translations. <o:p></o:p></p>

    <p class="MsoNormal">The data in this corpus consists of Arabic web

      text as follows:<o:p></o:p></p>

    <p class="MsoNormal">1. Filtered automatically from large Arabic

      text corpora harvested from the web by LDC. The LDC corpora

      consisted largely of weblog and online user groups and amounted to

      around 350 million Arabic words. Documents that contained a large

      percentage of non-Arabic or Modern Standard Arabic (MSA) words

      were eliminated. A list of dialect words was manually selected by

      culling through the Levantine Fisher (<a

href="http://www.ldc.upenn.edu/Catalog/catalogEntry.jsp?catalogId=LDC2005S07">LDC2005S07</a>,

      <a

href="http://www.ldc.upenn.edu/Catalog/catalogEntry.jsp?catalogId=LDC2005T03">LDC2005T03</a>,

      <a

href="http://www.ldc.upenn.edu/Catalog/catalogEntry.jsp?catalogId=LDC2007S02">LDC2007S02</a>

      and <a

href="http://www.ldc.upenn.edu/Catalog/catalogEntry.jsp?catalogId=LDC2007T04">LDC2007T04</a>)

      and Egyptian CALLHOME speech corpora (<a

href="http://www.ldc.upenn.edu/Catalog/catalogEntry.jsp?catalogId=LDC97S45">LDC97S45</a>,

      <a

href="http://www.ldc.upenn.edu/Catalog/catalogEntry.jsp?catalogId=LDC2002S37">LDC2002S37</a>,

      <a

href="http://www.ldc.upenn.edu/Catalog/catalogEntry.jsp?catalogId=LDC97T19">LDC97T19</a>

      and <a

href="http://www.ldc.upenn.edu/Catalog/catalogEntry.jsp?catalogId=LDC2002T38">LDC2002T38</a>)

      distributed by LDC. That list was then used to retain documents

      that contained a certain number of matches. The resulting subset

      of the web corpora contained around four million words. Documents

      were automatically segmented into passages using formatting

      information from the raw data.<o:p></o:p></p>

    <p class="MsoNormal">2. Manually harvested by Sakhr Software from

      Arabic dialect web sites.<o:p></o:p></p>

    <p class="MsoNormal">Dialect classification and sentence

      segmentation, as needed, and translation into English were

      performed by BBN through <a

        href="https://www.mturk.com/mturk/welcome">Amazon's Mechanical

        Turk</a>. Arabic annotators from Mechanical Turk classified

      filtered passages as being either MSA or one of four regional

      dialects: Egyptian, Levantine, Gulf/Iraqi or Maghrebi. An

      additional "General" dialect option was allowed for ambiguous

      passages. The classification was applied to whole passages rather

      than individual sentences. Only the passages labeled Levantine and

      Egyptian were further processed. The segmented Levantine and

      Egyptian sentences were then translated. Annotators were

      instructed to translate completely and accurately and to

      transliterate Arabic names. They were also provided with examples.

      All segments of a passage were presented in the same translation

      task to provide context.<o:p></o:p></p>

    <a style="mso-comment-reference:e_4;mso-comment-date:20120612T0847"><span

        style="mso-comment-continuation:5"><br>

      </span></a><span style="mso-special-character:comment"></span><o:p></o:p>

    <p class="MsoNormal" align="center">*<o:p></o:p></p>

    <p class="MsoNormal"><br>

      <a name="ce"></a>(2) <a

href="http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2012T08">P</a><a

href="http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2012T08">rague

        Czech-English Dependency Treebank (PCEDT) 2.0</a> was developed

      by the <a href="http://ufal.mff.cuni.cz/">Institute of Formal and

        Applied Linguistics</a> at <a href="http://www.cuni.cz/">Charles

        University</a> in Prague, Czech Republic. It is a corpus of

      Czech-English parallel resources translated, aligned and manually

      annotated for dependency structure, semantic labeling, argument

      structure, ellipsis and anaphora resolution. This release updates

      <span style="mso-spacerun:yes"> </span>Prague Czech-English

      Dependency Treebank 1.0 (<a

href="http://www.ldc.upenn.edu/Catalog/catalogEntry.jsp?catalogId=LDC2004T25">LDC2004T25</a>)

      by adding English newswire texts so that it now contains over two

      million words in close to 100,000 sentences. <o:p></o:p></p>

    <p class="MsoNormal">The principal new material in PCEDT 2.0 is the

      inclusion of the entire Wall Street Journal data from Treebank-3 (<a

href="http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC99T42">LDC99T42</a>).

Not

      included from PCEDT 1.0 are the Reader's Digest material, <span

        style="mso-spacerun:yes"> </span>the Czech monolingual corpus

      and <span style="mso-spacerun:yes"> </span>the English-Czech

      dictionary. <o:p></o:p></p>

    <p class="MsoNormal">Each section is enhanced with a comprehensive

      manual linguistic annotation in the Prague Dependency Treebank

      style (<a

href="http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2006T01">LDC2006T01</a>),

Prague

      Dependency Treebank 2.0). The main features of this annotation

      style are:<o:p></o:p></p>

    <blockquote>

      <p class="MsoNormal">-dependency structure of the content words

        and coordinating and similar structures (function words are

        attached as their attribute values)<o:p></o:p></p>

      <p class="MsoNormal">-semantic labeling of content words and types

        of coordinating structures<o:p></o:p></p>

      <p class="MsoNormal">-argument structure, including an argument

        structure ("valency") lexicon for both languages<o:p></o:p></p>

      <p class="MsoNormal">-ellipsis and anaphora resolution<o:p></o:p></p>

    </blockquote>

    <p class="MsoNormal">This annotation style is called

      tectogrammatical annotation, and it constitutes the

      tectogrammatical layer in the corpus. Please consult the PCEDT <a

        href="http://ufal.mff.cuni.cz/pcedt2.0/">website</a> for more

      information and documentation.<o:p></o:p></p>

    <br>

    <hr size="2" width="100%"><br>

    <pre class="moz-signature" cols="72">-- 

--

Ilya Ahtaridis

Membership Coordinator

--------------------------------------------------------------------

Linguistic Data Consortium                  Phone: 1 (215) 573-1275

University of Pennsylvania                    Fax: 1 (215) 573-2175

3600 Market St., Suite 810                        <a class="moz-txt-link-abbreviated" href="mailto:ldc@ldc.upenn.edu">ldc@ldc.upenn.edu</a>

Philadelphia, PA 19104 USA                 <a class="moz-txt-link-freetext" href="http://www.ldc.upenn.edu">http://www.ldc.upenn.edu</a>

</pre>

  </body>

</html>