<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<html>
<head>
</head>
<body bgcolor="#ffffff" text="#000000">
<p class="MsoNormal" style="margin-bottom: 12pt; text-align: center;"
align="center"><i>In
this newsletter:</i></p>
<p class="MsoNormal" style="margin-bottom: 12pt; text-align: center;"
align="center"><b>- </b><b><a href="#digging">LDC
and </a></b><st1:place><st1:placename><a href="#digging"><b>Oxford</b></a></st1:placename><a
href="#digging"><b>
</b><st1:placetype><b>University</b></st1:placetype></a></st1:place><a
href="#digging"><b>
Receive Digging into Data Challenge Grant</b></a> -<br>
</p>
<p class="MsoNormal" style="margin-bottom: 12pt; text-align: center;"
align="center">-<span style=""> </span><b><a href="#break">LDC
to
Close for Winter
Break</a></b><b><span style=""> </span>-</b></p>
<p class="MsoNormal" style="margin-bottom: 12pt; text-align: center;"
align="center"><i>New publications:</i></p>
<p class="MsoNormal" style="margin-bottom: 12pt; text-align: center;"
align="center">LDC2009T29<br>
<span style=""></span><b><a href="#acl">- ACL
Anthology Reference
Corpus</a><span style=""></span></b><b><span style=""> </span>-</b><br>
<br>
LDC2009T30<br>
-<span style=""> </span><b><a href="#giga">Arabic
Gigaword Fourth
Edition</a></b><b><span style=""> </span>-</b><br>
</p>
<p class="MsoNormal" style="margin-bottom: 12pt; text-align: center;"
align="center">
<b><span style="color: black;"></span></b><o:p></o:p></p>
<div class="MsoNormal" style="text-align: center;" align="center">
<hr align="center" size="2" width="100%"></div>
<p class="MsoNormal"><o:p> </o:p></p>
<div align="center"><a name="digging"></a><b>LDC and </b><st1:place><st1:placename><b>Oxford</b></st1:placename><b>
</b><st1:placetype><b>University</b></st1:placetype></st1:place><b>
Receive
Digging into Data Challenge Grant</b> <br>
</div>
<p class="MsoNormal">
<br>
LDC and its research team partner <st1:place><st1:placename>Oxford</st1:placename>
<st1:placetype>University</st1:placetype></st1:place> are one of eight
international research teams to have been awarded the first Digging
into Data
Challenge grants for projects that promote innovative humanities and
social
science research using large-scale data analysis. Four leading research
agencies sponsor the international competition: The Joint Information
Systems
Committee (JISC) from the <st1:country-region><st1:place>United Kingdom</st1:place></st1:country-region>,
the National Endowment for the Humanities and the National Science
Foundation
(NSF) from the <st1:country-region><st1:place>United States</st1:place></st1:country-region>
and the Social Sciences and Humanities Research Council from <st1:country-region><st1:place>Canada</st1:place></st1:country-region>.
<br>
<br>
LDC and Oxford University (with the participation of the The British
Library)
have been funded by NSF and JISC, respectively, for a project entitled
“Mining
a Year of Speech,” which will focus on creating tools to enable rapid
and
flexible access to more than 9,000 hours of spoken audio files. Those
files
contain a wide variety of speech drawn from some of the leading British
and
American spoken word corpora, allowing for news kinds of linguistic
analysis. <br>
<br>
Further information about the Digging into Data Challenge can be found
on the
<a href="http://www.diggingintodata.org/">project website</a>.</p>
<p class="MsoNormal">[<a href="#top">
top </a>]
</p>
<p class="MsoNormal" style="text-align: center;" align="center"><a
name="break"></a><b>LDC
to Close for
Winter Break</b><o:p></o:p></p>
<p><span style="color: black;">LDC will be closed from </span><st1:date
year="2009" day="25" month="12"><span style="color: black;">Friday,
December 25, 2009</span></st1:date><span style="color: black;"> through
</span><st1:date year="2010" day="1" month="1"><span
style="color: black;">Friday, January 1, 2010</span></st1:date><span
style="color: black;"> in accordance with the </span><st1:place><st1:placetype><span
style="color: black;">University</span></st1:placetype><span
style="color: black;"> of </span><st1:placename><span
style="color: black;">Pennsylvania Winter Break Policy</span></st1:placename></st1:place><span
style="color: black;">. </span><a name="2"> Our offices will reopen on
Monday, January 4, 2010 when we will begin to process requests received
during the winter break.</a><o:p></o:p></p>
<p class="MsoNormal" style="margin-bottom: 12pt;"><span
style="color: black;">Best
wishes for a happy and safe holiday season!<br>
</span></p>
<p class="MsoNormal" style="margin-bottom: 12pt;">[<a href="#top">
top </a>]<br>
<span style="color: black;"></span><o:p></o:p></p>
<br>
<p class="MsoNormal" style="text-align: center;" align="center"><b>New
Publications</b><o:p></o:p></p>
<p>(1) <a name="acl"></a><a
href="http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2009T29">ACL
Anthology Reference Corpus</a> is a digital archive of 10,291
research papers in computational linguistics sponsored by the
Association for
Computational Linguistics (ACL). Also <a
href="http://acl-arc.comp.nus.edu.sg/">available
from the ACL</a>, this release contains most of the papers that appear
up to
February 2007 in the web-based <a
href="http://aclweb.org/anthology-new/">ACL
Anthology</a>, a dynamic repository that currently hosts over 16,500
articles
drawn from a range of conferences and workshops as well as past issues
of the <em>Computational
Linguistics</em> journal. The ACL Anthology Reference Corpus is
designed to be
a standard, real-world digital collection testbed for experiments in
bibliographic and bibliometric research. <o:p></o:p></p>
<p>The ACL is the international scientific and professional society for
scholars working on problems involving natural language and
computation.
Membership includes the ACL quarterly journal, <em>Computational
Linguistics</em>,
reduced registration at most ACL-sponsored conferences, discounts on
ACL-sponsored publications and participation in ACL Special Interest
Groups.
Since 1988, <em>Computational Linguistics</em> has been the primary
forum for
research on computational linguistics and natural language processing. <o:p></o:p></p>
<p>The material in the ACL Anthology Reference Corpus was scanned at
600dpi
grayscale for archival storage, down-sampled to 300dpi black-and-white,
assembled into articles and stored in the PDF Image with Hidden Text
format.
Author and title metadata was extracted from the OCRed text and used to
build
HTML index pages. Older materials, such as conference proceedings from
the
1960s and early volumes of <em>Computational Linguistics</em>, were
manually
digitized from microfiche slides. <o:p></o:p></p>
<p>ACL Reference Anthology includes: <o:p></o:p></p>
<ul type="disc">
<li class="MsoNormal" style="">10,921 PDF files in the
pdf/anthology-PDF tree.<o:p></o:p></li>
<li class="MsoNormal" style="">13,551 files with metadata described
in the metadata/anthology-XML tree<o:p></o:p></li>
<li class="MsoNormal" style="">84,542 pages in the PDF files<o:p></o:p></li>
</ul>
<p class="MsoNormal" style="margin-bottom: 12pt;"><br>
</p>
<p class="MsoNormal" style="margin-bottom: 12pt;">[<a href="#top">
top </a>]
</p>
<p class="MsoNormal" style="margin-bottom: 12pt;" align="center">* <o:p></o:p></p>
<p>(2) <a name="giga"></a><a
href="http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2009T30">Arabic
Gigaword Fourth Edition</a> is a comprehensive archive of
Arabic newswire text that has been acquired over several years at LDC.
Arabic
Gigaword Fourth Edition includes all of the content of <a
href="http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2007T40">Arabic
Gigaword Third Edition (LDC2007T40)</a> as well as newly-collected
data. In
addition, three new sources have been added in the fourth edition:
Al-Ahram,
Asharq Al-Awsat and Al-Quds Al-Arabi. <o:p></o:p></p>
<p>Nine distinct international sources of Arabic newswire are
represented here:<o:p></o:p></p>
<ul type="disc">
<li class="MsoNormal" style="">Al-Ahram (ahr_arb)<o:p></o:p></li>
<li class="MsoNormal" style="">Asharq Al-Awsat (aaw_arb)<o:p></o:p></li>
<li class="MsoNormal" style="">Agence France Presse (afp_arb)<o:p></o:p></li>
<li class="MsoNormal" style="">Assabah (asb_arb)<o:p></o:p></li>
<li class="MsoNormal" style="">Al Hayat (hyt_arb)<o:p></o:p></li>
<li class="MsoNormal" style="">An Nahar (nhr_arb)<o:p></o:p></li>
<li class="MsoNormal" style="">Al-Quds Al-Arabi (qds_arb)<o:p></o:p></li>
<li class="MsoNormal" style="">Ummah Press (umh_arb)<o:p></o:p></li>
<li class="MsoNormal" style="">Xinhua News Agency (xin_arb)<o:p></o:p></li>
</ul>
<p>The seven-character codes shown above represent both the directory
names
where the data files are found and the 7-letter prefix that appears at
the
beginning of every file name. The 7-letter codes consist of the
three-character
source name IDs and the three-character language code ("arb")
separated by an underscore ("_") character.<o:p></o:p></p>
<p>These news services all use Modern Standard Arabic (<st1:stockticker>MSA</st1:stockticker>),
so there should be a fairly limited scope for orthographic and lexical
variation due to regional Arabic dialects. <o:p></o:p></p>
<p class="MsoNormal">New in the Fourth Edition<o:p></o:p></p>
<ul type="disc">
<li class="MsoNormal" style="">New Sources<o:p></o:p></li>
</ul>
<p class="MsoNormal"> This release marks the first
edition of Arabic Gigaword to include content from Al-Ahram, Asharq
Al-Awsat
and Al-Quds Al-Arabi covering the period from November 2006 through
December
2008. <o:p></o:p></p>
<ul type="disc">
<li class="MsoNormal" style="">New Data for Existing Sources<o:p></o:p></li>
</ul>
<p class="MsoNormal"> This release contains all
data collected by LDC from January 2007 through December 2008, except
for Ummah
Press for which data from January 2005 through December 2008 is
included. <o:p></o:p></p>
<p>The table below shows data quantity by source under the following
categories: data source (Source); the number of files per source
(#Files);
compressed file size (Gzip-MB); uncompressed file size (Totl-MB); the
number of
space-separated words tokens in the text (K-words); and the number of
documents
per source (#DOCs).<o:p></o:p></p>
<table class="MsoNormalTable" style="width: 75%;" border="1"
cellpadding="0" width="75%">
<tbody>
<tr style="">
<td style="padding: 0.75pt;">
<p class="MsoNormal"><strong>Source</strong><o:p></o:p></p>
</td>
<td style="padding: 0.75pt;">
<p class="MsoNormal">#<strong>Files</strong><o:p></o:p></p>
</td>
<td style="padding: 0.75pt;">
<p class="MsoNormal"><strong>Gzip-MB</strong><o:p></o:p></p>
</td>
<td style="padding: 0.75pt;">
<p class="MsoNormal"><strong>Totl-MB</strong><o:p></o:p></p>
</td>
<td style="padding: 0.75pt;">
<p class="MsoNormal"><strong>K-wrds</strong><o:p></o:p></p>
</td>
<td style="padding: 0.75pt;">
<p class="MsoNormal"><strong>#DOCs</strong><o:p></o:p></p>
</td>
</tr>
<tr style="">
<td style="padding: 0.75pt;">
<p class="MsoNormal">aaw_arb<o:p></o:p></p>
</td>
<td style="padding: 0.75pt;">
<p class="MsoNormal">26<o:p></o:p></p>
</td>
<td style="padding: 0.75pt;">
<p class="MsoNormal">114<o:p></o:p></p>
</td>
<td style="padding: 0.75pt;">
<p class="MsoNormal">386<o:p></o:p></p>
</td>
<td style="padding: 0.75pt;">
<p class="MsoNormal">36694<o:p></o:p></p>
</td>
<td style="padding: 0.75pt;">
<p class="MsoNormal">87506<o:p></o:p></p>
</td>
</tr>
<tr style="">
<td style="padding: 0.75pt;">
<p class="MsoNormal">afp_arb<o:p></o:p></p>
</td>
<td style="padding: 0.75pt;">
<p class="MsoNormal">176<o:p></o:p></p>
</td>
<td style="padding: 0.75pt;">
<p class="MsoNormal">530<o:p></o:p></p>
</td>
<td style="padding: 0.75pt;">
<p class="MsoNormal">1979<o:p></o:p></p>
</td>
<td style="padding: 0.75pt;">
<p class="MsoNormal">184631<o:p></o:p></p>
</td>
<td style="padding: 0.75pt;">
<p class="MsoNormal">930656<o:p></o:p></p>
</td>
</tr>
<tr style="">
<td style="padding: 0.75pt;">
<p class="MsoNormal">ahr_arb<o:p></o:p></p>
</td>
<td style="padding: 0.75pt;">
<p class="MsoNormal">26<o:p></o:p></p>
</td>
<td style="padding: 0.75pt;">
<p class="MsoNormal">114<o:p></o:p></p>
</td>
<td style="padding: 0.75pt;">
<p class="MsoNormal">131<o:p></o:p></p>
</td>
<td style="padding: 0.75pt;">
<p class="MsoNormal">42265<o:p></o:p></p>
</td>
<td style="padding: 0.75pt;">
<p class="MsoNormal">107187<o:p></o:p></p>
</td>
</tr>
<tr style="">
<td style="padding: 0.75pt;">
<p class="MsoNormal">asb_arb<o:p></o:p></p>
</td>
<td style="padding: 0.75pt;">
<p class="MsoNormal">52<o:p></o:p></p>
</td>
<td style="padding: 0.75pt;">
<p class="MsoNormal">45<o:p></o:p></p>
</td>
<td style="padding: 0.75pt;">
<p class="MsoNormal">149<o:p></o:p></p>
</td>
<td style="padding: 0.75pt;">
<p class="MsoNormal">14322<o:p></o:p></p>
</td>
<td style="padding: 0.75pt;">
<p class="MsoNormal">32794<o:p></o:p></p>
</td>
</tr>
<tr style="">
<td style="padding: 0.75pt;">
<p class="MsoNormal">hyt_arb<o:p></o:p></p>
</td>
<td style="padding: 0.75pt;">
<p class="MsoNormal">166<o:p></o:p></p>
</td>
<td style="padding: 0.75pt;">
<p class="MsoNormal">663<o:p></o:p></p>
</td>
<td style="padding: 0.75pt;">
<p class="MsoNormal">2224<o:p></o:p></p>
</td>
<td style="padding: 0.75pt;">
<p class="MsoNormal">209318<o:p></o:p></p>
</td>
<td style="padding: 0.75pt;">
<p class="MsoNormal">448335<o:p></o:p></p>
</td>
</tr>
<tr style="">
<td style="padding: 0.75pt;">
<p class="MsoNormal">nhr_arb<o:p></o:p></p>
</td>
<td style="padding: 0.75pt;">
<p class="MsoNormal">157<o:p></o:p></p>
</td>
<td style="padding: 0.75pt;">
<p class="MsoNormal">784<o:p></o:p></p>
</td>
<td style="padding: 0.75pt;">
<p class="MsoNormal">2662<o:p></o:p></p>
</td>
<td style="padding: 0.75pt;">
<p class="MsoNormal">253559<o:p></o:p></p>
</td>
<td style="padding: 0.75pt;">
<p class="MsoNormal">557151<o:p></o:p></p>
</td>
</tr>
<tr style="">
<td style="padding: 0.75pt;">
<p class="MsoNormal">qds_arb<o:p></o:p></p>
</td>
<td style="padding: 0.75pt;">
<p class="MsoNormal">26<o:p></o:p></p>
</td>
<td style="padding: 0.75pt;">
<p class="MsoNormal">62<o:p></o:p></p>
</td>
<td style="padding: 0.75pt;">
<p class="MsoNormal">198<o:p></o:p></p>
</td>
<td style="padding: 0.75pt;">
<p class="MsoNormal">18996<o:p></o:p></p>
</td>
<td style="padding: 0.75pt;">
<p class="MsoNormal">49352<o:p></o:p></p>
</td>
</tr>
<tr style="">
<td style="padding: 0.75pt;">
<p class="MsoNormal">umh_arb<o:p></o:p></p>
</td>
<td style="padding: 0.75pt;">
<p class="MsoNormal">68<o:p></o:p></p>
</td>
<td style="padding: 0.75pt;">
<p class="MsoNormal">9.3<o:p></o:p></p>
</td>
<td style="padding: 0.75pt;">
<p class="MsoNormal">31<o:p></o:p></p>
</td>
<td style="padding: 0.75pt;">
<p class="MsoNormal">2995<o:p></o:p></p>
</td>
<td style="padding: 0.75pt;">
<p class="MsoNormal">11350<o:p></o:p></p>
</td>
</tr>
<tr style="">
<td style="padding: 0.75pt;">
<p class="MsoNormal">xin_arb<o:p></o:p></p>
</td>
<td style="padding: 0.75pt;">
<p class="MsoNormal">91<o:p></o:p></p>
</td>
<td style="padding: 0.75pt;">
<p class="MsoNormal">245<o:p></o:p></p>
</td>
<td style="padding: 0.75pt;">
<p class="MsoNormal">890<o:p></o:p></p>
</td>
<td style="padding: 0.75pt;">
<p class="MsoNormal">85689<o:p></o:p></p>
</td>
<td style="padding: 0.75pt;">
<p class="MsoNormal">492664<o:p></o:p></p>
</td>
</tr>
<tr style="height: 17.25pt;">
<td style="padding: 0.75pt; height: 17.25pt;">
<p class="MsoNormal"><strong>Totals</strong><o:p></o:p></p>
</td>
<td style="padding: 0.75pt; height: 17.25pt;">
<p class="MsoNormal">788<o:p></o:p></p>
</td>
<td style="padding: 0.75pt; height: 17.25pt;">
<p class="MsoNormal">5018<o:p></o:p></p>
</td>
<td style="padding: 0.75pt; height: 17.25pt;">
<p class="MsoNormal">8650<o:p></o:p></p>
</td>
<td style="padding: 0.75pt; height: 17.25pt;">
<p class="MsoNormal">848469<o:p></o:p></p>
</td>
<td style="padding: 0.75pt; height: 17.25pt;">
<p class="MsoNormal">2716995<o:p></o:p></p>
</td>
</tr>
</tbody>
</table>
<p class="MsoNormal" style="margin-bottom: 12pt;"><br>
<br>
</p>
<p class="MsoNormal" style="margin-bottom: 12pt;">[<a href="#top">
top </a>]<br>
<o:p></o:p></p>
<br>
<hr size="2" width="100%">
<div align="center"><font face="Courier New, Courier, monospace"><small><small><big>Ilya
Ahtaridis</big></small></small></font><br>
<font face="Courier New, Courier, monospace"><small><small><big>Membership
Coordinator</big></small></small></font><br>
<br>
<font face="Courier New, Courier, monospace"><small>--------------------------------------------------------------------</small></font><br>
</div>
<div align="center">
<pre class="moz-signature" cols="72"><font
face="Courier New, Courier, monospace">Linguistic Data Consortium Phone: (215) 573-1275
University of Pennsylvania Fax: (215) 573-2175
3600 Market St., Suite 810 <a
class="moz-txt-link-abbreviated" href="mailto:ldc@ldc.upenn.edu">ldc@ldc.upenn.edu</a>
Philadelphia, PA 19104 USA <a
class="moz-txt-link-freetext" href="http://www.ldc.upenn.edu">http://www.ldc.upenn.edu</a></font></pre>
</div>
<pre class="moz-signature" cols="72">
</pre>
</body>
</html>