[Corpora-List] News from LDC

Tue Dec 22 15:14:39 UTC 2009

/In this newsletter:/

*- **LDC and <#digging>**Oxford* <#digging>* **University* <#digging>* 
Receive Digging into Data Challenge Grant* <#digging> -

- *LDC to Close for Winter Break <#break>** -*

/New publications:/

LDC2009T29
*-  ACL Anthology Reference Corpus <#acl>** -*

LDC2009T30
- *Arabic Gigaword Fourth Edition <#giga>** -*

**

------------------------------------------------------------------------

*LDC and **Oxford** **University** Receive Digging into Data Challenge 
Grant*

LDC and its research team partner Oxford University are one of eight 
international research teams to have been awarded the first Digging into 
Data Challenge grants for projects that promote innovative humanities 
and social science research using large-scale data analysis. Four 
leading research agencies sponsor the international competition: The 
Joint Information Systems Committee (JISC) from the United Kingdom, the 
National Endowment for the Humanities and the National Science 
Foundation (NSF) from the United States and the Social Sciences and 
Humanities Research Council from Canada.

LDC and Oxford University (with the participation of the The British 
Library) have been funded by NSF and JISC, respectively, for a project 
entitled "Mining a Year of Speech," which will focus on creating tools 
to enable rapid and flexible access to more than 9,000 hours of spoken 
audio files. Those files contain a wide variety of speech drawn from 
some of the leading British and American spoken word corpora, allowing 
for news kinds of linguistic analysis.

Further information about the Digging into Data Challenge can be found 
on the project website <http://www.diggingintodata.org/>.

[ top <#top>]

*LDC to Close for Winter Break*

LDC will be closed from Friday, December 25, 2009 through Friday, 
January 1, 2010 in accordance with the University of Pennsylvania Winter 
Break Policy.  Our offices will reopen on Monday, January 4, 2010 when 
we will begin to process requests received during the winter break.

Best wishes for a happy and safe holiday season!

[ top <#top>]

*New Publications*

(1)  ACL Anthology Reference Corpus 
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2009T29> 
is a digital archive of 10,291 research papers in computational 
linguistics sponsored by the Association for Computational Linguistics 
(ACL). Also available from the ACL <http://acl-arc.comp.nus.edu.sg/>, 
this release contains most of the papers that appear up to February 2007 
in the web-based ACL Anthology <http://aclweb.org/anthology-new/>, a 
dynamic repository that currently hosts over 16,500 articles drawn from 
a range of conferences and workshops as well as past issues of the 
/Computational Linguistics/ journal. The ACL Anthology Reference Corpus 
is designed to be a standard, real-world digital collection testbed for 
experiments in bibliographic and bibliometric research.

The ACL is the international scientific and professional society for 
scholars working on problems involving natural language and computation. 
Membership includes the ACL quarterly journal, /Computational 
Linguistics/, reduced registration at most ACL-sponsored conferences, 
discounts on ACL-sponsored publications and participation in ACL Special 
Interest Groups. Since 1988, /Computational Linguistics/ has been the 
primary forum for research on computational linguistics and natural 
language processing.

The material in the ACL Anthology Reference Corpus was scanned at 600dpi 
grayscale for archival storage, down-sampled to 300dpi black-and-white, 
assembled into articles and stored in the PDF Image with Hidden Text 
format. Author and title metadata was extracted from the OCRed text and 
used to build HTML index pages. Older materials, such as conference 
proceedings from the 1960s and early volumes of /Computational 
Linguistics/, were manually digitized from microfiche slides.

ACL Reference Anthology includes:

    * 10,921 PDF files in the pdf/anthology-PDF tree.
    * 13,551 files with metadata described in the metadata/anthology-XML
      tree
    * 84,542 pages in the PDF files

[ top <#top>]

* 

(2)  Arabic Gigaword Fourth Edition 
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2009T30> 
is a comprehensive archive of Arabic newswire text that has been 
acquired over several years at LDC. Arabic Gigaword Fourth Edition 
includes all of the content of Arabic Gigaword Third Edition 
(LDC2007T40) 
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2007T40> 
as well as newly-collected data. In addition, three new sources have 
been added in the fourth edition: Al-Ahram, Asharq Al-Awsat and Al-Quds 
Al-Arabi.

Nine distinct international sources of Arabic newswire are represented here:

    * Al-Ahram (ahr_arb)
    * Asharq Al-Awsat (aaw_arb)
    * Agence France Presse (afp_arb)
    * Assabah (asb_arb)
    * Al Hayat (hyt_arb)
    * An Nahar (nhr_arb)
    * Al-Quds Al-Arabi (qds_arb)
    * Ummah Press (umh_arb)
    * Xinhua News Agency (xin_arb)

The seven-character codes shown above represent both the directory names 
where the data files are found and the 7-letter prefix that appears at 
the beginning of every file name. The 7-letter codes consist of the 
three-character source name IDs and the three-character language code 
("arb") separated by an underscore ("_") character.

These news services all use Modern Standard Arabic (MSA), so there 
should be a fairly limited scope for orthographic and lexical variation 
due to regional Arabic dialects.

New in the Fourth Edition

    * New Sources

      This release marks the first edition of Arabic Gigaword to include 
content from Al-Ahram, Asharq Al-Awsat and Al-Quds Al-Arabi covering the 
period from November 2006 through December 2008. 

    * New Data for Existing Sources

      This release contains all data collected by LDC from January 2007 
through December 2008, except for Ummah Press for which data from 
January 2005 through December 2008 is included.

The table below shows data quantity by source under the following 
categories: data source (Source); the number of files per source 
(#Files); compressed file size (Gzip-MB); uncompressed file size 
(Totl-MB); the number of space-separated words tokens in the text 
(K-words); and the number of documents per source (#DOCs).

*Source*

#*Files*

*Gzip-MB*

*Totl-MB*

*K-wrds*

*#DOCs*

aaw_arb

26

114

386

36694

87506

afp_arb

176

530

1979

184631

930656

ahr_arb

26

114

131

42265

107187

asb_arb

52

45

149

14322

32794

hyt_arb

166

663

2224

209318

448335

nhr_arb

157

784

2662

253559

557151

qds_arb

26

62

198

18996

49352

umh_arb

68

9.3

31

2995

11350

xin_arb

91

245

890

85689

492664

*Totals*

788

5018

8650

848469

2716995

[ top <#top>]

------------------------------------------------------------------------
Ilya Ahtaridis
Membership Coordinator

--------------------------------------------------------------------

Linguistic Data Consortium                     Phone: (215) 573-1275
University of Pennsylvania                       Fax: (215) 573-2175
3600 Market St., Suite 810                         ldc at ldc.upenn.edu
 Philadelphia, PA 19104 USA                   http://www.ldc.upenn.edu

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20091222/a0e5c866/attachment.htm>
-------------- next part --------------
_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora