[Corpora-List] News from LDC
    Linguistic Data Consortium 
    ldc at ldc.upenn.edu
       
    Tue Dec 22 15:14:39 UTC 2009
    
    
  
/In this newsletter:/
*- **LDC and <#digging>**Oxford* <#digging>* **University* <#digging>* 
Receive Digging into Data Challenge Grant* <#digging> -
- *LDC to Close for Winter Break <#break>** -*
/New publications:/
LDC2009T29
*-  ACL Anthology Reference Corpus <#acl>** -*
LDC2009T30
- *Arabic Gigaword Fourth Edition <#giga>** -*
**
------------------------------------------------------------------------
 
*LDC and **Oxford** **University** Receive Digging into Data Challenge 
Grant*
LDC and its research team partner Oxford University are one of eight 
international research teams to have been awarded the first Digging into 
Data Challenge grants for projects that promote innovative humanities 
and social science research using large-scale data analysis. Four 
leading research agencies sponsor the international competition: The 
Joint Information Systems Committee (JISC) from the United Kingdom, the 
National Endowment for the Humanities and the National Science 
Foundation (NSF) from the United States and the Social Sciences and 
Humanities Research Council from Canada.
LDC and Oxford University (with the participation of the The British 
Library) have been funded by NSF and JISC, respectively, for a project 
entitled "Mining a Year of Speech," which will focus on creating tools 
to enable rapid and flexible access to more than 9,000 hours of spoken 
audio files. Those files contain a wide variety of speech drawn from 
some of the leading British and American spoken word corpora, allowing 
for news kinds of linguistic analysis.
Further information about the Digging into Data Challenge can be found 
on the project website <http://www.diggingintodata.org/>.
[ top <#top>]
*LDC to Close for Winter Break*
LDC will be closed from Friday, December 25, 2009 through Friday, 
January 1, 2010 in accordance with the University of Pennsylvania Winter 
Break Policy.  Our offices will reopen on Monday, January 4, 2010 when 
we will begin to process requests received during the winter break.
Best wishes for a happy and safe holiday season!
[ top <#top>]
*New Publications*
(1)  ACL Anthology Reference Corpus 
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2009T29> 
is a digital archive of 10,291 research papers in computational 
linguistics sponsored by the Association for Computational Linguistics 
(ACL). Also available from the ACL <http://acl-arc.comp.nus.edu.sg/>, 
this release contains most of the papers that appear up to February 2007 
in the web-based ACL Anthology <http://aclweb.org/anthology-new/>, a 
dynamic repository that currently hosts over 16,500 articles drawn from 
a range of conferences and workshops as well as past issues of the 
/Computational Linguistics/ journal. The ACL Anthology Reference Corpus 
is designed to be a standard, real-world digital collection testbed for 
experiments in bibliographic and bibliometric research.
The ACL is the international scientific and professional society for 
scholars working on problems involving natural language and computation. 
Membership includes the ACL quarterly journal, /Computational 
Linguistics/, reduced registration at most ACL-sponsored conferences, 
discounts on ACL-sponsored publications and participation in ACL Special 
Interest Groups. Since 1988, /Computational Linguistics/ has been the 
primary forum for research on computational linguistics and natural 
language processing.
The material in the ACL Anthology Reference Corpus was scanned at 600dpi 
grayscale for archival storage, down-sampled to 300dpi black-and-white, 
assembled into articles and stored in the PDF Image with Hidden Text 
format. Author and title metadata was extracted from the OCRed text and 
used to build HTML index pages. Older materials, such as conference 
proceedings from the 1960s and early volumes of /Computational 
Linguistics/, were manually digitized from microfiche slides.
ACL Reference Anthology includes:
    * 10,921 PDF files in the pdf/anthology-PDF tree.
    * 13,551 files with metadata described in the metadata/anthology-XML
      tree
    * 84,542 pages in the PDF files
[ top <#top>]
* 
(2)  Arabic Gigaword Fourth Edition 
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2009T30> 
is a comprehensive archive of Arabic newswire text that has been 
acquired over several years at LDC. Arabic Gigaword Fourth Edition 
includes all of the content of Arabic Gigaword Third Edition 
(LDC2007T40) 
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2007T40> 
as well as newly-collected data. In addition, three new sources have 
been added in the fourth edition: Al-Ahram, Asharq Al-Awsat and Al-Quds 
Al-Arabi.
Nine distinct international sources of Arabic newswire are represented here:
    * Al-Ahram (ahr_arb)
    * Asharq Al-Awsat (aaw_arb)
    * Agence France Presse (afp_arb)
    * Assabah (asb_arb)
    * Al Hayat (hyt_arb)
    * An Nahar (nhr_arb)
    * Al-Quds Al-Arabi (qds_arb)
    * Ummah Press (umh_arb)
    * Xinhua News Agency (xin_arb)
The seven-character codes shown above represent both the directory names 
where the data files are found and the 7-letter prefix that appears at 
the beginning of every file name. The 7-letter codes consist of the 
three-character source name IDs and the three-character language code 
("arb") separated by an underscore ("_") character.
These news services all use Modern Standard Arabic (MSA), so there 
should be a fairly limited scope for orthographic and lexical variation 
due to regional Arabic dialects.
New in the Fourth Edition
    * New Sources
      This release marks the first edition of Arabic Gigaword to include 
content from Al-Ahram, Asharq Al-Awsat and Al-Quds Al-Arabi covering the 
period from November 2006 through December 2008. 
    * New Data for Existing Sources
      This release contains all data collected by LDC from January 2007 
through December 2008, except for Ummah Press for which data from 
January 2005 through December 2008 is included.
The table below shows data quantity by source under the following 
categories: data source (Source); the number of files per source 
(#Files); compressed file size (Gzip-MB); uncompressed file size 
(Totl-MB); the number of space-separated words tokens in the text 
(K-words); and the number of documents per source (#DOCs).
*Source*
	
#*Files*
	
*Gzip-MB*
	
*Totl-MB*
	
*K-wrds*
	
*#DOCs*
aaw_arb
	
26
	
114
	
386
	
36694
	
87506
afp_arb
	
176
	
530
	
1979
	
184631
	
930656
ahr_arb
	
26
	
114
	
131
	
42265
	
107187
asb_arb
	
52
	
45
	
149
	
14322
	
32794
hyt_arb
	
166
	
663
	
2224
	
209318
	
448335
nhr_arb
	
157
	
784
	
2662
	
253559
	
557151
qds_arb
	
26
	
62
	
198
	
18996
	
49352
umh_arb
	
68
	
9.3
	
31
	
2995
	
11350
xin_arb
	
91
	
245
	
890
	
85689
	
492664
*Totals*
	
788
	
5018
	
8650
	
848469
	
2716995
[ top <#top>]
------------------------------------------------------------------------
Ilya Ahtaridis
Membership Coordinator
--------------------------------------------------------------------
Linguistic Data Consortium                     Phone: (215) 573-1275
University of Pennsylvania                       Fax: (215) 573-2175
3600 Market St., Suite 810                         ldc at ldc.upenn.edu
 Philadelphia, PA 19104 USA                   http://www.ldc.upenn.edu
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20091222/a0e5c866/attachment.htm>
-------------- next part --------------
_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora
    
    
More information about the Corpora
mailing list