[Corpora-List] News from LDC
Linguistic Data Consortium
ldc at ldc.upenn.edu
Tue Dec 22 15:14:39 UTC 2009
/In this newsletter:/
*- **LDC and <#digging>**Oxford* <#digging>* **University* <#digging>*
Receive Digging into Data Challenge Grant* <#digging> -
- *LDC to Close for Winter Break <#break>** -*
/New publications:/
LDC2009T29
*- ACL Anthology Reference Corpus <#acl>** -*
LDC2009T30
- *Arabic Gigaword Fourth Edition <#giga>** -*
**
------------------------------------------------------------------------
*LDC and **Oxford** **University** Receive Digging into Data Challenge
Grant*
LDC and its research team partner Oxford University are one of eight
international research teams to have been awarded the first Digging into
Data Challenge grants for projects that promote innovative humanities
and social science research using large-scale data analysis. Four
leading research agencies sponsor the international competition: The
Joint Information Systems Committee (JISC) from the United Kingdom, the
National Endowment for the Humanities and the National Science
Foundation (NSF) from the United States and the Social Sciences and
Humanities Research Council from Canada.
LDC and Oxford University (with the participation of the The British
Library) have been funded by NSF and JISC, respectively, for a project
entitled "Mining a Year of Speech," which will focus on creating tools
to enable rapid and flexible access to more than 9,000 hours of spoken
audio files. Those files contain a wide variety of speech drawn from
some of the leading British and American spoken word corpora, allowing
for news kinds of linguistic analysis.
Further information about the Digging into Data Challenge can be found
on the project website <http://www.diggingintodata.org/>.
[ top <#top>]
*LDC to Close for Winter Break*
LDC will be closed from Friday, December 25, 2009 through Friday,
January 1, 2010 in accordance with the University of Pennsylvania Winter
Break Policy. Our offices will reopen on Monday, January 4, 2010 when
we will begin to process requests received during the winter break.
Best wishes for a happy and safe holiday season!
[ top <#top>]
*New Publications*
(1) ACL Anthology Reference Corpus
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2009T29>
is a digital archive of 10,291 research papers in computational
linguistics sponsored by the Association for Computational Linguistics
(ACL). Also available from the ACL <http://acl-arc.comp.nus.edu.sg/>,
this release contains most of the papers that appear up to February 2007
in the web-based ACL Anthology <http://aclweb.org/anthology-new/>, a
dynamic repository that currently hosts over 16,500 articles drawn from
a range of conferences and workshops as well as past issues of the
/Computational Linguistics/ journal. The ACL Anthology Reference Corpus
is designed to be a standard, real-world digital collection testbed for
experiments in bibliographic and bibliometric research.
The ACL is the international scientific and professional society for
scholars working on problems involving natural language and computation.
Membership includes the ACL quarterly journal, /Computational
Linguistics/, reduced registration at most ACL-sponsored conferences,
discounts on ACL-sponsored publications and participation in ACL Special
Interest Groups. Since 1988, /Computational Linguistics/ has been the
primary forum for research on computational linguistics and natural
language processing.
The material in the ACL Anthology Reference Corpus was scanned at 600dpi
grayscale for archival storage, down-sampled to 300dpi black-and-white,
assembled into articles and stored in the PDF Image with Hidden Text
format. Author and title metadata was extracted from the OCRed text and
used to build HTML index pages. Older materials, such as conference
proceedings from the 1960s and early volumes of /Computational
Linguistics/, were manually digitized from microfiche slides.
ACL Reference Anthology includes:
* 10,921 PDF files in the pdf/anthology-PDF tree.
* 13,551 files with metadata described in the metadata/anthology-XML
tree
* 84,542 pages in the PDF files
[ top <#top>]
*
(2) Arabic Gigaword Fourth Edition
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2009T30>
is a comprehensive archive of Arabic newswire text that has been
acquired over several years at LDC. Arabic Gigaword Fourth Edition
includes all of the content of Arabic Gigaword Third Edition
(LDC2007T40)
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2007T40>
as well as newly-collected data. In addition, three new sources have
been added in the fourth edition: Al-Ahram, Asharq Al-Awsat and Al-Quds
Al-Arabi.
Nine distinct international sources of Arabic newswire are represented here:
* Al-Ahram (ahr_arb)
* Asharq Al-Awsat (aaw_arb)
* Agence France Presse (afp_arb)
* Assabah (asb_arb)
* Al Hayat (hyt_arb)
* An Nahar (nhr_arb)
* Al-Quds Al-Arabi (qds_arb)
* Ummah Press (umh_arb)
* Xinhua News Agency (xin_arb)
The seven-character codes shown above represent both the directory names
where the data files are found and the 7-letter prefix that appears at
the beginning of every file name. The 7-letter codes consist of the
three-character source name IDs and the three-character language code
("arb") separated by an underscore ("_") character.
These news services all use Modern Standard Arabic (MSA), so there
should be a fairly limited scope for orthographic and lexical variation
due to regional Arabic dialects.
New in the Fourth Edition
* New Sources
This release marks the first edition of Arabic Gigaword to include
content from Al-Ahram, Asharq Al-Awsat and Al-Quds Al-Arabi covering the
period from November 2006 through December 2008.
* New Data for Existing Sources
This release contains all data collected by LDC from January 2007
through December 2008, except for Ummah Press for which data from
January 2005 through December 2008 is included.
The table below shows data quantity by source under the following
categories: data source (Source); the number of files per source
(#Files); compressed file size (Gzip-MB); uncompressed file size
(Totl-MB); the number of space-separated words tokens in the text
(K-words); and the number of documents per source (#DOCs).
*Source*
#*Files*
*Gzip-MB*
*Totl-MB*
*K-wrds*
*#DOCs*
aaw_arb
26
114
386
36694
87506
afp_arb
176
530
1979
184631
930656
ahr_arb
26
114
131
42265
107187
asb_arb
52
45
149
14322
32794
hyt_arb
166
663
2224
209318
448335
nhr_arb
157
784
2662
253559
557151
qds_arb
26
62
198
18996
49352
umh_arb
68
9.3
31
2995
11350
xin_arb
91
245
890
85689
492664
*Totals*
788
5018
8650
848469
2716995
[ top <#top>]
------------------------------------------------------------------------
Ilya Ahtaridis
Membership Coordinator
--------------------------------------------------------------------
Linguistic Data Consortium Phone: (215) 573-1275
University of Pennsylvania Fax: (215) 573-2175
3600 Market St., Suite 810 ldc at ldc.upenn.edu
Philadelphia, PA 19104 USA http://www.ldc.upenn.edu
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20091222/a0e5c866/attachment.htm>
-------------- next part --------------
_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora
More information about the Corpora
mailing list