[Corpora-List] New Corpora from the LDC
Linguistic Data Consortium
ldc at ldc.upenn.edu
Fri Jun 22 19:13:23 UTC 2007
LDC2007T22
** 2001 Topic Annotated Enron Email Data Set
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2007T22> **
LDC2007T03
** Tagged Chinese Gigaword
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2007T03> **
*
*
The Linguistic Data Consortium (LDC) is pleased to announce the
availability of two new publications.*
*
------------------------------------------------------------------------
*New Publications
*
(1) The 2001 Topic Annotated Enron Email Data Set
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2007T22>contains
approximately 5000 (4936) emails from Enron Corporation (Enron) manually
indexed into 32 topics. It is a subset of the original Enron Email Data
Set of 1.5 million emails that was posted on the Federal Energy
Regulatory Commission website
<http://www.ferc.gov/industries/electric/indus-act/wec/enron/info-release.asp#>
as a matter of public record during the investigation of Enron. The
original set suffered from document integrity problems; attempts were
made to improve the quality of the data and to remove some sensitive and
private information. Dr. William Cohen of Carnegie Mellon University
<http://www.cs.cmu.edu/%7Eenron> took the lead in distributing the
improved corpus, consisting of 517,431 Enron employee emails that
covered the period 1999-2002.
This corpus is a subset of the Carnegie Mellon data set and covers the
period from January 2001 to December 2001. The email topics reflect the
business activities and interests of Enron employees in that year:
California energy problems and the subsequent state and Federal
investigations, Enron's downfall (newsfeeds and interoffice
communications), Enron's venture with the Dabhol India Power Company,
Enrononline (Enron's trading infrastructure), competitors (Dynegy, El
Paso Pipeline) and even fantasy football and college football. The
manual indexing was performed in the summer of 2006 by two people who
worked closely together.
Having an annotated subset such as this one should provide text mining
researchers with a way to evaluate the accuracy of new algorithms for
clustering and classification. This data set can also be used to provide
communication context for researchers using the Enron Email Data Set in
social network analysis. This annotation can be used to qualify the
discussion topics between individuals and groups comprising a social
network of Enron employees.
***
(2) Tagged Chinese Gigaword
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2007T03>
is the part-of-speech tagged version of the LDC's Chinese Gigaword
Second Edition LDC2005T14. It contains all of the data in Chinese
Gigaword Second Edition -- from Central News Agency (Taiwan), Xinhua
News Agency and Lianhe Zaobao -- annotated with full part of speech tags.
All sources have been categorized into four distinct "types":
* *story*: This type of DOC represents a coherent report on a
particular topic or event, consisting of paragraphs and full
sentences.
* *multi*: This type of DOC contains a series of unrelated "blurbs,"
each of which briefly describes a particular topic or event;
examples include "summaries of today's news," "news briefs in ..."
(some general area like finance or sports), and so on.
* *advis*: These are DOCs which the news service addresses to news
editors; they are not intended for publication to the "end users."
* *other*: These DOCs clearly do not fall into any of the above
types; they include items such as lists of sports scores, stock
prices, temperatures around the world, and so on.
------------------------------------------------------------------------
Ilya Ahtaridis
Membership Coordinator
--------------------------------------------------------------------
*
Linguistic Data Consortium Phone: (215) 573-1275
University of Pennsylvania Fax: (215) 573-2175
3600 Market St., Suite 810 ldc at ldc.upenn.edu
Philadelphia, PA 19104 USA http://www.ldc.upenn.edu*
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20070622/4831ffdd/attachment.htm>
More information about the Corpora
mailing list