[Corpora-List] New Corpora from the LDC

Fri Jun 22 19:13:23 UTC 2007

 LDC2007T22
**  2001 Topic Annotated Enron Email Data Set 
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2007T22>  **

LDC2007T03
**  Tagged Chinese Gigaword 
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2007T03>  **
*
*
The Linguistic Data Consortium (LDC) is pleased to announce the 
availability of two new publications.*

*
------------------------------------------------------------------------

*New Publications

*

(1) The 2001 Topic Annotated Enron Email Data Set 
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2007T22>contains 
approximately 5000 (4936) emails from Enron Corporation (Enron) manually 
indexed into 32 topics. It is a subset of the original Enron Email Data 
Set of 1.5 million emails that was posted on the Federal Energy 
Regulatory Commission website 
<http://www.ferc.gov/industries/electric/indus-act/wec/enron/info-release.asp#> 
as a matter of public record during the investigation of Enron. The 
original set suffered from document integrity problems; attempts were 
made to improve the quality of the data and to remove some sensitive and 
private information. Dr. William Cohen of Carnegie Mellon University 
<http://www.cs.cmu.edu/%7Eenron> took the lead in distributing the 
improved corpus, consisting of 517,431 Enron employee emails that 
covered the period 1999-2002.

This corpus is a subset of the Carnegie Mellon data set and covers the 
period from January 2001 to December 2001. The email topics reflect the 
business activities and interests of Enron employees in that year: 
California energy problems and the subsequent state and Federal 
investigations, Enron's downfall (newsfeeds and interoffice 
communications), Enron's venture with the Dabhol India Power Company, 
Enrononline (Enron's trading infrastructure), competitors (Dynegy, El 
Paso Pipeline) and even fantasy football and college football. The 
manual indexing was performed in the summer of 2006 by two people who 
worked closely together.

Having an annotated subset such as this one should provide text mining 
researchers with a way to evaluate the accuracy of new algorithms for 
clustering and classification. This data set can also be used to provide 
communication context for researchers using the Enron Email Data Set in 
social network analysis. This annotation can be used to qualify the 
discussion topics between individuals and groups comprising a social 
network of Enron employees.

***

(2) Tagged Chinese Gigaword 
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2007T03> 
is the part-of-speech tagged version of the LDC's Chinese Gigaword 
Second Edition LDC2005T14. It contains all of the data in Chinese 
Gigaword Second Edition -- from Central News Agency (Taiwan), Xinhua 
News Agency and Lianhe Zaobao -- annotated with full part of speech tags.

All sources have been categorized into four distinct "types":

    * *story*: This type of DOC represents a coherent report on a
      particular topic or event, consisting of paragraphs and full
      sentences.
    * *multi*: This type of DOC contains a series of unrelated "blurbs,"
      each of which briefly describes a particular topic or event;
      examples include "summaries of today's news," "news briefs in ..."
      (some general area like finance or sports), and so on.
    * *advis*: These are DOCs which the news service addresses to news
      editors; they are not intended for publication to the "end users."
    * *other*: These DOCs clearly do not fall into any of the above
      types; they include items such as lists of sports scores, stock
      prices, temperatures around the world, and so on.

------------------------------------------------------------------------

Ilya Ahtaridis
Membership Coordinator
--------------------------------------------------------------------

*
Linguistic Data Consortium                     Phone: (215) 573-1275
University of Pennsylvania                       Fax: (215) 573-2175
3600 Market St., Suite 810                         ldc at ldc.upenn.edu
Philadelphia, PA 19104 USA                  http://www.ldc.upenn.edu*

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20070622/4831ffdd/attachment.htm>