[Corpora-List] Announcement of the Columbia Summarization Corpus

William Y. Wang yww at cs.cmu.edu
Wed Nov 9 15:24:44 UTC 2011


Dear list,

I would like to introduce you the recently released Columbia Summarization
Corpus
(freely available at: http://www.cs.columbia.edu/~kathy/Data/CSC.tar.gz).

The Columbia Summarization Corpus (CSC) was retrieved from the output of
the Newsblaster online news summarization system that crawls the Web for
news articles, clusters them on specific topics and produces multidocument
summaries for each cluster. We collected a total of 166,435 summaries
containing 2.5 million sentences and covering 2,129 days in the 2003-2011
period. The CSC corpus can be used, but not limited to the following
purposes:

* Event Mining
* Language generation
* Summarization
* Information retrieval
* Information extraction
* Sentiment analysis and opinion mining
* Question answering
* Text mining and natural language processing applications
* Language modeling for text processing
* Lexicon and ontology development
* Machine learning (supervised, semi-supervised, and unsupervised learning)

Citation: William Yang Wang, Kapil Thadani, and Kathleen R. McKeown,
"Identifying Event Descriptions using Co-training with Online News
Summaries", in Proceedings of the 5th International Joint Conference on
Natural Language Processing (IJCNLP 2011), Chiang Mai, Thailand, Nov. 8-13,
ACL-AFNLP.  http://www.cs.cmu.edu/~yww/papers/ijcnlp2011.pdf

If you have any further questions, feel free to let me know.

Cheers,
William
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20111109/04fe626d/attachment.htm>
-------------- next part --------------
_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora


More information about the Corpora mailing list