[Corpora-List] Announcement of the Columbia Summarization Corpus

William Y. Wang ww at cmu.edu
Wed Nov 9 16:16:35 UTC 2011


Dear list,

I would like to introduce you the recently released Columbia Summarization
Corpus
(freely available at: http://www.cs.columbia.edu/~kathy/Data/CSC.tar.gz).

The Columbia Summarization Corpus (CSC) was retrieved from the output of
the Newsblaster online news summarization system (
http://newsblaster.cs.columbia.edu/) that crawls the Web for news articles,
clusters them on specific topics and produces multidocument summaries for
each cluster. We collected a total of 166,435 summaries containing 2.5
million sentences and covering 2,129 days in the 2003-2011 period. The CSC
corpus can be used, but not limited to the following purposes:

* Event Mining
* Language generation
* Summarization
* Information retrieval
* Information extraction
* Sentiment analysis and opinion mining
* Question answering
* Text mining and natural language processing applications
* Language modeling for text processing
* Lexicon and ontology development
* Machine learning (supervised, semi-supervised, and unsupervised learning)

Citation:

William Yang Wang, Kapil Thadani, and Kathleen R. McKeown, "Identifying
Event Descriptions using Co-training with Online News Summaries", in
Proceedings of the 5th International Joint Conference on Natural Language
Processing (IJCNLP 2011), Chiang Mai, Thailand, Nov. 8-13, ACL-AFNLP.
http://www.cs.cmu.edu/~yww/papers/ijcnlp2011.pdf  Additional references of
the Columbia Newsblaster summarizer can be found on the website of Columbia
NLP group publication page (http://www1.cs.columbia.edu/nlp/papers.cgi).

If you have any further questions, feel free to let me know.

Cheers,
William

-- 
William Y. Wang
School of Computer Science,
Carnegie Mellon University.
http://www.cs.cmu.edu/~yww/
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20111109/b2b60423/attachment.htm>
-------------- next part --------------
_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora


More information about the Corpora mailing list