22.2068, Sum: Genre-Specific Corpora

Fri May 13 19:19:40 UTC 2011

LINGUIST List: Vol-22-2068. Fri May 13 2011. ISSN: 1068 - 4875.

Subject: 22.2068, Sum: Genre-Specific Corpora

Moderators: Anthony Aristar, Eastern Michigan U <aristar at linguistlist.org>
            Helen Aristar-Dry, Eastern Michigan U <hdry at linguistlist.org>

Reviews: Veronika Drake, U of Wisconsin-Madison  
Monica Macaulay, U of Wisconsin-Madison  
Rajiv Rao, U of Wisconsin-Madison  
Joseph Salmons, U of Wisconsin-Madison  
Anja Wanner, U of Wisconsin-Madison  
       <reviews at linguistlist.org> 

Homepage: http://linguistlist.org/

The LINGUIST List is funded by Eastern Michigan University, 
and donations from subscribers and publishers.

Editor for this issue: Danielle St. Jean <danielle at linguistlist.org>
================================================================  

To post to LINGUIST, use our convenient web form at
http://linguistlist.org/LL/posttolinguist.cfm.

===========================Directory==============================  

1)
Date: 11-May-2011
From: Marina Santini [MarinaSantini.MS at gmail.com]
Subject: Genre-Specific Corpora

-------------------------Message 1 ---------------------------------- 
Date: Fri, 13 May 2011 15:17:48
From: Marina Santini [MarinaSantini.MS at gmail.com]
Subject: Genre-Specific Corpora

E-mail this message to a friend:
http://linguistlist.org/issues/emailmessage/verification.cfm?iss=22-2068.html&submissionid=4518715&topicid=10&msgnumber=1

Query for this summary posted in LINGUIST Issue: 22.1852                                                                                                                                               

Editor's Node: Please note that some URLs included in this submission 
may carry over onto a second line. If you want to go to a specific 
website provided in this submission, please be sure that you have 
copied the whole URL.

Many thanks to Laura Christopherson, Cohan Sujay Carlos, Vineet 
Yadav, Jason Teeple, Leslie Barrett, Joakim Nordström, Bob Kuhns, 
Dong Wang, Dave Lewis, John Tait, and Loredana Cerrato.

Suggested Corpora and Resources in English if not stated otherwise 
(not all of them are free of charge)

Genre-specific corpora:
- Genre: SMS Messages =  NUS SMS corpus: 
http://wing.comp.nus.edu.sg:8080/SMSCorpus/ (English / Chinese)

- Genre: chatlogs = CODIAC chatlogs 
(http://data.eol.ucar.edu/codiac/dss/id=92.124; 
http://data.eol.ucar.edu/codiac/dss/id=88.044; 
http://data.eol.ucar.edu/codiac/dss/id=107.010) 

- Genre: chatlogs = Many Eyes datasets: some chatlogs can be found 
here: 
http://www-958.ibm.com/software/data/cognos/manyeyes/datasets

- Genre: chats and switchboard conversations = 
Switchboard corpus and NPS chat corpus samples NLTK in NLTK data 
(http://nltk.googlecode.com/svn/trunk/nltk_data/index.xml). The NPS 
chat corpus (http://faculty.nps.edu/cmartell/NPSChat.htm) is a POS 
tagged chat corpus and the switchboard corpus 
(http://spot.colorado.edu/~michaeli/Lexsubj/swbd.html) is a telephonic 
conversation corpus. 

- The Linguistics Data Consortium has a good deal of telephone 
conversation - many files and a variety of languages. See 
http://www.ldc.upenn.edu/Catalog/byType.jsp#lexicon,%20speech,%20
text (not for free)

- Genre: blogs = The Corporate weblogs dataset in TREC datasets 
(http://ir.dcs.gla.ac.uk/test_collections/) is not for free. Helpful wiki: 
http://ir.dcs.gla.ac.uk/wiki/TREC-BLOG
- Genre: corporate blogs = It is possible to pull corporate blog feeds 
or scrape the blogs from this list: 
http://www.debbieweil.com/blog/list-of-67-big-brand-corporate-blogs/

- The Göteborg Spoken Language Corpus and other corpora in 
Swedish (http://spraakbanken.gu.se/)

- Genre: tweets = The twitter corpus associated with the paper 
www.stanford.edu/~alecmgo/papers/TwitterDistantSupervision09.pdf is 
here: https://sites.google.com/site/twittersentimenthelp/for-researchers

- Genre: tweets and other microblogs= MicroBlog track 
http://sites.google.com/site/trecmicroblogtrack/ (not for free)

- Genre: Newswires: Reuters' Newswires collections = 
http://trec.nist.gov/data/reuters/reuters.html

- Genre: emails = Enron corpus (http://www.cs.cmu.edu/~enron/); 
categorized Enron emails (http://sgi.nu/enron/corpora.php)

- Genre: emails = Junk email corpus 
(http://clg.wlv.ac.uk/resources/junk-emails/index.php)

- Genre: FAQs = 200 FAQs  
(http://www.itri.brighton.ac.uk/~Marina.Santini/#Download)

Resources: 
- In terms of words and concept, there are two main resources for 
English. First is WordNet, originally from Princeton, it is in NLTK (and 
one can get it separately). It is English words 'organized' according to 
their relationships: synonym, hyponym, piece of a whole, etc. The other 
resource is Word Association Norms, one can get that from the 
University of South Florida (http://w3.usf.edu/FreeAssociation/).
- Article: Hella Koo Finding: Twitter Dialect - 
http://blogs.wsj.com/ideas-market/2011/02/08/hella-koo-finding-twitter-
dialect/ 
- Genre: tweets = the suggestion is to use Twitter API to crawl twitter 
dataset. 
- DiscoverText is a program you can use to scoop out Twitter feeds 
really easily. Their website is here: 
http://discovertext.com/defaultDT2.aspx  
One can do a free 30 day trial and get a bunch of Twitter messages.

Note: 
Genre: Tweets = The Edinburg Tweets corpus has been withdrawn: 
http://demeter.inf.ed.ac.uk/ 

Linguistic Field(s): Computational Linguistics
                     Text/Corpus Linguistics

-----------------------------------------------------------
LINGUIST List: Vol-22-2068	
----------------------------------------------------------