[Corpora-List] Source codes of a latent variable model for short texts/sentences

Mon Oct 15 17:48:18 UTC 2012

Dear all,

We are pleased to announce the release of the Weighted Textual Matrix  
Factorization (WTMF) source code.

WTMF is a latent variable model to extract nuanced and robust latent  
vectors for short texts/sentences, such as tweets, SMS data, short  
forum posts/comments.  To overcome the sparsity problem in short texts/ 
sentences (e.g. 10 words in a sentence), we explicitly model the  
missing words, a feature that LSA/LDA typically overlooks.

The features of the model are:
1. An unsupervised approach.
2. A simple model -- only bag-of-words features for sentences/short  
texts.
3. No additional data required, and no specific format/genre  
required.  On contrast, people use metadata such as author/hashtag to  
help infer the topics of tweets.

The package can be downloaded from:
http://www.cs.columbia.edu/~weiwei

An detailed description of WTMF is provided in the following  
publication:
@INPROCEEDINGS {Guo+Diab:12,
         AUTHOR    = {Weiwei Guo and Mona Diab},
         TITLE     = {Modeling Sentences in the Latent Space},
         BOOKTITLE = {Proceedings of the 50th Annual Meeting of the  
Association for Computational Linguistics},
         YEAR      = {2012},
}
http://www.aclweb.org/anthology-new/P/P12/P12-1091v2.pdf

If you have any questions, feel free to send an email to:
weiwei at cs.columbia.edu

Weiwei Guo
Columbia University
http://www.cs.columbia.edu/~weiwei

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20121015/55dad9ab/attachment.htm>
-------------- next part --------------
_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora