[Corpora-List] Source codes of a latent variable model for short texts/sentences
Weiwei Guo
weiwei at cs.columbia.edu
Mon Oct 15 17:48:18 UTC 2012
Dear all,
We are pleased to announce the release of the Weighted Textual Matrix
Factorization (WTMF) source code.
WTMF is a latent variable model to extract nuanced and robust latent
vectors for short texts/sentences, such as tweets, SMS data, short
forum posts/comments. To overcome the sparsity problem in short texts/
sentences (e.g. 10 words in a sentence), we explicitly model the
missing words, a feature that LSA/LDA typically overlooks.
The features of the model are:
1. An unsupervised approach.
2. A simple model -- only bag-of-words features for sentences/short
texts.
3. No additional data required, and no specific format/genre
required. On contrast, people use metadata such as author/hashtag to
help infer the topics of tweets.
The package can be downloaded from:
http://www.cs.columbia.edu/~weiwei
An detailed description of WTMF is provided in the following
publication:
@INPROCEEDINGS {Guo+Diab:12,
AUTHOR = {Weiwei Guo and Mona Diab},
TITLE = {Modeling Sentences in the Latent Space},
BOOKTITLE = {Proceedings of the 50th Annual Meeting of the
Association for Computational Linguistics},
YEAR = {2012},
}
http://www.aclweb.org/anthology-new/P/P12/P12-1091v2.pdf
If you have any questions, feel free to send an email to:
weiwei at cs.columbia.edu
Weiwei Guo
Columbia University
http://www.cs.columbia.edu/~weiwei
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20121015/55dad9ab/attachment.htm>
-------------- next part --------------
_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora
More information about the Corpora
mailing list