[Corpora-List] Distributional and Morphological Word Clustering

Siddhartha Jonnalagadda sid.kgp at gmail.com
Sun Feb 12 16:51:35 UTC 2012


Hi Manaal,

If you have only one document, use this:
http://code.google.com/p/semanticvectors/wiki/PositionalIndexes for
generating the vectors.

After that, a simple K-means algorithm is implemented at
http://semanticvectors.googlecode.com/svn/javadoc/latest-stable/pitt/search/semanticvectors/ClusterResults.html

The last time I used it, I had to modify it for my own purpose.

If you want to use a different program for clustering such as WEKA, you
need to translate the vectors into text format using
http://semanticvectors.googlecode.com/svn/javadoc/latest-stable/pitt/search/semanticvectors/VectorStoreTranslater.html

Hope this helps.

Sincerely,
Siddhartha Jonnalagadda, Ph.D.
sjonnalagadda.wordpress.com




On Sun, Feb 12, 2012 at 3:02 AM, manaal faruqui <manaalfar at gmail.com> wrote:

> I am unable to find out how to make TermVectors, this package seems to
> take care about the term-document vector, but I need something which
> captures word co-occurence etc. so that the similar words occur in a single
> group after clustering. Also, I have only once document as the whole corpus.
>
> M
>
>
> On Sun, Feb 12, 2012 at 12:10 AM, manaal faruqui <manaalfar at gmail.com>wrote:
>
>> Hi Siddhartha,
>>
>> I have installed semanticvectors package & lucene and also indexed my
>> corpus (single document, around 330mb) using lucene. Now I need to form
>> vectors corressponding to every word and then cluster them using k-means.
>>
>> Can you let me know the required command for the same ?
>>
>> Thanks a lot,
>> Manaal
>>
>> On Sat, Feb 11, 2012 at 8:36 AM, Siddhartha Jonnalagadda <
>> sid.kgp at gmail.com> wrote:
>>
>>> Hi Manaal,
>>>
>>> The Semantic Vectors (code.google.com/p/*semanticvectors)* package
>>> assigns vectors to individual words and then you can use K-means or an
>>> algorithm of your choice to cluster. Sahlgren's dissertation (WordSpace...)
>>> talks about creating clusters. I have done that in my dissertation too
>>> (link in my webpage). Contact me if you would like more details.
>>>
>>> Sincerely,
>>> Siddhartha Jonnalagadda, Ph.D.
>>> sjonnalagadda.wordpress.com
>>>
>>>
>>>
>>>
>>> On Sat, Feb 11, 2012 at 5:23 AM, manaal faruqui <manaalfar at gmail.com>wrote:
>>>
>>>> Hi,
>>>>
>>>> I need a software (even a raw piece of code) which can cluster words
>>>> from a large untagged corpus into groups using their distributional and
>>>> morphological similarity.
>>>> One such software is provided by Alexander Clark (
>>>> http://www.cs.rhul.ac.uk/home/alexc/) but his code works only for
>>>> ASCII characters. I have used it earlier and it works pretty well.
>>>>
>>>> I need something which can work for Unicode encoding.
>>>> I can deal with it even if the software doesnt take morphological info
>>>> into account.
>>>>
>>>> Thanks !
>>>> Manaal Faruqui
>>>> IIT Kharagpur, India
>>>>
>>>> _______________________________________________
>>>> UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
>>>> Corpora mailing list
>>>> Corpora at uib.no
>>>> http://mailman.uib.no/listinfo/corpora
>>>>
>>>>
>>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20120212/39c6c39b/attachment.htm>
-------------- next part --------------
_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora


More information about the Corpora mailing list