[Corpora-List] implementing LDA

Toyin Popoola toyin_net at yahoo.com
Thu Sep 16 02:37:42 UTC 2010


Hi All,
I really appreciate the post by  Marco Baroni on links to LDA implementation. 
My challenge is in developing my own corpra for application to activity recognition. 'words' in my case are features extracted from video, and 'documents' are video clips.

1. What files will i need to provide as input for these LDA  codes, and what is the data format?
I came across things like "LDA-C format" at Blei's site where he says
"The data is a file where each line is of the form:
[M] [term_1]:[count] [term_2]:[count] ...  [term_N]:[count]
where [M] is the number of unique terms in the document, and the
[count] associated with each term is how many times that term appeared
in the document."

I got confused because my understanding so far is that the 'words' are rows while the 'documents' are columns. Therefore the row vector are occurrence of word i in all the documents j(s).
But the expression: "where [M] is the number of unique terms in the document, and the [count] associated with each term is how many times that term appeared
in the document."- makes it sound like the rows of the matrix are documents while the colums are the the words. 

PLEASE can someone help me clarify this. I would really appreciate is just a piece of any corpra that is already in the LDA-C format can be sent to my mail so i use as a template, including any other files i need to specify.
Thanks
Toyin Popoola
toyinpopoola at ieee.org
HEU



      
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20100915/944bfa34/attachment.htm>
-------------- next part --------------
_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora


More information about the Corpora mailing list