Hi Michal,<div><br></div><div>thanks for your comments. Very interesting!<br><div><br></div><div>1. Social aspects. You have to consider that the reviews from Amazon are composed by different authors that have their own style of writing. Moreover, you have to consider different cultural background, for example, Americans and Englishmen use different words to express same things. Goethe used other words than a truck driver does. How can a classifier calculate a weight of a lexical feature if this lexical feature is not present in the analyzed text?</div>
<div><br></div><div>In my demo, the author is an American James Berardinelli. He has his own style of expressing opinions. Other person would do it in another manner. In case of Amazon reviewers, there are several people that express their opinion about the same thing. Hence, the weights in the statistical classifiers can be deceptive because they are calculated for a community of different reviewers. I assume you have to compose individual datasets for persons of each cultural background or you have to use majority or average vote to calculate a general vote.</div>
<div><br></div><div>Moreover, the datasets I used for learning are composed on the basis of grammatically correct texts and not using weblogs with their characteristics such as repetitions and so on. I describe the differences better in my thesis. For example, I assume that POS-tagging using TreeTagger is better on literary texts. </div>
<div><br></div><div>2. Sparse data. The datasets that underlie my demo contain 215 instances for a 9-classes-problem. It's not much. That's why your and my feelings that probabilistic NaiveBayes performs better can be correct. It is anyhow much quicker. A classifier, for example, analytical SVM can use more texts but then you have to consider overfitting.</div>
<div><br></div><div>Best</div><div>Alexander</div><div><br><div class="gmail_quote">2011/12/20 Michal Ptaszynski <span dir="ltr"><<a href="mailto:ptaszynski@media.eng.hokudai.ac.jp" target="_blank">ptaszynski@media.eng.hokudai.ac.jp</a>></span><br>
<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">Hi Alexander,<br>
<br>
A few comments on the statistical engine.<br>
<br>
I tried a couple of reviews from Amazon. Among different feature sets from 1 to 6, always one is close to the amazon's ranking, but unfortunately its never one feature set in particular, but rather randomly one from the six.<br>
<br>
Besides the closest method, all other are usually reversed (e.g., if the closest method gives 5 star, all other give 1). However, this might have just happen for those couple examples I tried (Reviews of Kindle on Amazon).<br>
<br>
NaiveBayes seems to be hitting closer.<br>
<br>
SVMs are very slow or freeze (or perhaps its just your machine getting busy with the traffic).<br>
<br>
Best,<br>
<br>
Michal<br>
<br>
<br>
------------------<br>
Od: Alexander Osherenko <<a href="mailto:osherenko@gmx.de" target="_blank">osherenko@gmx.de</a>><br>
Do: <a href="mailto:corpora@uib.no" target="_blank">corpora@uib.no</a><br>
Data: Mon, 19 Dec 2011 15:43:27 +0100<br>
Temat: Re: [Corpora-List] EmoText - Software for opinion mining and lexical affect sensing<div><br>
<br>
I published a more comprehensive version of the statistical engine that additionally considers the BNC frequency list. The reason why I didn't do it previously : the BNC processing needs more computational power and is therefore slower. However, in the current version processing is slower but actually OK.<br>
<br>
Hence, I process three sources of lexical features: the corpus frequency list, BNC, Whissell's DAL. In the PhD thesis, I had additionally three lemmatized lists but the performance was not much better that's why I don't consider them in the current demo version. Stylistic difference: the frequency list is possibly tailored to the corpus and therefore only appropriate for opinion mining in this only corpus, the BNC frequency list is general, DAL contains emotion words. In my opinion, this demo version is beneficial for answering the question we discussed previously in this mailing list about the types of features and differences in votes.<br>
<br>
The link: <a href="http://www.socioware.de/EmoTextDemoWithBNC" target="_blank">www.socioware.de/<u></u>EmoTextDemoWithBNC</a><br>
<br></div><div><div>
______________________________<u></u>_________________<br>
UNSUBSCRIBE from this page: <a href="http://mailman.uib.no/options/corpora" target="_blank">http://mailman.uib.no/options/<u></u>corpora</a><br>
Corpora mailing list<br>
<a href="mailto:Corpora@uib.no" target="_blank">Corpora@uib.no</a><br>
<a href="http://mailman.uib.no/listinfo/corpora" target="_blank">http://mailman.uib.no/<u></u>listinfo/corpora</a><br>
</div></div></blockquote></div><br></div>
</div>