[Corpora-List] EmoText - Software for opinion mining and lexical affect sensing
Alexander Osherenko
osherenko at gmx.de
Tue Dec 20 15:10:01 UTC 2011
Hi Michal,
thanks for your comments. Very interesting!
1. Social aspects. You have to consider that the reviews from Amazon are
composed by different authors that have their own style of writing.
Moreover, you have to consider different cultural background, for example,
Americans and Englishmen use different words to express same things. Goethe
used other words than a truck driver does. How can a classifier calculate a
weight of a lexical feature if this lexical feature is not present in the
analyzed text?
In my demo, the author is an American James Berardinelli. He has his own
style of expressing opinions. Other person would do it in another manner.
In case of Amazon reviewers, there are several people that express their
opinion about the same thing. Hence, the weights in the statistical
classifiers can be deceptive because they are calculated for a community of
different reviewers. I assume you have to compose individual datasets for
persons of each cultural background or you have to use majority or average
vote to calculate a general vote.
Moreover, the datasets I used for learning are composed on the basis of
grammatically correct texts and not using weblogs with their
characteristics such as repetitions and so on. I describe the differences
better in my thesis. For example, I assume that POS-tagging
using TreeTagger is better on literary texts.
2. Sparse data. The datasets that underlie my demo contain 215 instances
for a 9-classes-problem. It's not much. That's why your and my feelings
that probabilistic NaiveBayes performs better can be correct. It is anyhow
much quicker. A classifier, for example, analytical SVM can use more texts
but then you have to consider overfitting.
Best
Alexander
2011/12/20 Michal Ptaszynski <ptaszynski at media.eng.hokudai.ac.jp>
> Hi Alexander,
>
> A few comments on the statistical engine.
>
> I tried a couple of reviews from Amazon. Among different feature sets from
> 1 to 6, always one is close to the amazon's ranking, but unfortunately its
> never one feature set in particular, but rather randomly one from the six.
>
> Besides the closest method, all other are usually reversed (e.g., if the
> closest method gives 5 star, all other give 1). However, this might have
> just happen for those couple examples I tried (Reviews of Kindle on Amazon).
>
> NaiveBayes seems to be hitting closer.
>
> SVMs are very slow or freeze (or perhaps its just your machine getting
> busy with the traffic).
>
> Best,
>
> Michal
>
>
> ------------------
> Od: Alexander Osherenko <osherenko at gmx.de>
> Do: corpora at uib.no
> Data: Mon, 19 Dec 2011 15:43:27 +0100
> Temat: Re: [Corpora-List] EmoText - Software for opinion mining and
> lexical affect sensing
>
>
> I published a more comprehensive version of the statistical engine that
> additionally considers the BNC frequency list. The reason why I didn't do
> it previously : the BNC processing needs more computational power and is
> therefore slower. However, in the current version processing is slower but
> actually OK.
>
> Hence, I process three sources of lexical features: the corpus frequency
> list, BNC, Whissell's DAL. In the PhD thesis, I had additionally three
> lemmatized lists but the performance was not much better that's why I don't
> consider them in the current demo version. Stylistic difference: the
> frequency list is possibly tailored to the corpus and therefore only
> appropriate for opinion mining in this only corpus, the BNC frequency list
> is general, DAL contains emotion words. In my opinion, this demo version is
> beneficial for answering the question we discussed previously in this
> mailing list about the types of features and differences in votes.
>
> The link: www.socioware.de/**EmoTextDemoWithBNC<http://www.socioware.de/EmoTextDemoWithBNC>
>
> ______________________________**_________________
> UNSUBSCRIBE from this page: http://mailman.uib.no/options/**corpora<http://mailman.uib.no/options/corpora>
> Corpora mailing list
> Corpora at uib.no
> http://mailman.uib.no/**listinfo/corpora<http://mailman.uib.no/listinfo/corpora>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20111220/485383fc/attachment.htm>
-------------- next part --------------
_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora
More information about the Corpora
mailing list