[Corpora-List] Corpus heterogeneity

Mon Nov 5 14:32:21 UTC 2012

Hi all,

I wonder if there is a measure to assess heterogeneity of a particular
corpus, for example, from the semantic or structural point of view. The
background of my question is the question in my LinkedIn contribution (see
below). I would appreciate if you would share your ideas. Thanks in advance!

Regards
Alexander

-------------------------------------------
Genre identification vs. opinion mining

Hi,

You may know that Marina Santini is working on automatic genre
identification; I was exploring opinion mining in my thesis. Recently, I
made a controversial statement: it doesn't matter much what algorithm of
automatic identification you use -- the siginificant issue is feature
extraction. In fact, in my experiments, I found out that NaiveBayes or SVM
are good enough -- sometimes NaiveBayes is better, sometimes SVM, but these
are always two usual suspects that are frequently used for classification.
What significantly changes classification results is feature extraction.

Since genre identification and opinion mining are similar tasks (the data
are texts and the result is obtained through statistical analysis) I asked
Marina to give me her data (
http://www.nltg.brighton.ac.uk/home/Marina.Santini/<http://www.linkedin.com/redirect?url=http%3A%2F%2Fwww%2Enltg%2Ebrighton%2Eac%2Euk%2Fhome%2FMarina%2ESantini%2F&urlhash=7wkh&_t=tracking_anet>)
to test if I get similar results on genre classification using "my"
features as it was the case in opinion mining. For simplicity, I extracted
only stopwords.

I performed a brief analysis of Marina's corpus. I used my InfoFramework to
process these data that contains 1400 html files corresponding to 7 genres
-- BLOG, ESHOP, FAQS, FRONTPAGE, LISTING, PHP, SPAGE. I built my dataset
automatically extracting features that correspond to 526 stopwords in WEKA.
I have compared the obtained results with Marina's dataset where recall
value using SMO were 89% recall; 89.07% precision and using NaiveBayes --
67,14% recall; 68.86% precision.

The main news: the corpus is very unusual. I analyzed already several
corpora and the result was always about triple choice by chance. So for a
corpus with 9 classes the classification result was about 3 x 11.1(%)=33.3%.

In the case of Marina's corpus, it is something different. The results
using SMO were unexpectedly 65.27% recall and 71.27% precision that is
about five times of choice by chance. Almost the same, 55.79% recall and
59.67% precision are results using NaiveBayes. I optimized my dataset using
FFS and obtained 56.57% (-8.72%) recall and 65.64% precision using SMO and
59.64% (+3.85%) recall and 64.4% precision using NaiveBayes. Although I
didn't think that I can optimize Marina's dataset, I ran FFS-optimization
over her dataset. For SMO, I got 87.79% (-1.21%) recall and 87.9%
precision. Incredibly, but I got a significant 80.14% (13%) recall and
79.77% precision improvement for NaiveBayes -- I even repeated
classification manually on the WEKA GUI.

The classification results of my dataset are amazingly high if we consider
that my features are extracted in groups and Marina's not. I assume it is
even possible to improve classification results. In my opinion, such
improvement can be the result of corpus composition, however, I would
appreciate if you tell me your opinion.

Alexander
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20121105/87492250/attachment.htm>
-------------- next part --------------
_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora