Hi all,<div><br></div><div>I wonder if there is a measure to assess heterogeneity of a particular corpus, for example, from the semantic or structural point of view. The background of my question is the question in my LinkedIn contribution (see below). I would appreciate if you would share your ideas. Thanks in advance!</div>

<div><br></div><div>Regards</div><div>Alexander</div><div><br></div><div>-------------------------------------------</div><div><h3 class="groups title" style="margin:0px;padding:0px;border:0px;outline:0px;font-weight:inherit;font-size:16px;font-family:Arial,Helvetica,'Nimbus Sans L',sans-serif;vertical-align:baseline;color:rgb(51,51,51)">

Genre identification vs. opinion mining</h3><p class="summary" style="margin:0px;padding:0px;border:0px;outline:0px;font-size:13px;font-family:Arial,Helvetica,'Nimbus Sans L',sans-serif;vertical-align:baseline">Hi,<br>

<br>You may know that Marina Santini is working on automatic genre identification; I was exploring opinion mining in my thesis. Recently, I made a controversial statement: it doesn't matter much what algorithm of automatic identification you use -- the siginificant issue is feature extraction. In fact, in my experiments, I found out that NaiveBayes or SVM are good enough -- sometimes NaiveBayes is better, sometimes SVM, but these are always two usual suspects that are frequently used for classification. What significantly changes classification results is feature extraction.<br>

<br>Since genre identification and opinion mining are similar tasks (the data are texts and the result is obtained through statistical analysis) I asked Marina to give me her data (<a target="blank" href="http://www.linkedin.com/redirect?url=http%3A%2F%2Fwww%2Enltg%2Ebrighton%2Eac%2Euk%2Fhome%2FMarina%2ESantini%2F&urlhash=7wkh&_t=tracking_anet" rel="nofollow" style="margin:0px;padding:0px;border:0px;outline:none;font-style:inherit;font-family:inherit;vertical-align:baseline;text-decoration:none;color:rgb(0,102,153)">http://www.nltg.brighton.ac.uk/home/Marina.Santini/</a>) to test if I get similar results on genre classification using "my" features as it was the case in opinion mining. For simplicity, I extracted only stopwords.<br>

<br>I performed a brief analysis of Marina's corpus. I used my InfoFramework to process these data that contains 1400 html files corresponding to 7 genres -- BLOG, ESHOP, FAQS, FRONTPAGE, LISTING, PHP, SPAGE. I built my dataset automatically extracting features that correspond to 526 stopwords in WEKA. I have compared the obtained results with Marina's dataset where recall value using SMO were 89% recall; 89.07% precision and using NaiveBayes -- 67,14% recall; 68.86% precision.<br>

<br>The main news: the corpus is very unusual. I analyzed already several corpora and the result was always about triple choice by chance. So for a corpus with 9 classes the classification result was about 3 x 11.1(%)=33.3%.<br>

<br>In the case of Marina's corpus, it is something different. The results using SMO were unexpectedly 65.27% recall and 71.27% precision that is about five times of choice by chance. Almost the same, 55.79% recall and 59.67% precision are results using NaiveBayes. I optimized my dataset using FFS and obtained 56.57% (-8.72%) recall and 65.64% precision using SMO and 59.64% (+3.85%) recall and 64.4% precision using NaiveBayes. Although I didn't think that I can optimize Marina's dataset, I ran FFS-optimization over her dataset. For SMO, I got 87.79% (-1.21%) recall and 87.9% precision. Incredibly, but I got a significant 80.14% (13%) recall and 79.77% precision improvement for NaiveBayes -- I even repeated classification manually on the WEKA GUI.<br>

<br>The classification results of my dataset are amazingly high if we consider that my features are extracted in groups and Marina's not. I assume it is even possible to improve classification results. In my opinion, such improvement can be the result of corpus composition, however, I would appreciate if you tell me your opinion.<br>

<br>Alexander</p></div>