[Corpora-List] Corpus heterogeneity

Mon Nov 5 18:42:24 UTC 2012

I think you have answered your own question. You now face the task of stating the answer in quasi- philosophical terms. Stop-words are so frequent that their domain is mostly the domain of logic. You are dealing with the preferred logic of different genres. You now need to answer Michael Dummett's question posed in his 1976, pre-computational article :'Is logic empirical?' Your work is a non-lexical contribution to genre studies, except insofar as strings of grammar words have lexical collocates that refuse to be ignored. Its subtextual genre studies.

Bill Louw 

--- On Mon, 5/11/12, Alexander Osherenko <osherenko at gmx.de> wrote:

From: Alexander Osherenko <osherenko at gmx.de>
Subject: [Corpora-List] Corpus heterogeneity
To: "Corpora at uib.no" <corpora at uib.no>
Date: Monday, 5 November, 2012, 14:32

Hi all,
I wonder if there is a measure to assess heterogeneity of a particular corpus, for example, from the semantic or structural point of view. The background of my question is the question in my LinkedIn contribution (see below). I would appreciate if you would share your ideas. Thanks in advance!

RegardsAlexander
-------------------------------------------
Genre identification vs. opinion miningHi,

You may know that Marina Santini is working on automatic genre identification; I was exploring opinion mining in my thesis. Recently, I made a controversial statement: it doesn't matter much what algorithm of automatic identification you use -- the siginificant issue is feature extraction. In fact, in my experiments, I found out that NaiveBayes or SVM are good enough -- sometimes NaiveBayes is better, sometimes SVM, but these are always two usual suspects that are frequently used for classification. What significantly changes classification results is feature extraction.

Since genre identification and opinion mining are similar tasks (the data are texts and the result is obtained through statistical analysis) I asked Marina to give me her data (http://www.nltg.brighton.ac.uk/home/Marina.Santini/) to test if I get similar results on genre classification using "my" features as it was the case in opinion mining. For simplicity, I extracted only stopwords.

I performed a brief analysis of Marina's corpus. I used my InfoFramework to process these data that contains 1400 html files corresponding to 7 genres -- BLOG, ESHOP, FAQS, FRONTPAGE, LISTING, PHP, SPAGE. I built my dataset automatically extracting features that correspond to 526 stopwords in WEKA. I have compared the obtained results with Marina's dataset where recall value using SMO were 89% recall; 89.07% precision and using NaiveBayes -- 67,14% recall; 68.86% precision.

The main news: the corpus is very unusual. I analyzed already several corpora and the result was always about triple choice by chance. So for a corpus with 9 classes the classification result was about 3 x 11.1(%)=33.3%.

In the case of Marina's corpus, it is something different. The results using SMO were unexpectedly 65.27% recall and 71.27% precision that is about five times of choice by chance. Almost the same, 55.79% recall and 59.67% precision are results using NaiveBayes. I optimized my dataset using FFS and obtained 56.57% (-8.72%) recall and 65.64% precision using SMO and 59.64% (+3.85%) recall and 64.4% precision using NaiveBayes. Although I didn't think that I can optimize Marina's dataset, I ran FFS-optimization over her dataset. For SMO, I got 87.79% (-1.21%) recall and 87.9% precision. Incredibly, but I got a significant 80.14% (13%) recall and 79.77% precision improvement for NaiveBayes -- I even repeated classification manually on the WEKA GUI.

The classification results of my dataset are amazingly high if we consider that my features are extracted in groups and Marina's not. I assume it is even possible to improve classification results. In my opinion, such improvement can be the result of corpus composition, however, I would appreciate if you tell me your opinion.

Alexander

-----Inline Attachment Follows-----

_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20121105/21890065/attachment.htm>
-------------- next part --------------
_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora