[Corpora-List] InfoFramework

Tue Apr 10 10:25:06 UTC 2012

Hi George,

I considered features, better known in authorship attribution, in my thesis
on opinion mining and assessed their impact in classification of opinions
in texts. I call these features stylometric (the group of stylometric
features) and extract word lengths' features, letter features, standard
deviation of sentence lengths, standard deviation of word lengths, digrams.
I extracted stylometric features for opinion mining in 4 textual corpora.

However, the framework is working with any type of information and not only
with texts. It uses instances of custom classes in the generation step to
generate datasets and a custom generator function. These classes and the
function are, of course, in every sense modality-DEPENDENT etc. and can be
added or deleted if necessary. After generation the processing is fully
modality-INDEPENDENT.

For example, in opinion mining the framework generates datasets by using
combinations of an instance of a custom class for word lengths and an
instance of a custom class for sentence lengths and an instance of a custom
class for standard deviation of lengths etc. The generator function returns
sequences of analyzed information (in my case, movie reviews).

Hence, if I want to work with new data I implement my custom feature
classes that implement a specific class interface and can be considered as
parts of a data mining instance. The custom generator function supplies
information to process. For instance, to work with neurobiological data I
implement neurobiological classes that represent features for brain regions
and a generator function that returns sequences of useful neurobiological
information segments. These classes and the generator function were in
opinion mining often very small and pilot studies of new corpora required
hence only very little time because such steps such as evaluating, fusing,
optimizing the datasets etc. are already implemented. In your case, you
would write a PERL-Jython wrapper class and let the framework do its work.

Next, what does the framework actually do with custom classes? The
framework instantiates them and composes combinatorial combinations of
class instances resulting mathematically in (2 power N)-1 combinations
where N is the number of features. In an exhaustive study, you create
datasets with all combination sets of features. For instance, for 5
stylometric features you generate 31 datasets with all feature combinations.

You can normalize feature values. BTW, what sort of normalization do you
mean in your email? The framework relies on information segments, for
example, sequences of movie reviews. You can derive and store useful
information globally as the ngrams-related information. For example, in
opinion mining in movie reviews I consider the frequency list of BNC as a
global variable. I considered normalization of feature values using the
length of a sentence in words or words in characters. However, such
normalization wasn't beneficial (see the thesis).

Hope I could answer your questions.

Best
Alexander

2012/4/9 Georgios Mikros <gmikros at isll.uoa.gr>

> Alexander Hi,****
>
> Your framework seems very interesting. What kind of features can be
> counted? My research focuses in authorship attribution and I use many
> different scripts in PERL for counting many different feature sets for my
> experiments. I was wondering whether your framework can count and normalize
> for text length character and word ngrams.****
>
> Best****
>
> George Mikros****
>
> ** **
>
>
> ------------------------------------------------------------------------------
> ****
>
> Dr. George K. Mikros,****
>
> Associate Professor of Computational Linguistics and Quantitative
> Linguistics****
>
> Department of Italian Language and Literature****
>
> School of Philosophy,****
>
> National and Kapodistrian University of Athens****
>
> Panepistimioupoli Zografou, GR 15784****
>
> Athens****
>
> Greece****
>
> Tel/Fax: +30 210 6511344****
>
> Email: gmikros at isll.uoa.gr ****
>
> Web: http://users.uoa.gr/~gmikros/****
>
> ** **
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20120410/2a0d3179/attachment.htm>
-------------- next part --------------
_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora