[Corpora-List] For students: "CL in Action"

Georgios Mikros gmikros at isll.uoa.gr
Thu Nov 4 23:00:57 UTC 2010


Dear Diana,
I recently tried to predict the gender of 14 authors (7 males and 7 females)
in a newspaper corpus of Modern Greek. I used a variety of stylometric
variables including 100 most frequent words, word length, letter frequencies
and lexical "richness" indices like Yule's K, Lexical Density, Text entropy
etc. The classifier employed in the research was an artificial neural
network (multilayer perceptron) using the above variables as input and the
author gender as an output. 10-fold cross validation results report a
precision value of 0.85. The most "useful" category of variables in gender
discrimination was the most frequent words. It is interesting that among the
words that predict male gender were many coordinating conjunctions and
contracted forms of the definite article. On the other hand "female" words
contained many personal pronouns (i.e. us, we etc). I just finished the
analysis and I don't have anything written yet. However, if you are
interested in this I could send you a copy when I wrote the full paper.
Best
George Mikros
University of Athens
Greece

-----Original Message-----
From: corpora-bounces at uib.no [mailto:corpora-bounces at uib.no] On Behalf Of
Diana Maynard
Sent: Thursday, November 04, 2010 12:16 PM
To: Adam Kilgarriff
Cc: corpora at uib.no
Subject: Re: [Corpora-List] For students: "CL in Action"

I wondered that as well.

On another note, I guess the success of it depends critically on at 
least two things:
(1) how good the gender guesser is (I didn't see any statistics on that, 
but I didn't search extensively).

(2) (which is related) - the proportion of American names in the twitter 
corpus (since I think the guesser used is based solely on American first 
names) - and this could have some impact. Even the differences between 
first name gender in the US and Britain are not insignificant.

On a related note, has anyone done the reverse and used vocabulary 
selection to help identify the gender of the speaker, with any success?
I'm sure people must have played with this idea.

I'm interested in techniques to improve person gender recognition - in 
my experience, using pre-built lists of male and female names and simple 
frequency information is often not accurate enough. Again, I haven't 
searched extensively for this, but if anyone happens to know offhand 
about it I'd be interested.
Diana

On 04/11/2010 09:51, Adam Kilgarriff wrote:
> Cool!
>
> So, what is it about 3?  (see
>
http://labs.buradayiz.webfactional.com/gender/query/query?words=1+2+3+4+5+6+
7+8+9)
>   You must have a theory
>
> adam


_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora


_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora



More information about the Corpora mailing list