forwarded query

Tue Mar 21 21:05:33 UTC 2000

Hi,

I received the following inquiry, which I'm forwarding to the list. Please
*do not reply to me or to the list*; reply directly to Moshe Koppel at:

koppel at netvision.net.il

Thanks,

Mary

-----------------------------------------------------------

My research group in the computer science department of Bar-Ilan
University has developed automated tools for finding differences in
writing styles between any two given corpora. We use only syntactical
and content-free features and the whole process takes a couple of
minutes. We just ran tests on male/female authors on the British
National Corpus using a few hundred texts of each. Five-fold
cross-validation experiments yield categorization accuracy of over 80% on
unseen texts. Is this comparable with other studies? We are computer
scientists and are not quite up on the gender literature. Can you give
us some pointers to similar studies?

The features we use are a list of several hundred function words
and several hundred common parts-of-speech n-grams. We invoke no theory at
all about what features ought to be important. Rather we throw in
everything but the kitchen sink and rely on computational methods which use
the training data to rapidly winnow out irrelevant features. In one run on
over 1000 documents in the British National Corpus (using only function
words) we got the following lists of indicator words (importance indicated
in the left column):

WOMEN
17.749924  she's
14.829651  enough
11.595656  someone
10.584093  myself
9.253394   fact
8.505204   quite
8.324067   example
7.531402   get
7.501985   once
7.420944   still

MEN
-7.14599   while
-7.269493  few
-7.510454  possible
-7.865377  off
-8.227482  even
-8.726618  these
-8.910974  something
-10.653121 against
-10.9888   result
-12.280301 tell

I couldn't venture a guess as to why any of this should turn up.

(If you're interested we've got reems of data: Edward III was written by
Marlowe, the Brontes can be distinguished even at the level of a few
hundred words of text, the New York Times rarely uses the word
"yesterday", etc.)

We look forward to your comments.

Cheers,

Prof. Moshe Koppel
Dept. of CS
Bar-Ilan University
koppel at netvision.net.il