forwarded query
Mary Bucholtz
bucholtz at TAMU.EDU
Tue Mar 21 21:05:33 UTC 2000
Hi,
I received the following inquiry, which I'm forwarding to the list. Please
*do not reply to me or to the list*; reply directly to Moshe Koppel at:
koppel at netvision.net.il
Thanks,
Mary
-----------------------------------------------------------
My research group in the computer science department of Bar-Ilan
University has developed automated tools for finding differences in
writing styles between any two given corpora. We use only syntactical
and content-free features and the whole process takes a couple of
minutes. We just ran tests on male/female authors on the British
National Corpus using a few hundred texts of each. Five-fold
cross-validation experiments yield categorization accuracy of over 80% on
unseen texts. Is this comparable with other studies? We are computer
scientists and are not quite up on the gender literature. Can you give
us some pointers to similar studies?
The features we use are a list of several hundred function words
and several hundred common parts-of-speech n-grams. We invoke no theory at
all about what features ought to be important. Rather we throw in
everything but the kitchen sink and rely on computational methods which use
the training data to rapidly winnow out irrelevant features. In one run on
over 1000 documents in the British National Corpus (using only function
words) we got the following lists of indicator words (importance indicated
in the left column):
WOMEN
17.749924 she's
14.829651 enough
11.595656 someone
10.584093 myself
9.253394 fact
8.505204 quite
8.324067 example
7.531402 get
7.501985 once
7.420944 still
MEN
-7.14599 while
-7.269493 few
-7.510454 possible
-7.865377 off
-8.227482 even
-8.726618 these
-8.910974 something
-10.653121 against
-10.9888 result
-12.280301 tell
I couldn't venture a guess as to why any of this should turn up.
(If you're interested we've got reems of data: Edward III was written by
Marlowe, the Brontes can be distinguished even at the level of a few
hundred words of text, the New York Times rarely uses the word
"yesterday", etc.)
We look forward to your comments.
Cheers,
Prof. Moshe Koppel
Dept. of CS
Bar-Ilan University
koppel at netvision.net.il
More information about the Fling
mailing list