[Corpora-List] For students: "CL in Action"

Fri Nov 5 08:31:10 UTC 2010

> I wondered that as well.

I think Matthew's guess is correct: Our tokenizer is emoticon-aware to some degree, but "<3" is tokenized as < and 3 separately. Looks like the Twitter users whose names are guessed as female use "<3" much more often in our dataset.

> (1) how good the gender guesser is (I didn't see any statistics on that,
> but I didn't search extensively).

We don't have any statistics so far because we don't have a gold standard. Twitter does not give away the sex of its users for understandable reasons. But simple sanity checks show that we are not totally wrong (http://bit.ly/9Aw0gs).

> (2) (which is related) - the proportion of American names in the twitter
> corpus (since I think the guesser used is based solely on American first
> names) - and this could have some impact. Even the differences between
> first name gender in the US and Britain are not insignificant.

I agree that this might me a problem. Another issue is that we use the names provided by the users themselves. There are lots of possible sources of bias.

> On a related note, has anyone done the reverse and used vocabulary
> selection to help identify the gender of the speaker, with any success?
> I'm sure people must have played with this idea.

In our case, the nearest neighbor algorithm run over the vectors of most popular 30-50 lemmas, was able to achieve an accuracy of 60% (for a million users). But still, this is not the gold standard but our name-guessed labels.

Especially in Twitter, there may be very interesting signals that will help the identification. The mention of "Justin Bieber" or "Steve Jobs" alone can tell a lot about the gender of a user (http://bit.ly/bORvey).

Amaç

_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora