[Corpora-List] For students: "CL in Action"
Alexander Yeh
asy at mitre.org
Fri Nov 5 23:38:26 UTC 2010
Amaç Herdağdelen wrote:
>> I wondered that as well.
>
> I think Matthew's guess is correct: Our tokenizer is emoticon-aware to some degree, but "<3" is tokenized as< and 3 separately. Looks like the Twitter users whose names are guessed as female use "<3" much more often in our dataset.
>
>> (1) how good the gender guesser is (I didn't see any statistics on that,
>> but I didn't search extensively).
>
> We don't have any statistics so far because we don't have a gold standard. Twitter does not give away the sex of its users for understandable reasons. But simple sanity checks show that we are not totally wrong (http://bit.ly/9Aw0gs).
I have seen some forum sites in German that give the gender of a user.
But these are probably self-reported and are about as (in)accurate as
other information.
_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora
More information about the Corpora
mailing list