[Corpora-List] For students: "CL in Action"

Alexander Yeh asy at mitre.org
Fri Nov 5 23:38:26 UTC 2010


Amaç Herdağdelen wrote:
>> I wondered that as well.
>
> I think Matthew's guess is correct: Our tokenizer is emoticon-aware to some degree, but "<3" is tokenized as<  and 3 separately. Looks like the Twitter users whose names are guessed as female use "<3" much more often in our dataset.
>
>> (1) how good the gender guesser is (I didn't see any statistics on that,
>> but I didn't search extensively).
>
> We don't have any statistics so far because we don't have a gold standard. Twitter does not give away the sex of its users for understandable reasons. But simple sanity checks show that we are not totally wrong (http://bit.ly/9Aw0gs).

I have seen some forum sites in German that give the gender of a user.
But these are probably self-reported and are about as (in)accurate as 
other information.



_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora



More information about the Corpora mailing list