[Corpora-List] UGC Tokenizer
Gustavo Laboreiro
gustavo.laboreiro at gmail.com
Wed Jan 4 15:25:16 UTC 2012
We would like to announce that we have made available our tokenizer for User-
Generated Content.
The tokenizer is based on a text-classification approach, making it more robust
than simpler rules-based approaches. You can find it described in this article:
http://dl.acm.org/citation.cfm?id=1871853
You can find it at the following URL:
http://labs.sapo.pt/up/2011/11/12/sylvester-ugc-tokenizer/
It is written in Python, but we include a script that shows a simple way to
call it from Perl. Other languages can use similar approaches.
We expect that, with its simple interface and ready-made tools, it can be
easily integrated into your processing pipelines.
Here are two examples:
#Normal use
from sylvester.tokenizer import Tokenizer
t = Tokenizer()
tokenized_message = t.tokenize( "original message" )
#Processing many messages
from sylvester.tokenizer import Tokenizer
message_list = [ "message 1" , "message 2" , "message 3" ]
t = Tokenizer( workers=4 ) # Quad-core machine
tokenized_message_list = t.tokenize_list( message_list )
Our original focus was on Portuguese (the third most popular language in
Twitter). By providing your own examples, you can re-train it for different
languages or specific needs.
Comments, questions, suggestions or other feedback can be reported to
gustavo.laboreiro at gmail.com
--
Gustavo Laboreiro
_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora
More information about the Corpora
mailing list