[Corpora-List] UGC Tokenizer

Gustavo Laboreiro gustavo.laboreiro at gmail.com
Wed Jan 4 15:25:16 UTC 2012


We would like to announce that we have made available our tokenizer for User-
Generated Content.

The tokenizer is based on a text-classification approach, making it more robust 
than simpler rules-based approaches. You can find it described in this article:
http://dl.acm.org/citation.cfm?id=1871853

You can find it at the following URL:
http://labs.sapo.pt/up/2011/11/12/sylvester-ugc-tokenizer/

It is written in Python, but we include a script that shows a simple way to 
call it from Perl. Other languages can use similar approaches.

We expect that, with its simple interface and ready-made tools, it can be 
easily integrated into your processing pipelines.

Here are two examples:

#Normal use

from sylvester.tokenizer import Tokenizer
t = Tokenizer()
tokenized_message = t.tokenize( "original message" )


#Processing many messages

from sylvester.tokenizer import Tokenizer
message_list = [ "message 1" , "message 2" , "message 3" ]
t = Tokenizer( workers=4 )     # Quad-core machine
tokenized_message_list = t.tokenize_list( message_list )


Our original focus was on Portuguese (the third most popular language in 
Twitter). By providing your own examples, you can re-train it for different 
languages or specific needs.

Comments, questions, suggestions or other feedback can be reported to
gustavo.laboreiro at gmail.com

-- 
Gustavo Laboreiro



_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora



More information about the Corpora mailing list