[Corpora-List] Invalid UTF8 character encountered! with Treetagger french parameter file

Samir Bilal samirbilal2 at yahoo.fr
Mon Dec 27 21:43:18 UTC 2010


Hi,

I am testing the cureent POS taggers for the french languague. For Treetagger I 
have an error in some case.
For a sentence with accent(for example:" l' étiqueteur se bloque".) , I 
encounter this error :

Invalid UTF8 character encountered! 
because of the accent with é.
But if the sentence has no accent character, the tagger works well.

I use the french parameter file at  
ftp://ftp.ims.uni-stuttgart.de/pub/corpora/french-par-linux-3.2-utf8.bin.gz .
My OS is Windows XP.

Can anybody help me?

Regards
Samir






________________________________
De : DJamé Seddah <djame.seddah at free.fr>
À : Samir Bilal <samirbilal2 at yahoo.fr>
Envoyé le : Dim 26 décembre 2010, 0h 47min 27s
Objet : Re: Re : [Corpora-List] Looking for free french POS tagger.

Hi, in that case I'll recommand  to use 
morfette as it provides windows binaries and pretrained models.

input format  (unix line separator)
one word per line
one blank line to separate sentences
and all in utf8

use this command
c:|whereverver/morfette predict MODELNAME < input > output.tagged


Djamé



Le 25 déc. 2010 à 23:43, Samir Bilal a écrit :

> Hi,
> 
> Thank you very much. My operating system is Window XP.  I did not succed to run 
>
> MeLT on it yet.Plesae can you help me?
> It will be wonderful, if I can use it on python program also.
> 
> 
> Many thanks
> Samir
> 
> 
> 
> 
> ________________________________
> De : DJamé Seddah <djame.seddah at free.fr>
> À : corpora at uib.no
> Envoyé le : Sam 25 décembre 2010, 22h 54min 42s
> Objet : Re: [Corpora-List] Looking for free french POS tagger.
> 
> Hi,
> There're also  two state-of-the-art data driven pos tagger available
> 
> MeLT
> https://gforge.inria.fr/frs/download.php/27240/melt-0.6.tar.gz
> and
> Morfette (which also provides a data driven lemmatizer)
> http://sites.google.com/site/morfetteweb/
> 
> both provide training models from the French Treebank (tagset CC, around 97.6 - 
>
> 98% of accuracy, the one to use for stat parsing ) and for a richer tagset 
> (tagset max, around 92-94%)
> 
> 
> Best,
> 
> Djamé
> 
> 
> 
> Le 25 déc. 2010 à 19:35, Samir Bilal a écrit :
> 
>> Hi everybody,
>> 
>> I  am looking for a free french POS tagger. 
>> 
>> Thank you
>> Samir
>> 
>> 
>> _______________________________________________
>> Corpora mailing list
>> Corpora at uib.no
>> http://mailman.uib.no/listinfo/corpora
> 
> 
> _______________________________________________
> Corpora mailing list
> Corpora at uib.no
> http://mailman.uib.no/listinfo/corpora
> 
> 
> 


      
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20101227/fef97804/attachment.htm>
-------------- next part --------------
_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora


More information about the Corpora mailing list