[Corpora-List] Invalid UTF8 character encountered! with Treetagger french parameter file

Alberto Simões albie at alfarrabio.di.uminho.pt
Mon Dec 27 21:53:12 UTC 2010


One first suggestion would be to recheck if your input file is in UTF8 
encoding.

Try opening the text file in an editor like Notepad++ and check what 
encoding it detects.

cheers

On 27/12/2010 21:43, Samir Bilal wrote:
> Hi,
>
> I am testing the cureent POS taggers for the french languague. For
> Treetagger I have an error in some case.
> For a sentence with accent(for example:" l' étiqueteur se bloque".) , I
> encounter this error :
>
> Invalid UTF8 character encountered!
> because of the accent with é.
> But if the sentence has no accent character, the tagger works well.
>
> I use the french parameter file at
> ftp://ftp.ims.uni-stuttgart.de/pub/corpora/french-par-linux-3.2-utf8.bin.gz
> .
> My OS is Windows XP.
>
> Can anybody help me?
>
> Regards
> Samir
>
>
>
> ------------------------------------------------------------------------
> *De :* DJamé Seddah <djame.seddah at free.fr>
> *À :* Samir Bilal <samirbilal2 at yahoo.fr>
> *Envoyé le :* Dim 26 décembre 2010, 0h 47min 27s
> *Objet :* Re: Re : [Corpora-List] Looking for free french POS tagger.
>
> Hi, in that case I'll recommand to use
> morfette as it provides windows binaries and pretrained models.
>
> input format (unix line separator)
> one word per line
> one blank line to separate sentences
> and all in utf8
>
> use this command
> c:|whereverver/morfette predict MODELNAME < input > output.tagged
>
>
> Djamé
>
>
>
> Le 25 déc. 2010 à 23:43, Samir Bilal a écrit :
>
>  > Hi,
>  >
>  > Thank you very much. My operating system is Window XP. I did not
> succed to run
>  > MeLT on it yet.Plesae can you help me?
>  > It will be wonderful, if I can use it on python program also.
>  >
>  >
>  > Many thanks
>  > Samir
>  >
>  >
>  >
>  >
>  > ________________________________
>  > De : DJamé Seddah <djame.seddah at free.fr <mailto:djame.seddah at free.fr>>
>  > À : corpora at uib.no <mailto:corpora at uib.no>
>  > Envoyé le : Sam 25 décembre 2010, 22h 54min 42s
>  > Objet : Re: [Corpora-List] Looking for free french POS tagger.
>  >
>  > Hi,
>  > There're also two state-of-the-art data driven pos tagger available
>  >
>  > MeLT
>  > https://gforge.inria.fr/frs/download.php/27240/melt-0.6.tar.gz
>  > and
>  > Morfette (which also provides a data driven lemmatizer)
>  > http://sites.google.com/site/morfetteweb/
>  >
>  > both provide training models from the French Treebank (tagset CC,
> around 97.6 -
>  > 98% of accuracy, the one to use for stat parsing ) and for a richer
> tagset
>  > (tagset max, around 92-94%)
>  >
>  >
>  > Best,
>  >
>  > Djamé
>  >
>  >
>  >
>  > Le 25 déc. 2010 à 19:35, Samir Bilal a écrit :
>  >
>  >> Hi everybody,
>  >>
>  >> I am looking for a free french POS tagger.
>  >>
>  >> Thank you
>  >> Samir
>  >>
>  >>
>  >> _______________________________________________
>  >> Corpora mailing list
>  >> Corpora at uib.no <mailto:Corpora at uib.no>
>  >> http://mailman.uib.no/listinfo/corpora
>  >
>  >
>  > _______________________________________________
>  > Corpora mailing list
>  > Corpora at uib.no <mailto:Corpora at uib.no>
>  > http://mailman.uib.no/listinfo/corpora
>  >
>  >
>  >
>
>
>
>
> _______________________________________________
> Corpora mailing list
> Corpora at uib.no
> http://mailman.uib.no/listinfo/corpora

-- 
Alberto Simões

_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora



More information about the Corpora mailing list