[Corpora-List] Wordgram generator

Anil Singh anil.phdcl at gmail.com
Thu Mar 13 00:31:04 UTC 2008


If you don't need smoothing, we have a program for generating n-grams as
part of a system called Sanchay. It can work on UTF-8 text.

- Anil Kumar Singh

On Tue, Mar 11, 2008 at 4:34 PM, Paul Johnston <
paul.a.johnston at manchester.ac.uk> wrote:

>  Can anyone recommend a wordgram generator similar to text2wngram in the
> CMU-Toolkit which can handle Unicode encoded texts, preferably utf-8 or
> UCS-2.
>
> I've been using the CMU-Toolkit successfully on English text files
> especially from the BNC but seem to have problems when using a UTF-8 file.
>
>
>
> Error reading temp file count /usr/tmp/text2wngram.tmp.hb-0021205.4217.1
>
>
>
> It seems to have problems reading the tmp files (see above) permissions
> are fine and it works with ascii texts.
>
>
>
> I've tried this on a couple of Linux systems (Fedora and SUSE) with clean
> builds and in both cases text2wfreq works fine but text2wngram does not.
>
> Any suggestions?
>
>
>
> Cheers Paul
>
>
>
>
>
> Paul Johnston
>
> Humanities Development
>
> Room 2.12
>
> Bridgeford Building
>
> Manchester University
>
> 0161 275 1396
>
>
>
> Programmers are in a race with the Universe to create bigger and better
> idiot-proof programs,
>
> while the Universe is trying to create bigger and better idiots.
>
> So far the Universe is winning.
>
> Rich Cook
>
>
>
> _______________________________________________
> Corpora mailing list
> Corpora at uib.no
> http://mailman.uib.no/listinfo/corpora
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20080313/0c707759/attachment.htm>
-------------- next part --------------
_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora


More information about the Corpora mailing list