[Corpora-List] Wordgram generator

Martin Reynaert reynaert at uvt.nl
Fri Mar 14 15:43:08 UTC 2008


Hi Paul,

I have run into the same problems with text2wngram. The fault lies not 
with the build of Linux, nor really with the tool.

In my case, the problem was with control characters and invisible 
characters in my text files. On at least one occasion, the text had null 
characters, i.e. characters with the actual codepage value 0. These are 
'nothing', but still there. OCR-software, e.g., may create such. A 
string with one, fully looking the same as one without one, will not 
match in a regexp.

The thing to do is first to normalize your text in order to remove such 
characters. You are better off without them, anyway. Regard 
text2wngram's non-robustness with regard to these as a useful diagnostic 
feature ;0)

In Linux you would use the text utility 'tr' to efficiently remove these.

In an earlier post I have described (for the benefit of Windows users) 
how to build word ngram lists without a tool such as text2wngram. On 
Linux, you can quickly do same using 'tr', 'cat', 'paste' and finally 
'sort' and 'uniq -c' 0;). This is great fun, once you see how easy this 
is ;0) The post was: 010398 07/09/26 [Corpora-List] How to?: POS n-grams.

Hope this helps,

Martin Reynaert
Induction of Linguistic Knowledge
Tilburg University
The Netherlands

Paul Johnston wrote:
> Can anyone recommend a wordgram generator similar to text2wngram in the 
> CMU-Toolkit which can handle Unicode encoded texts, preferably utf-8 or 
> UCS-2.
> 
> I’ve been using the CMU-Toolkit successfully on English text files 
> especially from the BNC but seem to have problems when using a UTF-8 file.
> 
>  
> 
> Error reading temp file count /usr/tmp/text2wngram.tmp.hb-0021205.4217.1
> 
>  
> 
> It seems to have problems reading the tmp files (see above) permissions 
> are fine and it works with ascii texts.
> 
>  
> 
> I’ve tried this on a couple of Linux systems (Fedora and SUSE) with 
> clean builds and in both cases text2wfreq works fine but text2wngram 
> does not.
> 
> Any suggestions?
> 
>  
> 
> Cheers Paul
> 
>  
> 
>  
> 
> Paul Johnston
> 
> Humanities Development
> 
> Room 2.12
> 
> Bridgeford Building
> 
> Manchester University
> 
> 0161 275 1396
> 
>  
> 
> Programmers are in a race with the Universe to create bigger and better 
> idiot-proof programs,
> 
> while the Universe is trying to create bigger and better idiots.
> 
> So far the Universe is winning.
> 
> Rich Cook
> 
>  
> 
> 
> ------------------------------------------------------------------------
> 
> _______________________________________________
> Corpora mailing list
> Corpora at uib.no
> http://mailman.uib.no/listinfo/corpora


_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora



More information about the Corpora mailing list