If Java is not required, the famous SRI toolkit is well suited for this task and further processing (ngram lmestimation, back-off, interpolation , ... ): http://www.speech.sri.com/projects/srilm/ Regards, -- Alexandre Allauzen Univ Paris XI, LIMSI-CNRS Tel : 01.69.85.80.64 (80.88) Bur : 114 LIMSI Bat. 508 allauzen at limsi.fr