[Corpora-List] N-gram extraction: Found it!

andrius at ccl.bham.ac.uk andrius at ccl.bham.ac.uk
Wed Aug 28 14:21:16 UTC 2002


Dear list members,

Thank you for all your suggestions and useful advice. I've collected quite a
lot of useful information about n-gram extraction, and if I'll have time I
will try to summarize it.
However, I have to admit that all this noise was due to one crucial
mistake, which I have overlooked. Our corpus was special yet in another
way, I removed end of lines from it, which means the perl script was dealing
with lines of enourmous size.
People who know just a little of PERL, will understand why it would take ages
to process such corpus even with the best written script.
I realized that when I tried a simple Contantin
Oras' script and I could see the rate at which the results were
produced.
As I mentioned earlier in such cases it would be useful to see some kind
of intermediate results, which I hadn't with Ted Pedersen's script.
Sorry about all this confusion. I've greatly benefited from it though.

Sincerely,
Andrius Utka
Research Assistant
Birmingham University



More information about the Corpora mailing list