[Corpora-List] N-gram extraction: Found it!

Ted Pedersen ted_pedersen at hotmail.com
Wed Aug 28 15:08:41 UTC 2002


Hi Andrius,

I'm glad to hear you isolated the problem. I was just running
some of my own experiments with comparably sized data and
was a little perplexed (happily so perhaps) as to why mine was
running more quickly. But you're absolutely right about the
negative impact of long lines on Perl. Your suggestion of a
"progress meter" for NSP makes good sense, and we'll certainly incorporate
that. It also seems that an "overly long line
detector" would be a good safety feature.

BTW There are some rather nice tips from Ken Church about
n-gram counting of very large files to be found in the
archives of this list. Check out this thread on the
good/bad of frequency lists...

http://www.hit.uib.no/corpora/1995-4/0076.html

I'm sure the papers mentioned are more complete sources of
info, but it's sometimes rather fun to see the ebb and flow
of these previous discussions.

Best of luck,
Ted

>Dear list members,
>
>Thank you for all your suggestions and useful advice. I've collected quite
>a
>lot of useful information about n-gram extraction, and if I'll have time I
>will try to summarize it.
>However, I have to admit that all this noise was due to one crucial
>mistake, which I have overlooked. Our corpus was special yet in another
>way, I removed end of lines from it, which means the perl script was
>dealing
>with lines of enourmous size.
>People who know just a little of PERL, will understand why it would take
>ages
>to process such corpus even with the best written script.
>I realized that when I tried a simple Contantin
>Oras' script and I could see the rate at which the results were
>produced.
>As I mentioned earlier in such cases it would be useful to see some kind
>of intermediate results, which I hadn't with Ted Pedersen's script.
>Sorry about all this confusion. I've greatly benefited from it though.
>
>Sincerely,
>Andrius Utka
>Research Assistant
>Birmingham University

--
Ted Pedersen
http://www.umn.edu/~tpederse


_________________________________________________________________
Join the world’s largest e-mail service with MSN Hotmail.
http://www.hotmail.com



More information about the Corpora mailing list