[Corpora-List] N-gram string extraction

Wed Aug 28 08:57:43 UTC 2002

Hello Ted,

Thank you for your reply. I really like your software, that's why I've
chosen it. It's very flexible, and I don't think it
is anything wrong with it, but I just thought there are quicker ways.
It's running for the 7th day now.

  PID TTY      STAT   TIME  MAJFL   TRS   DRS  RSS %MEM COMMAND
   7564 pts/0    SW     0:00    590   473  1762    0  0.0 [bash]
   7837 pts/0    R    10690:08 31556  658 10317 7624  2.9 [perl]
  22654 pts/4    SW     0:00    627   473  1750    0  0.0 [bash]
  22713 pts/4    SW     0:06   1609   491  2732    0  0.0 [mutt]
  23249 pts/4    SW     0:02   1048   755  2412    0  0.0 [editor]
  23771 pts/5    S      0:01    792   473  1758 1264  0.4 -bash
  23948 pts/5    R      0:00    358    55  2740  976  0.3 ps v

Well, that's the whole story. We want to extract statistically
significant n-gram strings for characters. We thought of ignoring punctuation
marks except full stops and spaces, so I stripped them off. The corpus is
14 mln words, which is 64,812,293 characters in 153 files.
Then we thought as your software is designed for words rather than for
characters, we will insert spaces between letters. So the text is of the
form: c h a r a c t e r s a r e t r i c k y... As I rethought it
afterwards, that wasn't necessary as you can specify tokens in
token.txt as one character /\w/.
But for this long running I used /\w+/, "which means one character or
more", and which is still valid for our corpus. Right? And it is nothing
like a very complicated regexp, is it?
I didn't want full  stops as tokens, but rather as separators, so I
didn't specify any regexp for full stops. I tried it on several files to
check if the globbing is working all right, and it was all right.
So, on the command line I'm running (it's exact copy from a command line):
> perl ~/bin/nsp-v0.51/count.pl --token token.txt output.txt *.new
on one machine and:
> perl ~/bin/nsp-v0.51/count.pl --token token.txt --ngram 3 output.txt *.new
on the other.

As I said it has never produced any results. In cases like this it would
be very helpful to have some sort of indication of "where we are", as
now we're wondering if the program is doing one character per second or
per hour... Sure there is a way to check, but not a very straightforward
one I guess.
Well it might be some mistake of mine after all, but I would
really like to be shown where.

Thank you,
Andrius

> Hi Andrius,
>
> We are always happy to hear from users of BSP/NSP. In fact, we
> nearly beg folks to contact us in our READMEs, etc.
> Perhaps you could send me some additional details of what you
> are trying to do, and how you have done it thus far?
> I'm at : tpederse at umn.edu
>
> One newly added power to NSP is that it allows the user
> to define tokens using regular expressions. So you can say that
> tokens are 2 word sequences that start with the letter 'a'. Or
> they can be two character long sequences, or they can be single
> characters, etc. They can be whatever they want to be really.
> However, a poorly crafted or very complex regular expressions
> can really lead to problems with performance. So the first thing
> I would look at is how you are defining your tokens - and I'd
> be happy to do this - you just need to contact me.
>
> For anyone on the list who doesn't know where to find NSP or
> what it is, here it is:
>
> http://www.d.umn.edu/~tpederse/nsp.html
>
> Cordially,
> Ted Pedersen
>
> >Dear list members,
> >
> >I am currently working on extraction of statistically significant n-gram
> >(1<n<6) strings of alpha-numerical characters from a 100 mln character
> >corpus, and I intend to apply different significance tests (MI, t-score,
> >log-likelihood etc.) on these strings. I'm testing Ted Pedersen's N-gram
> >Statistics Package, which seems being able to produce the tasks, however
> >it hasn't produced any results after one week of running.
> >I have a couple of queries regarding n-gram extraction:
> >1. I'd like to ask if members of the list are aware of similar software
> >capable of accomplishing the above mentioned tasks reliably and
> >efficiently.
> >2. And a statistical question. As I need to count association scores for
> >trigrams, tetragrams, and pentagrams as well, I plan to split them into
> >bigrams consisting of a string of words plus one word [n-1]+[1] and
> >count association scores for them.
> >Does anyone know if this is a right thing to do from a statistical point
> >of view?
> >
> >Thank you,
> >Andrius Utka
> >
> >Research Assistant
> >Centre for Corpus Linguistics
> >University of Birmingham
>
>
>
>
> --
> Ted Pedersen
> http://www.umn.edu/~tpederse
>
>
> _________________________________________________________________
> Join the world?s largest e-mail service with MSN Hotmail.
> http://www.hotmail.com
>

--
Andrius Utka			Centre for Corpus Linguistics
mailto:andrius at ccl.bham.ac.uk	Department of English
Tel:    +44 (0)121 414 8135	Birmingham University
Fax:    +44 (0)121 414 6053	Birmingham B15 2TT