[Corpora-List] How to?: POS n-grams

Wed Sep 26 13:31:25 UTC 2007

Dear Tom,

I assume you do not use the Unix text utilities, so  I will explain 'how 
to' with the tools you probably do have.

First, put every word (POS-tag, in your case ;0) ) on its own line, i.e. 
replace every space by a newline character.

Example: Given the corpus: `The cat ran away.'

to:

The
cat
ran
away
.

2/ Convert this column into an MS Word table
3/ Copy the table as many times as the n-gram required:

Example (for 3-grams):

The     The     The
cat     cat     cat
ran     ran     ran
away    away    away
. . .

4/ Insert empty cells at top of colums running up to the last. Number of 
empty cells required is n-1 for the first column, n-2 for second, etc.

Example: ('EC' represents an empty cell)

EC EC The
EC The cat
The cat ran
cat ran away
ran away .
away .

5/ Convert the table back to text. Done!

Hope this helps and you do not run into memory problems on the way.

Greetings,

Martin Reynaert
ILK

Tom Rankin wrote:
> Dear all,
>
> Can anyone give me any tips about how best to extract POS n-grams 
> from a corpus? I have removed the words from a tagged corpus and am 
> now using the cluster function in WordSmith to make n-grams of the 
> POS tags. Is there a better way?
>
> Thanks!
> Tom
>
>
> _______________________________________________
> Corpora mailing list
> Corpora at uib.no
> http://mailman.uib.no/listinfo/corpora
>
>   

_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora