[Corpora-List] How to?: POS n-grams
Martin Reynaert
Reynaert at uvt.nl
Wed Sep 26 13:31:25 UTC 2007
Dear Tom,
I assume you do not use the Unix text utilities, so I will explain 'how
to' with the tools you probably do have.
First, put every word (POS-tag, in your case ;0) ) on its own line, i.e.
replace every space by a newline character.
Example: Given the corpus: `The cat ran away.'
to:
The
cat
ran
away
.
2/ Convert this column into an MS Word table
3/ Copy the table as many times as the n-gram required:
Example (for 3-grams):
The The The
cat cat cat
ran ran ran
away away away
. . .
4/ Insert empty cells at top of colums running up to the last. Number of
empty cells required is n-1 for the first column, n-2 for second, etc.
Example: ('EC' represents an empty cell)
EC EC The
EC The cat
The cat ran
cat ran away
ran away .
away .
5/ Convert the table back to text. Done!
Hope this helps and you do not run into memory problems on the way.
Greetings,
Martin Reynaert
ILK
Tom Rankin wrote:
> Dear all,
>
> Can anyone give me any tips about how best to extract POS n-grams
> from a corpus? I have removed the words from a tagged corpus and am
> now using the cluster function in WordSmith to make n-grams of the
> POS tags. Is there a better way?
>
> Thanks!
> Tom
>
>
> _______________________________________________
> Corpora mailing list
> Corpora at uib.no
> http://mailman.uib.no/listinfo/corpora
>
>
_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora
More information about the Corpora
mailing list