Corpora: Tools needed to process British National Corpus

Thu Sep 14 09:35:10 UTC 2000

I attach a minimalist perl prog that does the job.  Or you can find
lists already generated on my website,

	   Adam

Kai Noponen wrote
> I need a tool that can make a frequency list out of the BNC. It must
> utilize the part-of-speech tags in order to separate the different cases.
> It also should read SGML.

--
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
Adam Kilgarriff
Senior Research Fellow                         tel: (44) 1273 642919
Information Technology Research Institute           (44) 1273 642900
University of Brighton                         fax: (44) 1273 642908
Lewes Road
Brighton BN2 4GJ         email:      Adam.Kilgarriff at itri.bton.ac.uk
UK                       http://www.itri.bton.ac.uk/~Adam.Kilgarriff
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

==============cut here===================

$/="<w ";
while (<>){
    /^([^>]+)>([^<]+)/;
    $word=lc $2;
# all words normalised to lower case --delete 'lc' if you want to retain capitalisation
    $pos = $1;
    $word =~ s/\n/ /;
    $word =~ s/ +$//;
    $word =~ s/ /_/;
# multiword 'words' will have _ between items ("in_order_to") in stead of spaces
    $count{$word." ".$pos}++;
}
for (keys %count){print "$_ $count{$_}\n"}

# words which, for some reason, weren't marked up with SGML w tag will be missed