Corpora: Tools needed to process British National Corpus
Adam Kilgarriff
Adam.Kilgarriff at itri.brighton.ac.uk
Thu Sep 14 09:35:10 UTC 2000
I attach a minimalist perl prog that does the job. Or you can find
lists already generated on my website,
Adam
Kai Noponen wrote
> I need a tool that can make a frequency list out of the BNC. It must
> utilize the part-of-speech tags in order to separate the different cases.
> It also should read SGML.
--
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
Adam Kilgarriff
Senior Research Fellow tel: (44) 1273 642919
Information Technology Research Institute (44) 1273 642900
University of Brighton fax: (44) 1273 642908
Lewes Road
Brighton BN2 4GJ email: Adam.Kilgarriff at itri.bton.ac.uk
UK http://www.itri.bton.ac.uk/~Adam.Kilgarriff
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
==============cut here===================
$/="<w ";
while (<>){
/^([^>]+)>([^<]+)/;
$word=lc $2;
# all words normalised to lower case --delete 'lc' if you want to retain capitalisation
$pos = $1;
$word =~ s/\n/ /;
$word =~ s/ +$//;
$word =~ s/ /_/;
# multiword 'words' will have _ between items ("in_order_to") in stead of spaces
$count{$word." ".$pos}++;
}
for (keys %count){print "$_ $count{$_}\n"}
# words which, for some reason, weren't marked up with SGML w tag will be missed
More information about the Corpora
mailing list