[Corpora-List] Keywords Generator
Ben Allison
ben at dcs.shef.ac.uk
Mon Feb 18 11:57:29 UTC 2008
Hi,
If you're happy with *nix, this can be done in a couple of lines --
first produce a list of frequency counted words for each of the two
corpora, then grep for the word you're interested in in both lists.
Depending on how reliable you want word detection to be, the first part
might be a little more complicated (especially if you have odd
formatting/encoding issues), but the second is just a singe line.
A very simple script for determining the count of a word SOME_WORD in
the corpus myfile.txt (assuming the script words.pl I include below)
would be:
./words.pl < myfile.txt | sort | uniq -c | grep 'SOME_WORD'
(replace myfile.txt and SOME_WORD with appropriate strings) Although
there are no doubt better ways... Also, you may wish to consider
normalised frequency, since raw counts are not going to be great for
comparison if the corpora are of different lengths.
Ben
---------------------------------
#!/usr/bin/perl
while(<>){
tr/A-Z/a-z/;
while (/\b([a-z']+)\b/g){
print "$1\n";
}
}
True Friend wrote:
> Hi Folks
> I need a a programm/script (even of *nix) that can provide frequency
> of a wordlist from two corpora. Actually I have made this list by
> comparing two word lists one from general english (specifically from
> Pakistani Origin) and law english (also of Pakistani origin). I know
> want to present these keywords with their frequencies in both corpora
> as a proof that these words are more frequent in law. Keywords are
> generated by Antconc.
> Is there any script/tool that can generate a parallel list of
> frequencies of each word in both corpora?
> Regards
> M Shakir Aziz
> A Corpus Linguistics Student
> Pakistan
>
> --
> محمد شاکر عزیز
> ------------------------------------------------------------------------
>
> _______________________________________________
> Corpora mailing list
> Corpora at uib.no
> http://mailman.uib.no/listinfo/corpora
>
_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora
More information about the Corpora
mailing list