[Corpora-List] Keywords Generator
True Friend
true.friend2004 at gmail.com
Mon Feb 18 16:56:12 UTC 2008
Hi Sir
Tried your script but ........ it has some problems. Probably the large size
of txt files was the reason. Corpus A was about 1.9 million and corpus B was
almost as A. It generated only "0"s for each word. Another thing was
probably big size of wordlist (1000 words). A glimpse of the result.
votes 0 0
whereas 0 0
whereby 0 0
wherein 0 0
without 0 0
witness 0 0
witnesses 0 0
wound 0 0
writ 0 0
written 0 0
zila 0 0
zina 0 0
court 0 0
When tried with small wordlist it generated only one word (the last one *
court*) plz see the result.
judge 0 0
judgment 0 0
land 0 0
law 0 0
learned 0 0
order 0 0
ordinance 0 0
person 0 0
petition 0 0
petitioner 0 0
police 0 0
record 0 0
respondent 0 0
section 0 0
suit 0 0
trial 0 0
court 718 11128
A procedure which I could make in my mind was like grab the word find its
frequency in Corpus A and then in Corpus B and then print it. I could not
understand the code (not a programmer yet :D), anyhows there is something
wrong. So can you spare some more time for it?
Thanks a lot for your effort to write this script.
Regards
M Shakir
Pakistan
On Feb 18, 2008 5:34 PM, Alexander Schutz <
goalscoringsuperstarhero at gmail.com> wrote:
> Hi Shakir,
>
> as part of a little exercise I wrote a tiny perl script performing what
> you asked.
> It takes as parameters the wordlist, the corpus_A and the corpus_B (each
> as text files)
> and produces as output the respective frequencies in each corpus:
> alesch at nbgal141:~$ perl wordlist_corpus_freq.pl wordlist.txt vbush.txt
> How2DoResearchMIT.txt
> color 1 0
> colour 0 0
> furiously 0 0
> green 0 0
> idea 7 22
> sleep 0 0
>
> It does some normalisation on the corpora, like conversion to lower case
> and
> punctuation removal.
>
> Please find it as attachment, including the sample wordlist, to this
> email.
>
> Hth,
> Alex
>
>
>
> On Feb 18, 2008 10:53 AM, True Friend <true.friend2004 at gmail.com> wrote:
>
> > Hi Folks
> > I need a a programm/script (even of *nix) that can provide frequency of
> > a wordlist from two corpora. Actually I have made this list by comparing two
> > word lists one from general english (specifically from Pakistani Origin) and
> > law english (also of Pakistani origin). I know want to present these
> > keywords with their frequencies in both corpora as a proof that these words
> > are more frequent in law. Keywords are generated by Antconc.
> > Is there any script/tool that can generate a parallel list of
> > frequencies of each word in both corpora?
> > Regards
> > M Shakir Aziz
> > A Corpus Linguistics Student
> > Pakistan
> >
> > --
> > محمد شاکر عزیز
> > _______________________________________________
> > Corpora mailing list
> > Corpora at uib.no
> > http://mailman.uib.no/listinfo/corpora
> >
> >
>
>
> --
> Alexander Schutz,
> Digital Enterprise Research Institute,
> Ollscoil na hÉireann, Gaillimh
> Galway, Ireland
--
محمد شاکر عزیز
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20080218/a74df7cb/attachment.htm>
-------------- next part --------------
_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora
More information about the Corpora
mailing list