Thnx Mr. Schutz<br>Now it is working fine. Only one thing I had to workaround was wordlist's auto generation. It didn't worked with wordlist generated by antconc so I manually types it and now it works fine. A few words I can see with 0 frequency, I'll correct them manually.<br>
Regards<br><br><div class="gmail_quote">On Tue, Feb 19, 2008 at 8:56 PM, Alexander Schutz <<a href="mailto:goalscoringsuperstarhero@gmail.com">goalscoringsuperstarhero@gmail.com</a>> wrote:<br><blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;">
Hi,<br>
<br>
I took the time to beautify and document the perl code a little bit, I hope<br>
it is a bit clearer now what is done. You can specify any number of corpus<br>
files on the command line, however the first file you specify must always<br>
be the wordlist file.<br>
Running the script on my machine yields the following:<br>
<br>
alesch@nbgal141:~/tmp$ perl wordlist_corpus_freq.pl wordlist.txt<br>
Charles_Dickens_-_David_Copperfield.txt<br>
James_Joyce_-_Ulysses_-_Text.txt Charles_Dickens_-_Oliver_Twist.txt<br>
reading wordlist : wordlist.txt<br>
processing corpus 0 : Charles_Dickens_-_David_Copperfield.txt<br>
processing corpus 1 : James_Joyce_-_Ulysses_-_Text.txt<br>
processing corpus 2 : Charles_Dickens_-_Oliver_Twist.txt<br>
color 0 0 0<br>
colour 25 23 5<br>
furious 2 5 7<br>
furiously 0 2 7<br>
green 39 55 24<br>
idea 93 55 14<br>
sleep 72 43 37<br>
<br>
<br>
If you have questions don't hesitate to get back to me,<br>
<br>
Hth,<br>
Alex<br>
<div><div></div><div class="Wj3C7c"><br>
<br>
On Feb 18, 2008 4:56 PM, True Friend <<a href="mailto:true.friend2004@gmail.com">true.friend2004@gmail.com</a>> wrote:<br>
> Hi Sir<br>
> Tried your script but ........ it has some problems. Probably the large size of txt files was the reason. Corpus A was about 1.9 million and corpus B was almost as A. It generated only "0"s for each word. Another thing was probably big size of wordlist (1000 words). A glimpse of the result.<br>
> votes 0 0<br>
> whereas 0 0<br>
> whereby 0 0<br>
> wherein 0 0<br>
> without 0 0<br>
> witness 0 0<br>
> witnesses 0 0<br>
> wound 0 0<br>
> writ 0 0<br>
> written 0 0<br>
> zila 0 0<br>
> zina 0 0<br>
> court 0 0<br>
> When tried with small wordlist it generated only one word (the last one court) plz see the result.<br>
> judge 0 0<br>
> judgment 0 0<br>
> land 0 0<br>
> law 0 0<br>
> learned 0 0<br>
> order 0 0<br>
> ordinance 0 0<br>
> person 0 0<br>
> petition 0 0<br>
> petitioner 0 0<br>
> police 0 0<br>
> record 0 0<br>
> respondent 0 0<br>
> section 0 0<br>
> suit 0 0<br>
> trial 0 0<br>
> court 718 11128<br>
> A procedure which I could make in my mind was like grab the word find its frequency in Corpus A and then in Corpus B and then print it. I could not understand the code (not a programmer yet :D), anyhows there is something wrong. So can you spare some more time for it?<br>
> Thanks a lot for your effort to write this script.<br>
> Regards<br>
> M Shakir<br>
> Pakistan<br>
><br>
><br>
><br>
> On Feb 18, 2008 5:34 PM, Alexander Schutz <<a href="mailto:goalscoringsuperstarhero@gmail.com">goalscoringsuperstarhero@gmail.com</a>> wrote:<br>
><br>
><br>
><br>
><br>
> > Hi Shakir,<br>
> ><br>
> > as part of a little exercise I wrote a tiny perl script performing what you asked.<br>
> > It takes as parameters the wordlist, the corpus_A and the corpus_B (each as text files)<br>
> > and produces as output the respective frequencies in each corpus:<br>
> > alesch@nbgal141:~$ perl wordlist_corpus_freq.pl wordlist.txt vbush.txt How2DoResearchMIT.txt<br>
> > color 1 0<br>
> > colour 0 0<br>
> > furiously 0 0<br>
> > green 0 0<br>
> > idea 7 22<br>
> > sleep 0 0<br>
> ><br>
> > It does some normalisation on the corpora, like conversion to lower case and<br>
> > punctuation removal.<br>
> ><br>
> > Please find it as attachment, including the sample wordlist, to this email.<br>
> ><br>
> > Hth,<br>
> > Alex<br>
> ><br>
> ><br>
> ><br>
> ><br>
> ><br>
> ><br>
> ><br>
> > On Feb 18, 2008 10:53 AM, True Friend <<a href="mailto:true.friend2004@gmail.com">true.friend2004@gmail.com</a>> wrote:<br>
> ><br>
> > ><br>
> > ><br>
> > ><br>
> > > Hi Folks<br>
> > > I need a a programm/script (even of *nix) that can provide frequency of a wordlist from two corpora. Actually I have made this list by comparing two word lists one from general english (specifically from Pakistani Origin) and law english (also of Pakistani origin). I know want to present these keywords with their frequencies in both corpora as a proof that these words are more frequent in law. Keywords are generated by Antconc.<br>
> > > Is there any script/tool that can generate a parallel list of frequencies of each word in both corpora?<br>
> > > Regards<br>
> > > M Shakir Aziz<br>
> > > A Corpus Linguistics Student<br>
> > > Pakistan<br>
> > ><br>
> > > --<br>
> > > محمد شاکر عزیز<br>
> > ><br>
> > > _______________________________________________<br>
> > > Corpora mailing list<br>
> > > <a href="mailto:Corpora@uib.no">Corpora@uib.no</a><br>
> > > <a href="http://mailman.uib.no/listinfo/corpora" target="_blank">http://mailman.uib.no/listinfo/corpora</a><br>
> > ><br>
> > ><br>
> ><br>
> ><br>
> ><br>
> > --<br>
> > Alexander Schutz,<br>
> > Digital Enterprise Research Institute,<br>
> > Ollscoil na hÉireann, Gaillimh<br>
> > Galway, Ireland<br>
><br>
><br>
><br>
> --<br>
> محمد شاکر عزیز<br>
<br>
<br>
<br>
--<br>
Alexander Schutz,<br>
Digital Enterprise Research Institute,<br>
Ollscoil na hÉireann, Gaillimh<br>
Galway, Ireland<br>
</div></div></blockquote></div><br><br clear="all"><br>-- <br>محمد شاکر عزیز