Thnx Mr. Schutz<br>Now it is working fine. Only one thing I had to workaround was wordlist's auto generation. It didn't worked with wordlist generated by antconc so I manually types it and now it works fine. A few words I can see with 0 frequency, I'll correct them manually.<br>

Regards<br><br><div class="gmail_quote">On Tue, Feb 19, 2008 at 8:56 PM, Alexander Schutz <<a href="mailto:goalscoringsuperstarhero@gmail.com">goalscoringsuperstarhero@gmail.com</a>> wrote:<br><blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;">

Hi,<br>

<br>

I took the time to beautify and document the perl code a little bit, I hope<br>

it is a bit clearer now what is done. You can specify any number of corpus<br>

files on the command line, however the first file you specify must always<br>

be the wordlist file.<br>

Running the script on my machine yields the following:<br>

<br>

alesch@nbgal141:~/tmp$ perl wordlist_corpus_freq.pl wordlist.txt<br>

Charles_Dickens_-_David_Copperfield.txt<br>

James_Joyce_-_Ulysses_-_Text.txt Charles_Dickens_-_Oliver_Twist.txt<br>

reading wordlist : wordlist.txt<br>

processing corpus 0 : Charles_Dickens_-_David_Copperfield.txt<br>

processing corpus 1 : James_Joyce_-_Ulysses_-_Text.txt<br>

processing corpus 2 : Charles_Dickens_-_Oliver_Twist.txt<br>

               color     0     0     0<br>

              colour    25    23     5<br>

             furious     2     5     7<br>

           furiously     0     2     7<br>

               green    39    55    24<br>

                idea    93    55    14<br>

               sleep    72    43    37<br>

<br>

<br>

If you have questions don't hesitate to get back to me,<br>

<br>

Hth,<br>

Alex<br>

<div><div></div><div class="Wj3C7c"><br>

<br>

On Feb 18, 2008 4:56 PM, True Friend <<a href="mailto:true.friend2004@gmail.com">true.friend2004@gmail.com</a>> wrote:<br>

> Hi Sir<br>

> Tried your script but ........ it has some problems. Probably the large size of txt files was the reason. Corpus A was about 1.9 million and corpus B was almost as A. It generated only "0"s for each word. Another thing was probably big size of wordlist (1000 words). A glimpse of the result.<br>


>    votes     0     0<br>

>              whereas     0     0<br>

>              whereby     0     0<br>

>              wherein     0     0<br>

>              without     0     0<br>

>              witness     0     0<br>

>            witnesses     0     0<br>

>                wound     0     0<br>

>                 writ     0     0<br>

>              written     0     0<br>

>                 zila     0     0<br>

>                 zina     0     0<br>

>             court     0     0<br>

> When tried with small wordlist it generated only one word (the last one court) plz see the result.<br>

>         judge     0     0<br>

>             judgment     0     0<br>

>                 land     0     0<br>

>                  law     0     0<br>

>              learned     0     0<br>

>                order     0     0<br>

>            ordinance     0     0<br>

>               person     0     0<br>

>             petition     0     0<br>

>           petitioner     0     0<br>

>               police     0     0<br>

>               record     0     0<br>

>           respondent     0     0<br>

>              section     0     0<br>

>                 suit     0     0<br>

>                trial     0     0<br>

>                court   718  11128<br>

> A procedure which I could make in my mind was like grab the word find its frequency in Corpus A and then in Corpus B and then print it. I could not understand the code (not a programmer yet :D), anyhows there is something wrong. So can you spare some more time for it?<br>


> Thanks a lot for your effort to write this script.<br>

> Regards<br>

> M Shakir<br>

> Pakistan<br>

><br>

><br>

><br>

> On Feb 18, 2008 5:34 PM, Alexander Schutz <<a href="mailto:goalscoringsuperstarhero@gmail.com">goalscoringsuperstarhero@gmail.com</a>> wrote:<br>

><br>

><br>

><br>

><br>

> > Hi Shakir,<br>

> ><br>

> > as part of a little exercise I wrote a  tiny perl script performing what you asked.<br>

> > It takes as parameters the wordlist, the corpus_A and the corpus_B (each as text files)<br>

> > and produces as output the respective frequencies in each corpus:<br>

> > alesch@nbgal141:~$ perl wordlist_corpus_freq.pl wordlist.txt vbush.txt How2DoResearchMIT.txt<br>

> >                color     1     0<br>

> >               colour     0     0<br>

> >            furiously     0     0<br>

> >                green     0     0<br>

> >                 idea     7    22<br>

> >                sleep     0     0<br>

> ><br>

> > It does some normalisation on the corpora, like conversion to lower case and<br>

> > punctuation removal.<br>

> ><br>

> > Please find it as attachment, including the sample wordlist, to this email.<br>

> ><br>

> > Hth,<br>

> > Alex<br>

> ><br>

> ><br>

> ><br>

> ><br>

> ><br>

> ><br>

> ><br>

> > On Feb 18, 2008 10:53 AM, True Friend <<a href="mailto:true.friend2004@gmail.com">true.friend2004@gmail.com</a>> wrote:<br>

> ><br>

> > ><br>

> > ><br>

> > ><br>

> > > Hi Folks<br>

> > > I need a a programm/script (even of *nix) that can provide frequency of a wordlist from two corpora. Actually I have made this list by comparing two word lists one from general english (specifically from Pakistani Origin) and law english (also of Pakistani origin). I know want to present these keywords with their frequencies in both corpora as a proof that these words are more frequent in law. Keywords are generated by Antconc.<br>


> > > Is there any script/tool that can generate a parallel list of frequencies of each word in both corpora?<br>

> > > Regards<br>

> > > M Shakir Aziz<br>

> > > A Corpus Linguistics Student<br>

> > > Pakistan<br>

> > ><br>

> > > --<br>

> > > محمد شاکر عزیز<br>

> > ><br>

> > > _______________________________________________<br>

> > > Corpora mailing list<br>

> > > <a href="mailto:Corpora@uib.no">Corpora@uib.no</a><br>

> > > <a href="http://mailman.uib.no/listinfo/corpora" target="_blank">http://mailman.uib.no/listinfo/corpora</a><br>

> > ><br>

> > ><br>

> ><br>

> ><br>

> ><br>

> > --<br>

> > Alexander Schutz,<br>

> > Digital Enterprise Research Institute,<br>

> > Ollscoil na hÉireann, Gaillimh<br>

> > Galway, Ireland<br>

><br>

><br>

><br>

> --<br>

> محمد شاکر عزیز<br>

<br>

<br>

<br>

--<br>

Alexander Schutz,<br>

Digital Enterprise Research Institute,<br>

Ollscoil na hÉireann, Gaillimh<br>

Galway, Ireland<br>

</div></div></blockquote></div><br><br clear="all"><br>-- <br>محمد شاکر عزیز