[Corpora-List] Keywords Generator

Fri Feb 22 07:08:23 UTC 2008

Thnx Mr. Schutz
Now it is working fine. Only one thing I had to workaround was wordlist's
auto generation. It didn't worked with wordlist generated by antconc so I
manually types it and now it works fine. A few words I can see with 0
frequency, I'll correct them manually.
Regards

On Tue, Feb 19, 2008 at 8:56 PM, Alexander Schutz <
goalscoringsuperstarhero at gmail.com> wrote:

> Hi,
>
> I took the time to beautify and document the perl code a little bit, I
> hope
> it is a bit clearer now what is done. You can specify any number of corpus
> files on the command line, however the first file you specify must always
> be the wordlist file.
> Running the script on my machine yields the following:
>
> alesch at nbgal141:~/tmp$ perl wordlist_corpus_freq.pl wordlist.txt
> Charles_Dickens_-_David_Copperfield.txt
> James_Joyce_-_Ulysses_-_Text.txt Charles_Dickens_-_Oliver_Twist.txt
> reading wordlist : wordlist.txt
> processing corpus 0 : Charles_Dickens_-_David_Copperfield.txt
> processing corpus 1 : James_Joyce_-_Ulysses_-_Text.txt
> processing corpus 2 : Charles_Dickens_-_Oliver_Twist.txt
>               color     0     0     0
>              colour    25    23     5
>             furious     2     5     7
>           furiously     0     2     7
>               green    39    55    24
>                idea    93    55    14
>               sleep    72    43    37
>
>
> If you have questions don't hesitate to get back to me,
>
> Hth,
> Alex
>
>
> On Feb 18, 2008 4:56 PM, True Friend <true.friend2004 at gmail.com> wrote:
> > Hi Sir
> > Tried your script but ........ it has some problems. Probably the large
> size of txt files was the reason. Corpus A was about 1.9 million and
> corpus B was almost as A. It generated only "0"s for each word. Another
> thing was probably big size of wordlist (1000 words). A glimpse of the
> result.
> >    votes     0     0
> >              whereas     0     0
> >              whereby     0     0
> >              wherein     0     0
> >              without     0     0
> >              witness     0     0
> >            witnesses     0     0
> >                wound     0     0
> >                 writ     0     0
> >              written     0     0
> >                 zila     0     0
> >                 zina     0     0
> >             court     0     0
> > When tried with small wordlist it generated only one word (the last one
> court) plz see the result.
> >         judge     0     0
> >             judgment     0     0
> >                 land     0     0
> >                  law     0     0
> >              learned     0     0
> >                order     0     0
> >            ordinance     0     0
> >               person     0     0
> >             petition     0     0
> >           petitioner     0     0
> >               police     0     0
> >               record     0     0
> >           respondent     0     0
> >              section     0     0
> >                 suit     0     0
> >                trial     0     0
> >                court   718  11128
> > A procedure which I could make in my mind was like grab the word find
> its frequency in Corpus A and then in Corpus B and then print it. I could
> not understand the code (not a programmer yet :D), anyhows there is
> something wrong. So can you spare some more time for it?
> > Thanks a lot for your effort to write this script.
> > Regards
> > M Shakir
> > Pakistan
> >
> >
> >
> > On Feb 18, 2008 5:34 PM, Alexander Schutz <
> goalscoringsuperstarhero at gmail.com> wrote:
> >
> >
> >
> >
> > > Hi Shakir,
> > >
> > > as part of a little exercise I wrote a  tiny perl script performing
> what you asked.
> > > It takes as parameters the wordlist, the corpus_A and the corpus_B
> (each as text files)
> > > and produces as output the respective frequencies in each corpus:
> > > alesch at nbgal141:~$ perl wordlist_corpus_freq.pl wordlist.txt vbush.txt
> How2DoResearchMIT.txt
> > >                color     1     0
> > >               colour     0     0
> > >            furiously     0     0
> > >                green     0     0
> > >                 idea     7    22
> > >                sleep     0     0
> > >
> > > It does some normalisation on the corpora, like conversion to lower
> case and
> > > punctuation removal.
> > >
> > > Please find it as attachment, including the sample wordlist, to this
> email.
> > >
> > > Hth,
> > > Alex
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > > On Feb 18, 2008 10:53 AM, True Friend <true.friend2004 at gmail.com>
> wrote:
> > >
> > > >
> > > >
> > > >
> > > > Hi Folks
> > > > I need a a programm/script (even of *nix) that can provide frequency
> of a wordlist from two corpora. Actually I have made this list by comparing
> two word lists one from general english (specifically from Pakistani Origin)
> and law english (also of Pakistani origin). I know want to present these
> keywords with their frequencies in both corpora as a proof that these words
> are more frequent in law. Keywords are generated by Antconc.
> > > > Is there any script/tool that can generate a parallel list of
> frequencies of each word in both corpora?
> > > > Regards
> > > > M Shakir Aziz
> > > > A Corpus Linguistics Student
> > > > Pakistan
> > > >
> > > > --
> > > > محمد شاکر عزیز
> > > >
> > > > _______________________________________________
> > > > Corpora mailing list
> > > > Corpora at uib.no
> > > > http://mailman.uib.no/listinfo/corpora
> > > >
> > > >
> > >
> > >
> > >
> > > --
> > > Alexander Schutz,
> > > Digital Enterprise Research Institute,
> > > Ollscoil na hÉireann, Gaillimh
> > > Galway, Ireland
> >
> >
> >
> > --
> > محمد شاکر عزیز
>
>
>
> --
> Alexander Schutz,
> Digital Enterprise Research Institute,
> Ollscoil na hÉireann, Gaillimh
> Galway, Ireland
>

-- 
محمد شاکر عزیز
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20080222/96d8dbe7/attachment.htm>
-------------- next part --------------
_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora