[Corpora-List] Keywords Generator

Mon Feb 18 16:56:12 UTC 2008

Hi Sir
Tried your script but ........ it has some problems. Probably the large size
of txt files was the reason. Corpus A was about 1.9 million and corpus B was
almost as A. It generated only "0"s for each word. Another thing was
probably big size of wordlist (1000 words). A glimpse of the result.
   votes     0     0
             whereas     0     0
             whereby     0     0
             wherein     0     0
             without     0     0
             witness     0     0
           witnesses     0     0
               wound     0     0
                writ     0     0
             written     0     0
                zila     0     0
                zina     0     0
            court     0     0
When tried with small wordlist it generated only one word (the last one *
court*) plz see the result.
        judge     0     0
            judgment     0     0
                land     0     0
                 law     0     0
             learned     0     0
               order     0     0
           ordinance     0     0
              person     0     0
            petition     0     0
          petitioner     0     0
              police     0     0
              record     0     0
          respondent     0     0
             section     0     0
                suit     0     0
               trial     0     0
               court   718  11128
A procedure which I could make in my mind was like grab the word find its
frequency in Corpus A and then in Corpus B and then print it. I could not
understand the code (not a programmer yet :D), anyhows there is something
wrong. So can you spare some more time for it?
Thanks a lot for your effort to write this script.
Regards
M Shakir
Pakistan

On Feb 18, 2008 5:34 PM, Alexander Schutz <
goalscoringsuperstarhero at gmail.com> wrote:

> Hi Shakir,
>
> as part of a little exercise I wrote a  tiny perl script performing what
> you asked.
> It takes as parameters the wordlist, the corpus_A and the corpus_B (each
> as text files)
> and produces as output the respective frequencies in each corpus:
> alesch at nbgal141:~$ perl wordlist_corpus_freq.pl wordlist.txt vbush.txt
> How2DoResearchMIT.txt
>                color     1     0
>               colour     0     0
>            furiously     0     0
>                green     0     0
>                 idea     7    22
>                sleep     0     0
>
> It does some normalisation on the corpora, like conversion to lower case
> and
> punctuation removal.
>
> Please find it as attachment, including the sample wordlist, to this
> email.
>
> Hth,
> Alex
>
>
>
> On Feb 18, 2008 10:53 AM, True Friend <true.friend2004 at gmail.com> wrote:
>
> > Hi Folks
> > I need a a programm/script (even of *nix) that can provide frequency of
> > a wordlist from two corpora. Actually I have made this list by comparing two
> > word lists one from general english (specifically from Pakistani Origin) and
> > law english (also of Pakistani origin). I know want to present these
> > keywords with their frequencies in both corpora as a proof that these words
> > are more frequent in law. Keywords are generated by Antconc.
> > Is there any script/tool that can generate a parallel list of
> > frequencies of each word in both corpora?
> > Regards
> > M Shakir Aziz
> > A Corpus Linguistics Student
> > Pakistan
> >
> > --
> > محمد شاکر عزیز
> > _______________________________________________
> > Corpora mailing list
> > Corpora at uib.no
> > http://mailman.uib.no/listinfo/corpora
> >
> >
>
>
> --
> Alexander Schutz,
> Digital Enterprise Research Institute,
> Ollscoil na hÉireann, Gaillimh
> Galway, Ireland


-- 
محمد شاکر عزیز
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20080218/a74df7cb/attachment.htm>
-------------- next part --------------
_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora