[Corpora-List] Keywords Generator

Mon Feb 18 18:35:02 UTC 2008

Shakir,

I am pretty sure this is not the right forum to give support for a quickly
hacked perl script, and the diagnosis of what went wrong is too much a
speculation given the bug report.
However, maybe I should have been clearer in the usage instructions:

So basically what you need to do is check the input format. Your wordlist
seems to
be ok (from what I can see in your output sample). The corpora need to be
plain text,
and one text-file each (again, see the example input for inspiration)

I tested the script with a wordlist of nouns extracted from the bnc
frequency list,
and as corpusA the europarl-corpus (en, with no tags) and as corpusB a
collection
of Charles Dickens novels (from Gutenberg).

Again, both corpora (I was hoping the provided example was sufficiently
illustrative)
must be plain text files, and size should not be a problem, given the fact
that I was able
to process the europarl (28m tokens)  -- *AND* Charles Dickens  ;-) , and it
only takes
a couple of seconds, and it produced the desired output.

Perhaps it would have been better to come up with a unix-shell pipe example
so that you can see how to do "stuff" quickly yourself, and provide
references
so you are not lost, and can educate yourself when you reach the limitations

of the unix shell one-liner. The pointers so far given are excellent
resources
to get your hands dirty quickly, really without having to learn everything
about
programming.
More helpful resources include the perl man-pages ('man perl' or 'man
perlintro')
in the unix-shell, hopefully your system administrator has them installed
for you.
I can do a bit more documentation on the script, but I suggest we handle
that
in private communication.

Now, I hope there won't be much need to continue this thread.
Sorry, but vanity is my favourite sin ;-)

Kind regards,
Alex

On Feb 18, 2008 4:56 PM, True Friend <true.friend2004 at gmail.com> wrote:

> Hi Sir
> Tried your script but ........ it has some problems. Probably the large
> size of txt files was the reason. Corpus A was about 1.9 million and
> corpus B was almost as A. It generated only "0"s for each word. Another
> thing was probably big size of wordlist (1000 words). A glimpse of the
> result.
>    votes     0     0
>              whereas     0     0
>              whereby     0     0
>              wherein     0     0
>              without     0     0
>              witness     0     0
>            witnesses     0     0
>                wound     0     0
>                 writ     0     0
>              written     0     0
>                 zila     0     0
>                 zina     0     0
>             court     0     0
> When tried with small wordlist it generated only one word (the last one *
> court*) plz see the result.
>         judge     0     0
>             judgment     0     0
>                 land     0     0
>                  law     0     0
>              learned     0     0
>                order     0     0
>            ordinance     0     0
>               person     0     0
>             petition     0     0
>           petitioner     0     0
>               police     0     0
>               record     0     0
>           respondent     0     0
>              section     0     0
>                 suit     0     0
>                trial     0     0
>                court   718  11128
> A procedure which I could make in my mind was like grab the word find its
> frequency in Corpus A and then in Corpus B and then print it. I could not
> understand the code (not a programmer yet :D), anyhows there is something
> wrong. So can you spare some more time for it?
> Thanks a lot for your effort to write this script.
> Regards
> M Shakir
> Pakistan
>
> On Feb 18, 2008 5:34 PM, Alexander Schutz <
> goalscoringsuperstarhero at gmail.com> wrote:
>
> > Hi Shakir,
> >
> > as part of a little exercise I wrote a  tiny perl script performing what
> > you asked.
> > It takes as parameters the wordlist, the corpus_A and the corpus_B (each
> > as text files)
> > and produces as output the respective frequencies in each corpus:
> > alesch at nbgal141:~$ perl wordlist_corpus_freq.pl wordlist.txt vbush.txt
> > How2DoResearchMIT.txt
> >                color     1     0
> >               colour     0     0
> >            furiously     0     0
> >                green     0     0
> >                 idea     7    22
> >                sleep     0     0
> >
> > It does some normalisation on the corpora, like conversion to lower case
> > and
> > punctuation removal.
> >
> > Please find it as attachment, including the sample wordlist, to this
> > email.
> >
> > Hth,
> > Alex
> >
> >
> >
> > On Feb 18, 2008 10:53 AM, True Friend <true.friend2004 at gmail.com> wrote:
> >
> > > Hi Folks
> > > I need a a programm/script (even of *nix) that can provide frequency
> > > of a wordlist from two corpora. Actually I have made this list by comparing
> > > two word lists one from general english (specifically from Pakistani Origin)
> > > and law english (also of Pakistani origin). I know want to present these
> > > keywords with their frequencies in both corpora as a proof that these words
> > > are more frequent in law. Keywords are generated by Antconc.
> > > Is there any script/tool that can generate a parallel list of
> > > frequencies of each word in both corpora?
> > > Regards
> > > M Shakir Aziz
> > > A Corpus Linguistics Student
> > > Pakistan
> > >
> > > --
> > > محمد شاکر عزیز
> > > _______________________________________________
> > > Corpora mailing list
> > > Corpora at uib.no
> > > http://mailman.uib.no/listinfo/corpora
> > >
> > >
> >
> >
> > --
> > Alexander Schutz,
> > Digital Enterprise Research Institute,
> > Ollscoil na hÉireann, Gaillimh
> > Galway, Ireland
>
>
>
>
> --
> محمد شاکر عزیز

-- 
Alexander Schutz,
Digital Enterprise Research Institute,
Ollscoil na hÉireann, Gaillimh
Galway, Ireland
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20080218/64a9d810/attachment.htm>
-------------- next part --------------
_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora