[Corpora-List] Cleaning text to take word frequency

Alexandre Rafalovitch arafalov at gmail.com
Sun Jun 1 13:30:03 UTC 2008


The way I would have approached this is by finding which words
generate count discrepancies and also exist in one, but not another
version of the result. Then, I would look for those words in the text
and see what context they are in.

What I suspect you will find is that your partial reimplementation of
perl's :punct: class is causing problems. I would either do a complete
reimplementation of that (see:
http://en.wikipedia.org/wiki/Regular_expression ) or look into C#'s
regular expressions, which I am sure will contain the same definition
of the :punct: class.

Finally, if you are working with languages other than English, you
most certainly should look into regular expression libraries. They
take into account Unicode's rules as well, something you really don't
want to have to duplicate in your own code.

Regards,
    Alex.

-- 
Personal blog: http://blog.outerthoughts.com/
Research group: http://www.clt.mq.edu.au/Research/


On Sun, Jun 1, 2008 at 7:07 AM, True Friend <true.friend2004 at gmail.com> wrote:
>
> HI
> I am a corpus linguistics student and learning C# for this purpose as well.
> I've created a simple application to find the frequency of a given word in
> two files. Actually this simple application is a practice version in C# of a
> Perl script a respected subscriber of this list (Alexander Schutz) written
> for me on my request on this list. I needed it then, now I am trying to
> programm myself so I tried to implement that idea in C#.

_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora



More information about the Corpora mailing list