[Corpora-List] Cleaning text to take word frequency

Trevor Jenkins trevor.jenkins at suneidesis.com
Sun Jun 1 12:57:13 UTC 2008


On Sun, 1 Jun 2008, True Friend <true.friend2004 at gmail.com> wrote:

> ... version in C# of a Perl script a respected subscriber of this list
> (Alexander Schutz) ... now I am trying to programm myself so I tried to
> implement that idea in C#. I have done that all and it works also but it
> does not give me 100% frequency of the word as the Perl script does.

That is possible. In fact doesn't surprise me at all.

> ... The resulting string array was cleaned from such characters but I
> couldn't get the 100% result. The frequency of most words are less than
> that of Perl script (which does the same thing). ...

I'm neither a perl wizard or a C# tune-smith (I still use Snobol4) but I'd
suspect a major difference in the way the two language process text. For
my money I'd believe perl is giving you a more accurate result because the
language itself was designed to process text. I'd further believer that C#
(as Microsoft's attempt to have their own Java) doesn't deal with
character and/or textual data in the same way. What perl accepts as text
C# may well be ditching. You may be right by citing the System.split()
function; check very carefully what that function is intended to do and
then compare it with how the similar, but not necessarily identical,
function in perl works. Assume absolutely nothing about the functionality
of either language or of functions with the same name. If in doubt blame
C# for the discrepancy.

Regards, Trevor

<>< Re: deemed!




_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora



More information about the Corpora mailing list