[Corpora-List] AntConc 3.2.2 released for Windows and Mac OS X

Laurence Anthony anthony0122 at gmail.com
Wed Apr 13 14:32:45 UTC 2011


Dear Michal and all,

I'll reply about the two issues separately.

>I have two issues that haven't been either noticed or perhaps required in Antconc.
>The first is that Antconc does not read files with file names containing 2-bite characters (even after changing the encoding in Global Settings). Since you work in Japan, didn't you >have any problems with that?

The problem with developing AntConc as a multi-language program is
that I have to deal with the horrible character encoding issues on
Windows systems. Basically, all (pre Win 7?) windows systems had their
own legacy encodings, which varied from country to country. So, even
if you have a file saved as UTF8, the file *name* is saved in the
legacy encoding. AntConc only offers one encoding setting, and assumes
that the file *and* the filename are the same. But, this will cause
problems as you have noticed. The files will still open, but the
filename will just become jumbled in the display. (Actually, I would
recommend everyone to stick with ascii filenames regardless of the
system they use.

Saying that, I just tried to get AntConc 3.2.2 to display a Japanese
filename (in ShiftJis) without success! It opened the file correctly
and displayed the internal UTF8 without problem, but when I selected
Shiftjis, the filename appeared blank. It works properly in AntConc
3.2.1, so perhaps Perl 5.10 (which I use to program with) is doing
something a little differently. (I'll check and release another bug
fix).

>The second is calculating ranks of words. I noticed that words that have the same occurrence (hit-rate) have subsequent ranks (which probably comes from alphabetical sorting). This >means that if there is 1000 words of only 1 occurrence per each, the word starting with "Aa" will have rank = 1, and word starting with "Zz" will have rank = 1000, although statistically >they should be of the same rank.
>Do you consider the above as issues or is it irrelevant in your research?

As Mike Scott says, the Rank column is not a rank of the frequencies,
it's a rank of the word in the sort order. But, I can understand the
issue. Perhaps "Index" or "Sort Rank" would be better. (Thank you for
the kind comment, Mike!)

William Fletcher writes,

> One way to avoid the problem of assigning different ranks to
> items with the same frequency is to use "shared ranks"
> instead, so that all items with the same frequency have the
> same rank.
> Shared rank is the mean of the lowest index (=position in
> list) and the highest index of items with the same frequency.
> In Michal's example all items would have the rank 500.5
> (1 + 1000) / 2
>
> Bill Fletcher

Perhaps this could be added as a separate statistic. Let me think about it.

Laurence.

_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora



More information about the Corpora mailing list