[Corpora-List] Query about nomenclature
Damon Allen Davison
allolex at gmail.com
Wed Mar 9 21:46:30 UTC 2005
Dear John,
Here are some rather unscientific results. My corpus was a page of
Google results limited to 100 for the search term "n gram". Doing both
"ngram" and "n gram" was slightly problematic because their is a Perl
CPAN module called Text::Ngram, so that weights the results for
"ngram" quite a bit.
n-gram : 128 times
N-gram : 126 times
ngram : 57 times
N-Gram : 34 times
Ngram : 10 times
N-GRAM : 9 times
NGRAM : 8 times
n-Gram : 7 times
NGram : 5 times
I did this using this Perl script after doing "links --dump
results.html > results.txt" to the results file I had saved.
#!/usr/bin/perl
# syntax: findword <filename>
use warnings;
use strict;
my %total;
my @matches;
while ( <> ) {
@matches = /(n-?gram)/i; # case-insensitive, case-preserving
matching, dash optional
$total{$_}++ foreach @matches;
}
print map { "$_ : $total{$_} times\n" } reverse sort { $total{$a} <=>
$total{$b} } keys %total;
Anyway, I hope that helps a little. You can use the same script to do
searches on other files. :)
I like to use "n-gram".
Warm regards,
Damon
--
Damon Allen Davison
http://allolex.net
More information about the Corpora
mailing list