[Corpora-List] New Ngram package in Perl
Vlado Keselj
vlado at cs.dal.ca
Fri Jun 6 20:17:30 UTC 2003
Text::Ngrams - a new Perl package for n-gram analysis, is made
available at the site:
http://www.cs.dal.ca/~vlado/srcperl/Ngrams
and it will be soon be indexed by CPAN (www.cpan.org).
It is a small and flexible piece of code that comes with a script
ngrams.pl for direct processing of files.
I am aware that this is `yet another' n-gram package, but it is novel in
some ways. References to other packages are included.
The man pages for the script and the module are included below.
Vlado
-------
SYNOPIS
ngram [--version] [--help] [--n=3] [--type=character] [--orderbyfrequency] [input files]
DESCRIPTION
This script produces n-grams tables of the input files to
the standard ouput.
Options: =over 4 =item --version
Prints version.
--help Prints help.
--n=NUMBER
N-gram size, produces 3-grams by default.
--type=character|byte|word
Type of n-grams produces. See Text::Ngrams module.
--orderbyfrequency
By default, the n-grams are ordered lexicographi
cally. If this option is specified, then they are
ordered by frequency in descending order.
PREREQUISITES
Text::Ngrams, Getopt::Long
SCRIPT CATEGORIES
Text::Statistics
SEE ALSO
Text::Ngrams module.
COPYRIGHT
Copyright 2003 Vlado Keselj http://www.cs.dal.ca/~vlado
This module is provided "as is" without expressed or
implied warranty. This is free software; you can redis
tribute it and/or modify it under the same terms as Perl
itself.
The latest version can be found at
http://www.cs.dal.ca/~vlado/srcperl/.
------------------------------------------------------------------------
NAME
Text::Ngrams - Flexible Ngram analysis (for characters,
words, and more)
SYNOPSIS
For default character n-gram analysis of string:
use Text::Ngrams;
my $ng3 = Text::Ngrams->new;
ng3->process_text('abcdefg1235678hijklmnop');
print ng3->to_string;
One can also feed tokens manually:
use Text::Ngrams;
my $ng3 = Text::Ngrams->new;
$ng3->feed_tokens('a');
$ng3->feed_tokens('b');
$ng3->feed_tokens('c');
$ng3->feed_tokens('d');
$ng3->feed_tokens('e');
$ng3->feed_tokens('f');
$ng3->feed_tokens('g');
$ng3->feed_tokens('h');
We can choose n-grams of various sizes, e.g.:
my $ng = Text::Ngrams->new( windowsize => 6 );
or different types of n-grams, e.g.:
my $ng = Text::Ngrams->new( type => byte );
my $ng = Text::Ngrams->new( type => word );
DESCRIPTION
This module implement text n-gram analysis, supporting
several types of analysis, including character and word n-
grams.
The module Text::Ngrams is very flexible. For example, it
allows a user to manually feed a sequence of any tokens.
It handles several types of tokens (character, word), and
also allows a lot of flexibility in automatic recognition
and feed of tokens and the way they are combined in an n-
gram. It counts all n-gram frequencies up to the maximal
specified length. The output format is meant to be pretty
much human-readable, while also loadable by the module.
The module can be used from the command line through the
script the ngrams.pl manpage provided with the package.
OUTPUT FORMAT
The output looks like this:
BEGIN OUTPUT BY Text::Ngrams version 0.01
1-GRAMS (total count: 8)
------------------------
a 1
b 1
c 1
d 1
e 1
f 1
g 1
h 1
2-GRAMS (total count: 7)
------------------------
ab 1
bc 1
cd 1
de 1
ef 1
fg 1
gh 1
3-GRAMS (total count: 6)
------------------------
abc 1
bcd 1
cde 1
def 1
efg 1
fgh 1
END OUTPUT BY Text::Ngrams
N-grams are encoded using encode_S
(www.cs.dal.ca/~vlado/srcperl/snip/encode_S), so that they
can always be recognized as \S+. For example, for word n-
grams, space is replaced by underscore (_):
BEGIN OUTPUT BY Text::Ngrams version 0.01
1-GRAMS (total count: 8)
------------------------
The 1
brown 3
fox 3
quick 1
2-GRAMS (total count: 7)
------------------------
The_brown 1
brown_fox 2
brown_quick 1
fox_brown 2
quick_fox 1
END OUTPUT BY Text::Ngrams
Or, in case of byte type of processing:
BEGIN OUTPUT BY Text::Ngrams version 0.01
1-GRAMS (total count: 55)
-------------------------
\t 3
\n 3
_ 12
, 2
. 3
T 1
b 3
c 1
... etc
2-GRAMS (total count: 54)
-------------------------
\t_ 1
\tT 1
\tb 1
\n\t 2
__ 5
_. 1
_b 2
_f 3
_q 1
,\n 2
.\n 1
.. 2
Th 1
br 3
ck 1
e_ 1
... etc
END OUTPUT BY Text::Ngrams
METHODS
new ( windowsize => POS_INTEGER, type => charac
ter|byte|word )
my $ng = Text::Ngrams->new;
my $ng = Text::Ngrams->new( windowsize=>10 );
my $ng = Text::Ngrams->new( type=>'word' );
and similar.
Creates a new "Text::Ngrams" object and returns it.
Parameters:
windowsize
n-gram size (i.e., `n' itself). Default is 3 if not
given. It is stored in $object->{windowsize}.
type
Specifies a predefined type of n-grams:
character (default)
Default character n-grams: Read letters, sequences
of all other characters are replaced by a space,
letters are turned uppercase.
byte
Raw character n-grams: Don't ignore any bytes and
don't pre-process them.
word
Default word n-grams: One token is a word consist
ing of letters, digits and decimal digit are
replaced by <NUMBER>, and everything else is
ignored. A space is inserted when n-grams are
formed.
One can also modify type, creating its own type, by
fine-tuning several parameters (they can be unde
fined):
$o->{tokenseparator} - string used to be inserted
between tokens in n-gram (for characters it is empty,
and for words it is a space).
$o->{skiprex} - regular expression for ignoring stuff
between tokens.
$o->{tokenrex} - regular expression for recognizing a
token.
$o->{processtoken} - routine for token preprocessing.
Token is given and returned in $_.
feed_tokens ( list of tokens )
This function manually supplies tokens.
process_text ( list of strings )
Process text, i.e., break each string into tokens and feed
them.
process_files ( file_names or file_handle_references)
Process files, similarly to text. The files are processed
line by line, so there should not be any multi-line
tokens.
to_string ( orderby => frequency )
Produce string representation of the n-gram tables. If
parameter 'orderyby=>frequency' is specified, each table
is ordered by decreasing frequency.
HISTORY AND RELATED WORK
This code originated in my "monkeys and rhinos" project in
2000, and is related to authorship attribution project.
Some of the similar projects are (URLs can be found at my
site):
Ngram Statistics Package in Perl, by T. Pedersen at al.
This is a package that includes a script for word n-
grams.
Text::Ngram Perl Package by Simon Cozens
This is a similar package for character n-grams. As
an XS-implementation it is supposed to be very effi
cient.
Perl script ngram.pl by Jarkko Hietaniemi
This is a script for analyzing character n-grams.
Waterloo Statistical N-Gram Language Modeling Toolkit, in
C++ by Fuchun Peng
A n-gram language modeling package written in C++.
BUGS AND LIMITATIONS
If a user customizes a type, it is possible that a result
ing n-gram will be ambiguous. In this way, to different
n-grams may be counted as one. With predefined types of
n-grams, this should not happen.
For example, if a user chooses that a token can contain a
space, and uses space as an n-gram separator, then a tri
gram like this "x x x x" is ambiguous.
AUTHOR
Copyright 2003 Vlado Keselj www.cs.dal.ca/~vlado
This module is provided "as is" without expressed or
implied warranty. This is free software; you can redis
tribute it and/or modify it under the same terms as Perl
itself.
The latest version can be found at
http://www.cs.dal.ca/~vlado/srcperl/.
SEE ALSO
Ngram Statistics Package in Perl, by T. Pedersen at al.,
Waterloo Statistical N-Gram Language Modeling Toolkit in
C++ by Fuchun Peng, Perl script ngram.pl by Jarkko
Hietaniemi, Simon Cozen's Text::Ngram module in CPAN.
The links should be available at
http://www.cs.dal.ca/~vlado/nlp.
More information about the Corpora
mailing list