[Corpora-List] New Ngram package in Perl

Vlado Keselj vlado at cs.dal.ca
Fri Jun 6 20:17:30 UTC 2003


Text::Ngrams - a new Perl package for n-gram analysis, is made
available at the site:

  http://www.cs.dal.ca/~vlado/srcperl/Ngrams

and it will be soon be indexed by CPAN (www.cpan.org).

It is a small and flexible piece of code that comes with a script
ngrams.pl for direct processing of files.

I am aware that this is `yet another' n-gram package, but it is novel in
some ways.  References to other packages are included.

The man pages for the script and the module are included below.

Vlado

-------

SYNOPIS
         ngram [--version] [--help] [--n=3] [--type=character] [--orderbyfrequency] [input files]

DESCRIPTION
       This script produces n-grams tables of the input files to
       the standard ouput.

       Options: =over 4 =item --version

       Prints version.

       --help Prints help.

       --n=NUMBER
              N-gram size, produces 3-grams by default.

       --type=character|byte|word
              Type of n-grams produces. See Text::Ngrams module.

       --orderbyfrequency
              By default, the n-grams are ordered lexicographi­
              cally.  If this option is specified, then they are
              ordered by frequency in descending order.

PREREQUISITES
       Text::Ngrams, Getopt::Long

SCRIPT CATEGORIES
       Text::Statistics

SEE ALSO
       Text::Ngrams module.

COPYRIGHT
       Copyright 2003 Vlado Keselj http://www.cs.dal.ca/~vlado

       This module is provided "as is" without expressed or
       implied warranty.  This is free software; you can redis­
       tribute it and/or modify it under the same terms as Perl
       itself.

       The latest version can be found at
       http://www.cs.dal.ca/~vlado/srcperl/.

------------------------------------------------------------------------

NAME
       Text::Ngrams - Flexible Ngram analysis (for characters,
       words, and more)

SYNOPSIS
       For default character n-gram analysis of string:

         use Text::Ngrams;
         my $ng3 = Text::Ngrams->new;
         ng3->process_text('abcdefg1235678hijklmnop');
         print ng3->to_string;

       One can also feed tokens manually:

         use Text::Ngrams;
         my $ng3 = Text::Ngrams->new;
         $ng3->feed_tokens('a');
         $ng3->feed_tokens('b');
         $ng3->feed_tokens('c');
         $ng3->feed_tokens('d');
         $ng3->feed_tokens('e');
         $ng3->feed_tokens('f');
         $ng3->feed_tokens('g');
         $ng3->feed_tokens('h');

       We can choose n-grams of various sizes, e.g.:

         my $ng = Text::Ngrams->new( windowsize => 6 );

       or different types of n-grams, e.g.:

         my $ng = Text::Ngrams->new( type => byte );
         my $ng = Text::Ngrams->new( type => word );

DESCRIPTION
       This module implement text n-gram analysis, supporting
       several types of analysis, including character and word n-
       grams.

       The module Text::Ngrams is very flexible.  For example, it
       allows a user to manually feed a sequence of any tokens.
       It handles several types of tokens (character, word), and
       also allows a lot of flexibility in automatic recognition
       and feed of tokens and the way they are combined in an n-
       gram.  It counts all n-gram frequencies up to the maximal
       specified length.  The output format is meant to be pretty
       much human-readable, while also loadable by the module.

       The module can be used from the command line through the
       script the ngrams.pl manpage provided with the package.

OUTPUT FORMAT
       The output looks like this:

         BEGIN OUTPUT BY Text::Ngrams version 0.01

         1-GRAMS (total count: 8)
         ------------------------
         a     1
         b     1
         c     1
         d     1
         e     1
         f     1
         g     1
         h     1

         2-GRAMS (total count: 7)
         ------------------------
         ab    1
         bc    1
         cd    1
         de    1
         ef    1
         fg    1
         gh    1

         3-GRAMS (total count: 6)
         ------------------------
         abc   1
         bcd   1
         cde   1
         def   1
         efg   1
         fgh   1

         END OUTPUT BY Text::Ngrams

       N-grams are encoded using encode_S
       (www.cs.dal.ca/~vlado/srcperl/snip/encode_S), so that they
       can always be recognized as \S+.  For example, for word n-
       grams, space is replaced by underscore (_):

         BEGIN OUTPUT BY Text::Ngrams version 0.01

         1-GRAMS (total count: 8)
         ------------------------
         The   1
         brown 3
         fox   3
         quick 1

         2-GRAMS (total count: 7)
         ------------------------
         The_brown     1
         brown_fox     2
         brown_quick   1
         fox_brown     2
         quick_fox     1

         END OUTPUT BY Text::Ngrams

       Or, in case of byte type of processing:

         BEGIN OUTPUT BY Text::Ngrams version 0.01

         1-GRAMS (total count: 55)
         -------------------------
         \t    3
         \n    3
         _     12
         ,     2
         .     3
         T     1
         b     3
         c     1
         ... etc

         2-GRAMS (total count: 54)
         -------------------------
         \t_   1
         \tT   1
         \tb   1
         \n\t  2
         __    5
         _.    1
         _b    2
         _f    3
         _q    1
         ,\n   2
         .\n   1
         ..    2
         Th    1
         br    3
         ck    1
         e_    1
         ... etc

         END OUTPUT BY Text::Ngrams

METHODS

       new ( windowsize => POS_INTEGER, type => charac­
       ter|byte|word )

         my $ng = Text::Ngrams->new;
         my $ng = Text::Ngrams->new( windowsize=>10 );
         my $ng = Text::Ngrams->new( type=>'word' );
         and similar.

       Creates a new "Text::Ngrams" object and returns it.
       Parameters:

       windowsize
           n-gram size (i.e., `n' itself).  Default is 3 if not
           given.  It is stored in $object->{windowsize}.

       type
           Specifies a predefined type of n-grams:

           character (default)
               Default character n-grams: Read letters, sequences
               of all other characters are replaced by a space,
               letters are turned uppercase.

           byte
               Raw character n-grams: Don't ignore any bytes and
               don't pre-process them.

           word
               Default word n-grams: One token is a word consist­
               ing of letters, digits and decimal digit are
               replaced by <NUMBER>, and everything else is
               ignored.  A space is inserted when n-grams are
               formed.

           One can also modify type, creating its own type, by
           fine-tuning several parameters (they can be unde­
           fined):

           $o->{tokenseparator} - string used to be inserted
           between tokens in n-gram (for characters it is empty,
           and for words it is a space).

           $o->{skiprex} - regular expression for ignoring stuff
           between tokens.

           $o->{tokenrex} - regular expression for recognizing a
           token.

           $o->{processtoken} - routine for token preprocessing.
           Token is given and returned in $_.

       feed_tokens ( list of tokens )

       This function manually supplies tokens.

       process_text ( list of strings )

       Process text, i.e., break each string into tokens and feed
       them.

       process_files ( file_names or file_handle_references)

       Process files, similarly to text.  The files are processed
       line by line, so there should not be any multi-line
       tokens.

       to_string ( orderby => frequency )

       Produce string representation of the n-gram tables.  If
       parameter 'orderyby=>frequency' is specified, each table
       is ordered by decreasing frequency.

HISTORY AND RELATED WORK
       This code originated in my "monkeys and rhinos" project in
       2000, and is related to authorship attribution project.
       Some of the similar projects are (URLs can be found at my
       site):

       Ngram Statistics Package in Perl, by T. Pedersen at al.
           This is a package that includes a script for word n-
           grams.

       Text::Ngram Perl Package by Simon Cozens
           This is a similar package for character n-grams.  As
           an XS-implementation it is supposed to be very effi­
           cient.

       Perl script ngram.pl by Jarkko Hietaniemi
           This is a script for analyzing character n-grams.

       Waterloo Statistical N-Gram Language Modeling Toolkit, in
           C++ by Fuchun Peng
           A n-gram language modeling package written in C++.

BUGS AND LIMITATIONS
       If a user customizes a type, it is possible that a result­
       ing n-gram will be ambiguous.  In this way, to different
       n-grams may be counted as one.  With predefined types of
       n-grams, this should not happen.

       For example, if a user chooses that a token can contain a
       space, and uses space as an n-gram separator, then a tri­
       gram like this "x x x x" is ambiguous.

AUTHOR
       Copyright 2003 Vlado Keselj www.cs.dal.ca/~vlado

       This module is provided "as is" without expressed or
       implied warranty.  This is free software; you can redis­
       tribute it and/or modify it under the same terms as Perl
       itself.

       The latest version can be found at
       http://www.cs.dal.ca/~vlado/srcperl/.

SEE ALSO
       Ngram Statistics Package in Perl, by T. Pedersen at al.,
       Waterloo Statistical N-Gram Language Modeling Toolkit in
       C++ by Fuchun Peng, Perl script ngram.pl by Jarkko
       Hietaniemi, Simon Cozen's Text::Ngram module in CPAN.

       The links should be available at
       http://www.cs.dal.ca/~vlado/nlp.



More information about the Corpora mailing list