[Corpora-List] summary: free sentencizers ; test differentsentencizers with cgi script

Shlomo Yona shlomo at cs.haifa.ac.il
Mon Mar 10 09:27:13 UTC 2003


On Mon, 10 Mar 2003, Joerg Schuster wrote:

> I think one of the disandvantages of your program is that it stores
> all data in main memory. You have to say something like
>
>  my $sentences=get_sentences($in);
>
> Though this is very comfortable when dealing with small files, I would
> like to rather say something like
>
> while(<>) {
> 	  print_sentences;
> }
>
> Then huge files could easily be sentencized, too.

The thing is that some of the decisions are made globally.
Of course the program does not need more than a reasonable
window of text to make good decisions, but the size of that
windos is something the user should worry about (according
to the data available).

Given a huge file, you can first chop it into smaller chunks
(and you have the freedom to decide how to do that) and then
feed to the Lingua::EN::Sentence module each chunk at a time.

Taking input one line at a time will in most cases fail the
effort of determining the proper locations of sentence boundaries.


--
Shlomo Yona
shlomo at cs.haifa.ac.il
http://cs.haifa.ac.il/~shlomo/



More information about the Corpora mailing list