[Corpora-List] Sentence boundary detection

Jason Baldridge jbaldrid at mail.utexas.edu
Fri Jul 20 16:21:50 UTC 2007


Hi,

One fairly easy to use sentence boundary detector and tokenizer is included
in the OpenNLP toolkit:

http://opennlp.sf.net

It is written in Java and is basically the same as Ratnaparkhi's detector.
Lots of other tools, including parsing, tagging, and coreference are in that
package. There are already trained models available for English. The tools
themselves are not language specific, so if you provide an appropriate
training corpus in Spanish, you can train new models easily enough. (And the
code is open source, so you can modify it to make it more sensitive to
another language (e.g., morphology) if you want.)

For other tools, many of which are geared for Spanish NLP, you might also
have a look at FreeLing:

http://garraf.epsevg.upc.es/freeling/

There are certainly many other tools available -- it is actually pretty
straightforward to whip up a detector from scratch. There are some recent
unsupervised approaches for sentence boundary detection too that could be
relevant for you. You might have a look at this article by Tibor Kiss and
Jan Strunk:

http://www.linguistics.ruhr-uni-bochum.de/~strunk/ks2005FINAL.pdf

Hope that helps!

Jason


On 7/20/07, Kelly Vincent <kpvincent at hotmail.com> wrote:
>
> I am interested in what the current state-of-the-art is in sentence
> boundary
> detection and (to a lesser degree) tokenization. I have been able to
> locate
> several articles, but very few that are quite recent. I would appreciate
> any
> pointers to particularly important papers or to available tools, as well
> as
> the community's thoughts on the topic.
>
> We are building a Spanish corpus so I am particularly interested in these
> topics from the Spanish perspective, though not confined to that.
>
> Regards,
> Kelly Vincent
> Software Engineer
> MetaMetrics, Inc.
>
> _________________________________________________________________
> Local listings, incredible imagery, and driving directions - all in one
> place! http://maps.live.com/?wip=69&FORM=MGAC01
>
>
> _______________________________________________
> Corpora mailing list
> Corpora at uib.no
> http://mailman.uib.no/listinfo/corpora
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20070720/f517d19f/attachment.htm>
-------------- next part --------------
_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora


More information about the Corpora mailing list