[Corpora-List] Sentence ambiguator/splitter summary

Staffan Hermansson shend00 at student.vxu.se
Thu Jan 29 20:26:02 UTC 2004


Hello people. Here's a brief summary of the things I've recieved. Some
people were nice enough to attach documents. I've located most of those
on the web for you.

Again, thank you for your support.

//Staffan

Applications:
A free CPAN Perl module for sentence splitting.
http://listserv.linguistlist.org/cgi-bin/wa?A2=ind0302&L=corpora&P=R5743

Shlomo Yona maintains another perl-based sentence splitter.
http://cs.haifa.ac.il/~shlomo/

Earlier posts on this list (might have missed some):
http://helmer.aksis.uib.no/corpora/1998-4/0026.html
http://helmer.aksis.uib.no/corpora/1999-3/0347.html
http://helmer.aksis.uib.no/corpora/2000-2/0225.html
http://helmer.aksis.uib.no/corpora/2003-1/0140.html

Reports:

Ghassan Mourad was nice and attached the following to me. Though I can't
read a word in French (thanks anyway), it might still be of interrest.

Ghassan Mourad (1999)
La segmentation de textes par l'étude de la ponctuation
http://www.lalic.paris4.sorbonne.fr/articles/1998-1999/Mourad/CIDE99.pdf

Ghassan Mourad
La segmentation de textes par exploration contextuelle automatique,
présentation du module SegATex
Ghassan.Mourad at paris4.sorbonne.fr

Greg Grefenstette and Past Tapanainen. "What is a word, what is a
sentence? Problems of tokenization."
http://citeseer.nj.nec.com/grefenstette94what.html

Tibor Kiss and Jan Strunk
Scaled log likelihood ratios for the detection of abbreviations in text
corpora
http://www.linguistics.rub.de/~kiss/publications/abbrev.pdf

Tibor Kiss and Jan Strunk
Multilingual Least-Effort Sentence Boundary Disambiguation
http://www.linguistics.rub.de/~kiss/publications/publications.html#boundaries

Andrei Mikheev. "Text Segmentation." In R. Mitkov (ed.) Oxford Handbook
of Computational Linguistics, OUP, 2003.

Andrei Mikheev
Tagging Sentence Boundaries (2000)
http://citeseer.nj.nec.com/mikheev00tagging.html

Andrei Mikheev
Periods, Capitalized Words, etc (1999)
http://citeseer.nj.nec.com/mikheev99periods.html

David D. Palmer (2000)
Tokenisation and Sentence Segmentation,
Robert Dale, Hermann Moisl and Harold Somers (Eds)
in A Handbook of Natural Language Processing, Marcel Dekker

David D. Palmer and Marti A. Hearst,
Adaptive Multilingual Sentence Boundary Disambiguation
citeseer.nj.nec.com/palmer97adaptive.html

J. Reynar and A. Ratnaparkhi,
A Maximum Entropy Approach to Identifying Sentence Boundaries
citeseer.nj.nec.com/article/reynar97maximum.html

--



More information about the Corpora mailing list