[Corpora-List] Sentence ambiguator/splitter summary
Staffan Hermansson
shend00 at student.vxu.se
Thu Jan 29 20:26:02 UTC 2004
Hello people. Here's a brief summary of the things I've recieved. Some
people were nice enough to attach documents. I've located most of those
on the web for you.
Again, thank you for your support.
//Staffan
Applications:
A free CPAN Perl module for sentence splitting.
http://listserv.linguistlist.org/cgi-bin/wa?A2=ind0302&L=corpora&P=R5743
Shlomo Yona maintains another perl-based sentence splitter.
http://cs.haifa.ac.il/~shlomo/
Earlier posts on this list (might have missed some):
http://helmer.aksis.uib.no/corpora/1998-4/0026.html
http://helmer.aksis.uib.no/corpora/1999-3/0347.html
http://helmer.aksis.uib.no/corpora/2000-2/0225.html
http://helmer.aksis.uib.no/corpora/2003-1/0140.html
Reports:
Ghassan Mourad was nice and attached the following to me. Though I can't
read a word in French (thanks anyway), it might still be of interrest.
Ghassan Mourad (1999)
La segmentation de textes par l'étude de la ponctuation
http://www.lalic.paris4.sorbonne.fr/articles/1998-1999/Mourad/CIDE99.pdf
Ghassan Mourad
La segmentation de textes par exploration contextuelle automatique,
présentation du module SegATex
Ghassan.Mourad at paris4.sorbonne.fr
Greg Grefenstette and Past Tapanainen. "What is a word, what is a
sentence? Problems of tokenization."
http://citeseer.nj.nec.com/grefenstette94what.html
Tibor Kiss and Jan Strunk
Scaled log likelihood ratios for the detection of abbreviations in text
corpora
http://www.linguistics.rub.de/~kiss/publications/abbrev.pdf
Tibor Kiss and Jan Strunk
Multilingual Least-Effort Sentence Boundary Disambiguation
http://www.linguistics.rub.de/~kiss/publications/publications.html#boundaries
Andrei Mikheev. "Text Segmentation." In R. Mitkov (ed.) Oxford Handbook
of Computational Linguistics, OUP, 2003.
Andrei Mikheev
Tagging Sentence Boundaries (2000)
http://citeseer.nj.nec.com/mikheev00tagging.html
Andrei Mikheev
Periods, Capitalized Words, etc (1999)
http://citeseer.nj.nec.com/mikheev99periods.html
David D. Palmer (2000)
Tokenisation and Sentence Segmentation,
Robert Dale, Hermann Moisl and Harold Somers (Eds)
in A Handbook of Natural Language Processing, Marcel Dekker
David D. Palmer and Marti A. Hearst,
Adaptive Multilingual Sentence Boundary Disambiguation
citeseer.nj.nec.com/palmer97adaptive.html
J. Reynar and A. Ratnaparkhi,
A Maximum Entropy Approach to Identifying Sentence Boundaries
citeseer.nj.nec.com/article/reynar97maximum.html
--
More information about the Corpora
mailing list