[Corpora-List] Horizontal spaces and sentence segmentation

Sat Jan 30 09:34:01 UTC 2010

Dear corpora users,
I'm looking for hints or recipes on segmentation of plain text into
sentences. I know the topic is really down-to-earth and has been
discussed too many times, yet having skimmed a number of NLP/CL/IR
textbooks and dedicated articles I still can't find any hint on an
inescapable problem: when to split sentences if there is no explicit
'dot' character (in European languages).

All the resources I've seen concentrate on abbreviations and deciding
when a dot is indeed a full stop. The problem appears when dealing
with plain text (or after removal of mark-up which is usually
unreliable) with no explicit paragraph boundaries. For instance a
heading followed by a paragraph or a list of items ends up as a huge
sentence (as there was no dot character in the input). This is
especially painful when dealing with text extracted from PDF files or
ill-formatted Wikipedia distillate. Some textbooks mention the general
problem of splitting text into paragraphs, yet the methods employed
seem too sophisticated at this level (e.g. referring to lexical
semantics).

Some implicit splitting on newlines is sometimes done (e.g. NLTK
implementation of Punkt assumes each newline being a sentence break).
I know the `correct' behaviour is dependent on the expected formatting
of input, nevertheless I wonder if anyone has tried a systematic
approach to this issue. This would help me avoid reinventing the
wheel.

Any suggestions, references or links to software will be appreciated.
And sorry for spamming if you consider this problem already solved or
too obvious (perhaps there can't be a systematic way and that's it).

Best,
Adam Radziszewski

_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora