[Corpora-List] Horizontal spaces and sentence segmentation

Marcin Miłkowski list-address at wp.pl
Sun Jan 31 17:15:56 UTC 2010


Dear Adam,

I think in most cases it's being assumed that a paragraph ends with a 
single newline or more than one. It's hard to decide which is best 
without seeing the data.

However, you could use similar heuristics as for dots in many languages. 
For example, if the next token after a newline is a lowercase word, you 
probably haven't got a new paragraph. If this is uppercase (and not a 
proper name), then it probably is. There should be many regularities to 
discover. I suppose you can either try to train a segmenter (if you have 
a manually segmented corpus at hand) or try to use rule-based approach 
and see how it works on your data.

Hope this helps,
Marcin Miłkowski

W dniu 2010-01-30 10:34, Adam Radziszewski pisze:
> Dear corpora users,
> I'm looking for hints or recipes on segmentation of plain text into
> sentences. I know the topic is really down-to-earth and has been
> discussed too many times, yet having skimmed a number of NLP/CL/IR
> textbooks and dedicated articles I still can't find any hint on an
> inescapable problem: when to split sentences if there is no explicit
> 'dot' character (in European languages).
>
> All the resources I've seen concentrate on abbreviations and deciding
> when a dot is indeed a full stop. The problem appears when dealing
> with plain text (or after removal of mark-up which is usually
> unreliable) with no explicit paragraph boundaries. For instance a
> heading followed by a paragraph or a list of items ends up as a huge
> sentence (as there was no dot character in the input). This is
> especially painful when dealing with text extracted from PDF files or
> ill-formatted Wikipedia distillate. Some textbooks mention the general
> problem of splitting text into paragraphs, yet the methods employed
> seem too sophisticated at this level (e.g. referring to lexical
> semantics).
>
> Some implicit splitting on newlines is sometimes done (e.g. NLTK
> implementation of Punkt assumes each newline being a sentence break).
> I know the `correct' behaviour is dependent on the expected formatting
> of input, nevertheless I wonder if anyone has tried a systematic
> approach to this issue. This would help me avoid reinventing the
> wheel.
>
> Any suggestions, references or links to software will be appreciated.
> And sorry for spamming if you consider this problem already solved or
> too obvious (perhaps there can't be a systematic way and that's it).
>
> Best,
> Adam Radziszewski
>
> _______________________________________________
> Corpora mailing list
> Corpora at uib.no
> http://mailman.uib.no/listinfo/corpora
>
>
>    


_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora


More information about the Corpora mailing list